The Physical Noises

A Taxonomy for Noise in Images of Paper Documents
Part I - The Physical Noises
Rafael Dueire Lins
Universidade Federal de Pernambuco, Recife - PE, Brazil
[email protected]
Abstract. A taxonomy encompasses not only a classification methodology, but
also an explicative theory of the phenomena that justify such classification. This
paper introduces a taxonomy for noise in images of paper documents and
details the Physical Noises according to the proposed taxonomy.
Keywords: noise, document images, paper documents, digital libraries.
1. Introduction
Legated paper documents are being digitalized at a very fast pace to provide their
integration with digital documents, bridging the gap between the past and today
technologies. The pioneer work of Baird [1] discusses several document image defect
models, which approximate some aspects of quality deterioration in printing and
scanning, such as jitter, blurring, physical deformation, etc. The recent paper by
Cheriet and Moghaddam [2] is an important attempt to model degradation in
documents. They propose a classification of degradation in two distinct classes: those
that have external source and those originating in the document itself. Besides that
classification they model back-to-front interference [3], a noise that falls in the latter
case of degradation, and that was first described in the literature in reference [4].
This paper presents an ontology or taxonomy that is more general than [1, 2] and,
besides providing an explanation of how such noise appeared in the final image, may
provide pointers to the literature that show ways of avoiding or removing it. Noise is
defined here as any phenomenon that degrades document information. In the
classification proposed here, there are four kinds of noise:
1. The physical noise – whatever “damages” the physical integrity and
readability of the original information of a document. The physical noise
may be further split into the two sub-categories proposed in [2] as internal
and external.
2. The digitalization noise – the noise introduced by the digitalization process.
Several problems may be clustered in this group such as: inadequate
digitalization resolution, unsuitable palette, framing noises, skew and
orientation, lens distortion, geometrical warping, out-of-focus digitized
images, motion noises.
3. The filtering noise – unsuitable manipulation of the digital file may degrade
the information that exists in the digital version of the document (instead of
increasing it). The introduction of colors not originally present in the
document due to arithmetic manipulation or overflow is an example of such
a noise.
4. The storage/transmission noise – the noise that appears either from storage
algorithms with losses or from network transmission. JPEG artifact is a
typical example of this kind of undesirable interference.
It is true, however, that the idea of “information” in general is vague. Consequently
defining noise as some sort of loss of information of a document, will inherit such
imprecision. Even though, the author believes that the taxonomy proposed herein
helps to understand the phenomena that yielded the digital document image
degradation. In what follows, each of those four highest-level kinds of noise are better
explained and exemplified, Guidelines on how to remove or minor their effects
pointing out to references in the literature that addressed them, whenever available,
are provided. Due to space restrictions, only the physical noises of the four classes
proposed are detailed herein.
2. The physical noises
Paper documents are very sensitive to degradation of its integrity. In this paper,
such degradation is seen as a physical noise that damages the original information.
For instance, paper aging is one of those physical noises that has impact in OCR
transcription [2], thus it reduces the original information acquired from the document.
A list of physical noises includes:
1. Folding marks
2. Paper translucency
3. Paper aging
4. Paper texture
5. Paper punching
6. Stains
7. Thorn-off regions
8. Worm holes
9. Readers’ annotations
10. Physical blur
11. Carbon copy effect
12. Scratches and cracks
13. Sunburn
14. Inadequate printing
In what follows, a brief description of the characterization of such noises and how to
overcome their effects is presented.
2.1. Folding Marks
Very frequently documents are folded to fit envelopes, to be stored, etc. Folding
marks on information (printed or handwritten parts of documents) may degrade OCR
response and make virtually impossible image alignment, in the case of handling the
images of both sides of a document simultaneously. Folding marks may be considered
as an external degradation factor. Figure 1 shows an example of a part of a document
with a folding mark.
Figure 1. Zoom into a folding mark of a historical document from Nabuco´s bequest
2.2. Paper translucency
If a document is printed in a paper that provides a low degree of opacity problems
may arise in the digital version of such document.
2.2.1. One-sided
sided printed documents
If the document is written only on one side of the paper sheet, special
care should be taken to avoid introducing “background” noise in the
digitalization process. For instance, if such document is digitalized using a
camera
mera the mechanical support where the document lays on should be white
and opaque (assuming the document is written on white paper with dark
paint). A non-white
white background will provide a non-original
non original color to the
document that acts as external degradation.
2.2.2. Back-to-front
front interference
If a document is written on both sides of translucent paper an internal
degradation, first described in the literature in reference [4], appears. That
noise is known as back-to-front interference, bleeding or show-through.. The
human eye is able to “filter-out”
“filter
the interference in the true-color
color version of
documents. However, it degrades OCR response [2] and the straightforward
binarization yields unreadable documents, in general. Figure 2 presents an
example of a document with back-to-front
back front interference and its binary
counterpart.
Figure 2. Document with back
back-to-front interference and its binary version.
Several algorithms in the literature address the removal of such noise in the
digital image of documents. They range from automatic to semi-automatic,
global to local thresholding, watermarking to wavelet based algorithms, etc.
One of the techniques suggested in the literature is a mirror-filtering of the
images of both sides simultaneously [5]. Folding marks make unsuitable the
adoption of such solution as it is impossible to perfectly align the two
images. No algorithm is an all-case-winner. Reference [6] presents a
comparative analysis of some of the most used techniques, together with
some criteria to choose them based on an account of the strength of the
interference inferred by the percentage of black pixel in the binary version of
the document.
2.3. Paper aging
Paper surface tends to become yellow with age, the older the darker, as may be
observed on the image of the document presented on the left-hand side of Figure 1. In
the case of historical documents, paper aging is considered of iconographic value thus
it should be preserved in the true-color version of the document. Paper aging may be
considered an internal degradation to OCR response [2] as the contrast between the
printed part and its surroundings tends to lower. In the case of non-historical
documents, paper aging is considered an undesirable artifact. In the case of
documents written only on one side of the sheet of paper, Gamma-correction may
work effectively. To the best of the knowledge of the authors of this paper, there is no
reference in the literature that address this problem in a systematic and focused way
that allow its automatic removal.
2.4. Paper texture
In a not too distant past, it was not unusual to print, overall photos, on textured
paper. A kind of bee-heave texture was of widespread use during the 1940s to the
1960s for photo printing, for instance. The digitalization of such document gives rise
to a texture noise that degrades image recognition. An example of such internal noise
may be found in Figure 3, where a zoom into the paper texture is also shown in its
right-hand side. It seems that this problem has not been addressed in the technical
literature, so far.
Figure 3. Part of photo with texture noise at left-hand side and zoom.
2.5. Filing and staple punching
Very often bureaucratic documents are filed. In general, the filing process is done
by punching two or more holes in the left margin of the document so that it remains
attached to the file. This process may be seen as an external degradation factor to the
document [2]. The digitalization of such documents brings the “memory” of this filing
process in general, as the hole image appears in the document image. The regular
shape of such holes that are of around 5mm of diameter and are placed at standard
distances
tances apart allows its automatic removal. Sometimes, careless filing punches
document information causing damage to its content.
Similarly to the filing punches, one also often finds staple punches that tend to be
found in the border of documents, often iinn the same side of file punches. Staple
punches also appear in pairs of distances from each other.
Figure 4. Document with file punches and staple holes.
Figure 4 presents an example of a document image where the filing punches may be
found in its bottom and several staple punches in its left-bottom
left bottom part. The removal of
staple punches with a salt-and-pepper
salt
pepper filter is not adequate, as the size of such
punches may be larger than the artifacts removed by that filter. Reference [7] presents
a solution too remove filing punches in monochromatic documents. The solution may
serve as an intermediate step for true color or gray scale documents. The literature
presents no algorithm to remove staple noises, so far.
2.6. Stains
The manipulation of “real
“real-world”” documents provides several opportunities to this
external noise to appear. Unfortunately, there is no automatic way either to detect or
to remove it, as it may permanently damage document information depending on
several aspects such as if the stain reaches
reaches information areas, how “strong” is the stain
in relation to the printed part, etc.
2.7. Thorn-off regions
Intensive document handling often causes damages to its integrity, thus a
permanent external degradation. Very often filing holes expand to the margins tearing
pieces of the document apart. The possibility of automatic recuperation of thorn-off
regions in the document image depends on a number of factors such as where the
piece lost lies and if it reached the content of the document, if the size of the
undamaged document is known, etc. Figure 5 presents part of a document in which
there is a thorn-off region that reaches document information. As one may observe, on
the top part of that image there is a black area, which corresponds to the digitalization
noise that encompass the thorn-off region. The removal of such noise should take into
account the possibility of the thorn-off region reaching document information not to
further remove the document content, such care is observed in the algorithm presented
in references [16] and [17].
Figure 5. Thorn-off document due to unsuitable filing and handling
2.8. Worm holes
Paper is one of the favorite meals of termites and their relatives. They dig tunnels
in paper of a very particular shape at random positions in documents that in general
encompass several pages. Figure 6 shows an image of part of a document in which the
top corners exhibit worm holes in the margin and content areas. Although easily
recognized by humans, their automatic detection is still distant and may be used to
help OCR transcription, for instance. There is no report in the literature that addresses
such external noise in documents.
Figure 6. Two pages of a scanned book with worm holes at their external corners
2.9. Reader´s Annotations and Highlighting
Very frequently readers make annotations and highlight sentences in document, for
different reasons. In very seldom cases, such as the one that Fermat annotated the
margins the Arithmetic of Diophant of Alexandria, those annotations add little or
nothing too the information of the document per se.. The document presented in Figure
01 exhibits an example
ple of such noise. Figure 7 zooms into the penciled mark made by
historians on the document to file it. That mark is of restricted interest and ought to be
removed
ed from the document image.
Figure 7. Zoom into reader´s annotations in the document shown in Figure 1.
The automatic removal of such external noise is possible, but depends on how easy
it is to distinguish between the information and the annotations. In the case of the
document shown in Figure 4, in which noise removal will leave blank spaces that
need to be filled in such a way as to the document to look “natural”. Several methods
of annotations extraction have been proposed in the literature [8, 9, 10, 11]. These
methods have achieved good responses by limiting either colors or types of
annotations. A more general
eneral solution for typed or printed documents is offered by
[12] through the layout analysis of the document, as the printed part shows a more
“regular” diagrammation pattern. Underline removal is the focus of reference [13].
Figure 8.
8 Document with highlighting in yellow
Document highlighting is something
something that appeared in the last 40 years. The author of
this paper found no reference in the literature on highlighting removal in document
images. Figure 8 shows part of a document with words highlighted
hig
in yellow.
2.10. Physical Blur
Most often, one associates blur either with a digitalization noise due to out--offocus image acquisition or with a filtering noise as the result of inadequate image
manipulation. Although less frequent than the
the two other sources of blur images may
also
so exhibit the same effect
effect. For instance, if a document printed using water soluble
ink, such as in an ink-jet
jet printer, is exposed to excessive humidity it yields a
document in which region definitions become smoother
smoother, looking blurred.
Figure 9. Blurred document image due to “washing”
Figure 9 zooms into part of a document that was originally printed in an ink-jet
printer that was washed causing a blurred effect in the original image. The ways to
remove the physical blur noise are possibly similar to the compensation of the other
kinds of blur, and depends on how strong the noise is and its degradation power in the
original information. The physical blur is in the category of external physical noise.
2.11. Carbon copy effect
Several legated documents in historical bequests are not the original ones, but a
carbon copy of a document sent away. Very often the original document was sent
away and the author kept a carbon copy of it. Such documents bring an extra degree
of difficulty for automatic transcription or even keyword spotting as they exhibit a
very particular kind of blur, as may be observed in the document shown in Figure 10.
Figure 10. Transport list of WWII prisoners between concentration camps.
To the best of the knowledge of the author this paper the compensation of the carbon
copy noise, an internal noise, is still an original research topic not yet addressed in the
literature.
2.12. Scratches and cracks
Scratches and cracks have similar effects in documents. They are a sort of internal
physical noises very difficult to be automatically classified for images of printed
documents, although they may appear more often than one expects in all sorts of
printed material, overall in glossy documents such as photos and posters. Scratches
tend to be made by some external action while cracks tend to appear due to aging of
the physical medium. The recent paper by Bruni, Ferrara and Vitulano [14] analyzes
color scratches in the context of digital film restoration by comparing a sequence of
frames. Wirth and Bobier [15] suggest a fuzzy truncated-median filter to remove
cracks in old photos.
2.13. Sunburn
The first general use of thermal printers was in fax machines. They are still of
current use today in all sorts of devices from Automatic Till Machines and Cash
Dispensers to boarding pass printers. They provide a simple and cheap technological
solution to printing documents that are supposed to be short lived without any need of
toner or ink cartridges. Figure 11 left presents a part of boarding pass printed in
thermal paper.
Figure 11. (left) Document printed in thermal paper.
(right) Same document from the left exposed a few minutes to direct sunlight.
If kept in ideal conditions the document is supposed to last until 3 or 4 years. After
that time the printed part tends to fade. However, sometimes the document needs to
be kept longer or is stored in non ideal situations such as under direct sunlight or in
warm places. In this case, as shown in the right hand side of Figure 11 the paper
background becomes darker. To the best knowledge of the author of this paper, the
sunburn noise is an external noise that has not been previously described in the
literature, and its effect may not be hard to model and quite hard to automatically
compensate.
2.14. Inadequate printing
Although the taxonomy proposed herein encompasses the problems found in paper
documents assuming they were correctly printed, there is a wide number of printing
problems that may lead to unsatisfactory results that may also be classified as
belonging to the class of internal physical noises in documents. The inadequate
printing noise range from the incorrect printer set-up (paper quality, document palette,
draft/economical/final, printing head alignment, bad quality or low toner/ink
cartridge, paper jam, incorrect feeding, etc) to paper humidity, old thermal paper, etc.
Compensating or correcting such noises may be extremely hard in many cases and
should only be attempted in those cases in which there are no chances of adequately
re-printing the document.
4. Conclusions
This paper presents a general taxonomy for noise in paper document images and
discusses methods to address them, whenever possible, in the case of physical noises.
Besides that, it shows several lines for further work in the area, as some of the noises
described were not referred in the technical literature, so far.
Sometimes it is difficult to distinguish some of the noises in the proposed
taxonomy. For instance, it may be impossible to distinguish a sheet of paper that was
completely stained by English breakfast tea from an old sheet of paper. However, this
may be the case even for physical documents, as they may be forged. Taxonomies in
general may suffer from this kind of problem. A better detailed version of the
taxonomy proposed herein may be found in reference [18].
Acknowledgements
The author is grateful to Steve Simske (HP Labs. US) and to Serene Banerjee (HP
Labs. India) for their comments in a previous version of this paper. Thanks also to
Josep Lladós (CVC, Universitat Autónoma de Barcelona) for providing the image of
Figure 8.
The research reported herein was partly sponsored by CNPq (Brazil) and HPUFPE TechDoc Project.
5. References
[1] H.S.Baird. Document image defect models and their uses. ICDAR 1993, pp. 62-67, 1993.
[2] M.Cheriet and R.F.Moghaddam. DIAR: Advances in Degradation Modeling and
Processing, ICIAR 2008, LNCS(5112):1-10, Springer Verlag, 2008.
[3] J. da Silva; et al. “A New and Efficient Algorithm to Binarize Document Images
Removing Back-to-Front Interference”. Journal of Universal Computer Science,
v(14):299-313, 2008.
[4] R. D. Lins, et al. “An Environment for Processing Images of Historical Documents.
Microprocessing & Microprogramming”, v(40):939-942, N-Holland, 1993.
[5] G.Sharma, “Show-through cancellation in scans of duplex printed documents”, IEEE
Transactions on Image Processing, v10(5):736-754, 2001.
[6] R.D.Lins, et al. “Detailing a Quantitative Method for Assessing Algorithms to Remove
Back-to-Front Interference in Documents”. Journal of Universal Computer Science, v. 14,
p. 266-283, 2008.
[7] G.Meng et al. Circular Noises Removal from Scanned Document Images. ICDAR 2007,
pp. 183-187, IEEE Press, 2007.
[8] D.Möri and H.Bunke. Automatic interpretation and execution of manual corrections on
text documents. Handbook of Character Recognition and Document Image Analysis, pp
679-702. World Scientific, 1997.
[9] J.Stevens, A.Gee, C.Dance. Automatic processing of document annotations. British
Machine Vision Conference, v(2): 438-448, 1998.
[10] J.K.Guo and M.Y.Ma. Separating handwritten material from machine printed text using
hidden markov models. ICDAR 2001, pp.436-443, 2001.
[11] Y.Zheng, H.Li, and D.Doermann. The segmentation and identification of handwriting in
noisy document images. DAS02, LNCS 2423, pp.95-105, Springer Verlag, 2002.
[12] T.Nakai, K.Kise, M.Iwamura. A method of annotation extraction from paper documents
using alignment based on local etc, ICDAR 2007, pp.23-27, IEEE Press, 2007.
[13] J.R.Caldas Pinto et al. Underline Removal on Old Documents. ICIAR 2004, LNCS(3212),
v(2):226-233, 2004.
[14] V.Bruni, P.Ferrara, and D. Vitulano. Color Scratches Removal Using Human Perception,
LNCS(5112):33-42, Springer Verlag, 2008.
[15] M.Wirth and B.Bobier. Supression of Noise in Historical Photographs Using a Fuzzy
Truncated-Median Filter. ICIAR 2007, LNCS(4633):1206-1216, Springer Verlag. 2007.
[16] B.T.Ávila and R.D.Lins, A New Algorithm for Removing Noisy Borders from
Monochromatic Documents, ACM-SAC’2004, pp 1219-1225, ACM Press, March, 2004.
[17] B.T.Ávila and R.D.Lins, Efficient Removal of Noisy Borders from Monochromatic
Documents, LNCS(3212):249-256, Springer Verlag, 2004.
[18] R.D.Lins. A Global Taxonomy for Noises in Paper Documents, in preparation.