A Taxonomy for Noise in Images of Paper Documents Part I - The Physical Noises Rafael Dueire Lins Universidade Federal de Pernambuco, Recife - PE, Brazil [email protected] Abstract. A taxonomy encompasses not only a classification methodology, but also an explicative theory of the phenomena that justify such classification. This paper introduces a taxonomy for noise in images of paper documents and details the Physical Noises according to the proposed taxonomy. Keywords: noise, document images, paper documents, digital libraries. 1. Introduction Legated paper documents are being digitalized at a very fast pace to provide their integration with digital documents, bridging the gap between the past and today technologies. The pioneer work of Baird [1] discusses several document image defect models, which approximate some aspects of quality deterioration in printing and scanning, such as jitter, blurring, physical deformation, etc. The recent paper by Cheriet and Moghaddam [2] is an important attempt to model degradation in documents. They propose a classification of degradation in two distinct classes: those that have external source and those originating in the document itself. Besides that classification they model back-to-front interference [3], a noise that falls in the latter case of degradation, and that was first described in the literature in reference [4]. This paper presents an ontology or taxonomy that is more general than [1, 2] and, besides providing an explanation of how such noise appeared in the final image, may provide pointers to the literature that show ways of avoiding or removing it. Noise is defined here as any phenomenon that degrades document information. In the classification proposed here, there are four kinds of noise: 1. The physical noise – whatever “damages” the physical integrity and readability of the original information of a document. The physical noise may be further split into the two sub-categories proposed in [2] as internal and external. 2. The digitalization noise – the noise introduced by the digitalization process. Several problems may be clustered in this group such as: inadequate digitalization resolution, unsuitable palette, framing noises, skew and orientation, lens distortion, geometrical warping, out-of-focus digitized images, motion noises. 3. The filtering noise – unsuitable manipulation of the digital file may degrade the information that exists in the digital version of the document (instead of increasing it). The introduction of colors not originally present in the document due to arithmetic manipulation or overflow is an example of such a noise. 4. The storage/transmission noise – the noise that appears either from storage algorithms with losses or from network transmission. JPEG artifact is a typical example of this kind of undesirable interference. It is true, however, that the idea of “information” in general is vague. Consequently defining noise as some sort of loss of information of a document, will inherit such imprecision. Even though, the author believes that the taxonomy proposed herein helps to understand the phenomena that yielded the digital document image degradation. In what follows, each of those four highest-level kinds of noise are better explained and exemplified, Guidelines on how to remove or minor their effects pointing out to references in the literature that addressed them, whenever available, are provided. Due to space restrictions, only the physical noises of the four classes proposed are detailed herein. 2. The physical noises Paper documents are very sensitive to degradation of its integrity. In this paper, such degradation is seen as a physical noise that damages the original information. For instance, paper aging is one of those physical noises that has impact in OCR transcription [2], thus it reduces the original information acquired from the document. A list of physical noises includes: 1. Folding marks 2. Paper translucency 3. Paper aging 4. Paper texture 5. Paper punching 6. Stains 7. Thorn-off regions 8. Worm holes 9. Readers’ annotations 10. Physical blur 11. Carbon copy effect 12. Scratches and cracks 13. Sunburn 14. Inadequate printing In what follows, a brief description of the characterization of such noises and how to overcome their effects is presented. 2.1. Folding Marks Very frequently documents are folded to fit envelopes, to be stored, etc. Folding marks on information (printed or handwritten parts of documents) may degrade OCR response and make virtually impossible image alignment, in the case of handling the images of both sides of a document simultaneously. Folding marks may be considered as an external degradation factor. Figure 1 shows an example of a part of a document with a folding mark. Figure 1. Zoom into a folding mark of a historical document from Nabuco´s bequest 2.2. Paper translucency If a document is printed in a paper that provides a low degree of opacity problems may arise in the digital version of such document. 2.2.1. One-sided sided printed documents If the document is written only on one side of the paper sheet, special care should be taken to avoid introducing “background” noise in the digitalization process. For instance, if such document is digitalized using a camera mera the mechanical support where the document lays on should be white and opaque (assuming the document is written on white paper with dark paint). A non-white white background will provide a non-original non original color to the document that acts as external degradation. 2.2.2. Back-to-front front interference If a document is written on both sides of translucent paper an internal degradation, first described in the literature in reference [4], appears. That noise is known as back-to-front interference, bleeding or show-through.. The human eye is able to “filter-out” “filter the interference in the true-color color version of documents. However, it degrades OCR response [2] and the straightforward binarization yields unreadable documents, in general. Figure 2 presents an example of a document with back-to-front back front interference and its binary counterpart. Figure 2. Document with back back-to-front interference and its binary version. Several algorithms in the literature address the removal of such noise in the digital image of documents. They range from automatic to semi-automatic, global to local thresholding, watermarking to wavelet based algorithms, etc. One of the techniques suggested in the literature is a mirror-filtering of the images of both sides simultaneously [5]. Folding marks make unsuitable the adoption of such solution as it is impossible to perfectly align the two images. No algorithm is an all-case-winner. Reference [6] presents a comparative analysis of some of the most used techniques, together with some criteria to choose them based on an account of the strength of the interference inferred by the percentage of black pixel in the binary version of the document. 2.3. Paper aging Paper surface tends to become yellow with age, the older the darker, as may be observed on the image of the document presented on the left-hand side of Figure 1. In the case of historical documents, paper aging is considered of iconographic value thus it should be preserved in the true-color version of the document. Paper aging may be considered an internal degradation to OCR response [2] as the contrast between the printed part and its surroundings tends to lower. In the case of non-historical documents, paper aging is considered an undesirable artifact. In the case of documents written only on one side of the sheet of paper, Gamma-correction may work effectively. To the best of the knowledge of the authors of this paper, there is no reference in the literature that address this problem in a systematic and focused way that allow its automatic removal. 2.4. Paper texture In a not too distant past, it was not unusual to print, overall photos, on textured paper. A kind of bee-heave texture was of widespread use during the 1940s to the 1960s for photo printing, for instance. The digitalization of such document gives rise to a texture noise that degrades image recognition. An example of such internal noise may be found in Figure 3, where a zoom into the paper texture is also shown in its right-hand side. It seems that this problem has not been addressed in the technical literature, so far. Figure 3. Part of photo with texture noise at left-hand side and zoom. 2.5. Filing and staple punching Very often bureaucratic documents are filed. In general, the filing process is done by punching two or more holes in the left margin of the document so that it remains attached to the file. This process may be seen as an external degradation factor to the document [2]. The digitalization of such documents brings the “memory” of this filing process in general, as the hole image appears in the document image. The regular shape of such holes that are of around 5mm of diameter and are placed at standard distances tances apart allows its automatic removal. Sometimes, careless filing punches document information causing damage to its content. Similarly to the filing punches, one also often finds staple punches that tend to be found in the border of documents, often iinn the same side of file punches. Staple punches also appear in pairs of distances from each other. Figure 4. Document with file punches and staple holes. Figure 4 presents an example of a document image where the filing punches may be found in its bottom and several staple punches in its left-bottom left bottom part. The removal of staple punches with a salt-and-pepper salt pepper filter is not adequate, as the size of such punches may be larger than the artifacts removed by that filter. Reference [7] presents a solution too remove filing punches in monochromatic documents. The solution may serve as an intermediate step for true color or gray scale documents. The literature presents no algorithm to remove staple noises, so far. 2.6. Stains The manipulation of “real “real-world”” documents provides several opportunities to this external noise to appear. Unfortunately, there is no automatic way either to detect or to remove it, as it may permanently damage document information depending on several aspects such as if the stain reaches reaches information areas, how “strong” is the stain in relation to the printed part, etc. 2.7. Thorn-off regions Intensive document handling often causes damages to its integrity, thus a permanent external degradation. Very often filing holes expand to the margins tearing pieces of the document apart. The possibility of automatic recuperation of thorn-off regions in the document image depends on a number of factors such as where the piece lost lies and if it reached the content of the document, if the size of the undamaged document is known, etc. Figure 5 presents part of a document in which there is a thorn-off region that reaches document information. As one may observe, on the top part of that image there is a black area, which corresponds to the digitalization noise that encompass the thorn-off region. The removal of such noise should take into account the possibility of the thorn-off region reaching document information not to further remove the document content, such care is observed in the algorithm presented in references [16] and [17]. Figure 5. Thorn-off document due to unsuitable filing and handling 2.8. Worm holes Paper is one of the favorite meals of termites and their relatives. They dig tunnels in paper of a very particular shape at random positions in documents that in general encompass several pages. Figure 6 shows an image of part of a document in which the top corners exhibit worm holes in the margin and content areas. Although easily recognized by humans, their automatic detection is still distant and may be used to help OCR transcription, for instance. There is no report in the literature that addresses such external noise in documents. Figure 6. Two pages of a scanned book with worm holes at their external corners 2.9. Reader´s Annotations and Highlighting Very frequently readers make annotations and highlight sentences in document, for different reasons. In very seldom cases, such as the one that Fermat annotated the margins the Arithmetic of Diophant of Alexandria, those annotations add little or nothing too the information of the document per se.. The document presented in Figure 01 exhibits an example ple of such noise. Figure 7 zooms into the penciled mark made by historians on the document to file it. That mark is of restricted interest and ought to be removed ed from the document image. Figure 7. Zoom into reader´s annotations in the document shown in Figure 1. The automatic removal of such external noise is possible, but depends on how easy it is to distinguish between the information and the annotations. In the case of the document shown in Figure 4, in which noise removal will leave blank spaces that need to be filled in such a way as to the document to look “natural”. Several methods of annotations extraction have been proposed in the literature [8, 9, 10, 11]. These methods have achieved good responses by limiting either colors or types of annotations. A more general eneral solution for typed or printed documents is offered by [12] through the layout analysis of the document, as the printed part shows a more “regular” diagrammation pattern. Underline removal is the focus of reference [13]. Figure 8. 8 Document with highlighting in yellow Document highlighting is something something that appeared in the last 40 years. The author of this paper found no reference in the literature on highlighting removal in document images. Figure 8 shows part of a document with words highlighted hig in yellow. 2.10. Physical Blur Most often, one associates blur either with a digitalization noise due to out--offocus image acquisition or with a filtering noise as the result of inadequate image manipulation. Although less frequent than the the two other sources of blur images may also so exhibit the same effect effect. For instance, if a document printed using water soluble ink, such as in an ink-jet jet printer, is exposed to excessive humidity it yields a document in which region definitions become smoother smoother, looking blurred. Figure 9. Blurred document image due to “washing” Figure 9 zooms into part of a document that was originally printed in an ink-jet printer that was washed causing a blurred effect in the original image. The ways to remove the physical blur noise are possibly similar to the compensation of the other kinds of blur, and depends on how strong the noise is and its degradation power in the original information. The physical blur is in the category of external physical noise. 2.11. Carbon copy effect Several legated documents in historical bequests are not the original ones, but a carbon copy of a document sent away. Very often the original document was sent away and the author kept a carbon copy of it. Such documents bring an extra degree of difficulty for automatic transcription or even keyword spotting as they exhibit a very particular kind of blur, as may be observed in the document shown in Figure 10. Figure 10. Transport list of WWII prisoners between concentration camps. To the best of the knowledge of the author this paper the compensation of the carbon copy noise, an internal noise, is still an original research topic not yet addressed in the literature. 2.12. Scratches and cracks Scratches and cracks have similar effects in documents. They are a sort of internal physical noises very difficult to be automatically classified for images of printed documents, although they may appear more often than one expects in all sorts of printed material, overall in glossy documents such as photos and posters. Scratches tend to be made by some external action while cracks tend to appear due to aging of the physical medium. The recent paper by Bruni, Ferrara and Vitulano [14] analyzes color scratches in the context of digital film restoration by comparing a sequence of frames. Wirth and Bobier [15] suggest a fuzzy truncated-median filter to remove cracks in old photos. 2.13. Sunburn The first general use of thermal printers was in fax machines. They are still of current use today in all sorts of devices from Automatic Till Machines and Cash Dispensers to boarding pass printers. They provide a simple and cheap technological solution to printing documents that are supposed to be short lived without any need of toner or ink cartridges. Figure 11 left presents a part of boarding pass printed in thermal paper. Figure 11. (left) Document printed in thermal paper. (right) Same document from the left exposed a few minutes to direct sunlight. If kept in ideal conditions the document is supposed to last until 3 or 4 years. After that time the printed part tends to fade. However, sometimes the document needs to be kept longer or is stored in non ideal situations such as under direct sunlight or in warm places. In this case, as shown in the right hand side of Figure 11 the paper background becomes darker. To the best knowledge of the author of this paper, the sunburn noise is an external noise that has not been previously described in the literature, and its effect may not be hard to model and quite hard to automatically compensate. 2.14. Inadequate printing Although the taxonomy proposed herein encompasses the problems found in paper documents assuming they were correctly printed, there is a wide number of printing problems that may lead to unsatisfactory results that may also be classified as belonging to the class of internal physical noises in documents. The inadequate printing noise range from the incorrect printer set-up (paper quality, document palette, draft/economical/final, printing head alignment, bad quality or low toner/ink cartridge, paper jam, incorrect feeding, etc) to paper humidity, old thermal paper, etc. Compensating or correcting such noises may be extremely hard in many cases and should only be attempted in those cases in which there are no chances of adequately re-printing the document. 4. Conclusions This paper presents a general taxonomy for noise in paper document images and discusses methods to address them, whenever possible, in the case of physical noises. Besides that, it shows several lines for further work in the area, as some of the noises described were not referred in the technical literature, so far. Sometimes it is difficult to distinguish some of the noises in the proposed taxonomy. For instance, it may be impossible to distinguish a sheet of paper that was completely stained by English breakfast tea from an old sheet of paper. However, this may be the case even for physical documents, as they may be forged. Taxonomies in general may suffer from this kind of problem. A better detailed version of the taxonomy proposed herein may be found in reference [18]. Acknowledgements The author is grateful to Steve Simske (HP Labs. US) and to Serene Banerjee (HP Labs. India) for their comments in a previous version of this paper. Thanks also to Josep Lladós (CVC, Universitat Autónoma de Barcelona) for providing the image of Figure 8. The research reported herein was partly sponsored by CNPq (Brazil) and HPUFPE TechDoc Project. 5. References [1] H.S.Baird. Document image defect models and their uses. ICDAR 1993, pp. 62-67, 1993. [2] M.Cheriet and R.F.Moghaddam. DIAR: Advances in Degradation Modeling and Processing, ICIAR 2008, LNCS(5112):1-10, Springer Verlag, 2008. [3] J. da Silva; et al. “A New and Efficient Algorithm to Binarize Document Images Removing Back-to-Front Interference”. Journal of Universal Computer Science, v(14):299-313, 2008. [4] R. D. Lins, et al. “An Environment for Processing Images of Historical Documents. Microprocessing & Microprogramming”, v(40):939-942, N-Holland, 1993. [5] G.Sharma, “Show-through cancellation in scans of duplex printed documents”, IEEE Transactions on Image Processing, v10(5):736-754, 2001. [6] R.D.Lins, et al. “Detailing a Quantitative Method for Assessing Algorithms to Remove Back-to-Front Interference in Documents”. Journal of Universal Computer Science, v. 14, p. 266-283, 2008. [7] G.Meng et al. Circular Noises Removal from Scanned Document Images. ICDAR 2007, pp. 183-187, IEEE Press, 2007. [8] D.Möri and H.Bunke. Automatic interpretation and execution of manual corrections on text documents. Handbook of Character Recognition and Document Image Analysis, pp 679-702. World Scientific, 1997. [9] J.Stevens, A.Gee, C.Dance. Automatic processing of document annotations. British Machine Vision Conference, v(2): 438-448, 1998. [10] J.K.Guo and M.Y.Ma. Separating handwritten material from machine printed text using hidden markov models. ICDAR 2001, pp.436-443, 2001. [11] Y.Zheng, H.Li, and D.Doermann. The segmentation and identification of handwriting in noisy document images. DAS02, LNCS 2423, pp.95-105, Springer Verlag, 2002. [12] T.Nakai, K.Kise, M.Iwamura. A method of annotation extraction from paper documents using alignment based on local etc, ICDAR 2007, pp.23-27, IEEE Press, 2007. [13] J.R.Caldas Pinto et al. Underline Removal on Old Documents. ICIAR 2004, LNCS(3212), v(2):226-233, 2004. [14] V.Bruni, P.Ferrara, and D. Vitulano. Color Scratches Removal Using Human Perception, LNCS(5112):33-42, Springer Verlag, 2008. [15] M.Wirth and B.Bobier. Supression of Noise in Historical Photographs Using a Fuzzy Truncated-Median Filter. ICIAR 2007, LNCS(4633):1206-1216, Springer Verlag. 2007. [16] B.T.Ávila and R.D.Lins, A New Algorithm for Removing Noisy Borders from Monochromatic Documents, ACM-SAC’2004, pp 1219-1225, ACM Press, March, 2004. [17] B.T.Ávila and R.D.Lins, Efficient Removal of Noisy Borders from Monochromatic Documents, LNCS(3212):249-256, Springer Verlag, 2004. [18] R.D.Lins. A Global Taxonomy for Noises in Paper Documents, in preparation.
© Copyright 2026 Paperzz