A Proposition of Retrieval tools for Historical Document Images libraries Nicholas Journet LI, TOURS-FRANCE [email protected] Rémy Mullot L3I, La Rochelle-FRANCE [email protected] Véronique Eglin LIRIS INSA de Lyon Villeurbanne-FRANCE [email protected] Jean-Yves Ramel LI, TOURS-FRANCE [email protected] Abstract In this article, we propose a method of characterization of pictures of old documents based on a texture approach. This characterization is carried out with the help of a multiresolution study of the textures contained in the pictures of the document. So, by extracting five features linked to the frequencies and to the orientations in the different parts of a page, it is possible to extract and to compare elements of high semantic level without expressing any hypothesis about the physical or logical structure of the analysed documents. Experiments show the feasibility of the fulfillment of tools for the navigation or the indexation help. In these experimentations, we will lay the emphasis upon the pertinence of these texture features and the advances that they represent in terms of characterization of content of a deeply heterogeneous corpus. 1. Introduction This article meets the fundamental problem which is the characterization of the content of pictures of old documents considered as an alternative to the analysis methods of old documents that, until now, are mainly based on a pages segmentation and on an interpretation of their structure. This article is based on two parts. At first, this article details our proposal for characterization of content of old pictures. With the help of the calculation of texture features at different resolutions, we show that it is possible to characterize the content of pictures without expressing any hypothesis, neither about the structure nor about the characteristics of the treated pictures. In the last part, we show how the characterization of textures can be used for the purpose of indexation of the content. 2 Our texture approach for the characterization of the content 2.1 Analysis of the content of pictures of documents : the Texture Approach In the context of a collaboration with the Centre d’Etudes Superieures de la Renaissance de Tours1 , we had access to more than one hundred digitized works from the 15th and the 16th century. One of the strong characteristics of these old documents is the heterogeneity of the available works. The rudimentary character of the used techniques and equipment, the deterioration of the documents and the variety of the editorial rules are some of the reasons that explain the diversity of old documents pictures. The pictures of documents, to which we had access, encompass 3 centuries of printing and history. The complicated layout (several columns of irregular sizes), the use of specific fonts, (no more used now), the frequent use of embellishments, (illuminations, initial letters, frames), the small line spaces, or also the presence of non constant spacing between the characters and the words, the superimposing layers of pieces of information (noise, handwritten notes), are just so many specificities of old documents wich make difficult the characterization of the content. One possibility is to use texture algorithms. For example, the authors of [4] propose a method of text/picture separation of Hebrew documents based on the building of horizontal histograms. In [3], the author analyses previously cut out blocks in order to classify them, either as drawing or as text. The extracted texture standards come from an analysis of the results of the emission of pixels according to different angles. In [6] the authors use HMM to segment pictures of handwritten documents into areas of labelized interest areas (text lines, 1 http://www.bvh.univ-tours.fr scratches, note in the margin...). This allows them to segment the handwritten notes of Flaubert which have the particularity to contain many hatchings and deletions making the classic approaches little efficient.Actually, the fact that texture approaches use mainly informations of (very) low level, allows to free oneself from much a priori knowledge which is used by the methods exclusively driven by the data or by the model. Among the other advantages, one can say that in most cases, these tools work on pictures in grey level. A binarisation is then not systematically necessary. It must be noted that if these texture tools allow to characterize the pictures content, they do not allow to get a segmentation into blocks (paragraphes, pictures, titles...), this objective has to be carried out only at the end of post-treatments. 2.2 Characterization of the content of the pictures We propose a process based on the extraction of pieces of information extracted from an analysis of the textures within the picture, without looking for or taking into account the a priori knowledge of the structure of the pages. Five texture features are calculated. The first three are relating to the orientations, the two others to the information about the frequencies of transitions. These features are calculated on a local level at different picture resolutions. With the help of an analysis by means of a slippery window, (the size of which is the only parameter of our method), it is possible to associate with every picture pixel, meta-data corresponding to the results of the extraction of texture attributes. This analysis is carried out at 4 different resolutions, returning finally 20 numeric values describing every pixel. Once the whole pages of the work are analysed, the extracted pieces of information are stored in a meta-data basis. 2.2.1 Characterization of the orientations In order to extract texture orientation informations, we chose to use a non-parametric tool based on the autocorrelation function : the rose of directions (proposed by Bres in [2]). The rose of directions is a polar diagram based on the study of the answer of the auto-correlation function when it is applied to a picture. Let (k, l) be the central point of the picture after auto-correlation and the straight line Dorigin be the abscissa axis going through this point. Let θi be the studied orientation, one calculates then the straight line Di such that any couples of points (a, b) respects the following relation : angle formed by the straight line (a, b) and going through Dorigin = θi . The definition of the auto-correlation function for a two-dimensional signal is defined by eq.1. C(k, l) = +∞ X +∞ X x(k 0 , l0 ).x(k 0 + k, l0 + l) (1) k0 =−∞ l0 =−∞ So, a point C(k, l) of the auto-correlation function contains the value of the sum of the products of the grey levels x(k 0 , l0 ) of the points in correspondence after a translation of vector (k, l). At least, for every orientation θi , one calculates the sum of the different values of the auto-correlation function eq.2. X C(a, b) (2) R(θi ) = Di We have then defined pertinent characteristics to describe the content of the document made from this tool. We have then decided to extract 3 signatures wich permits to characterize texture informations relative to the orientations. The first extracted signatures, is the angle matching the main orientation of the rose of directions eq.3. In order not to have to manipulate circular data, this angle is normalized according to the deviation from the horizontal angle. So, at a resolution k, one calculates for every pixel (i, j) of a picture the texture attribute 1. F eature1k (i, j) = |180 − ArgM ax(R(i,j) (θ))| W ith θ ∈ [0, π] (3) The isotropy of the picture is estimated according to the intensity of the auto-correlation function. So, at the main orientation found by eq.3, every pixel (i, j) will be characterized with the help of the equation eq.4. F eature2k (i, j) = R(ArgM ax(R(i,j) (θ))) W ith θ ∈ [0, π] (4) The last texture feature linked to the orientation, characterizes the global form of the rose. For that, one calculates the variance of the intensities of the rose, except for the orientation of maximal intensity (eq.5). If the variance is high, it means that the rose is deformed and that a great number of orientations are present in various proportions. F eature3k (i, j) = ST D(R(i,j) (θ)) W ith θ ∈ [0, π] and θ ∈ / ArgM ax(R(i,j) (θ)) 2.2.2 (5) Characterization of frequencies The notion of ”frequencies” about pictures of documents, is linked to the transitions frequencies between paper and ink. In order to characterize the transitions frequencies, we have prefered to draw our inspiration from the works [1, 8]. These authors detail how it is possible to characterize different styles of text or to separate the text from the pictures, by studying the properties of the pixels grey levels transitions. The first signature that we use, permits to characterize the transitions frequencies between ink and paper. For every line of the analysed area by means of the slippery window, one does the sum of the difference between a grey level pixel and its neighbour on its left. The more the sum is high, the more the number of transitions on a line is high. A simple calculation allows to get a signature about the transitions in the studied area (eq.6). F eature4k (i; j) = Avg ( i∈I 0 X (pij − pij+1 )) (6) With I 0 and J 0 the size of the analysis window and pij the grey level of the pixel of coordinates (i, j). The last calculated texture sign is based on a characterization algorithm of the white spaces separating the collateral elements. We look so for a mean to get pieces of information on the extend of the various background areas which punctuate the pages. We adopted a recursive approach consisting in calculating 4 iterations of a recursive XY-CUT algorithm. To every iteration, one cuts into 4 areas of identical size the one which has just been analysed and one calculates for every one the feature of the eq.7. This feature is, for every pixel, equal to the average of the sum of the grey levels per column and per line. P k F eature5 = pkil + l∈J 0 2 pkil h∈I 0 P (7) With I 0 and J 0 the size of the window of analysis to the iteration k of the recursive algorithm. 2.2.3 Figure 1. Pixels Classification of a work j∈J 0 Discution about our proposal In this section, the quality of the categorization of the content is estimated through a classification of the pixels on the basis of the 20 proposed texture features. To classify the elements of content of the works allows, firstly, to verify if the classification of the contents meets the objective of separation of the pixels into layers, when it is done on a complete work. Every picture pixel has 20 values coming from the 5 features calculated at 4 different resolutions. Our objective is to group together the picture pixels corresponding to homogeneous areas, which amounts to grouping together characteristic vectors near in the sense of a metric system. It is a problem of non-supervised classification for which we do not know a priori the points labels allowing to build classes. We used a classification algorithm of mobile center kind, where is only indicated the number of classes that one wishes to get. The figure 1 shows the kind of results that one gets when a classification on a complete work is done. These tests allowed before all, to lay the emphasis upon a real coherent separating power of the extracted features. As regards the main limits of the proposed marking, they become localized on the level of the analysis of transition areas between texts and pictures, but also of titles containing big alone characters. Because of that, a great part of the titles (isolated from the body of the text) are identified as being drawings. In order to measure the pertinence of the method, we propose a simple estimation of the abilities of our data, instead to give a great quantity of visual results ; we have then decided to estimate the ability of separation of the pixels into 3 classes : text/drawing/background. To do that, we have manually captured a groundtruth with the help of an application that we developed and which allows to delimit with the mouse the outlines of the drawing areas and of the text areas (ground truth). A file is created in order to store this ground truth in order to be finally compared with the calculated classification. Our tests were done on 400 pages of old documents, extracted from 3 different works. Given, that we wanted to have an idea of the pertinence of the extracted features, we have made up a corpus of test pictures with contents as various as possible. The used texture features allow a good separation of the information layers of the documents. The rates of good classification is 83% for the drawings pixels and 92% for text pixels. 3 3.1 Towards new applications of search for information by the content Comparison of pages The first testing that we wish to carry out consists of a comparison of pages of pictures of old documents. This testing will allow to study if it is possible, without segment- ing and without identifying, to compare their content coming from texture pieces of information. We have chosen to characterize the pages by the spatial organization of the blocks of texts, of pictures and of background. On the basis of this definition, we propose the use of tools for comparison of partitions presented in [9]. In the framework of our work, a partition is the result of a classification of pixels carried out on the basis of the generated texture features. Let two pictures α and β for which a classification of their pixels was carried out (Lα (i) = u means that the class of the pixel i from the image α is u). It is then possible to build α,β a contingency table N of these two pictures α, β (eq.8). This table allows one to compare two partitions in a reduced α,β Figure 2. Examples of requests results data space ( N uv is of dimension pXq with p the number of classes of the picture α and q the number of classes of the picture β) and a building in O(n). allows to categorize structures visually very different from each other. α,β P i Nuv∈p,q= i Xuv 1 ifLα (i) = u et Lβ (i) = v i = Xuv 0 (8) Table 1. Precision rate obtained to 5 styles of different requests This contingency table is at the basis of a similarity measure S(α, β) between two images (eq.9) 2 v Nuv P P S(α, β) = u + 2 u Nuu P − n2 2 u Nu. P − 2 v Nv. P + n2 (9) To estimate the quality of the comparison of pictures of documents, we drew our inspiration from the works of [5]. We have then decided to separate the documents into 5 different classes : the pages with a frame which entirely surrounds the content, the pages made up only of text and justified on the right and on the left, the pages made up only of text but this time arranged into two columns, the pages made up only of an initial letter and the rest of the page made up only of text and finally the pages made up only of drawings. The results showed in the rest of the article were all carried out on the same picture database. We have then chosen nearly 400 pages out of 9 different works. Every test begins with the application of the classification algorithm for 3 classes (text/graphics/backround). The figure 2 shows the ability to detect visually similar pages in a large data-base. So, a picture is given as request (surrounded by red in the following examples) and the system provides the pictures which are the more similar to it. The table 1 allows to summarize the rate of good answers obtained to 5 kinds of different requests. The results meet precision rates to a Top5, Top10 and Top 15. We calculate in a simpler way a precision rate by dividing the number of obtained good answers after request by the considered number of pictures (size of the studied Top) Good ansers Rate = Size of the T op . In the tests we have done, all the pages of the whole works are mixed. The used measure Top5 1 0.93 0.9 0.74 0.65 Pages with borders Text on 2 columns Pages with only drawings Pages with only text Pages with text and drop caps 3.2 Top10 0.93 0.76 0.62 0.56 0.56 Top15 0.86 0.78 0.6 0.50 0.55 Comparison of textured pictures The second experimentation consists in doing a search for pictures by the content on a basis made up of historical drawings of old documents. So, we have made up a basis of tests containing more than 400 pictures. More than a third of the basis is made up of initial letters, the rest is divided into several categories : coats of arms, characters, emblems, skulls, various ornamental elements... We wish to calculate a similarity between two pictures according to the characteristic textures which they are made up of. For this, we propose the use of a metric system allowing to measure a similarity between two matrices of texture features. The similarity function d(k, l) (eq.10) the measurement of simik larity between two pictures k, l and C the matrix describing the textures of the pictures k and l. In this section, the tests which are done meet a search for pictures by the example. r d(k, l) = k l k l trace((Ci,j − Ci,j ).t (Ci,j − Ci,j )) (10) The figure 3 shows the good results obtained on the drawings databases. The request picture is surrounded by red, below every answer picture is indicated the measurement of similarity (non normed) between this picture and the request picture. After studying the results, we notice that the discrimination of the different categories pictures of the basis meets the anticipations. On more than a hundred tested initial letters, the majority of the obtained answers in a top 20 are initial letters. frequencies and the orientations of the different parts of a range, it is possible to extract and to compare elements of content without putting forward hypothesis about the physical or logical structure of the analysed documents. There is still to study now their integration into more complete indexation devices (CBIR systems for example). The first of these prospects that we set ourselves is then to finalize an indexation system able to produce automatically the descriptive meta-data of the pictures of documents comprising our texture features but also other pieces of information (linked to the colours, the forms, the positions...). References Figure 3. Examples of requests results To allow a global estimation of the made requests, we propose to put in place the same procedure as the one which was used to the comparison of pages. So we calculate a precision rate to a top 5, 10 and 15 at two different textures of the basis. The table 2 sums up the obtained average rates. Table 2. Accuracy rates for different requests drop caps Characters Skulls Emblems Coasts of arms 4 Top5 0.95 0.92 0.91 0.90 0.88 Top10 0.92 0.90 0.86 0.87 0.78 Top15 0.90 0.89 0.79 0.78 0.73 Conclusion This article presents our proposal of tools for treatments of pictures for a characterization of the pictures of documents without any knowledge a priori. The originality of our proposal is particularly due to the fact that we do not try to segment or extract the structure of the analysed documents. So, we describe how it is possible to characterize the content of pictures of documents by basing on non-parametrical textures pieces of information and a multiresolution approach. By extracting signatures linked to the [1] Allier and Emptoz. Font type extraction and character prototyping using gabor filters. ICDAR, 02:799, 2003. [2] Bres. Contributions la quantification des critres de transparence et d’anisotropie par une approche globale. PhD thesis, LIRIS, université de Lyon, 1994. [3] D. Chetverikov, J. Liang, J. Komuves, and R. M. Haralick. Zone classification using texture features. In ICPR ’96: Proceedings of the International Conference on Pattern Recognition (ICPR ’96) Volume III-Volume 7276, page 676, Washington, DC, USA, 1996. IEEE Computer Society. [4] S. Khedekar, V. Ramanaprasad, S. Setlur, and V. Govindaraju. Text - image separation in devanagari documents. In ICDAR ’03: Proceedings of the Seventh International Conference on Document Analysis and Recognition, volume 2, page 1265, Washington, DC, USA, 2003. IEEE Computer Society. [5] S. Marinai, E. Marino, and G. Soda. Tree clustering for layout-based document image retrieval. In DIAL ’06: Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06), pages 243–253, Washington, DC, USA, 2006. IEEE Computer Society. [6] Nicolas, Kessentini, Paquet, and Heutte. Handwritten document segmentation using hidden markov random fields. ICDAR, 1:212–216, August 2005. [7] J. Ramel, S. Busson, and M. Demonet. Agora: the interactive document image analysis tool of the bvh project. DIAL, 0:145–155, 2006. [8] Youness and Saporta. Une méthodologie pour la comparaison de partitions. Revue de Statistique Appliquée, 52:97–120, 2004.
© Copyright 2026 Paperzz