A Proposition of Retrieval tools for Historical Document Images

A Proposition of Retrieval tools for Historical Document Images libraries
Nicholas Journet
LI, TOURS-FRANCE
[email protected]
Rémy Mullot
L3I, La Rochelle-FRANCE
[email protected]
Véronique Eglin
LIRIS INSA de Lyon
Villeurbanne-FRANCE
[email protected]
Jean-Yves Ramel
LI, TOURS-FRANCE
[email protected]
Abstract
In this article, we propose a method of characterization
of pictures of old documents based on a texture approach.
This characterization is carried out with the help of a multiresolution study of the textures contained in the pictures of
the document. So, by extracting five features linked to the
frequencies and to the orientations in the different parts of
a page, it is possible to extract and to compare elements of
high semantic level without expressing any hypothesis about
the physical or logical structure of the analysed documents.
Experiments show the feasibility of the fulfillment of tools
for the navigation or the indexation help. In these experimentations, we will lay the emphasis upon the pertinence of
these texture features and the advances that they represent
in terms of characterization of content of a deeply heterogeneous corpus.
1. Introduction
This article meets the fundamental problem which is the
characterization of the content of pictures of old documents
considered as an alternative to the analysis methods of old
documents that, until now, are mainly based on a pages segmentation and on an interpretation of their structure. This
article is based on two parts. At first, this article details
our proposal for characterization of content of old pictures.
With the help of the calculation of texture features at different resolutions, we show that it is possible to characterize
the content of pictures without expressing any hypothesis,
neither about the structure nor about the characteristics of
the treated pictures. In the last part, we show how the characterization of textures can be used for the purpose of indexation of the content.
2
Our texture approach for the characterization of the content
2.1
Analysis of the content of pictures of
documents : the Texture Approach
In the context of a collaboration with the Centre d’Etudes
Superieures de la Renaissance de Tours1 , we had access
to more than one hundred digitized works from the 15th
and the 16th century. One of the strong characteristics of
these old documents is the heterogeneity of the available
works. The rudimentary character of the used techniques
and equipment, the deterioration of the documents and the
variety of the editorial rules are some of the reasons that
explain the diversity of old documents pictures. The pictures of documents, to which we had access, encompass
3 centuries of printing and history. The complicated layout (several columns of irregular sizes), the use of specific
fonts, (no more used now), the frequent use of embellishments, (illuminations, initial letters, frames), the small line
spaces, or also the presence of non constant spacing between the characters and the words, the superimposing layers of pieces of information (noise, handwritten notes), are
just so many specificities of old documents wich make difficult the characterization of the content. One possibility is
to use texture algorithms. For example, the authors of [4]
propose a method of text/picture separation of Hebrew documents based on the building of horizontal histograms. In
[3], the author analyses previously cut out blocks in order
to classify them, either as drawing or as text. The extracted
texture standards come from an analysis of the results of
the emission of pixels according to different angles. In [6]
the authors use HMM to segment pictures of handwritten
documents into areas of labelized interest areas (text lines,
1 http://www.bvh.univ-tours.fr
scratches, note in the margin...). This allows them to segment the handwritten notes of Flaubert which have the particularity to contain many hatchings and deletions making
the classic approaches little efficient.Actually, the fact that
texture approaches use mainly informations of (very) low
level, allows to free oneself from much a priori knowledge
which is used by the methods exclusively driven by the data
or by the model. Among the other advantages, one can say
that in most cases, these tools work on pictures in grey level.
A binarisation is then not systematically necessary. It must
be noted that if these texture tools allow to characterize the
pictures content, they do not allow to get a segmentation
into blocks (paragraphes, pictures, titles...), this objective
has to be carried out only at the end of post-treatments.
2.2
Characterization of the content of the
pictures
We propose a process based on the extraction of pieces
of information extracted from an analysis of the textures
within the picture, without looking for or taking into account the a priori knowledge of the structure of the pages.
Five texture features are calculated. The first three are relating to the orientations, the two others to the information
about the frequencies of transitions. These features are calculated on a local level at different picture resolutions. With
the help of an analysis by means of a slippery window, (the
size of which is the only parameter of our method), it is possible to associate with every picture pixel, meta-data corresponding to the results of the extraction of texture attributes.
This analysis is carried out at 4 different resolutions, returning finally 20 numeric values describing every pixel. Once
the whole pages of the work are analysed, the extracted
pieces of information are stored in a meta-data basis.
2.2.1
Characterization of the orientations
In order to extract texture orientation informations, we
chose to use a non-parametric tool based on the autocorrelation function : the rose of directions (proposed by
Bres in [2]). The rose of directions is a polar diagram based
on the study of the answer of the auto-correlation function
when it is applied to a picture. Let (k, l) be the central
point of the picture after auto-correlation and the straight
line Dorigin be the abscissa axis going through this point.
Let θi be the studied orientation, one calculates then the
straight line Di such that any couples of points (a, b) respects the following relation : angle formed by the straight
line (a, b) and going through Dorigin = θi . The definition
of the auto-correlation function for a two-dimensional signal is defined by eq.1.
C(k, l) =
+∞
X
+∞
X
x(k 0 , l0 ).x(k 0 + k, l0 + l)
(1)
k0 =−∞ l0 =−∞
So, a point C(k, l) of the auto-correlation function contains the value of the sum of the products of the grey levels
x(k 0 , l0 ) of the points in correspondence after a translation
of vector (k, l). At least, for every orientation θi , one calculates the sum of the different values of the auto-correlation
function eq.2.
X
C(a, b)
(2)
R(θi ) =
Di
We have then defined pertinent characteristics to describe the content of the document made from this tool.
We have then decided to extract 3 signatures wich permits
to characterize texture informations relative to the orientations. The first extracted signatures, is the angle matching
the main orientation of the rose of directions eq.3. In order
not to have to manipulate circular data, this angle is normalized according to the deviation from the horizontal angle.
So, at a resolution k, one calculates for every pixel (i, j) of
a picture the texture attribute 1.
F eature1k (i, j) = |180 − ArgM ax(R(i,j) (θ))|
W ith θ ∈ [0, π]
(3)
The isotropy of the picture is estimated according to the
intensity of the auto-correlation function. So, at the main
orientation found by eq.3, every pixel (i, j) will be characterized with the help of the equation eq.4.
F eature2k (i, j) = R(ArgM ax(R(i,j) (θ)))
W ith θ ∈ [0, π]
(4)
The last texture feature linked to the orientation, characterizes the global form of the rose. For that, one calculates
the variance of the intensities of the rose, except for the orientation of maximal intensity (eq.5). If the variance is high,
it means that the rose is deformed and that a great number
of orientations are present in various proportions.
F eature3k (i, j) = ST D(R(i,j) (θ))
W ith θ ∈ [0, π] and θ ∈
/ ArgM ax(R(i,j) (θ))
2.2.2
(5)
Characterization of frequencies
The notion of ”frequencies” about pictures of documents,
is linked to the transitions frequencies between paper and
ink. In order to characterize the transitions frequencies, we
have prefered to draw our inspiration from the works [1, 8].
These authors detail how it is possible to characterize different styles of text or to separate the text from the pictures, by
studying the properties of the pixels grey levels transitions.
The first signature that we use, permits to characterize
the transitions frequencies between ink and paper. For every line of the analysed area by means of the slippery window, one does the sum of the difference between a grey level
pixel and its neighbour on its left. The more the sum is high,
the more the number of transitions on a line is high. A simple calculation allows to get a signature about the transitions
in the studied area (eq.6).
F eature4k (i; j) = Avg (
i∈I 0
X
(pij − pij+1 ))
(6)
With I 0 and J 0 the size of the analysis window and pij the grey
level of the pixel of coordinates (i, j).
The last calculated texture sign is based on a characterization algorithm of the white spaces separating the collateral elements. We look so for a mean to get pieces of information on the extend of the various background areas which
punctuate the pages. We adopted a recursive approach consisting in calculating 4 iterations of a recursive XY-CUT
algorithm. To every iteration, one cuts into 4 areas of identical size the one which has just been analysed and one calculates for every one the feature of the eq.7. This feature is,
for every pixel, equal to the average of the sum of the grey
levels per column and per line.
P
k
F eature5 =
pkil +
l∈J 0
2
pkil
h∈I 0
P
(7)
With I 0 and J 0 the size of the window of analysis to the iteration
k of the recursive algorithm.
2.2.3
Figure 1. Pixels Classification of a work
j∈J 0
Discution about our proposal
In this section, the quality of the categorization of the content is estimated through a classification of the pixels on the
basis of the 20 proposed texture features. To classify the elements of content of the works allows, firstly, to verify if the
classification of the contents meets the objective of separation of the pixels into layers, when it is done on a complete
work. Every picture pixel has 20 values coming from the
5 features calculated at 4 different resolutions. Our objective is to group together the picture pixels corresponding to
homogeneous areas, which amounts to grouping together
characteristic vectors near in the sense of a metric system.
It is a problem of non-supervised classification for which
we do not know a priori the points labels allowing to build
classes. We used a classification algorithm of mobile center
kind, where is only indicated the number of classes that one
wishes to get. The figure 1 shows the kind of results that
one gets when a classification on a complete work is done.
These tests allowed before all, to lay the emphasis upon
a real coherent separating power of the extracted features.
As regards the main limits of the proposed marking, they
become localized on the level of the analysis of transition
areas between texts and pictures, but also of titles containing big alone characters. Because of that, a great part of the
titles (isolated from the body of the text) are identified as
being drawings. In order to measure the pertinence of the
method, we propose a simple estimation of the abilities of
our data, instead to give a great quantity of visual results ;
we have then decided to estimate the ability of separation of
the pixels into 3 classes : text/drawing/background. To do
that, we have manually captured a groundtruth with the help
of an application that we developed and which allows to delimit with the mouse the outlines of the drawing areas and
of the text areas (ground truth). A file is created in order to
store this ground truth in order to be finally compared with
the calculated classification. Our tests were done on 400
pages of old documents, extracted from 3 different works.
Given, that we wanted to have an idea of the pertinence of
the extracted features, we have made up a corpus of test pictures with contents as various as possible. The used texture
features allow a good separation of the information layers
of the documents. The rates of good classification is 83%
for the drawings pixels and 92% for text pixels.
3
3.1
Towards new applications of search for information by the content
Comparison of pages
The first testing that we wish to carry out consists of a
comparison of pages of pictures of old documents. This
testing will allow to study if it is possible, without segment-
ing and without identifying, to compare their content coming from texture pieces of information. We have chosen
to characterize the pages by the spatial organization of the
blocks of texts, of pictures and of background. On the basis
of this definition, we propose the use of tools for comparison of partitions presented in [9]. In the framework of our
work, a partition is the result of a classification of pixels carried out on the basis of the generated texture features. Let
two pictures α and β for which a classification of their pixels was carried out (Lα (i) = u means that the class of the
pixel i from the image α is u). It is then possible to build
α,β
a contingency table N of these two pictures α, β (eq.8).
This table allows one to compare two partitions in a reduced
α,β
Figure 2. Examples of requests results
data space ( N uv is of dimension pXq with p the number
of classes of the picture α and q the number of classes of
the picture β) and a building in O(n).
allows to categorize structures visually very different from
each other.
α,β
P i
Nuv∈p,q= i Xuv
1 ifLα (i) = u et Lβ (i) = v
i
=
Xuv
0
(8)
Table 1. Precision rate obtained to 5 styles of
different requests
This contingency table is at the basis of a similarity measure S(α, β) between two images (eq.9)
2
v Nuv
P P
S(α, β) =
u
+
2
u Nuu
P
−
n2
2
u Nu.
P
−
2
v Nv.
P
+
n2
(9)
To estimate the quality of the comparison of pictures of
documents, we drew our inspiration from the works of [5].
We have then decided to separate the documents into 5 different classes : the pages with a frame which entirely surrounds the content, the pages made up only of text and justified on the right and on the left, the pages made up only
of text but this time arranged into two columns, the pages
made up only of an initial letter and the rest of the page
made up only of text and finally the pages made up only of
drawings. The results showed in the rest of the article were
all carried out on the same picture database. We have then
chosen nearly 400 pages out of 9 different works. Every test
begins with the application of the classification algorithm
for 3 classes (text/graphics/backround).
The figure 2 shows the ability to detect visually similar
pages in a large data-base. So, a picture is given as request
(surrounded by red in the following examples) and the system provides the pictures which are the more similar to it.
The table 1 allows to summarize the rate of good answers obtained to 5 kinds of different requests. The results meet precision rates to a Top5, Top10 and Top 15.
We calculate in a simpler way a precision rate by dividing the number of obtained good answers after request by
the considered number of pictures (size of the studied Top)
Good ansers
Rate = Size
of the T op . In the tests we have done, all
the pages of the whole works are mixed. The used measure
Top5
1
0.93
0.9
0.74
0.65
Pages with borders
Text on 2 columns
Pages with only drawings
Pages with only text
Pages with text and drop caps
3.2
Top10
0.93
0.76
0.62
0.56
0.56
Top15
0.86
0.78
0.6
0.50
0.55
Comparison of textured pictures
The second experimentation consists in doing a search
for pictures by the content on a basis made up of historical
drawings of old documents. So, we have made up a basis of tests containing more than 400 pictures. More than
a third of the basis is made up of initial letters, the rest is
divided into several categories : coats of arms, characters,
emblems, skulls, various ornamental elements... We wish to
calculate a similarity between two pictures according to the
characteristic textures which they are made up of. For this,
we propose the use of a metric system allowing to measure
a similarity between two matrices of texture features. The
similarity function d(k, l) (eq.10) the measurement of simik
larity between two pictures k, l and C the matrix describing
the textures of the pictures k and l. In this section, the tests
which are done meet a search for pictures by the example.
r
d(k, l)
=
k
l
k
l
trace((Ci,j − Ci,j ).t (Ci,j − Ci,j ))
(10)
The figure 3 shows the good results obtained on the
drawings databases. The request picture is surrounded by
red, below every answer picture is indicated the measurement of similarity (non normed) between this picture and
the request picture. After studying the results, we notice
that the discrimination of the different categories pictures of
the basis meets the anticipations. On more than a hundred
tested initial letters, the majority of the obtained answers in
a top 20 are initial letters.
frequencies and the orientations of the different parts of a
range, it is possible to extract and to compare elements of
content without putting forward hypothesis about the physical or logical structure of the analysed documents. There
is still to study now their integration into more complete indexation devices (CBIR systems for example). The first of
these prospects that we set ourselves is then to finalize an indexation system able to produce automatically the descriptive meta-data of the pictures of documents comprising our
texture features but also other pieces of information (linked
to the colours, the forms, the positions...).
References
Figure 3. Examples of requests results
To allow a global estimation of the made requests, we
propose to put in place the same procedure as the one which
was used to the comparison of pages. So we calculate a
precision rate to a top 5, 10 and 15 at two different textures
of the basis. The table 2 sums up the obtained average rates.
Table 2. Accuracy rates for different requests
drop caps
Characters
Skulls
Emblems
Coasts of arms
4
Top5
0.95
0.92
0.91
0.90
0.88
Top10
0.92
0.90
0.86
0.87
0.78
Top15
0.90
0.89
0.79
0.78
0.73
Conclusion
This article presents our proposal of tools for treatments
of pictures for a characterization of the pictures of documents without any knowledge a priori. The originality
of our proposal is particularly due to the fact that we do
not try to segment or extract the structure of the analysed
documents. So, we describe how it is possible to characterize the content of pictures of documents by basing on
non-parametrical textures pieces of information and a multiresolution approach. By extracting signatures linked to the
[1] Allier and Emptoz. Font type extraction and character prototyping using gabor filters. ICDAR, 02:799, 2003.
[2] Bres. Contributions la quantification des critres de transparence et d’anisotropie par une approche globale. PhD thesis, LIRIS, université de Lyon, 1994.
[3] D. Chetverikov, J. Liang, J. Komuves, and R. M. Haralick.
Zone classification using texture features. In ICPR ’96: Proceedings of the International Conference on Pattern Recognition (ICPR ’96) Volume III-Volume 7276, page 676, Washington, DC, USA, 1996. IEEE Computer Society.
[4] S. Khedekar, V. Ramanaprasad, S. Setlur, and V. Govindaraju.
Text - image separation in devanagari documents. In ICDAR
’03: Proceedings of the Seventh International Conference on
Document Analysis and Recognition, volume 2, page 1265,
Washington, DC, USA, 2003. IEEE Computer Society.
[5] S. Marinai, E. Marino, and G. Soda. Tree clustering for
layout-based document image retrieval. In DIAL ’06: Proceedings of the Second International Conference on Document Image Analysis for Libraries (DIAL’06), pages 243–253,
Washington, DC, USA, 2006. IEEE Computer Society.
[6] Nicolas, Kessentini, Paquet, and Heutte. Handwritten document segmentation using hidden markov random fields. ICDAR, 1:212–216, August 2005.
[7] J. Ramel, S. Busson, and M. Demonet. Agora: the interactive document image analysis tool of the bvh project. DIAL,
0:145–155, 2006.
[8] Youness and Saporta. Une méthodologie pour la comparaison de partitions. Revue de Statistique Appliquée, 52:97–120,
2004.