ON THE INFLUENCE OF WORD REPRESENTATIONS FOR

International Journal of Pattern Recognition
and Arti¯cial Intelligence
Vol. 26, No. 5 (2012) 1263002 (25 pages)
#
.c World Scienti¯c Publishing Company
DOI: 10.1142/S0218001412630025
ON THE INFLUENCE OF WORD REPRESENTATIONS
FOR HANDWRITTEN WORD SPOTTING IN HISTORICAL
DOCUMENTS
JOSEP LLADÓS*, MARÇAL RUSIÑOL†, ALICIA FORNÉS‡,
§
and ANJAN DUTTA¶
DAVID FERNANDEZ
Computer Vision Center, Computer Science Department
Edi¯ci O, Universitat Aut
onoma de Barcelona
08193, Bellaterra, Spain
*[email protected][email protected][email protected]
§
[email protected][email protected]
Received 7 August 2011
Accepted 2 May 2012
Published 19 October 2012
Word spotting is the process of retrieving all instances of a queried keyword from a digital
library of document images. In this paper we evaluate the performance of di®erent word
descriptors to assess the advantages and disadvantages of statistical and structural models in a
framework of query-by-example word spotting in historical documents. We compare four word
representation models, namely sequence alignment using DTW as a baseline reference, a bag of
visual words approach as statistical model, a pseudo-structural model based on a Loci features
representation, and a structural approach where words are represented by graphs. The four
approaches have been tested with two collections of historical data: the George Washington
database and the marriage records from the Barcelona Cathedral. We experimentally demonstrate that statistical representations generally give a better performance, however it cannot be
neglected that large descriptors are di±cult to be implemented in a retrieval scenario where
word spotting requires the indexation of data with million word images.
Keywords : Handwriting recognition; word spotting; historical documents; feature representation; shape descriptors.
1. Introduction
There is a huge amount of old manuscripts stored in libraries and archives. These
documents have a historical value not only for their physical appearance but also for
their contents. Such contents are daily studied by scholars and communities of use
for history research. Many digitization initiatives exist worldwide to convert historical documents to digital libraries. The digitization process and the use of
1263002-1
J. Llad
os et al.
standard formats to store the images in digital libraries ensures the long-term
preservation of the documents. On the other hand, the society at large can browse
through collections by web-enabled technologies. But just browsing through milliards of digitized document images is not enough. Digital libraries have to provide
tools to the users for mining the contained contents and unlock the wealth of information in these resources. Manually typing all the information and recording into
databases is a tedious and slow process, requiring a lot of resources. There is still an
important gap in the conversion pipeline from paper documents to useful information, especially when documents are handwritten and are degraded due to the ageing
impact.
Handwriting recognition (HWR) is one of the most signi¯cant topics within the
¯eld old Document Image Analysis and Recognition (DIAR). It is, after decades, a
key research activity with still many challenges towards the creation of digital libraries of historical manuscripts whose contents can be retrieved and crosslinked.
Over the last years, relevant research achievements have been attained and some
systems have been reported in the literature that are capable of transcribing handwritten documents up to a certain precision. Generally speaking a HWR system can
be divided in two coupled components, namely the morphological and the language
models. The morphological component formulates the recognition of a handwritten
word as a pattern recognition problem. The language model allows to take into
account the prior probabilities of the lexical units of a given lexicon. Handwritten
text can be seen as a sequence of observations (e.g. characters, graphemes, or columns features) so following an analytical approach the classi¯cation of an unknown
sequence X is formulated in terms of the likelihood that a feature vector sequence
X is generated by the model of word W , and the prior probabilities P ðXÞ of (sub)
words in the lexicon. In this framework, a large amount of data is required to train
both the morphological and the language models. However, in historical documents
such training data may not exist, or may require a tedious manual creation. Moreover, the information contained in a collection of manuscripts may not be useful to
train recognizers for another collection because of di®erent periods of time with
di®erent script styles, or di®erent disciplines, with nonintersecting and restricted
lexicons. On the other hand, some use cases do not require full transcription, or the
existing technologies can hardly transcribe documents with poor quality. Rather
than attempting to transcribe text to its full extend, only checking the existence of
special keywords (names, dates, cities, etc.) may be enough.
1.1. Word spotting
Handwritten word spotting is de¯ned as the pattern analysis task which consists in
¯nding keywords in handwritten document images. In historical documents, it o®ers
a solution when searching information into digital libraries of documents. A typical
example is when genealogists search for family names in birth records when they
explore family linkages. A straightforward strategy to search words in handwritten
1263002-2
Word Representations for Handwritten Word Spotting
documents would be to apply a HWR system to that document and then to search
the words in the output text. However, it would be costly in terms of computation
and training set requirements. Manmatha et al. proposed a di®erent philosophy.27
The important contribution of this work is that an image matching approach is
su±cient for retrieving keywords in documents, without the need for recognizing the
text. The problem is then converted to a validation problem rather than a recognition problem — validate whether a word image matches a given query with a high
score.
Two main approaches of word spotting exist depending on the representation of
the query. Query-by-string (QBS)1 uses an arbitrary string as input. It typically
requires a large amount of training materials since character models are learned
a priori and the model for a query word is build at runtime from the models of its
constituent characters. In Query-by-example (QBE)27 the input is one or several
exemplary images of the queried word. This is addressed as an image retrieval
problem. Therefore, it does not require learning but collecting one or several
examples of the keyword. QBS is more °exible because it allows searching for any
keyword while QBE does not. In contrast, QBE can be performed \on-the-°y" since
the user can crop a word in a document collection and start searching for similar
ones. In this work we study the relevance of di®erent representation models in a
scenario of QBE handwritten word spotting in historical document collections.
1.2. Related work
Due to its e®ectiveness, word spotting has been largely used for historical document
indexing and retrieval, not only for old printed documents,28 but also for old
handwritten ones.7,17,22,23,30,33
It is important to remark that, although the use of a language model is very
common in HWR, it may be useless when dealing with historical documents. It may
be due to di®erent reasons: the lack of enough training data to compute lexicon
frequencies, constrained vocabularies (names, cities) or nonstable over the time, \on
the °y" querying (the user selects a collection and wants to search an arbitrary word
in it), etc. In this scenario, a good description of the word images is a key issue, so the
recognition relies only in the morphological model. In this paper we study the in°uence of di®erent feature representations when recognizing handwriting. Di®erent
representations can be found in the literature for the description of word images.
Similarly to shape descriptors, they can be classi¯ed into statistical and structural.
The former represent the image as a n-dimensional feature vector, whereas the latter
usually represent the image as a set of geometric and topological primitives and
relationships among them.
Statistical descriptors are the most frequent and they can be de¯ned from global
and local features. Global features are computed from the image as a whole, for
example the width, height, aspect ratio, number of pixels, are global features. In
contrast, local features are those which refer independently to di®erent regions of the
1263002-3
J. Llad
os et al.
image or primitives extracted from it. For example, position/number of holes, valleys, dots or crosses at keypoints or regions are local features. Rath and Manmatha
proposed a Dynamic Time Warping (DTW)-based approach which computed pro¯le
features31,33: the upper and lower pro¯les, the number of foreground pixels, and the
number of transitions black/white. This set of features showed better performance
than the raw pixels as input features used in their previous work.27 In a similar way,
other approaches also use information about pro¯les: the QBS approach proposed by
Kesidis et al.17 for machine-printed documents compute 90 features based on zoning
plus 60 features based on upper and lower pro¯les. Frinken et al.7,8 proposed a word
spotting approach based on neural networks, which use the set of features proposed
by Marti and Bunke.29 For each window of 1-column width, the following nine
geometric features are computed: The 0th, 1st and 2nd moment of the black pixels
distribution within the window; the position of the top-most and bottom-most black
pixels (upper/lower pro¯les); the inclination of the top and bottom contour of the
word at the actual window position; the number of vertical black/white transitions;
and the average gray scale value between the top-most and bottom-most black pixel.
Gradient features are also very common: In the HMM-based word spotting
approach of Rodríguez and Perronnin,34 local gradient histogram features were
computed. Similarly, in the multi-lingual QBS proposed by Leydier et al.,22,23 gradient features from ZOI (zones of interest) are extracted. Moghaddam and Cheriet30
proposed a word segmentation free approach that computes a stroke gray-level
estimation, and edge pro¯le estimation (based on gradients). Other interesting statistical descriptions are proposed in the approach of Rothfeder et al.36 based on the
Harris corner detector, or the work of van der Zant et al.2 where the recognition is
inspired in biological features which are extracted by Gabor ¯lters.
Holistic features can also be used for word recognition.26 Lavrenko et al.20 propose
the use of holistic features in a simple HMM approach, with only one state per word.
They compute a vector of ¯xed length (27 dimensions) for each word, composed of
scalar features (height, width, aspect ratio, area, number of ascenders and descenders
of the word), and time series (projection pro¯le, upper and lower pro¯le), which are
approximated by the lower-order coe±cients of the Discrete Fourier Transform
(DFT).
Structural approaches are less common in the literature. Some early works exist
using graph representations for isolated digits5,24 or Chinese characters.12,39 The
approach proposed by Keaton et al.15 is based on the extraction of information about
concavities, although results show that this description is sensitive to noise. The
word spotting approach for printed documents (old and modern) described by
Marinai et al.,28 uses string encoding as the input features, and uses Self Organizing
Map (SOM) for clustering.
Finally, it must be said that combinations of statistical and structural descriptors
have also been proposed. Kessentini et al.18 proposed a multi-stream HMM-based
approach for o®-line handwritten word recognition (tested in Latin and Arabic
1263002-4
Word Representations for Handwritten Word Spotting
manuscripts), combining two di®erent types of features: directional density features
(8 from chain codes, 4 from the structure of the contour, and 3 from the position of
the contour), and density features (16 from foreground densities and transitions, and
10 from concavities). The authors conclude that the combination of these types of
features obtain the best results. Recently, Fischer et al.6 proposed an interesting
approach for HWR in historical documents where graph similarity features are used
as observations in a classical HMM.
1.3. Outline of the paper
In this paper we focus on the morphological model alone, without using a language
model. We analyze the performance of the description when dealing with word
recognition. Hence, the contribution of this paper is the analysis of four families of
morphological models applied to word classi¯cation in a QBE word spotting
framework in historical manuscripts. In this scenario we assume that a large amount
of images of digitized handwritten documents are stored in a digital library.
Therefore, given a query word, there may be million of candidate words into the
database. The performance of the descriptors is therefore not only assessed in terms
of the retrieval quality, but also other attributes like their computational cost and
their ability of being integrated in a large scale retrieval application.
As baseline model, we have selected the well-known approach of Rath and
Manmatha32,33 where word images are represented as sequences of column features
and aligned using a DTW algorithm. We have chosen this representation because it
can be considered the ¯rst successful handwritten word spotting approach in the
state of the art.
In addition, other description models are analyzed according to the classical
taxonomy that divides pattern recognition into statistical and structural approaches.
Hence, we present three di®erent description models that can be considered statistical, pseudo-structural or hybrid, and structural, respectively. The models presented
in this paper do not have to be considered as contributions by themselves, since all of
them correspond to implementations of published work adapted to the problem of
handwritten word representation. We do not intend to discuss about the goodness of
each model, but analyze them as representatives of the corresponding families. As
stated previously, the actual contribution of this paper in the evaluation of the
performance of di®erent families of morphological models when applied to a scenario
of handwritten word spotting in historical documents. The three models proposed in
the evaluation are: the bag of visual words representation based on SIFT features
proposed by Rusiñol et al.,37 as statistical descriptor; a descriptor inspired in Loci
features proposed by Fern
andez et al.,4 as pseudo-structural representation; and
¯nally a graph-based representation following the work of Dutta et al.,3 as structural
descriptor.
To perform the experiments, two di®erent datasets of historical handwritten
documents have been used. The ¯rst one consists of a set of 20 pages from a collection
1263002-5
J. Llad
os et al.
of letters by George Washington.33 The second corpus consists of 27 pages from a
18th century collection of marriage records from the Barcelona Cathedral.4 Both
collections are accurately segmented into words and transcribed. In order to evaluate
the performance of the di®erent word representation methods in a word spotting
framework we have chosen some performance assessments based on precision and
recall measures.
The rest of the paper is organized as follows. In Sec. 2 we overview the four
representation models used in our performance evaluation protocol. Afterwards,
Sec. 3 describes the experimentation framework with historical databases. We discuss the results obtained and the pros and cons of the di®erent approaches. Finally,
in Sec. 4 the conclusions are drawn.
2. Representation Models for Handwritten Words
As we have introduced in Sec. 1 we have implemented four models as representative
of di®erent classes to evaluate the importance of shape word representation in a word
spotting context. This section overviews the di®erent approaches. The reader is
referred to the corresponding original works for detailed descriptions.
2.1. Sequences of column features and DTW
As a reference system, we have chosen the well-known word spotting approach
described by Rath and Manmatha,32,33 which is based on the DTW algorithm. The
DTW algorithm was ¯rst introduced by Kruskal and Liberman19 for putting samples
into correspondence in the context of speech recognition. It is a robust distance
measure for time series, allowing similar samples to match even if they are out of
phase in the time axis (see Fig. 1). DTW can distort (or warp) the time axis, compressing it at some places and expanding it at others, ¯nding the best matching
between two samples.
In order to apply DTW for word spotting, a common practice is to represent the
word images as a sequence of column-wise feature vectors. In the word spotting
approach described by Rath and Manmatha,33 the following four features are
Fig. 1.
Bin-to-bin and DTW alignment.
1263002-6
Word Representations for Handwritten Word Spotting
computed for every column of a word image:
.
the number of foreground pixels in every column.
the upper pro¯le (the distance of the upper pixel in the column to the upper
boundary of the word's bounding box).
. the lower pro¯le (the distance of the lower pixel in the column to the lower
boundary of the word's bounding box).
. the number of transitions from background to foreground and vice versa.
.
In this way, two word images can be easily compared using DTW. Given a word
image A with M columns and a word image B with N columns, a feature vector for
each column is computed and normalized [0, 1]. Afterwards, the matrix Dði; jÞ (where
i ¼ 1; . . . ; M; j ¼ 1; . . . ; N) of distances is computed using dynamic programming:
9
8
>
>
=
<Dði; j 1Þ
þ d2ðxi ; yj Þ;
Dði; jÞ ¼ min Dði 1; jÞ
ð1Þ
>
>
;
:
Dði 1; j 1Þ
d2ðxi ; yj Þ ¼
4
X
ðfk ðai Þ fk ðbj ÞÞ 2 ;
ð2Þ
k¼1
where fk ðai Þ corresponds to the kth feature of the column i of the image A, and fk ðbj Þ
corresponds to the kth feature of the column j of the image B.
Performing backtracking along the minimum cost index pairs ði; jÞ starting from
ðM; NÞ yields the warping path. Finally, the matching distance DTWCostðA; BÞ is
normalized by the length Z of this warping path, otherwise longest time series should
have a higher matching cost than shorter ones:
DTWCostðA; BÞ ¼ DðM; NÞ=Z:
ð3Þ
For word spotting, every query word is compared with the word images in the
database using the DTW algorithm.
DTW is robust to the elastic deformations typically found in handwritten words,
because the algorithm is able to distort the \time" axis for ¯nding the best matching.
In addition, DTW is able to handle word images of unequal length, allowing the
comparison without resampling. Although there are some e±cient and fast DTW
approaches in the literature,16,38 the original DTW is usually considered very time
consuming. The classic DTW method requires the computation of a matrix for
comparing each pair of words, which has a complexity order of Oðn 2 Þ. Therefore,
in case of large datasets, faster similarity measurements among words would be
preferable.
2.2. Bag-of-visual-words descriptor
In this section, we give the details of a word image representation that is based on the
bag-of-visual-words (BoVW) model powered by SIFT25 descriptors. The BoVW
1263002-7
J. Llad
os et al.
model has already been used for handwritten word representation.37 The main motivation of representing handwritten words by this model is that in other Computer
Vision scenarios it has obtained a good performance albeit its simplicity. The
advantages of the presented method are twofold. On the one hand, by using the
BoVW model we achieve robustness to occlusions or image deformations. This is a
great advantage in the context of word representations for spotting applications
since we can handle noisy word segmentations. On the other hand the use of local
descriptors adds invariance to changes of illumination or image noise. This is important when dealing with historical documents, since the image conditions might
vary from page to page. Besides, the descriptors obtained by the BoVW model can be
compared using standard distances and subsequently any statistical pattern recognition technique can be applied.
We detail below the following steps for the BoVW handwritten word representation. Initially we need a reference set of some of the word images in order to
perform a clustering of the SIFT descriptors to build the codebook. Once we have the
codebook, the word images can be encoded by the BoVW model. In a last step, in
order to produce more robust word descriptors, we add some coarse spatial information to the orderless BoVW model.
2.2.1. Codebook generation
For each word image in the reference set, we densely calculate the SIFT descriptors
over a regular grid of 5 pixels by using the method presented by Fulkerson et al.9
Three di®erent scales using bin sizes of 3, 5 and 10 pixels size are considered. These
parameters are related to the word size, and in our case have been experimentally set.
We can see in Fig. 2 an example of dense SIFT features extracted from a word image.
The selected scales guarantee that a descriptor either covers part of a character, a
complete character or a character and its surroundings. As shown by Zhang et al.,41
the larger the amount of descriptors we extract from an image, the better the performance of the BoVW model is. Therefore, a dense sampling strategy has a clear
advantage over approaches de¯ning their regions by using interest points. Since the
descriptors are densely sampled, some SIFT descriptors calculated in low textured
regions are unreliable. Therefore, descriptors having a low gradient magnitude before
normalization are directly discarded.
Once the SIFT descriptors are calculated, by clustering the descriptor feature
space into k clusters we obtain the codebook that quantizes SIFT feature vectors into
Fig. 2. Dense SIFT features extracted from a word image.
1263002-8
Word Representations for Handwritten Word Spotting
visual words. We use the k-means algorithm to perform the clustering of the feature
vectors. In the experiments carried out in this paper, we use a codebook with dimensionality of k ¼ 20:000 visual words.
2.2.2. BoVW feature vectors
For each of the word images, we extract the SIFT descriptors, and we quantize them
into visual words with the codebook. Then, the visual word associated to a descriptor
corresponds to the index of the cluster that the descriptor belongs to. The BoVW
feature vector for a given word is then computed by counting the occurrences of each
of the visual words in the image.
2.2.3. Spatial information
The main drawback of bag-of-words-based models is that they do not take into
account the spatial distribution of the features. In order to add spatial information
to the orderless BoVW model, Lazebnik et al.21 proposed the Spatial Pyramid
Matching (SPM) method. This method roughly takes into account the visual word
distribution over the images by creating a pyramid of spatial bins.
This pyramid is recursively constructed by splitting the images in spatial bins
following the vertical and horizontal axis. At each spatial bin, a di®erent BoVW
histogram is extracted. The resulting descriptor is obtained by concatenating all the
BoVW histograms. Therefore, the ¯nal dimensionality of the descriptor is determined by the number of levels used to build the pyramid.
In our experiments, we have adapted the idea of SPM to be used in the context of
handwritten word representation. We use the SPM con¯guration presented in Fig. 3
where two di®erent levels are used. The ¯rst level is the whole word image and in the
second level we divide it in its right and left part and its upper, central and lower
part. With the proposed con¯guration we aim to capture information about the
ascenders and descenders of the words as well as information about the right and left
parts of the words.
Since the amount of visual words assigned to each bin is lower at higher levels of
the pyramid, due to the fact that the spatial bins are smaller, the visual words
contribution is usually weighted. In our con¯guration, the visual words in the second
Fig. 3. Second level of the proposed SPM con¯guration. Ascenders and descenders information and right
and left parts of the words is captured.
1263002-9
J. Llad
os et al.
level appearing in the upper and lower parts are weighted by w ¼ 8 whereas the ones
appearing in the central parts are weighted by w ¼ 4, while in the ¯rst level we use a
weight w ¼ 1.
Since we used a two levels SPM with 7 spatial bins, we therefore obtain a ¯nal a
descriptor of 140.000 dimensions for each word image.
2.2.4. Normalization and distance computation
Finally, all the word descriptors are normalized by using the L2-norm. In order
to assess whether two word images are similar or not, we use the cosine distance
between its feature vectors.
2.3. Pseudo-structural descriptor
Fern
andez et al. have proposed a pseudo-structural descriptor for word spotting4
based on the characteristic Loci features proposed by Glucksman11 for the classi¯cation of mixed-font alphabets. A characteristic Loci feature consists of the number
of the intersections in four directions: up, down, right and left (see Fig. 4). For each
background pixel in a binary image, and each direction, we count the number of
intersections (an intersection means a black/white transition between two consecutive pixels). Hence, each keypoint generates a code, called Locu number, of length
4. Characteristic Loci features were designed for digit and isolated letter recognition.
In our proposal we adapt the idea to handwritten word images. Thus, we are
encoding more complex images, with a potentially high number of classes, and
intraclass variability due to the nature of handwriting. To cope with this, we have
introduced three variations in the basic descriptor:
.
We have added the two diagonal directions, as we can see in Fig. 4. This gives more
information to the descriptor and more sturdiness to the method.
. The range of the number of the intersections is quantized and bounded in intervals. It allows a compact representation when the indexation structure is constructed because similar Locu numbers are clustered in the same bucket.
Fig. 4.
Characteristic Loci feature of a single point of the word page.
1263002-10
Word Representations for Handwritten Word Spotting
.
Two modes are implemented to compute the feature vector, namely background
and foreground pixels are taken as keypoints.
Before encoding word images by Locu numbers, a preprocessing step binarizes the
image and the skeletons are computed by an iterative thinning operator until lines of
width of 1 pixel are obtained.
The feature vector (Locu number) is computed by assigning a label to each
background (or foreground) pixel, referred as keypoints, as it is shown in Fig. 4. We
generate a Locu number of length 8 for each keypoint. These values correspond to the
counts of intersections with the skeletonized image along the 8 directions starting
from the keypoint. A word image is ¯nally encoded with a histogram of Locu
numbers. Since the histograms of Locu numbers are integrated in an indexation
structure, we reduce the dynamic range of Locu numbers (length of histograms). The
length of the histograms of Locu numbers exponentially increases when the number
of possible intersection counts is higher. By delimiting the number of possible values
we reduce the number of combinations, and consequently the computational cost.
For example, with 3 possible values and 8 directions, we obtain 3 8 (6.561) combinations; with 4 possible values we have 4 8 (65.536). Thus, to reduce the dimension of
the feature space, the counts for the number of intersections have been limited to 3
values (0, 1 and 2).
Contrarily to the original work,11 we do not de¯ne ¯xed upper bounds for
intersection counts. The casting from the actual count to the range ½0 3 is de¯ned
independently for each direction according to di®erent intervals. The horizontal
direction has a larger interval than the vertical direction. In the original approach the
digits or characters have a similar height and width, but in our approach the width of
the words is usually bigger than the height. According to the dimensions of the words
the range of the intervals are directly proportional. Diagonal directions are a combination of the two other directions. Table 1 shows the intervals for each direction.
According to the above encoding, for each keypoint (background or foreground
pixel), an eight-digit number in base 3 is obtained. For instance, the Locu number of
point P in Fig. 4 is ð22111122Þ3 ¼ ð6170Þ10 . It generates a vote in the corresponding
entry of the histogram of Locu numbers for this word image. Since the Locu numbers
range between 0 and 6561 (= 3 8 ), the length of histograms (dimension of the feature
space) is 6561. The histograms are used as indexation keys in the word spotting
Table 1. Intervals for each direction
in characteristic Loci feature.
Values
Direction
0
1
2
Vertical
Horizontal
Diagonal
f0g
f0g
f0g
[1, 2]
[1, 4]
[1, 3]
[3, þ1]
[5, þ1]
[4, þ1]
1263002-11
J. Llad
os et al.
process. The indexation strategy is out of the scope of this paper. The reader can ¯nd
further details in the original work.4
2.4. Graph-based descriptor
Graphs are widely adapted by the research community since a long back as a robust
tool to represent structural information of images. Graph theory provides e±cient
methods to compare structural representations by means of a graph isomorphism
formulation. However, graph representations are rarely used in HWR. The inherent
elastic variability of handwritten strokes has to be modeled in the graphs and
therefore the recognition involves the implementation of an error-tolerant graph
isomorphism, which is a computationally expensive problem (it belongs to the class
of NP-complete problems). In addition, the use of graphs in a word spotting problem
would require multiple matchings between the query graph and the graphs representing words in the database.
To have a complete view of the performance of the di®erent families of representations, we have adapted a graph matching approach3 to the word spotting
problem. The graph representation for a word image is constructed from its skeletal
information. The skeleton of the image is polygonally approximated using a vectorization algorithm.35 This particular vectorization algorithm works only with a
pruning parameter which is used to eliminate the very small noise from the image.
For our case we set the parameter to 5 i.e. all isolated pixels having total pixel size
less than or equal to 5 will be eliminated. Thus, graph nodes correspond to critical
points in the skeleton (extrema, corners or crossings) and the lines joining them are
considered as the edges of the graph (see Fig. 5). As it can be seen in the example,
when the skeletons are vectorized a rough polygonal approximation is intentionally
done to keep the main structure properties in the graph but neglecting the details.
Otherwise, two instances of the same word would be converted into very di®erent
graphs.
Once words are represented by graphs, word spotting is solved by a graph
matching algorithm. To avoid the computational burden, we have proposed a
method based on graph serialization. Graph serialization consists in decomposing
Fig. 5. Graph representation from the skeleton of a word. The critical points detected by the vectorization method are considered as the nodes and the lines joining them are considered as the edges.
1263002-12
Word Representations for Handwritten Word Spotting
graphs in one-dimensional structures like attributed strings, so graph matching can
be reduced to the combination of matchings of one-dimensional structures. In addition, graphs are factorized, i.e. represented in terms of common one-dimensional
substructures. Thus, given a graph, we serialize it extracting its acyclic paths. Paths
are then clustered, so the representation can be seen as a bag-of-paths (BoP)
descriptor for describing the words.
An acyclic path between any two connected nodes in a graph is the sequence of
line segments from the source node to the destination following the order. For describing a word with the BoP descriptors, we compute all possible acyclic paths
between each pair of the connected nodes in a graph corresponding to a word image
and repeat the same procedure for all the word images to be considered for recognition. A vector of attributes is associated to each path capturing its shape information. For our case we use the Zernike moments descriptors of order 7 for that
purpose. These settings are experimentally chosen to give the best performance. Let
us call the set of descriptors of all the graph paths as P , and a set S P of prototype
paths are selected by a random prototype selection technique from the set P . Then
each of the paths in a single word are classi¯ed as one of the prototype paths and
ultimately represent the graph presenting a word as the histogram of number of
occurrences of the prototype paths (see Fig. 6). For this experiment, we select
Fig. 6. The histogram of the number of occurrences of the prototype paths represents the graphs
extracted from the handwritten word image.
1263002-13
J. Llad
os et al.
jSj ¼ 500, this actually determines the dimension of the BoP descriptors, this
parameter is also chosen to give the best performance. The Euclidean distance is used
to compare two BoP representations.
3. Experimental Results
To carry out the performance study of the di®erent methods we used two di®erent
data collections of historical handwritten documents. Let us ¯rst give the details
of the collections and the evaluation measures and then proceed with the result
analysis.
3.1. Datasets and evaluation measures
To perform the experiments, we used two di®erent datasets of historical handwritten
documents. The ¯rst one consists of a set of 20 pages from a collection of letters by
George Washington.33 The second evaluation corpus contains 27 pages from a collection of marriage registers from the Barcelona Cathedral.4 Both collections are
accurately segmented into words and transcribed. In the George Washington collection we have a total of 4860 segmented words with 1124 di®erent transcriptions
whereas in the Barcelona Cathedral collection we have 6544 word snippets with 1751
di®erent transcriptions. All the words having at least three characters and appearing
at least ten times in the collections were selected as queries. For the George
Washington collection we have 1847 queries corresponding to 68 di®erent words and
for the Barcelona Cathedral collection we have 514 queries from 32 di®erent words.
Both subsets selected for the experiments are single writer collections.
In order to evaluate the performance of the di®erent word representation methods
in a word spotting framework we have chosen some performance assessments based
on precision and recall40 measures. Let us brie°y review the used measures.
Given a query, let us label as rel the set of relevant objects with regard to the
query and as ret the set of retrieved elements from the database. The precision and
recall ratios are then de¯ned as follows:
Precision ¼
jret \ relj
;
jretj
Recall ¼
jret \ relj
:
jrelj
ð4Þ
In order to give a better idea on how the di®erent systems rank the relevant words
we use the following metrics. The precision at n or P @n is obtained by computing the
precision at a given cut-o® rank, considering only the n topmost results returned by
the system. In this evaluation we will provide the results at the 10 and 20 ranks. In a
similar fashion, the R-precision is the precision computed at the Rth position in the
ranking of results, being R the number of relevant words for that speci¯c query. The
R-precision is usually correlated with the mean average precision mAP, which is
computed using each precision value after truncating at each relevant item in the
ranked list. For a given query, let rðnÞ be a binary function on the relevance of the
1263002-14
Word Representations for Handwritten Word Spotting
nth item in the returned ranked list, the mean average precision is de¯ned as follows:
Pjretj
ðP @n rðnÞÞ
:
ð5Þ
mAP ¼ n¼1
jrelj
Finally, we also use the normalized discounted cumulative gain13 nDCG evaluation measure. For a given query, the discounted cumulative gain DCGn at a particular rank position n is de¯ned as:
DCGn ¼ rel1 þ
n
X
reli
:
log2 i
i¼2
ð6Þ
The normalized version, nDCG, is obtained by producing an ideal DCG for the
position n. It is computed as:
nDCGn ¼
DCGn
:
IDCGn
ð7Þ
3.2. Results
We present in Figs. 7 and 8 an example of the qualitative results for a given query for
the Barcelona Cathedral and the George Washington collection, respectively. We
can see that all the methods present some false positives in the ¯rst ten responses.
However, it is interesting to notice that this false positive words are in most of the
cases similar to the query in terms of shape. In Fig. 7, when asking for the word
\Farrer ", we obtain similar results such as \Farer ", \Ferrer ", \Carner ", \Fuster "
or \Serra". In the case of the George Washington collection, the behavior is similar.
When asking for the word \Company" we obtain as false alarms similar words as
\Conway ", \Commissary ", \Companies ", \Country ", \complete ", etc.
In Fig. 9 we present the precision and recall plots for both collections and in
Tables 2 and 3 we present the evaluation measures for the George Washington and
the Barcelona Cathedral collection, respectively.
We can see that in both scenarios, the BoVW method outperforms the other
three. It is however interesting to notice that the George Washington collection
seems to be easier for the BoVW method whereas the Barcelona Cathedral is the one
where we obtain better results for the other three methods. The BoVW method
extracts the word representation directly from the gray-level word images, whereas
the DTW-based, the pseudo-structural and the structural methods need a preliminary step of binarization. In the George Washington collection there are many
degraded word images that might be di±cult to binarize hindering the performance
of the word representations that need a preprocessing step including binarization.
We can see an example of the performance of the Otsu binarization process for the
two collections in Fig. 11. As we can appreciate, the binarization process in the
Barcelona Cathedral dataset is quite robust and stable, whereas the George
1263002-15
J. Llad
os et al.
(a)
(b)
(c)
(d)
Fig. 7. Example of qualitative results for the Barcelona Cathedral collection. (a) DTW-based method,
(b) BoVW method, (c) pseudo-structural method and (d) structural method.
Washington collection is more noisy and the binarization performs quite di®erent for
words from the same class.
Even if the BoVW method performs better in both cases than the rest, it also
presents an important drawback. Its feature vector is huge (140.000 dimensions)
respect the other three methods. This is an issue when facing large collections since
the feature vectors must ¯t in memory for real-time retrieval. Binarization techniques such as LSH10 or product quantization14 should be taken into account. However, even if those indexing methods will reduce the size of the descriptor so it can be
1263002-16
Word Representations for Handwritten Word Spotting
(a)
(b)
(c)
(d)
Fig. 8. Example of qualitative results for the George Washington collection. (a) DTW-based method, (b)
BoVW method, (c) pseudo-structural method and (d) structural method.
used in real scenarios, they will also harm its discriminative power decreasing its
performance.
Notice that the pseudo-structural method performs very close to the DTW reference method for the Barcelona Cathedral collection. However, DTW is very time
consuming. Computing distances among words with the DTW method is very time
consuming since the algorithm has a Oðn 2 Þ complexity. On the other hand, the
pseudo-structural word representation is a compact signature that can be easily
indexed, and thus scales much better than a DTW-based method.
1263002-17
J. Llad
os et al.
1
1
DTW
BoVW
Pseudo−Struct
Struct
0.9
0.8
0.7
0.7
0.6
0.6
Precision
Precision
0.8
0.5
0.4
0.3
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
DTW
BoVW
Pseudo−Struct
Struct
0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
Recall
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
(a)
0.4
0.5
0.6
Recall
0.7
0.8
0.9
1
(b)
Fig. 9. Precision and Recall curves for the (a) George Washington collection and (b) Barcelona Cathedral
collection.
Table 2.
Retrieval results for the George Washington collection.
DTW
BoVW
Pseudo-Struct
Structural
P @10
P @20
R-Precision
mAP
nDCG
0.346
0.606
0.183
0.059
0.286
0.523
0.149
0.049
0.191
0.412
0.096
0.036
0.169
0.422
0.072
0.028
0.539
0.726
0.431
0.362
Table 3. Retrieval results for the Barcelona Cathedral collection.
DTW
BoVW
Pseudo-Struct
Structural
P @10
P @20
R-Precision
mAP
nDCG
0.288
0.378
0.273
0.155
0.201
0.289
0.189
0.12
0.214
0.303
0.199
0.118
0.192
0.3
0.178
0.097
0.503
0.592
0.476
0.382
Finally, we can see that the structural method performs poorly in both collections.
This word representation needs a preliminary step of transforming the raster image
to a vectorial image in order to build a graph representing the words. When working
with historical documents, slight degradations a®ect too much to the vectorization
process yielding to an important performance loss of this word representation.
However, it seems from the performance that the vectorization step does not have
much e®ect on the performance but in time complexity for computing the paths. This
can happen due to the appearance of noise in the image for which the vectorization
method can create spurious critical points which increases the path computation
time. Since the appearance of spurious points does not change the overall structure of
a path, the performance will not change drastically.
1263002-18
Word Representations for Handwritten Word Spotting
1
1
DTW
BoVW
Pseudo−Struct
Struct
0.9
0.8
0.7
0.7
0.6
0.6
mAP
mAP
0.8
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
2
4
6
8
10
12
Database Size d
14
16
0
18
(a)
DTW
BoVW
Pseudo−Struct
Struct
0
2
4
6
8
10
12
Database Size d
14
16
18
(b)
Fig. 10. Scalability test results for the (a) George Washington collection and (b) Barcelona Cathedral
collection.
In Fig. 10 we can see the results for a scalability test. For each given query, having
R relevant words in the collection we performed the retrieval experiments by iteratively increasing the negative samples by a factor of R in the considered dataset. In
the ¯rst step, d ¼ 0, the collection has just the R relevant items, so all the methods
yield a mean average precision of 1. In the next iteration, d ¼ 1 we add a random
selection of R nonrelevant items. At d ¼ 2 the collection consist of the R relevant
words and 2 R nonrelevant words, and so on. We can see that again the BoVW
method outperforms the rest in both collections. The pseudo-structural and structural methods for the George Washington collection present a sudden drop in performance at the early steps, that is adding just a few negative samples to the dataset
already hinders a lot the performance of these methods. On the other hand, for the
(a)
(b)
Fig. 11. Binarization examples for di®erent word instances of the same class for the (a) George
Washington collection and (b) Barcelona Cathedral collection.
1263002-19
J. Llad
os et al.
Table 4. Pros and cons summary.
Size
Time
Indexability
Preprocessing
Performance
Scalability
BoVW
DTW
Pseudo-structural
+
++
+
++
+
++
+
++
+
+
Structural
++
++
++
(size)
(time)
(discriminative
power)
(discriminative
power)
Barcelona Cathedral collection, increasing the dataset do not entail such a performance loss and both the pseudo-structural and the structural method have the same
behavior in the early stages. We can also note that even if the methods present a
sudden performance drop at the early stages, the performance tends to remain stable
in the last stages.
In order to summarize the analysis of the results of this study, we present in
Table 4 a summary of the pros and cons of each of the word representations. For each
of the methods we highlight their strengths and the weaknesses. Concerning the
complexity of the methods, both the BoVW and the DTW-based methods present
some de¯ciencies. The feature vectors from the BoVW method occupy a lot of
memory whereas the time complexity for the distance computation in the DTW
method is also high. On the other hand, both the structural and pseudo-structural
methods are e±cient in terms of space and time, which makes them good candidates
for being indexed. As we shown above, the need of preprocessing steps might severely
hinder the performance of the methods. In that sense, the BoVW method, which can
be directly computed from the gray-level images, is a better choice than the methods
that need a binarization step, and much better than the structural method that
needs a vectorization step in order to build the word graphs. This is also re°ected by
the observed performance of these methods. Finally, notice that none of the word
representations under study scale well, either by its high-dimensionality, by the
complexity of the distance computation or by their performance when increasing the
dataset size.
4. Conclusion
HWR approaches usually combine morphological (i.e. shape) and language models.
When dealing with historical documents, for di®erent reasons, not always the language model is neither available nor useful. In word spotting applications, when the
purpose is to retrieve a page containing a queried word from a large database, the
representation of the shape of the word takes a high relevance. In this paper we
have compared di®erent descriptors to assess the strengths and weaknesses of statistical and structural models when recognizing handwritten words in a word spotting application. We have compared four approaches, namely sequence alignment
using DTW as a classical reference, a BOVW approach as statistical model, a
1263002-20
Word Representations for Handwritten Word Spotting
pseudo-structural model based on a Loci features representation, and a structural
approach where words are represented by graphs. The four approaches have been
tested with two collections of historical data.
The overall conclusion we can draw according to the results described in Sec. 3 is
that the better performance is achieved with a statistical model based in the image
photometry. In our work, it has been implemented with a local descriptor based on a
BOVW representation constructed from dense SIFT features extracted from the
word image. The main drawback of this approach is its high memory requirements to
store such a large descriptor. This makes it unable for a retrieval framework unless
that dimensionality reduction and hashing strategies are implemented, which would
harm its performance. The main drawback of DTW is that it is very time consuming,
so it would be di±cult to integrate it in a real-time retrieval application. However,
DTW is robust to the elastic deformations typically found in handwritten words, so
it appears to be more robust in multiple writer scenarios. As we could expect, the
methods based on structural features underperform the statistical ones. The ones we
have implemented rely on the geometric and topological information of the image
contours or skeletons. This information varies a lot among di®erent instances of the
same word class. The good points of the approaches we have presented is its ability to
be integrated in a retrieval framework. Of course, it is not an intrinsic feature of
structural methods.
There are many parameters to take into account if one wants to implement a word
spotting application. Depending on the di®erent characteristics that de¯ne the system, the selection of the most appropriate method can be decided. We identify the
following factors as relevant when designing a word spotting system. First, the
availability of a lexicon that allows to infer a language model. A language model
requires enough training data statistically representative. Intra-class variability and
inter-class separability are also key issues. This might depend on whether we have a
single or multiple writer scenario, if the number of classes is known a priori, or if the
collection covers a large period of time with strong variation in script styles. When
the designed application requires a real-time retrieval it is important to choose a
compact descriptor. In that case, the loose of performance can be compensated with a
relevance feedback paradigm. The later factor is related to the ability of compiling
o®-line the whole set of words of the collection to build an indexation structure for
the retrieval process. A last consideration is the querying mode, namely queryby-example or query-by-string. The former requires learning the model from individual characters, while the latter can be performed \on the °y" asking the user to
crop an example word as a query.
As a ¯nal conclusion regarding feature descriptors, there is no universal representation able to cope will all the above criteria. Although in our study a statistical descriptor outperforms the other ones, we cannot neglect the use of structural information.
The indexability of the descriptor is also very important. A representation which is
very accurate and has a good performance in small sets, can dramatically reduce its
performance when it is used for retrieving large collections of document images.
1263002-21
J. Llad
os et al.
Acknowledgments
This work has been partially supported by the Spanish Ministry of Education and
Science under projects TIN2008-04998, TIN2009-14633-C03-03 and Consolider
Ingenio 2010: MIPRCV (CSD200700018), the EU project ERC-2010-AdG20100407-269796, the grant 2009-SGR-1434 of the Generalitat de Catalunya and
the research grants UAB-471-01-8/09 and AGAUR-2011FI-B-01022.
References
1. H. Cao and V. Govindaraju, Template-free word spotting in low-quality manuscripts, in
Proc. Sixth Int. Conf. Advances in Pattern Recognition (2007), pp. 135139.
2. T. van der Zant, L. Schomaker and K. Haak, Handwritten-word spotting using biologically inspired features, IEEE Trans. Pattern Anal. Mach. Intell. 30(11) (2008)
19451957.
3. A. Dutta, J. Llados and U. Pal, Symbol spotting in line drawings through graph paths
hashing, in Proc. Eleventh Int. Conf. Document Analysis and Recognition (2011),
pp. 982986.
4. D. Fernandez, J. Llados and A. Fornes, Handwritten word spotting in old manuscript
images using a pseudo-structural descriptor organized in a hash structure, in Pattern
Recognition and Image Analysis, Lecture Notes in Computer Science, Vol. 6669 (2011),
pp. 628635.
5. A. Filatov, A. Gitis and I. Kil, Graph-based handwritten digit string recognition, in Proc.
Third Int. Conf. Document Analysis and Recognition (1995), pp. 845849.
6. A. Fischer, K. Riesen and H. Bunke, Graph similarity features for HMM-based handwriting recognition in historical documents, in Proc. Int. Conf. Frontiers in Handwriting
Recognition (2010), pp. 25325.
7. V. Frinken, A. Fischer, H. Bunke and R. Manmatha, Adapting BLSTM neural network
based keyword spotting trained on modern data to historical documents, in Proc.
Twelveth Int. Conf. Frontiers in Handwriting Recognition (2010), pp. 352357.
8. V. Frinken, A. Fischer, R. Manmatha and H. Bunke, A novel word spotting method based
on recurrent neural networks, IEEE Trans. Pattern Anal. Mach. Intell. 34(2) (2012)
211224.
9. B. Fulkerson, A. Vedaldi and S. Soatto, Localizing objects with smart dictionaries, in
Computer Vision — ECCV, Lecture Notes in Computer Science, Vol. 5302 (2008),
pp. 179192.
10. A. Gionis, P. Indyk and R. Motwani, Similarity search in high dimensions via hashing, in
Proc. Twenty¯fth Int. Conf. Very Large Data Bases (1999), pp. 518529.
11. H. A. Glucksman, Classi¯cation of mixed-font alphabetics by characteristic loci, Digest of
the First Annual IEEE Computer Conference (1967), pp. 138141.
12. A. J. Hsieh, K. C. Fan and T. I. Fan, Bipartite weighted matching for on-line handwritten
Chinese character recognition, Pattern Recogn. 28(2) (1995) 143151.
13. K. Järvelin and J. Kekäläinen, Cumulated gain-based evaluation of IR techniques, ACM
Trans. Inform. Syst. 20(4) (2002) 422446.
14. H. Jegou, M. Douze and C. Schmid, Product quantization for nearest neighbor search,
IEEE Trans. Pattern Anal. Mach. Intell. 1(33) (2011) 117128.
15. P. Keaton, H. Greenspan and R. Goodman, Keyword spotting for cursive document
retrieval, in Proc. Workshop on Document Image Analysis (1997), pp. 7481.
1263002-22
Word Representations for Handwritten Word Spotting
16. E. Keogh and C. A. Ratanamahatana, Exact indexing of dynamic time warping, Knowl.
Inform. Syst. 7(3) (2005) 358386.
17. A. L. Kesidis, E. Galiotou, B. Gatos and I. Pratikakis, A word spotting framework
for historical machine-printed documents, Int. J. Doc. Anal. Recogn. 14(2) (2011)
131144.
18. Y. Kessentini, T. Paquet and A. M. Ben Hamadou, O®-line handwritten word recognition using multi-stream hidden Markov models, Pattern Recogn. Lett. 31(1) (2010)
6070.
19. J. B. Kruskal and M. Liberman, The symmetric time-warping problem: From continuous
to discrete, in Time Warps, String Edits and Macromolecules: The Theory and Practice of
Sequence Comparison (CSLI Publications, Stanford, CA 94305, 1983), pp. 125161.
20. V. Lavrenko, T. M. Rath and R. Manmatha, Holistic word recognition for handwritten
historical documents, in Proc. First Int. Workshop on Document Image Analysis for
Libraries (2004), pp. 278287.
21. S. Lazebnik, C. Schmid and J. Ponce, Beyond bags of features: Spatial pyramid matching
for recognizing natural scene categories, in Proc. Conf. Computer Vision and Pattern
Recognition (2006), pp. 21692178.
22. Y. Leydier, F. Lebourgeois and H. Emptoz, Text search for medieval manuscript images,
Pattern Recogn. 40(12) (2007) 35523567.
23. Y. Leydier, A. Ouji, F. LeBourgeois and H. Emptoz, Towards an omnilingual word
retrieval system for ancient manuscripts, Pattern Recogn. 42(9) (2009) 20892105.
24. D. Lopez and J. M. Sempere, Handwritten digit recognition through inferring graph
grammars. A ¯rst approach, in Proc. Joint IAPR Int. Workshops on Advances in Pattern
Recognition (1998), pp. 483491.
25. D. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis.
60(2) (2004) 91110.
26. S. Madhvanath and V. Govindaraju, The role of holistic paradigms in handwritten word
recognition, IEEE Trans. Pattern Anal. Mach. Intell. 23(2) (2001) 149164.
27. R. Manmatha, C. Han and E. M. Riseman, Word spotting: A new approach to indexing
handwriting, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (1996),
pp. 631635.
28. S. Marinai, E. Marino and G. Soda, Indexing and retrieval of words in old documents, in
Proc. Seventh Int. Conf. Document Analysis and Recognition (2003), pp. 223227.
29. U. V. Marti and H. Bunke, Using a statistical language model to improve the performance
of an HMM-based cursive handwriting recognition system, Int. J. Pattern Recogn. Artif.
Intell. 15(1) (2001) 6590.
30. R. F. Moghaddam and M. Cheriet, Application of multi-level classi¯ers and clustering for
automatic word spotting in historical document images, in Proc. Tenth Int. Conf. Document Analysis and Recognition (2009), pp. 511515.
31. T. M. Rath and R. Manmatha, Features for word spotting in historical manuscripts,
in Proc. Seventh Int. Conf. Document Analysis and Recognition (2003), pp. 218222.
32. T. M. Rath and R. Manmatha, Word image matching using dynamic time warping,
in Proc. Conf. Computer Vision and Pattern Recognition (2003), pp. 521527.
33. T. M. Rath and R. Manmatha, Word spotting for historical documents, Int. J. Doc. Anal.
Recogn. 9(24) (2007) 139152.
34. J. A. Rodriguez-Serrano and F. Perronnin, Handwritten word-spotting using hidden
Markov models and universal vocabularies, Pattern Recogn. 42(9) (2009) 21062116.
35. J. Rosin and G. West, Segmentation of edges into lines and arcs, Image Vis. Comput. 7(2)
(1989) 109114.
1263002-23
J. Llad
os et al.
36. J. L. Rothfeder, S. Feng and T. M. Rath, Using corner feature correspondences to rank
word images by similarity, in Proc. Computer Vision and Pattern Recognition Workshop
(2003), pp. 3035.
37. M. Rusiñol, D. Aldavert, R. Toledo and J. Llad
os, Browsing heterogeneous document
collections by a segmentation-free word spotting method, in Proc. Eleventh Int. Conf.
Document Analysis and Recognition (2011), pp. 6367.
38. S. Salvador and P. Chan, Toward accurate dynamic time warping in linear time and
space, Intell. Data Anal. 11(5) (2007) 561580.
39. P. N. Suganthan and H. Yan, Recognition of handprinted Chinese characters by constrained graph matching, Image Vis. Comput. 16(3) (1998) 191201.
40. C. J. van Rijsbergen, Information Retrieval (Butterworth-Heinemann, Newton, MA,
USA, 1979).
41. J. Zhang, M. Marszałek, S. Lazebnik and C. Schmid, Local features and kernels for
classi¯cation of texture and object categories: A comprehensive study, Int. J. Comput.
Vis. 73(2) (2007) 213238.
Josep Llad
os received
his degree in Computer
Sciences in 1991 from the
Universitat Politecnica de
Catalunya and his Ph.D.
in Computer Sciences in
1997 from the Universitat
Autonoma de Barcelona
(Spain) and the Universite Paris 8 (France).
Currently he is an Associate Professor at the Computer Sciences Department of the Universitat Autònoma de
Barcelona and a sta® researcher at the Computer
Vision Center, where he has been the director
since January 2009. He is the head of the Pattern
Recognition and Document Analysis Group
(2009SGR-00418). He is chair holder of Knowledge Transfer of the UAB Research Park and
Santander Bank. His current research ¯elds are
document analysis, graphics recognition and
structural and syntactic pattern recognition. He
has been the head of a number of Computer
Vision RþD projects and published more than
100 papers in national and international conferences and journals. Dr. Llados is an active
member of the Image Analysis and Pattern
Recognition Spanish Association (AERFAI), a
member society of the IAPR. He is currently the
chairman of the IAPR-ILC (Industrial Liaison
Committee). Previously he served as chairman of
the IAPR TC-10, the Technical Committee on
Graphics Recognition, and he is also a member of
the IAPR TC-11 (reading Systems) and IAPR
TC-15 (Graph based Representations). He serves
on the Editorial Board of the ELCVIA (Electronic Letters on Computer Vision and Image
Analysis) and the IJDAR (International Journal
in Document Analysis and Recognition), and is
also a PC member of a number of international
conferences. He was the recipient of the IAPRICDAR Young Investigator Award in 2007.
Dr. Llados also has experience in technological
transfer and in 2002 he created the company
ICAR Vision Systems, a spin-o® of the CVC/
UAB.
1263002-24
Word Representations for Handwritten Word Spotting
Marcal Rusi~
nol received his B.Sc. and his
M.Sc. degrees in Computer Sciences from the Universitat Autonoma de
Barcelona (UAB), Barcelona, Spain, in 2004 and
2006, respectively. In
2004 he joined the Computer
Vision
Center
where he obtained his
Ph.D. under the supervision of Dr. Josep Llados
in 2009. He was also a Teaching Assistant at the
Computer Sciences Department of the Universitat Autònoma de Barcelona between 2005
and 2009. He currently holds a Marie Curie fellowship at ITESOFT in France. His main research interests include graphics recognition,
structural pattern recognition, multimedia retrieval and performance evaluation.
Alicia Forn
es received
her B.S. degree from the
Universitat de les Illes
Balears (UIB) in 2003 and
her M.S. degree from the
Universitat Autònoma de
Barcelona (UAB) in 2005.
She obtained her Ph.D. on
writer identi¯cation of old
music scores from the
UAB in 2009. She was the
recipient of the AERFAI (Image Analysis and
Pattern Recognition Spanish Association) best
thesis award 20092010. She is currently a
postdoctoral researcher in the Computer Vision
Center. Her research interests include document
analysis, symbol recognition, optical music recognition, historical documents, handwriting
recognition and writer identi¯cation.
David Fern
andez graduated in Computer Science from the Universitat
Jaume I of Castellon. He
received his M.S. degree
in
Computer
Vision
and Arti¯cial Intelligence
from the Universitat
Autonoma de Barcelona,
Barcelona, Spain in 2010.
Currently he is pursuing
his Ph.D. in the Centre de Visio per Computador, Barcelona, Spain under the supervision of
Dr. Josep Llados and Dr. Alicia Fornes. In his
Ph.D. he is working on historical handwritten
documents. His main research interests include
enhancement and segmentation of documents
and word spotting.
Anjan Dutta received
his M.S. degree in Computer Vision and Arti¯cial
Intelligence from the Universitat Autònoma de
Barcelona,
Barcelona,
Spain in 2010. Currently
he is pursuing his Ph.D. in
the Centre de Visio per
Computador, Barcelona,
Spain under the supervision of Dr. Josep Llados and Dr. Umapada Pal. In
his Ph.D. he is working on subgraph matching
applied for symbol spotting in graphical documents. His main research interests include e±cient
subgraph matching, graph indexing, graphics
recognition and structural pattern recognition.
1263002-25