Document Segmentation Using Relative Location Features

Document Segmentation Using Relative Location Features
Francisco Cruz Fernández
Computer Vision Center - Universitat Autònoma de Barcelona
[email protected]
Abstract In this paper we present a generic document segmentation algorithm that uses Relative Location Features
to encode the spatial relationships between the different image entities. We prove that using this features improve
the final segmentation result on documents with a Manhattan structure, in addition to obtain results up to the state
of the art in other reference tasks.
Problem definition Document segmentation is a well-known task in the Document Analysis field which
consists in the identification of the different entities or regions that form the document image. In terms of
computer vision, we can define it as a labeling problem where we want to assign to each image pixel one label
from a finite set of k possible labels L = {c1 , ..., ck }.
Some common issues from this task come from the great variability in types of documents to process, besides
than other features that make this problem even more challenging, as the possible degradation on ancient documents or complex layouts to segment. Given this situation, it is common to find segmentation works devised to
a particular type of documents. Among the usual techniques, projection profiles have proved to obtain good results on documents with a manhattan layout, while methods based in texture analysis or connected components
are more used in documents with a non-manhattan layout. In addition to these aproaches, it has been proved
that the inclusion of contextual and spatial information into the models improves the quality of the results [1].
In this scope, graphical models as Conditional Random Fields (CRF) offer a powerful framework to represent
these relationships.
Proposal In this paper we propose a CRF framework in order to find the optimal labelling configuration
for an input image I that maximizes the a posteriori probability P (C|I), being C the set composed of the
label values corresponding to each pixel p in the image given the set of class labels L. Here, the use of a CRF
allow us to encode pairwise dependences between neighbouring pixels and compute the previous probability
distribution in terms of energy minimization as follows:




X
X
1
P (C|I) = exp −
Di (ci ) −
Vi,j (ci , cj )
(1)


Z
i
{i,j}∈N
where the term Di (ci ) represents how well the label ci fits in the i-th pixel of the image I. The term Vi,j will
represent the pairwise potential between the set N of interacting neighbor pixels, in other words, the penalty of
assigning pixel labels ci and cj to pixels pi and pj , respectively.
Our proposal is based in compute the term Di (ci ) as a combination of both texture features and Relative
Location Features (RLF) [1]. In the first place, we extract texture information by computing Gabor filter
responses for several frequencies and orientations. With these features we train a Gaussian Mixture Model
(GMM) which will provide us with an initial class prediction for each image pixel. In second place, in order to
compute the RLF we learn a set of probability maps for each pair of classes, which encode the probability of
finding items of one class in a particular location according to the relative position of elements of another class.
Then, combining the initial class prediction with the information provided by the probability maps we are able
to encode the RLF, which we will call vother and vself . Finally, we are able to compute the term Di (ci ) from
(1) as:
Di (ci ) = wapp log P (ci |pi )+
(2)
+wother
log vcother
(pi )+wcself
log vcself
(pi ),
ci
i
i
i
where P (ci |Si ) is the probability returned by the GMM and the weights wapp , wcother
and wcself
are learned
i
i
using a logistic regression model from he training dataset. Once we have built the proposed CRF, we perform
energy minimization using the Graph Cut algorithm [2]. The segmentation process and the combination of
these features can be seen in Figure 1.
Discussion
We have performed experiments devised to check whether the inclusion of RLF helps to improve the segmentation results either in documents with a Manhattan and non-Manhattan structure, at the same time that
evaluate the effectiveness of our method on both types of documents. We have use two datasets that fit to these
requirements, the 5CofM dataset as a set with Manhattan layout, and the PRImA dataset as a benchmark set
with a non-Manhattan layout.
The results on 5CofM confirm that the use of the RLF significantly improves the segmentation results for
all the considered class entities. However, in the case of the PRImA dataset the inclusion of the RLF does not
produce significant improvements, but our method still achieves results up to the state of the art.
As an ongoing work we are working in several aspects to improve the reported results. Our current research
lines are focused in the exploration of other inference algorithms as Loopy Belief Propagation and the use of
2D-SCFG (Stochastic context-free grammar) to obtain the final result, as well as other ways to include more
spatial information between the entities.
Gabor features +
GMM
CRF-based final
segmentation
Training Set
Relative Location
Features
Probability Maps
Figure 1: Segmentation process using Relative Location Features.
References
[1] Gould, Stephen and Rodgers, Jim and Cohen, David, ”Multi-Class Segmentation with Relative Location
Prior.” International Journal on Computer Vision, 2008.
[2] Boykov, Y and Veksler, O. and Zabih R., ”Fast Approximate energy minimization via graph cuts” PAMI,
23(11), 1222-1239, 2001.
2