Logical Document Conversion: Combining Functional and Formal

Logical Document Conversion: Combining Functional and
Formal Knowledge
Hervé Déjean
Jean-Luc Meunier
Xerox Research Centre Europe
6 chemin de Maupertuis
F-38 240 Meylan, France
Xerox Research Centre Europe
6 chemin de Maupertuis
F-38 240 Meylan, France
[email protected]
[email protected]
ABSTRACT
We present in this paper a method for document layout analysis
based on identifying the function of document elements (what
they do). This approach is orthogonal and complementary to the
traditional view based on the form of document elements (how
they are constructed). One key advantage of such functional
knowledge is that the functions of some document elements are
very stable from document to document and over time. Relying
on the stability of such functions, the method is not impacted by
layout variability, a key issue in logical document analysis and is
thus very robust and versatile. The method starts the recognition
process by using functional knowledge and uses in a second step
formal knowledge as a source of feedback in order to correct some
errors. This allows the method to adapt to specific documents by
using formal specificities.
Categories and Subject Descriptors
I.7.2 [Computing Methodologies]: Document and Text
Processing - Document preparation Markup languages; I.7.4
[Computing Methodologies]: Document and Text Processing Electronic Publishing. I.7.5 [Computing Methodologies]
Document Capture - Document analysis
General Terms
Algorithms, Documentation, Experimentation
Keywords
Logical document analysis, functional analysis, adaptation,
feedback.
1. INTRODUCTION
While document digitization has been investigated for decades, no
robust and high-accuracy system is today able to convert a large
variety of documents. Several reviews in the last fifteen years
detail the state of the art in document analysis [1], [4], [9], [17].
In [1], H. Baird lists a series of difficult problems for document
image analysis, from image capture to analysis of content. The
other reviews usually articulate document analysis into two areas:
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
DocEng’07, Month 1–2, 2007, Winnipeg, Ontario, Canada.
Copyright 2007 ACM 1-58113-000-0/00/xxx…$5.00.
physical (or geometric) layout analysis and logical (or functional)
layout analysis.
For Cattoni and Coianiz, [4], “the geometric layout analysis aims
at producing a description of the geometric structure of the
document. The last step of this operation produces a
decomposition into maximal homogeneous regions whose
elements belong to different data types (text, graphics,
pictures,…).” And “Purpose of logical layout analysis is to assign
a meaningful label to the homogeneous regions (blocks) of a
document image and to determine their logical relationships,
typically with respect to an a priori description of the document,
i.e. a model.” A typical relationship between elements determines
the reading order of some parts of the page.
According to the latest proceedings of the different conferences
and workshops on this topic such as [10] and [12], optical
character recognition (OCR) is still, after five decades, one of the
most active fields in the document analysis community (now
addressing handwritten characters). Geometric layout analysis and
logical layout analysis are present in the literature (see [17], [22]
for some recent work), but one could expect more publications on
these topics since no system is able to guarantee high accuracy
across a large variety of documents.
As mentioned in [4], functional/logical layout analysis requires
some kind of a priori knowledge in order to label elements
provided by the geometric analysis step. Most of the examples of
systems found in the literature are based on layout knowledge:
they model a document element with its layout attributes:
position, size, style, and sometimes content (typography for
instance). Features may be absolute (e.g. size = 12 points) or
relative (e.g. its size is larger than the common size, as used in
[26]). An emphasis on visual features or visual perception is used
in [21], [24] and [25]. Contextual information (surrounding
elements) as well as document models are also used. The way this
knowledge is used varies significantly from one system to another
(from rule-based to statistical learning methods). In this paper we
call this kind of information provided by document layout (or
form) formal knowledge.
But, currently, the conclusion is that no system (mostly based on
formal knowledge) is accurate, versatile and robust enough except
in some very specific applications and for some homogeneous
document collections. As raised by [4], the key issue is to select
pieces of knowledge which are generic but also accurate enough
to guarantee a robust and accurate system. But even for specific
collections, formal variations occur over time due to new or
modified document models as illustrated by [18], [27].
Robustness appears to be a real issue and publications are starting
to focus on this problem [1], [2], [14].
Inner environment is the kind of knowledge traditionally used in
logical analysis as explained in the introduction.
The first objective of this paper is to emphasize the importance of
functional knowledge for logical layout analysis. This kind of
knowledge is not based on the form of elements, but is based on
the purpose or function of elements in their environment. As the
rest of this paper will show, this kind of information addresses the
key issue of the layout variability encountered in documents. We
do not aim at opposing functional against formal knowledge, but
at leveraging both of them. We argue that starting the recognition
task (i.e. document analysis) with functional knowledge and then
refining the analysis with formal knowledge provides a very
robust method for document functional analysis. The second
objective of this paper is to present this method, which can be
viewed as a design pattern or best practice for functional analysis
design developed as XRCE over the last 4 years.
Outer environment is very often ignored or atrophied: elements
are considered in isolation or with their surrounding context
(previous or next elements). But the real outer environment of an
element is the elements with which it interacts.
The rest of this paper is organized as follows: the next section
explains what functional knowledge is, and why it is valuable to
use it. It will delimit its field of application, and compare it with
traditional formal knowledge. Section 3 details the method for
recognizing/labeling document elements by first using functional
knowledge and then refining recognition with formal knowledge.
The method is illustrated with several examples. Section 4
discusses the interest of a component-centric approach and the
advantage to use structural redundancy in documents.
2. DO NOT FORGET FUNCTION AND
ENVIRONMENT
2.1 Functional Considerations
Imagine a system which is able to label document elements
without information about their position, font, and style.
Impossible? While many approaches try to recognize elements by
answering the question: “how does it look” or “how is it made?”
we propose here to address this recognition problem with the
complementary and orthogonal question: “What does it do?” This
functional approach (teleological as defined in [20]: what is the
purpose or function of this element in the document?) allows us to
reformulate the traditional pattern recognition problem as a
function recognition problem: elements will be primarily
recognized by their function in a document and secondly by their
form.
This work follows and uses concepts developed by Herbert A.
Simon in his book “The Sciences of the Artificial” [23], where
this citation is extracted:
Let us look a little at the functional or purposeful aspect of
artificial things. Fulfillment of purpose or adaptation to a goal
involves a relation among three terms: the purpose or goal, the
character of the artifact, and the environment, in which the
artifact performs. [23]
An artifact, in our context the document element we try to
recognize, is defined by three elements: its purpose or function, its
inner environment (character) and its outer environment, where
inner and outer environments are defined as:
[...] an “inner” environment [is] the substance and organization
of the artifact itself, and an “outer” environment [is] the
surroundings in which it operates.
Function is not new (one speaks about functional layout analysis),
but very often, no functional considerations are used during the
recognition step, only formal ones. Only after they have been
recognized, these elements their functions can be identified.
The method we are presenting and illustrating here explicitly
makes use of the two usually underexploited notions: functional
purpose of a document element in its environment. Paraphrasing
Simon1, we can often predict the element label from knowledge of
the element’s goal and its outer environment “with only minimal
assumptions about the inner environment” (see footnote 1 for the
original sentence). Simon’s instant corollary describes well the
issue of variability of document layout (inner environment). For
instance, we often find quite different tables of contents from a
layout point of view accomplishing identical or similar goals
(navigation) in similar outer environments (documents).
We do not claim here that we will totally ignore formal
knowledge, but we claim that, firstly, the combination of both
kinds of knowledge, functional and formal, allows one to design
robust components, and, secondly, that functional information
allows one to bootstrap in a reliable way the recognition process.
One advantage of this approach is to offer a robust way to face the
crucial and traditional problem of layout variability: we will see
that functions are less variable from document to document and
also over time. For instance, the function of a running title is still
the same after almost two millennia.
2.2 Conditions for Using Functional
Knowledge
We describe here some conditions required for working with
functional knowledge.
First, the use of a functional characteristic is often based on the
use of content: function is often expressed though relations
between textual elements. This requires then to perform OCR on
document images or to extract textual elements from digital
formats such as PDF.
Secondly, these relations drawn between textual elements may
span over the whole document. This requires that the granularity
is no longer the page, but the whole document (Simon’s outer
environment). The term Document Layout Analysis is now taken
literally. This can be seen as a drawback since the traditional level
(page) can provide more flexibility.
1
[…] we can often predict behavior from knowledge of the
system’s goals and its outer environment, with only minimal
assumptions about the inner environment. An instant corollary is
that we often find quite different inner environments
accomplishing identical or similar goals in identical or similar
outer environment […] [23]
2.3 What Kind of Document Components?
One category of document elements naturally covered by this
functional analysis corresponds to those organizational elements
introduced in the course of the history of books in order to help
readers read them: running titles, page numbers, indexes,
organizational tables, headers and footers. We refer to [5] for an
excellent history of books and reading in the west. These artifacts
were designed to give the reader better and quicker access to some
elements of a book, avoiding linear access. These elements are
ancient (several centuries and more) and stable over documents
from a functional point of view (they have been assuming merely
one function, have one purpose, and are well known among
readers). A method able to use the functional purpose for
recognizing those elements will then be robust and will be able to
face the form/layout variability.
The “small” number of these elements concerned by this method
may appear to be a limitation, but it should be pointed out that
these components are very frequent, strongly related to the logical
organization of a document, and are thus of particular interest for
high-level logical structuring. This is particularly true for
technical documents where rapid access to some document unit is
required, whereas this quick direct access is not required for some
documents such as novels.
This method has been applied to the following document
components: page number, table of contents, page header and
page footer, caption, footnote. Work currently under way
addresses lists. These components are among the traditional ones
usually recognized by other work in logical analysis.
Other traditional components recognized in logical analysis are
metadata, often bibliographical metadata. Among them, an
interesting case is the recognition of document titles. The
traditional solution is to use rules such as “Titles usually occur in
a large font, near the beginning of a paper” [3]. If we want to
design a title recognizer using functional knowledge, we first have
to functionally characterize such a title: originally2, its purpose
was to refer to the document, and first of all, to help find it in the
library. Its outer environment is then no longer the page nor the
document, but the library (or in a first approximation its catalog).
And indeed such environment is used in several works such as
[11] and [13]: in order to correct errors due to OCR, they use an
existing repository (the library catalog), for finding different
bibliographical metadata (such as author’s affiliation, prone to
OCR errors). This can be extended to all traditional
bibliographical metadata of a document (title, author, year, etc).
The library/repository catalog can be seen as a table of contents,
and similarly to the TOC component which also labels document
headings, we can use the catalog so as to label document
metadata.
The recognition problem then becomes simpler if we are willing
to move it, and to use a more complex environment, for instance
to use the document level as environment for recognizing
document elements, or to use the digital library level for
2
Historically, the title (greek sillybos, latin titulus) is a leather tag
attached to the rolls and providing the title of the work (and
other “metadata” such as the number of lines of the scroll). Its
function was to help one to find the scroll quickly (without
opening the roll).
recognizing metadata. Of course, it may not be always possible to
access such environment, but whenever it is possible, its usage is
of primary interest.
2.4 An Example of a Functional View and a
Formal View
In order to illustrate the methodological differences between
traditional formal knowledge and functional knowledge, we will
take several publications focusing on the detection of tables of
contents. The work described in [16] illustrates the formal
approach and the publications [7] and [15] illustrate the
functional approach. Even if the objectives of that work are not
the same (the first one aims at identifying tables of contents
without any OCR step whereas [7] and [15] require textual
elements), the way they consider table of contents (TOC)
illustrates well the differences between the formal and the
functional perspective: [16] defines a TOC as a text with a
structured format (or in a previous publication as nothing but text
lines with a structured format). On the functional side, [7] defines
a TOC as a series of contiguous references to some other parts of
the document itself. And for [15] a TOC is simply a collection of
references to individual articles of a document no matter what
layout it follows.
These definitions show very well the main difference between the
two perspectives: in the first one, one tries to describe the form of
the elements to be recognized (often taken in isolation), while the
second one puts the element into its environment and
characterizes its purpose/function in the document or identifies
properties inferred from its purpose/function. To define a TOC as
a text with a (specific) structured format is fully correct, since a
human being is able to recognize a TOC just by looking at the
page where it appears, without checking each entries of the TOC.
But here we are in the context of automatic recognition performed
by machines, and the issue is to perform this automatic
recognition in a robust and accurate way. As argued by [15], to
design a system whose goal is to recognize a form is not easy,
when the form varies. And his following remark remains still true
when we generalize it to many of the functional (logical)
document elements: Most previous TOC analysis methods only
concentrate on the TOC itself without taking into account the
other pages in the same document.
We will now explain our function-based method.
3. DESCRIPTION OF THE METHOD
In this section we will describe the different steps of the method.
We will first give a general description of the method, and the
next sections will describe each of its steps with two examples:
the detection of page numbers and of tables of contents. Several
articles from different authors will be mentioned, and a complete
description of the algorithms referred to can be found in [6], [7],
[8], and [15].
As input, the method requires a document on which some
geometrical analysis has been performed. This analysis is not
required to be perfectly accurate, the current method being robust
to noise.
In this section, the term element will refer to the geometrical
elements of the document provided as input structure: words,
lines, pages, whereas the term component will refer to the specific
elements of the specific document structure (such as page
numbers, TOC). Component recognizer is used to refer to the
software program performing this labeling. The elements will be
recognized as being part of a component (they belong to the
component). A component can contain several elements: a table of
contents component consists of several elements (lines) of a
document.
The method first tries to functionally characterize a component of
a document (step 1). Paraphrasing Simon again, we might hope to
be able to characterize the main properties of the system and its
behavior without too elaborating the detail of either the outer or
the inner environment. Step 2 provides a rough representation of
the inner and outer environment. The properties defined in Step 1
are used for identifying solution alternatives consistent with these
functional properties (Step 3). A way to select an optimal
alternative is defined, and this optimal alternative labels some
elements of the document (Step 4). The formal knowledge of
these elements (their style, size, position) is used in order to
provide feedback to the system and to improve or correct accuracy
(Step 5). This feedback can impact steps 2, 3, or 4.
1.
prerequisite: Find functional properties
2.
Select elements and environment
3.
Build up solution alternatives (based
knowledge)
4.
Select the optimal alternative
5.
Generate Formal feedback
on functional
Step 4 (select the optimal alternative) is in practice already
accurate (at least 90% in most cases). But when a very high
accuracy is required, the feedback step has to be performed and
negative feedback is provided based on the formal characteristics
of the identified elements. One assumption here is that these
elements will share common formal characteristics.
In the following sections, the first part will provide a general
description, and the second part will provide illustrations with
recognition of page numbers and tables of contents. The work
described in [7] and [8] will be used as main illustration, and
quickly compared with work described in [6] and [15].
3.1 Functional Definition
3.1.1 Explanation
While formal characterization may look trivial at first glance (at
least when the form does not change), a functional
characterization of an element is not always obvious and
immediate. The primary question to start with is “What is the
purpose of this component?” This question will point out relations
between different elements of the document. We will use these
relations in order to recognize elements belonging to the
component. All elements that will share these relations belong to a
potential solution. This step allows the identification of difficult
cases or cases where the functional approach will not work at all
(if we have only one page for instance).
3.1.2 Example with Page Numbers
The purpose of page numbers is to let the reader find a page, or
more exactly some information referred to by page number,
quickly. To achieve this purpose, a unique page number is
associated to each page. The set of page numbers forms an
incremental sequence. Multiple sequences of several kinds can
occur in a document, but they do not overlap. Typically the first
sequence can cover the front-matter of a book and uses Roman
numbers, and the rest of the document is covered by a sequence
using Arabic numbers. In order to recognize page numbers, we
will thus focus on incremental sequences of numbers. The type of
numbers used by a sequence is called its numbering schema. This
functional characterization relying on incremental sequences is
used in [6], [8], and [19][15]. In order to be robust to OCR errors
(relatively frequent for small numbers), sequences can admit holes
(pages without numbers). [8] and [19] use the sequence level in
order to induce missing page numbers.
3.1.3 Example with Table of Contents
The purpose of a table of contents is to let the reader directly
access a part of a document referred to in the table of contents
(TOC). A TOC is then defined as a zone of text referring to other
elements in the rest of the document. The most important one
used in [7] is that the order of the TOC entries and the order of
the entries in the document (headings) are identical. In order to
deal with noise, some entries without link are allowed in the TOC.
Links between each TOC entry and one body element are essential
to methods proposed in [7] and [15]. These links are based on
textual similarity. One advantage of this formalization is that the
TOC recognizer is able to recognize not only the TOC itself, but
also body elements that are referred to in the TOC: document
headings. This approach allows one to logically structure a
document according to its TOC. [15] also uses page number
matching when page numbers appear in TOC entries.
3.2 Element and Environment Selection
3.2.1 Explanation
The objective of this step is to identify document elements that
will be considered in the next processing steps. First we have to
decide the granularity of the elements, and second the set of
elements we will consider in order to define the outer
environment.
The geometrical analysis performed beforehand provides the
elements we will try to recognize/label. In practice, the line and
token (word) levels are sufficient for (very) good results for most
of the components. The paragraph level is sometimes preferable,
if available (e.g. for the table of contents, where entries or
headings can span over several lines).
The initial outer environment is very often the whole document,
and all its elements can be considered. Some elements can be
ignored depending on the component we are trying to recognize.
For instance, elements in the middle of the page are not
considered for recognizing page headers, but initially all the pages
are considered. This is often used in order to optimize processing
time.
Either geometrical segmentation or initial environment selection
can introduce noise. For instance, line or paragraph segmentation
can be wrong. Or considering all the pages of a document can
introduce noise (cover pages are different from body pages). But
it is up to the method to be robust to noise and adaptive, since
such noise can not be avoided in practice. This follows Simon’s
point of view:
We might hope to be able to characterize the main properties of
the system and its behavior without elaborating the detail of
either the outer or inner environment.[23]
A good approximation is often enough to bootstrap the
recognition task. Step 5 provides an adaptive mechanism
(feedback) to improve the recognition.
Before the feedback step, no formal knowledge is used, but
constraint over content can be used. This selection can be refined,
in a later step, by feedback provided by Step 5 (see Section 3.5),
or by external components for the environment delimitation.
3.2.2 Illustration with Page Numbers
First, links are defined between each pair of elements in the whole
document satisfying a textual similarity criterion. Each link
includes a source element and a target element. Whenever the
similarity (based on an edit distance) is above a predefined
threshold a pair of symmetric links is created (Figure 2).
Secondly, all possible ToC candidate areas are enumerated. A
brute force approach works fine. It consists in testing each
element as a possible ToC start and extending this ToC candidate
further in the document until it is no longer possible to comply
with the constraints. A ToC candidate is then a set of contiguous
elements, from which it is possible to select one link per element
so as to provide an ascending order for the target text elements
(Figure 3).
The natural level for the page numbering component is token. We
also add some constraints over token content in order to classify
the different type of numbering: Arabic or Roman numbering. A
default numbering schema called ASCII is also used.
x
3.2.3 Illustration with Table of Contents
For the recognition of tables of contents, the level used in [7] is
paragraph if available or by default the line level. In [15], the
method works at the page level (and answers the question: which
pages of the document contain the TOC?).
sim(x,y)
3.3 Build Solution Alternatives
3.3.1 Explanation
We use here the functional characterization of the component in
order to identify elements which belong to it. We do not expect to
recognize only one solution, and this step usually builds several
alternatives. Often a greedy algorithm is used in order to generate
these solutions. This 2-step approach between generating
alternatives and selecting an optimal one is common to [6], [7],
[8], [15].
y
Figure 2: A similarity score is computed for each element pairs
3.3.2 Illustration with Page Numbers
For page numbers, the solution proposed in [8] enumerates all
possible sequences, over consecutive document pages, of
elements that all belong to the same numbering scheme (e.g. all
Roman numerals). The method enumerates in a greedy manner all
the longest possible sequences of text fragments occurring on
consecutive pages and fitting one of the predefined numbering
schemes. Figure 1 shows the list of sequences generated from this
artificial 5-page document.
2
1
8
1
a
b
4
2
b
foo
fop
Doc-1
5
Doc-2
Figure 1: All page number sequences are generated: (1, 2, x,
4), (a, b), (1, 2), (foo, fop), (Doc-1, Doc-2)
[6] presents a more local solution where only the previous page is
used to check the incrementally of the sequence. Using a larger
window (the whole document) as in [8] allows the method to
provide better accuracy as will be explained in Section 3.4.
3.3.3 Illustration with Table of Contents
For tables of contents, the algorithm in [7] consists in identifying
a list of contiguous elements which refer (we use here a similarity
measure) to other elements of the document.
Figure 3: a TOC candidate is composed of a set of contiguous
elements from which it is possible to select one link per block
so as to provide an ascending order for the target text blocks
In [15], the candidate TOC page selection is first much simpler
since the candidates are pages and no longer lines or paragraphs:
the twenty first pages are selected in an initial step. If the last few
selected pages are considered as real TOC pages, the next 20
pages are taken into account. In a second step, links are computed
between each candidate TOC page and other pages. The link score
is based on text match length, ordering of body pages, etc). It does
not require any previous geometrical layout analysis. A second
score is computed when page numbers occur in TOC entries
(based on the number of extracted page numbers).
3.4 Select one (Optimal) Solution
3.4.1 Explanation
Once all the alternatives have been computed, an optimal one is
selected. If the optimal criterion is specific to each component, a
characteristic emerges from the design of different components:
the number of occurrences is often positively correlated with
confidence. For instance, a sequence of page numbers composed
of 50 pages (50 is arbitrary chosen), or a TOC composed of 100
entries are more reliable than a 2-element sequence (of page
numbers or TOC entries). Hence long elements and long
sequences are preferred. This use of number of element
occurrences for selecting optimal solutions can be found in [6],
[7], [8], and [15].
The notion of number of occurrences can also be used with
traditional association measures when no sequence is built. There
are of course several ways to select the best solution, and the
method we propose here does not impose any specific algorithm.
In practice, a Viterbi-like algorithm is often used in order to select
the optimal solution.
3.4.2 Illustration with Page Numbers
The optimal selection for page numbers consists in selecting the
subset of non-overlapping sequences so as to optimally cover the
whole document. Eventually it associates each page with its
corresponding token in the sequence, possibly extrapolating the
missing numbers (using the incremental property). A variant of
the Viterbi algorithm is used in [8], giving a high score to long
sequences. Figure 4 shows the best set of page number sequences
found for the example. The page number for page 3 is
extrapolated from the values of other elements of its sequence.
3
2
1
8
1
a
b
4
2
b
foo
fop
Doc-1
5
3.5 Generate Formal Feedback
3.5.1 Explanation
The objective of this step is to use formal knowledge in order to
provide feedback for improving the current solution. The
underlying assumption for this step is that document elements
belonging to the same component will share some formal
regularity: for instance, page numbers will not be spread out
randomly over pages, and each entry of the TOC will not have a
different style or size. The issue for approaches only using formal
knowledge is that these regularities depend on the document and
are not consistent from document to document. This issue is
avoided since we use here the fact that some elements have
already been identified (in the previous step) as belonging to the
same component, and we will find out the formal regularity which
distinguishes them for other elements.
A key assumption during this step is that the selected optimal
solution (previous step) is accurate enough. This is usually the
case since all authors of the papers cited in the beginning of
Section 3 report accuracy around 90% and above. But we know
from Step 1 that, in some particular cases, the solution fails or is
not reliable. Suspicious solutions should thus be ignored. This is
done by setting up the heuristics of the optimal solution selection
so that it favors precise solutions (rather than solutions with
optimal recall).
Feedback can be used in different ways and can impact different
steps: It can help improve element selection in Step 2, or help
generate better alternatives and help select an optimal solution.
The formal characterization uses the traditional features: position,
style, size, typographic features (among others).
3.5.2 Illustration with Page Numbers
For page numbers, the feedback is used in order to better select
elements in step 2. During the first loop, the method is set up so
that only sequences of length greater than 2 are recognized. This
threshold is in practice reliable enough.
A Machine Learning method (for example logistic regression) is
applied with the following input data:
a.
Positive examples correspond to the alreadyrecognized page number elements.
b.
Negative examples are drawn randomly among the
rest of the textual elements of the pages (it can be
proportional to the number of positive examples).
c.
The feature used (how to characterize a textual
element for the Machine Learning method) is only
the vertical position (Y). Other features could be
used.
d.
The Machine Learning method is trained with
these positive and negative data and a model is
generated.
e.
The model is applied over the data in order to
recognize page numbers.
Doc-2
Figure 4: Selection of the optimal sequences: (1, 2, 3, 4, Doc-1,
Doc-2)
3.4.3 Illustration with Table of Contents
For tables of contents, a scoring function is used to rank the TOC
alternatives. The scoring function is the sum of entry weights,
where an entry weight is inversely proportional to the number of
outgoing links. This entry weight characterizes the certainty of
any of its associated links, under the assumption that the more
links originate from a given source text block, the less likely that
any one of those links is a "true" link of a table of contents. Since
this scoring function is a sum of entry weights, the selected TOC
is often a long TOC.
In [15], textual link scores are summed up and combined with the
page number score. This combination provides the final
confidence score for each TOC candidate page. This score is used
in order to decide whether this page is a real TOC or not.
Feedback impacts Step 2, where only the page numbers
recognized by the Machine Learning algorithm and the already
recognized elements (step 4) are selected. Step 3 and 4 are
performed again, taking into account only those elements, but,
this time, accepting page number sequences of length 1 and more.
This feedback step allows one to recognize all page number
sequences in a robust way. Evaluation is provided in [8] and
shows more accurate results for document with short sequences.
3.5.3 Illustration with Table of Contents
The implementation of this feedback step is currently under way
for the table of contents components.
in order to correct and improve the recognition task. Some kinds
of noise from previous steps (OCR, geometrical analysis) can be
tolerated. We are now investigating methods in order to combine
the results of each component recognizer in order to improve the
overall quality of document conversion.
Library
4. USING DOCUMENT REDUNDANCY
FOR IMPROVING QUALITY
The different component recognizers presented in this article
achieve on average 90% accuracy. In real-word applications, this
accuracy level may be insufficient. One solution is to improve
efficiency of each recognizer, but we all know how difficult it is
to increase accuracy at this level (above 95%). The other solution
we envisage is to use organizational redundancy present in the
different document components. In order to reach a very high
accuracy, we combine several components in order to improve
and to check the final conversion quality. This relies on the fact,
which shows, that a document has (potentially) a redundant
organization, where many components reflect the logical structure
of the document (in order to help readers navigate). A document
is considered here as an organized system with many
interrelations between its sub-systems. This is particularly true for
technical documents. Several components can provide the same
kind of information. For instance, each logical section of a
document has its title in its headers, and as table of contents entry,
enforces page numbering to restart. The three components provide
clues for section segmentation. A voting system can produce the
best solution.
A more ambitious way to make use of organizational redundancy
would be to improve component outputs using other correlated
components in a collaborative environment: feedback as
performed in Step 5 of the method would no longer be internal to
a component but also external. Considering the previous example,
errors of the TOC component could be corrected using
information from the page number component. We are currently
investigating this collaborative approach.
Another direct application of this redundancy is Quality
Assurance: outputs from components providing the same or
equivalent result can be cross-validated and inconsistencies
detected. For instance, the logical segmentation provided by the
table of contents can be checked against the segmentation
provided by page numbers for documents where each logical
structure starts a new page number sequence (in technical
documents). This validation is currently performed in an ad hoc
way, but we are investigating ways to automatically detect
relations between document components.
5. CONCLUSION
Most document analysis systems rely on formal knowledge
(layout characteristics) for labeling elements. Due to the large
variability of the forms of these document elements, no robust
system is available today for a large variety of documents. We
presented in this article a method for functional document analysis
enhancing two kinds of knowledge: functional knowledge, which
allows the identification of elements according to their purpose,
and traditional formal knowledge. The functional stability of
some elements in documents allows them to be recognized.
Traditional formal knowledge is used with a feedback mechanism
Title
Bibliography
TOC
Logical structure
References
Pagination
Running title
Inde
Figure 5: Document components share manifold relations and
this redundancy can be exploited during document conversion.
6. REFERENCES
[1] H. Baird, Difficult and Urgent Open Problem in Document
Image Analysis for Libraries, proceedings of the First
International Workshop on Document Image Analysis for
Libraries (DIAL’04), 2004.
[2] H. Baird, D. Lopresti, Robust Document Image
Understanding Technologies, HDP/ 04, Washington, USA,
2004.
[3] D. Bergmark. Automatic Extraction of Reference Linking
Information from Online Documents, CSTR 2000-1821,
2000.
[4] R. Cattoni, T. Coianiz, Geometric Layout Analysis
Techniques for Document Image Understanding: a Review,
TC-IRST Technical Report #9703-09, 1998.
[5] G. Cavallo, R. Chartier, A history of reading in the west,
Cambridge: Polity Press, 1999.
[6] K. Collins-Thompson and R. Novkolov, A Clustering-Based
Algorithm for Automatic Document Separation, in
proceedings of the SIGIR 2002 Workshop on Information
Retrieval and OCR: From Converting Content to Grasping
Meaning, Tampere, Finland, 2002.
[7] H. Déjean and J-L. Meunier, Structuring Documents
According to Their Table of Contents, Symposium on
Document Engineering (DocEng’05), Bristol, 2005.
[8] H. Déjean, J-L. Meunier, “Versatile Page Numbering
Analysis”, submitted to ICDAR 2007.
[9] R. Haralick, Document Image Understanding: Geometric
and Logical Layout, EEE Computer Society Conference on
Computer Vision and Pattern Recognition, 1994.
[17] S. Mao, A. Rosenfeld, T. Kanungo. Document structure
analysis algorithms: a literature survey, Proc. SPIE
Electronic Imaging, January 2003 SPIE Vol. 5010:197-207.
[10] Seventh International Workshop, DAS 2006, Nelson, New
Zealand, February 13-15, 2006, Proceedings Series: Lecture
Notes in Computer Science, Vol. 3872.
[18] S. Mao, J. Woo Kim, G. R. Thoma, Style-Independent
Document Labeling: Design and Performance Evaluation,
SPIE 2004.
[11] S. Hauser, T. Sabir, G. Thoma, OCR Correction Using
Historical Relationship from Verified Text in Biomedical
Citations, Proceedings of 2003 Symposium on Document
Image Understanding Technology, Greenbelt, Maryland,
2003.
[19] G. Mühlberger, Automated Digitisation of Printed Material
for Everyone: The METADATA ENGINE Project, RLG
DigiNews, Volume 6, Number 3, 2002.
[12] Eighth International Conference on Document Analysis and
Recognition (ICDAR 2005), August 29- September 1, 2005,
Seoul, Korea. IEEE Computer Society.
[21] P. Sarkar, E. Saund, Perceptual Organization in Semantic
Role Labeling, Symposium on Document Image
Understanding Technology, Maryland, US, 2005.
[13] A. Kawtrakul, C. Yingsaeree, A Unified Framework for
Automatic Metadata Extraction from Electronic Document,
in proceedings of the International Advanced Digital Library
Conference (IADLC), Nagoya, Japan, 2005
[22] F. Shafait, Daniel Keysers, T. M. Breuel, Performance
Comparison of Six Algorithms for Page Segmentation,
Workshop in Document Analysis System (DAS’06), Nelson,
New-Zealand, 2006.
[14] X. Lin, Quality Assurance in high Volume document
Digitization: a Survey, in proceedings of the second
International Conference on Document Image Analysis for
Libraries, Lyon, France, 2006.
[23] H. A. Simon., The sciences of the Artificial, (3rd ed.).
Cambridge, MA: The MIT Press, 1997.
[15] X. Lin, Y. Xiong. Detection and Analysis of Table of
Contents Based on Content Association, IJDAR 8(2-3): 132143, 2006.
[16] S. Mandal, S.P. Chowdhury, A. K. Das, B. Chanda,
Detection and Segmentation of Table of Contents and Index
Pages from Document Images, in proceedings of the second
International Conference on Document Image Analysis for
Libraries, (DLIA’06), Lyon, France, 2006.
[20] A. Rosenblueth, N. Wiener, J. Bigelow, “Behavior, Purpose
and Teleology”, in Philosophy of Science, 10, 1943.
[24] D. Slocombe, R. Boyd, “There is no unstructured
documents,” XML Europe 2002, Paris, France 2002.
[25] K. Summers, Using White Space for Automated Document
Structuring, Cornell University Computer Science Technical
Report TR 94-1452.
[26] L. Todoran, M. Woring, M. Aiello, C. Monz, Document
Understanding for a broad class of documents, ISIS
technical report series, Vol. 2001-15, 2001.
[27] S. Yacoub, J. Abad, P. Faraboschi, J. Burns, D. Ortega, V.
Saxena, Document Digitization Lifecycle for Complex
Magazine Collection, ACM Symposium on Document
Engineering, DocEng’ 05, Bristol, UK, 2005.