Logical Document Conversion: Combining Functional and Formal Knowledge Hervé Déjean Jean-Luc Meunier Xerox Research Centre Europe 6 chemin de Maupertuis F-38 240 Meylan, France Xerox Research Centre Europe 6 chemin de Maupertuis F-38 240 Meylan, France [email protected] [email protected] ABSTRACT We present in this paper a method for document layout analysis based on identifying the function of document elements (what they do). This approach is orthogonal and complementary to the traditional view based on the form of document elements (how they are constructed). One key advantage of such functional knowledge is that the functions of some document elements are very stable from document to document and over time. Relying on the stability of such functions, the method is not impacted by layout variability, a key issue in logical document analysis and is thus very robust and versatile. The method starts the recognition process by using functional knowledge and uses in a second step formal knowledge as a source of feedback in order to correct some errors. This allows the method to adapt to specific documents by using formal specificities. Categories and Subject Descriptors I.7.2 [Computing Methodologies]: Document and Text Processing - Document preparation Markup languages; I.7.4 [Computing Methodologies]: Document and Text Processing Electronic Publishing. I.7.5 [Computing Methodologies] Document Capture - Document analysis General Terms Algorithms, Documentation, Experimentation Keywords Logical document analysis, functional analysis, adaptation, feedback. 1. INTRODUCTION While document digitization has been investigated for decades, no robust and high-accuracy system is today able to convert a large variety of documents. Several reviews in the last fifteen years detail the state of the art in document analysis [1], [4], [9], [17]. In [1], H. Baird lists a series of difficult problems for document image analysis, from image capture to analysis of content. The other reviews usually articulate document analysis into two areas: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DocEng’07, Month 1–2, 2007, Winnipeg, Ontario, Canada. Copyright 2007 ACM 1-58113-000-0/00/xxx…$5.00. physical (or geometric) layout analysis and logical (or functional) layout analysis. For Cattoni and Coianiz, [4], “the geometric layout analysis aims at producing a description of the geometric structure of the document. The last step of this operation produces a decomposition into maximal homogeneous regions whose elements belong to different data types (text, graphics, pictures,…).” And “Purpose of logical layout analysis is to assign a meaningful label to the homogeneous regions (blocks) of a document image and to determine their logical relationships, typically with respect to an a priori description of the document, i.e. a model.” A typical relationship between elements determines the reading order of some parts of the page. According to the latest proceedings of the different conferences and workshops on this topic such as [10] and [12], optical character recognition (OCR) is still, after five decades, one of the most active fields in the document analysis community (now addressing handwritten characters). Geometric layout analysis and logical layout analysis are present in the literature (see [17], [22] for some recent work), but one could expect more publications on these topics since no system is able to guarantee high accuracy across a large variety of documents. As mentioned in [4], functional/logical layout analysis requires some kind of a priori knowledge in order to label elements provided by the geometric analysis step. Most of the examples of systems found in the literature are based on layout knowledge: they model a document element with its layout attributes: position, size, style, and sometimes content (typography for instance). Features may be absolute (e.g. size = 12 points) or relative (e.g. its size is larger than the common size, as used in [26]). An emphasis on visual features or visual perception is used in [21], [24] and [25]. Contextual information (surrounding elements) as well as document models are also used. The way this knowledge is used varies significantly from one system to another (from rule-based to statistical learning methods). In this paper we call this kind of information provided by document layout (or form) formal knowledge. But, currently, the conclusion is that no system (mostly based on formal knowledge) is accurate, versatile and robust enough except in some very specific applications and for some homogeneous document collections. As raised by [4], the key issue is to select pieces of knowledge which are generic but also accurate enough to guarantee a robust and accurate system. But even for specific collections, formal variations occur over time due to new or modified document models as illustrated by [18], [27]. Robustness appears to be a real issue and publications are starting to focus on this problem [1], [2], [14]. Inner environment is the kind of knowledge traditionally used in logical analysis as explained in the introduction. The first objective of this paper is to emphasize the importance of functional knowledge for logical layout analysis. This kind of knowledge is not based on the form of elements, but is based on the purpose or function of elements in their environment. As the rest of this paper will show, this kind of information addresses the key issue of the layout variability encountered in documents. We do not aim at opposing functional against formal knowledge, but at leveraging both of them. We argue that starting the recognition task (i.e. document analysis) with functional knowledge and then refining the analysis with formal knowledge provides a very robust method for document functional analysis. The second objective of this paper is to present this method, which can be viewed as a design pattern or best practice for functional analysis design developed as XRCE over the last 4 years. Outer environment is very often ignored or atrophied: elements are considered in isolation or with their surrounding context (previous or next elements). But the real outer environment of an element is the elements with which it interacts. The rest of this paper is organized as follows: the next section explains what functional knowledge is, and why it is valuable to use it. It will delimit its field of application, and compare it with traditional formal knowledge. Section 3 details the method for recognizing/labeling document elements by first using functional knowledge and then refining recognition with formal knowledge. The method is illustrated with several examples. Section 4 discusses the interest of a component-centric approach and the advantage to use structural redundancy in documents. 2. DO NOT FORGET FUNCTION AND ENVIRONMENT 2.1 Functional Considerations Imagine a system which is able to label document elements without information about their position, font, and style. Impossible? While many approaches try to recognize elements by answering the question: “how does it look” or “how is it made?” we propose here to address this recognition problem with the complementary and orthogonal question: “What does it do?” This functional approach (teleological as defined in [20]: what is the purpose or function of this element in the document?) allows us to reformulate the traditional pattern recognition problem as a function recognition problem: elements will be primarily recognized by their function in a document and secondly by their form. This work follows and uses concepts developed by Herbert A. Simon in his book “The Sciences of the Artificial” [23], where this citation is extracted: Let us look a little at the functional or purposeful aspect of artificial things. Fulfillment of purpose or adaptation to a goal involves a relation among three terms: the purpose or goal, the character of the artifact, and the environment, in which the artifact performs. [23] An artifact, in our context the document element we try to recognize, is defined by three elements: its purpose or function, its inner environment (character) and its outer environment, where inner and outer environments are defined as: [...] an “inner” environment [is] the substance and organization of the artifact itself, and an “outer” environment [is] the surroundings in which it operates. Function is not new (one speaks about functional layout analysis), but very often, no functional considerations are used during the recognition step, only formal ones. Only after they have been recognized, these elements their functions can be identified. The method we are presenting and illustrating here explicitly makes use of the two usually underexploited notions: functional purpose of a document element in its environment. Paraphrasing Simon1, we can often predict the element label from knowledge of the element’s goal and its outer environment “with only minimal assumptions about the inner environment” (see footnote 1 for the original sentence). Simon’s instant corollary describes well the issue of variability of document layout (inner environment). For instance, we often find quite different tables of contents from a layout point of view accomplishing identical or similar goals (navigation) in similar outer environments (documents). We do not claim here that we will totally ignore formal knowledge, but we claim that, firstly, the combination of both kinds of knowledge, functional and formal, allows one to design robust components, and, secondly, that functional information allows one to bootstrap in a reliable way the recognition process. One advantage of this approach is to offer a robust way to face the crucial and traditional problem of layout variability: we will see that functions are less variable from document to document and also over time. For instance, the function of a running title is still the same after almost two millennia. 2.2 Conditions for Using Functional Knowledge We describe here some conditions required for working with functional knowledge. First, the use of a functional characteristic is often based on the use of content: function is often expressed though relations between textual elements. This requires then to perform OCR on document images or to extract textual elements from digital formats such as PDF. Secondly, these relations drawn between textual elements may span over the whole document. This requires that the granularity is no longer the page, but the whole document (Simon’s outer environment). The term Document Layout Analysis is now taken literally. This can be seen as a drawback since the traditional level (page) can provide more flexibility. 1 […] we can often predict behavior from knowledge of the system’s goals and its outer environment, with only minimal assumptions about the inner environment. An instant corollary is that we often find quite different inner environments accomplishing identical or similar goals in identical or similar outer environment […] [23] 2.3 What Kind of Document Components? One category of document elements naturally covered by this functional analysis corresponds to those organizational elements introduced in the course of the history of books in order to help readers read them: running titles, page numbers, indexes, organizational tables, headers and footers. We refer to [5] for an excellent history of books and reading in the west. These artifacts were designed to give the reader better and quicker access to some elements of a book, avoiding linear access. These elements are ancient (several centuries and more) and stable over documents from a functional point of view (they have been assuming merely one function, have one purpose, and are well known among readers). A method able to use the functional purpose for recognizing those elements will then be robust and will be able to face the form/layout variability. The “small” number of these elements concerned by this method may appear to be a limitation, but it should be pointed out that these components are very frequent, strongly related to the logical organization of a document, and are thus of particular interest for high-level logical structuring. This is particularly true for technical documents where rapid access to some document unit is required, whereas this quick direct access is not required for some documents such as novels. This method has been applied to the following document components: page number, table of contents, page header and page footer, caption, footnote. Work currently under way addresses lists. These components are among the traditional ones usually recognized by other work in logical analysis. Other traditional components recognized in logical analysis are metadata, often bibliographical metadata. Among them, an interesting case is the recognition of document titles. The traditional solution is to use rules such as “Titles usually occur in a large font, near the beginning of a paper” [3]. If we want to design a title recognizer using functional knowledge, we first have to functionally characterize such a title: originally2, its purpose was to refer to the document, and first of all, to help find it in the library. Its outer environment is then no longer the page nor the document, but the library (or in a first approximation its catalog). And indeed such environment is used in several works such as [11] and [13]: in order to correct errors due to OCR, they use an existing repository (the library catalog), for finding different bibliographical metadata (such as author’s affiliation, prone to OCR errors). This can be extended to all traditional bibliographical metadata of a document (title, author, year, etc). The library/repository catalog can be seen as a table of contents, and similarly to the TOC component which also labels document headings, we can use the catalog so as to label document metadata. The recognition problem then becomes simpler if we are willing to move it, and to use a more complex environment, for instance to use the document level as environment for recognizing document elements, or to use the digital library level for 2 Historically, the title (greek sillybos, latin titulus) is a leather tag attached to the rolls and providing the title of the work (and other “metadata” such as the number of lines of the scroll). Its function was to help one to find the scroll quickly (without opening the roll). recognizing metadata. Of course, it may not be always possible to access such environment, but whenever it is possible, its usage is of primary interest. 2.4 An Example of a Functional View and a Formal View In order to illustrate the methodological differences between traditional formal knowledge and functional knowledge, we will take several publications focusing on the detection of tables of contents. The work described in [16] illustrates the formal approach and the publications [7] and [15] illustrate the functional approach. Even if the objectives of that work are not the same (the first one aims at identifying tables of contents without any OCR step whereas [7] and [15] require textual elements), the way they consider table of contents (TOC) illustrates well the differences between the formal and the functional perspective: [16] defines a TOC as a text with a structured format (or in a previous publication as nothing but text lines with a structured format). On the functional side, [7] defines a TOC as a series of contiguous references to some other parts of the document itself. And for [15] a TOC is simply a collection of references to individual articles of a document no matter what layout it follows. These definitions show very well the main difference between the two perspectives: in the first one, one tries to describe the form of the elements to be recognized (often taken in isolation), while the second one puts the element into its environment and characterizes its purpose/function in the document or identifies properties inferred from its purpose/function. To define a TOC as a text with a (specific) structured format is fully correct, since a human being is able to recognize a TOC just by looking at the page where it appears, without checking each entries of the TOC. But here we are in the context of automatic recognition performed by machines, and the issue is to perform this automatic recognition in a robust and accurate way. As argued by [15], to design a system whose goal is to recognize a form is not easy, when the form varies. And his following remark remains still true when we generalize it to many of the functional (logical) document elements: Most previous TOC analysis methods only concentrate on the TOC itself without taking into account the other pages in the same document. We will now explain our function-based method. 3. DESCRIPTION OF THE METHOD In this section we will describe the different steps of the method. We will first give a general description of the method, and the next sections will describe each of its steps with two examples: the detection of page numbers and of tables of contents. Several articles from different authors will be mentioned, and a complete description of the algorithms referred to can be found in [6], [7], [8], and [15]. As input, the method requires a document on which some geometrical analysis has been performed. This analysis is not required to be perfectly accurate, the current method being robust to noise. In this section, the term element will refer to the geometrical elements of the document provided as input structure: words, lines, pages, whereas the term component will refer to the specific elements of the specific document structure (such as page numbers, TOC). Component recognizer is used to refer to the software program performing this labeling. The elements will be recognized as being part of a component (they belong to the component). A component can contain several elements: a table of contents component consists of several elements (lines) of a document. The method first tries to functionally characterize a component of a document (step 1). Paraphrasing Simon again, we might hope to be able to characterize the main properties of the system and its behavior without too elaborating the detail of either the outer or the inner environment. Step 2 provides a rough representation of the inner and outer environment. The properties defined in Step 1 are used for identifying solution alternatives consistent with these functional properties (Step 3). A way to select an optimal alternative is defined, and this optimal alternative labels some elements of the document (Step 4). The formal knowledge of these elements (their style, size, position) is used in order to provide feedback to the system and to improve or correct accuracy (Step 5). This feedback can impact steps 2, 3, or 4. 1. prerequisite: Find functional properties 2. Select elements and environment 3. Build up solution alternatives (based knowledge) 4. Select the optimal alternative 5. Generate Formal feedback on functional Step 4 (select the optimal alternative) is in practice already accurate (at least 90% in most cases). But when a very high accuracy is required, the feedback step has to be performed and negative feedback is provided based on the formal characteristics of the identified elements. One assumption here is that these elements will share common formal characteristics. In the following sections, the first part will provide a general description, and the second part will provide illustrations with recognition of page numbers and tables of contents. The work described in [7] and [8] will be used as main illustration, and quickly compared with work described in [6] and [15]. 3.1 Functional Definition 3.1.1 Explanation While formal characterization may look trivial at first glance (at least when the form does not change), a functional characterization of an element is not always obvious and immediate. The primary question to start with is “What is the purpose of this component?” This question will point out relations between different elements of the document. We will use these relations in order to recognize elements belonging to the component. All elements that will share these relations belong to a potential solution. This step allows the identification of difficult cases or cases where the functional approach will not work at all (if we have only one page for instance). 3.1.2 Example with Page Numbers The purpose of page numbers is to let the reader find a page, or more exactly some information referred to by page number, quickly. To achieve this purpose, a unique page number is associated to each page. The set of page numbers forms an incremental sequence. Multiple sequences of several kinds can occur in a document, but they do not overlap. Typically the first sequence can cover the front-matter of a book and uses Roman numbers, and the rest of the document is covered by a sequence using Arabic numbers. In order to recognize page numbers, we will thus focus on incremental sequences of numbers. The type of numbers used by a sequence is called its numbering schema. This functional characterization relying on incremental sequences is used in [6], [8], and [19][15]. In order to be robust to OCR errors (relatively frequent for small numbers), sequences can admit holes (pages without numbers). [8] and [19] use the sequence level in order to induce missing page numbers. 3.1.3 Example with Table of Contents The purpose of a table of contents is to let the reader directly access a part of a document referred to in the table of contents (TOC). A TOC is then defined as a zone of text referring to other elements in the rest of the document. The most important one used in [7] is that the order of the TOC entries and the order of the entries in the document (headings) are identical. In order to deal with noise, some entries without link are allowed in the TOC. Links between each TOC entry and one body element are essential to methods proposed in [7] and [15]. These links are based on textual similarity. One advantage of this formalization is that the TOC recognizer is able to recognize not only the TOC itself, but also body elements that are referred to in the TOC: document headings. This approach allows one to logically structure a document according to its TOC. [15] also uses page number matching when page numbers appear in TOC entries. 3.2 Element and Environment Selection 3.2.1 Explanation The objective of this step is to identify document elements that will be considered in the next processing steps. First we have to decide the granularity of the elements, and second the set of elements we will consider in order to define the outer environment. The geometrical analysis performed beforehand provides the elements we will try to recognize/label. In practice, the line and token (word) levels are sufficient for (very) good results for most of the components. The paragraph level is sometimes preferable, if available (e.g. for the table of contents, where entries or headings can span over several lines). The initial outer environment is very often the whole document, and all its elements can be considered. Some elements can be ignored depending on the component we are trying to recognize. For instance, elements in the middle of the page are not considered for recognizing page headers, but initially all the pages are considered. This is often used in order to optimize processing time. Either geometrical segmentation or initial environment selection can introduce noise. For instance, line or paragraph segmentation can be wrong. Or considering all the pages of a document can introduce noise (cover pages are different from body pages). But it is up to the method to be robust to noise and adaptive, since such noise can not be avoided in practice. This follows Simon’s point of view: We might hope to be able to characterize the main properties of the system and its behavior without elaborating the detail of either the outer or inner environment.[23] A good approximation is often enough to bootstrap the recognition task. Step 5 provides an adaptive mechanism (feedback) to improve the recognition. Before the feedback step, no formal knowledge is used, but constraint over content can be used. This selection can be refined, in a later step, by feedback provided by Step 5 (see Section 3.5), or by external components for the environment delimitation. 3.2.2 Illustration with Page Numbers First, links are defined between each pair of elements in the whole document satisfying a textual similarity criterion. Each link includes a source element and a target element. Whenever the similarity (based on an edit distance) is above a predefined threshold a pair of symmetric links is created (Figure 2). Secondly, all possible ToC candidate areas are enumerated. A brute force approach works fine. It consists in testing each element as a possible ToC start and extending this ToC candidate further in the document until it is no longer possible to comply with the constraints. A ToC candidate is then a set of contiguous elements, from which it is possible to select one link per element so as to provide an ascending order for the target text elements (Figure 3). The natural level for the page numbering component is token. We also add some constraints over token content in order to classify the different type of numbering: Arabic or Roman numbering. A default numbering schema called ASCII is also used. x 3.2.3 Illustration with Table of Contents For the recognition of tables of contents, the level used in [7] is paragraph if available or by default the line level. In [15], the method works at the page level (and answers the question: which pages of the document contain the TOC?). sim(x,y) 3.3 Build Solution Alternatives 3.3.1 Explanation We use here the functional characterization of the component in order to identify elements which belong to it. We do not expect to recognize only one solution, and this step usually builds several alternatives. Often a greedy algorithm is used in order to generate these solutions. This 2-step approach between generating alternatives and selecting an optimal one is common to [6], [7], [8], [15]. y Figure 2: A similarity score is computed for each element pairs 3.3.2 Illustration with Page Numbers For page numbers, the solution proposed in [8] enumerates all possible sequences, over consecutive document pages, of elements that all belong to the same numbering scheme (e.g. all Roman numerals). The method enumerates in a greedy manner all the longest possible sequences of text fragments occurring on consecutive pages and fitting one of the predefined numbering schemes. Figure 1 shows the list of sequences generated from this artificial 5-page document. 2 1 8 1 a b 4 2 b foo fop Doc-1 5 Doc-2 Figure 1: All page number sequences are generated: (1, 2, x, 4), (a, b), (1, 2), (foo, fop), (Doc-1, Doc-2) [6] presents a more local solution where only the previous page is used to check the incrementally of the sequence. Using a larger window (the whole document) as in [8] allows the method to provide better accuracy as will be explained in Section 3.4. 3.3.3 Illustration with Table of Contents For tables of contents, the algorithm in [7] consists in identifying a list of contiguous elements which refer (we use here a similarity measure) to other elements of the document. Figure 3: a TOC candidate is composed of a set of contiguous elements from which it is possible to select one link per block so as to provide an ascending order for the target text blocks In [15], the candidate TOC page selection is first much simpler since the candidates are pages and no longer lines or paragraphs: the twenty first pages are selected in an initial step. If the last few selected pages are considered as real TOC pages, the next 20 pages are taken into account. In a second step, links are computed between each candidate TOC page and other pages. The link score is based on text match length, ordering of body pages, etc). It does not require any previous geometrical layout analysis. A second score is computed when page numbers occur in TOC entries (based on the number of extracted page numbers). 3.4 Select one (Optimal) Solution 3.4.1 Explanation Once all the alternatives have been computed, an optimal one is selected. If the optimal criterion is specific to each component, a characteristic emerges from the design of different components: the number of occurrences is often positively correlated with confidence. For instance, a sequence of page numbers composed of 50 pages (50 is arbitrary chosen), or a TOC composed of 100 entries are more reliable than a 2-element sequence (of page numbers or TOC entries). Hence long elements and long sequences are preferred. This use of number of element occurrences for selecting optimal solutions can be found in [6], [7], [8], and [15]. The notion of number of occurrences can also be used with traditional association measures when no sequence is built. There are of course several ways to select the best solution, and the method we propose here does not impose any specific algorithm. In practice, a Viterbi-like algorithm is often used in order to select the optimal solution. 3.4.2 Illustration with Page Numbers The optimal selection for page numbers consists in selecting the subset of non-overlapping sequences so as to optimally cover the whole document. Eventually it associates each page with its corresponding token in the sequence, possibly extrapolating the missing numbers (using the incremental property). A variant of the Viterbi algorithm is used in [8], giving a high score to long sequences. Figure 4 shows the best set of page number sequences found for the example. The page number for page 3 is extrapolated from the values of other elements of its sequence. 3 2 1 8 1 a b 4 2 b foo fop Doc-1 5 3.5 Generate Formal Feedback 3.5.1 Explanation The objective of this step is to use formal knowledge in order to provide feedback for improving the current solution. The underlying assumption for this step is that document elements belonging to the same component will share some formal regularity: for instance, page numbers will not be spread out randomly over pages, and each entry of the TOC will not have a different style or size. The issue for approaches only using formal knowledge is that these regularities depend on the document and are not consistent from document to document. This issue is avoided since we use here the fact that some elements have already been identified (in the previous step) as belonging to the same component, and we will find out the formal regularity which distinguishes them for other elements. A key assumption during this step is that the selected optimal solution (previous step) is accurate enough. This is usually the case since all authors of the papers cited in the beginning of Section 3 report accuracy around 90% and above. But we know from Step 1 that, in some particular cases, the solution fails or is not reliable. Suspicious solutions should thus be ignored. This is done by setting up the heuristics of the optimal solution selection so that it favors precise solutions (rather than solutions with optimal recall). Feedback can be used in different ways and can impact different steps: It can help improve element selection in Step 2, or help generate better alternatives and help select an optimal solution. The formal characterization uses the traditional features: position, style, size, typographic features (among others). 3.5.2 Illustration with Page Numbers For page numbers, the feedback is used in order to better select elements in step 2. During the first loop, the method is set up so that only sequences of length greater than 2 are recognized. This threshold is in practice reliable enough. A Machine Learning method (for example logistic regression) is applied with the following input data: a. Positive examples correspond to the alreadyrecognized page number elements. b. Negative examples are drawn randomly among the rest of the textual elements of the pages (it can be proportional to the number of positive examples). c. The feature used (how to characterize a textual element for the Machine Learning method) is only the vertical position (Y). Other features could be used. d. The Machine Learning method is trained with these positive and negative data and a model is generated. e. The model is applied over the data in order to recognize page numbers. Doc-2 Figure 4: Selection of the optimal sequences: (1, 2, 3, 4, Doc-1, Doc-2) 3.4.3 Illustration with Table of Contents For tables of contents, a scoring function is used to rank the TOC alternatives. The scoring function is the sum of entry weights, where an entry weight is inversely proportional to the number of outgoing links. This entry weight characterizes the certainty of any of its associated links, under the assumption that the more links originate from a given source text block, the less likely that any one of those links is a "true" link of a table of contents. Since this scoring function is a sum of entry weights, the selected TOC is often a long TOC. In [15], textual link scores are summed up and combined with the page number score. This combination provides the final confidence score for each TOC candidate page. This score is used in order to decide whether this page is a real TOC or not. Feedback impacts Step 2, where only the page numbers recognized by the Machine Learning algorithm and the already recognized elements (step 4) are selected. Step 3 and 4 are performed again, taking into account only those elements, but, this time, accepting page number sequences of length 1 and more. This feedback step allows one to recognize all page number sequences in a robust way. Evaluation is provided in [8] and shows more accurate results for document with short sequences. 3.5.3 Illustration with Table of Contents The implementation of this feedback step is currently under way for the table of contents components. in order to correct and improve the recognition task. Some kinds of noise from previous steps (OCR, geometrical analysis) can be tolerated. We are now investigating methods in order to combine the results of each component recognizer in order to improve the overall quality of document conversion. Library 4. USING DOCUMENT REDUNDANCY FOR IMPROVING QUALITY The different component recognizers presented in this article achieve on average 90% accuracy. In real-word applications, this accuracy level may be insufficient. One solution is to improve efficiency of each recognizer, but we all know how difficult it is to increase accuracy at this level (above 95%). The other solution we envisage is to use organizational redundancy present in the different document components. In order to reach a very high accuracy, we combine several components in order to improve and to check the final conversion quality. This relies on the fact, which shows, that a document has (potentially) a redundant organization, where many components reflect the logical structure of the document (in order to help readers navigate). A document is considered here as an organized system with many interrelations between its sub-systems. This is particularly true for technical documents. Several components can provide the same kind of information. For instance, each logical section of a document has its title in its headers, and as table of contents entry, enforces page numbering to restart. The three components provide clues for section segmentation. A voting system can produce the best solution. A more ambitious way to make use of organizational redundancy would be to improve component outputs using other correlated components in a collaborative environment: feedback as performed in Step 5 of the method would no longer be internal to a component but also external. Considering the previous example, errors of the TOC component could be corrected using information from the page number component. We are currently investigating this collaborative approach. Another direct application of this redundancy is Quality Assurance: outputs from components providing the same or equivalent result can be cross-validated and inconsistencies detected. For instance, the logical segmentation provided by the table of contents can be checked against the segmentation provided by page numbers for documents where each logical structure starts a new page number sequence (in technical documents). This validation is currently performed in an ad hoc way, but we are investigating ways to automatically detect relations between document components. 5. CONCLUSION Most document analysis systems rely on formal knowledge (layout characteristics) for labeling elements. Due to the large variability of the forms of these document elements, no robust system is available today for a large variety of documents. We presented in this article a method for functional document analysis enhancing two kinds of knowledge: functional knowledge, which allows the identification of elements according to their purpose, and traditional formal knowledge. The functional stability of some elements in documents allows them to be recognized. Traditional formal knowledge is used with a feedback mechanism Title Bibliography TOC Logical structure References Pagination Running title Inde Figure 5: Document components share manifold relations and this redundancy can be exploited during document conversion. 6. REFERENCES [1] H. Baird, Difficult and Urgent Open Problem in Document Image Analysis for Libraries, proceedings of the First International Workshop on Document Image Analysis for Libraries (DIAL’04), 2004. [2] H. Baird, D. Lopresti, Robust Document Image Understanding Technologies, HDP/ 04, Washington, USA, 2004. [3] D. Bergmark. Automatic Extraction of Reference Linking Information from Online Documents, CSTR 2000-1821, 2000. [4] R. Cattoni, T. Coianiz, Geometric Layout Analysis Techniques for Document Image Understanding: a Review, TC-IRST Technical Report #9703-09, 1998. [5] G. Cavallo, R. Chartier, A history of reading in the west, Cambridge: Polity Press, 1999. [6] K. Collins-Thompson and R. Novkolov, A Clustering-Based Algorithm for Automatic Document Separation, in proceedings of the SIGIR 2002 Workshop on Information Retrieval and OCR: From Converting Content to Grasping Meaning, Tampere, Finland, 2002. [7] H. Déjean and J-L. Meunier, Structuring Documents According to Their Table of Contents, Symposium on Document Engineering (DocEng’05), Bristol, 2005. [8] H. Déjean, J-L. Meunier, “Versatile Page Numbering Analysis”, submitted to ICDAR 2007. [9] R. Haralick, Document Image Understanding: Geometric and Logical Layout, EEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994. [17] S. Mao, A. Rosenfeld, T. Kanungo. Document structure analysis algorithms: a literature survey, Proc. SPIE Electronic Imaging, January 2003 SPIE Vol. 5010:197-207. [10] Seventh International Workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006, Proceedings Series: Lecture Notes in Computer Science, Vol. 3872. [18] S. Mao, J. Woo Kim, G. R. Thoma, Style-Independent Document Labeling: Design and Performance Evaluation, SPIE 2004. [11] S. Hauser, T. Sabir, G. Thoma, OCR Correction Using Historical Relationship from Verified Text in Biomedical Citations, Proceedings of 2003 Symposium on Document Image Understanding Technology, Greenbelt, Maryland, 2003. [19] G. Mühlberger, Automated Digitisation of Printed Material for Everyone: The METADATA ENGINE Project, RLG DigiNews, Volume 6, Number 3, 2002. [12] Eighth International Conference on Document Analysis and Recognition (ICDAR 2005), August 29- September 1, 2005, Seoul, Korea. IEEE Computer Society. [21] P. Sarkar, E. Saund, Perceptual Organization in Semantic Role Labeling, Symposium on Document Image Understanding Technology, Maryland, US, 2005. [13] A. Kawtrakul, C. Yingsaeree, A Unified Framework for Automatic Metadata Extraction from Electronic Document, in proceedings of the International Advanced Digital Library Conference (IADLC), Nagoya, Japan, 2005 [22] F. Shafait, Daniel Keysers, T. M. Breuel, Performance Comparison of Six Algorithms for Page Segmentation, Workshop in Document Analysis System (DAS’06), Nelson, New-Zealand, 2006. [14] X. Lin, Quality Assurance in high Volume document Digitization: a Survey, in proceedings of the second International Conference on Document Image Analysis for Libraries, Lyon, France, 2006. [23] H. A. Simon., The sciences of the Artificial, (3rd ed.). Cambridge, MA: The MIT Press, 1997. [15] X. Lin, Y. Xiong. Detection and Analysis of Table of Contents Based on Content Association, IJDAR 8(2-3): 132143, 2006. [16] S. Mandal, S.P. Chowdhury, A. K. Das, B. Chanda, Detection and Segmentation of Table of Contents and Index Pages from Document Images, in proceedings of the second International Conference on Document Image Analysis for Libraries, (DLIA’06), Lyon, France, 2006. [20] A. Rosenblueth, N. Wiener, J. Bigelow, “Behavior, Purpose and Teleology”, in Philosophy of Science, 10, 1943. [24] D. Slocombe, R. Boyd, “There is no unstructured documents,” XML Europe 2002, Paris, France 2002. [25] K. Summers, Using White Space for Automated Document Structuring, Cornell University Computer Science Technical Report TR 94-1452. [26] L. Todoran, M. Woring, M. Aiello, C. Monz, Document Understanding for a broad class of documents, ISIS technical report series, Vol. 2001-15, 2001. [27] S. Yacoub, J. Abad, P. Faraboschi, J. Burns, D. Ortega, V. Saxena, Document Digitization Lifecycle for Complex Magazine Collection, ACM Symposium on Document Engineering, DocEng’ 05, Bristol, UK, 2005.
© Copyright 2026 Paperzz