Using a Generic Document Recognition Method for Mathematical Formulae Recognition Pascal Garcia and Bertrand Coüasnon IRISA / INSA-Département Informatique 20, Avenue des buttes de Coësmes, CS 14315 F-35043 Rennes Cedex, France [email protected] Abstract. We present in this paper how to apply to mathematical formulae a generic recognition method already used for musical scores, table structure and old forms recognition. We propose to use this method to recognize the structure of formulae and also to recognize some symbols made of line segments. This offers two possibilities: improving the symbol recognition when there is a lot of symbols like in mathematics; and overcoming segmentation problems we usually find in old mathematical formulae. 1 Introduction We presented in [3] DMOS (Description and MOdification of Segmentation) a generic recognition method for structured documents. This method is made of: – the grammatical formalism EPF (Enhanced Position Formalism), which can be seen as a description language for structured documents. EPF makes possible at the same time a graphical, a syntactical or even a semantical description of a class of documents; – the associated parser which is able to change the parsed structure during the parsing. This allows the system to try other segmentations with the help of context to improve recognition. We have implemented this DMOS method to build an automatic generator of structured document recognition systems. Using this generator, adaptation to a new kind of document is then simply done by defining a description of the document with an EPF grammar. This grammar is then compiled to produce a new structured document recognition system. Each system can then use a classifier to recognize symbols which can be seen as characters. By only changing the EPF grammar, and when needed by training a classifier, we produced automatically various recognition systems: one for musical scores [4] (figure 1a), one on recursive table structures whatever the number of rows or columns (figure 1c), and one on military forms of the 19th century (figure 1b). We have been able to successfully test this military forms recognition system on 5,000 pages of forms even if they were quite damaged [6]. D. Blostein and Y.-B. Kwon (Eds.): GREC 2002, LNCS 2390, pp. 236–244, 2002. c Springer-Verlag Berlin Heidelberg 2002 Using a Generic Document Recognition Method (a) On musical scores: the original image (up above) and the reconstructed score from the recognized structure (down below) Original Image Detected Table: Level 1 237 (b) On a military form of the 19th century: the original image and his well recognized structure in spite of the added sheets of paper Detected Table: Level 2 Detected Table: Level 3 (c) On table structure: recognition of the table hierarchy Fig. 1. Applications of DMOS, a generic document recognition method We present in this paper the way we used this generic method to automatically produce a mathematical formulae recognition system. Thanks to DMOS we have been able to strongly reduce the development time which is one of the main interests of using a generic method. Moreover, we could propose an answer to some unsolved problems of mathematical formulae recognition: – by describing some symbols with EPF, we could limit the number of mathematical symbols recognized by classifiers. This is very important as mathematical notation can easily uses 250 symbols and sometimes a lot more (LATEX documentation presents around 500 symbols). Besides, some symbols can have very different sizes in a same document. All this makes difficult the building of good classifiers for mathematical symbols. The grammatical description we propose on some symbols can appreciably decrease the number of symbols classes handled by classifiers and therefore increase their quality; 238 Pascal Garcia and Bertrand Coüasnon – the use of the DMOS method offers the possibility to deal with some of the over and under segmentation problems we can find in old mathematical books. Even if we can find in the literature quite a lot of work done on the mathematical formulae recognition [8,2], only few of them tried to deal with an important number of symbol classes. On the formulae structure recognition, various grammatical methods have been proposed (for example [7,9]), but none of them are able to use the context to improve segmentation problems and therefore to improve the recognition quality. The system we propose in this paper is limited to isolated and printed formulae with the following vocabulary: basic arithmetical expressions with subscript and superscript, trigonometric expressions, relational expressions, sums, products, square roots and integrals. The generated system produced by the EPF grammar compilation is able to produce the recognized formula in LATEX. We will start this paper by a fast presentation of the EPF formalism. Then we will see how we used it to define a description of mathematical formulae in a way to recognize their structure. Section 4 will show that this description can improve the symbols recognition and can deal with segmentation problems. Before concluding we will present some results. 2 DMOS Method and EPF Formalism To rapidly develop a mathematical formulae recognition system, we used DMOS, the generic method we proposed for structured document recognition. We only had to define an EPF grammar describing mathematical notation. The EPF formalism (morely presented in different papers [5,3]) can be seen as a grammatical language to describe a structured class of documents. From an EPF grammar we can automatically produce an adapted recognition system by using the automatic generator we developed. EPF can be seen as an adding of several operators to mono-dimensional grammars like: Position Operator (encapsulated by AT): A && AT(pos) && B means A, and at the position pos in relation to A, we find B. Where && is the concatenation in the grammar, A and B represent a terminal or a non-terminal. The writer of the grammar can define as much as necessary position operators as well as he can for non-terminals. Factorization Operator ## (in association with the position operators): A && (AT(pos1) && B ## AT(pos2) && C) Using a Generic Document Recognition Method 239 means (A && AT(pos1) && B) and (A && AT(pos2) && C) Space Reduction Operator (IN ... DO): EPF offers also an operator to optionally reduce the area where a rule should be applied: IN(aera_definition) DO(rule) This is very useful, for example, to make recursive descriptions. Thus we can describe what a square root notation is (where ::= is the constructor of a grammar rule): squareRoot ::= termSqrtSign && AT(rightSqrt) && IN(areaUnderSqrt) DO(expression). The termSqrtSign describes the symbol of square root and the IN DO operator allows to limit the recursive description in the part of the image under the square root. The expression is positioned relatively to termSqrtSign by the position operator rightSqrt. x 0 y 2 dy + (x + y)2 i2 0≤i≤n 1+ 1 1−x2 3 p+q=n (a) The beginning symbol of each formula is the leftmost xp + yq (b) The beginning symbol of and ) is each formula ( NOT the leftmost Fig. 2. The beginning symbol 3 Grammar of Mathematical Formulae With the help of EPF we defined a grammatical description of mathematical formulae. This description follows the natural reading order of a formula. The grammar needs to be able to define three kinds of descriptions: – the beginning of a formula; – the alignment, subscript and superscript; – the recursivity of mathematical formulae. 240 3.1 Pascal Garcia and Bertrand Coüasnon Description of the Beginning of a Formula As we work on isolated formulae, we can describe the symbol starting the formula by the leftmost symbol in the image except when a , a or a fraction line starts the formula (see figure 2). When a fraction line is the beginning of a formula, it is not necessary the leftmost symbol because of the scanning skew. We can then describe the beginning symbol of a formula by this EPF grammatical rule: formBegin formBegin formBegin formBegin ::= ::= ::= ::= sumBegin. prodBegin. fracLineBegin. leftMostSymbol. This means that the beginning of a formula can be a , a or a fraction line under certain conditions and if none of the conditions are found it is the leftmost symbol in the image. The description of the conditions for a sigma to be the beginning symbol is defined by: sumBegin ::= termSigma && ( AT(leftSide) && noSymbol ## AT(topBottom) && noFracRule). Where termSigma is the non-terminal for the symbol .The position operators leftSide and topBottom define areas - non specifically closed - like in figure 3, relatively to termSigma with the factorization notation. This grammar rule explains that a sigma is the beginning of a formula if : – the symbol exists; – at the left side there is no symbol; – above and below there is no big fraction line. The description of a formula is then defined using EPF relatively to this beginning symbol (formBegin). (a) Area defined by leftSide (b) Area topBottom Fig. 3. Position operators for the description of defined by as the beginning symbol Using a Generic Document Recognition Method 3.2 241 Description of Alignment, Subscript and Superscript To describe the relative position (alignment, subscript or superscript) of symbols in a mathematical expression, we use position operators (defined by AT) to find the next symbol after the current one. Then we add a condition on the positions of base lines of these two following symbols. 3.3 Description of the Recursivity Mathematical notation is very recursive. For example, we can find a mathematical formula in superscript, in a denominator or under a square root. Using the IN DO operator of EPF, we can easily describe this recursivity, like it is done in the squareRoot rule previously presented. For example, a mathematical expression can contains a variable with subscript or superscript. It can then be described by: expression ::= ... | atomExpr | ... atomExpr ::= variable && ( AT(subPos) && subExpressionOrNot ## AT(supPos) && supExpressionOrNot ). supExpressionOrNot ::= IN(areaSup) DO(expression). supExpressionOrNot. The position operator supPos defines the area where a superscript can be found. supExpressionOrNot uses the IN DO operator to recursively call expression. As for the beginning of the formula, recursivity will look for the beginning of the formula inside the areaSup (see figure 4). The beginning symbol will be used to check conditions of being superscript. Fig. 4. Recursivity in superscript : area defined by areaSup 4 Mathematical Symbol Recognition In this mathematical formulae recognition system we need to develop a mathematical symbol classifier. 242 Pascal Garcia and Bertrand Coüasnon To do so, we decided to use the classifier (able to deal with reject notions) we presented in [1], but for time reasons we have not been able to make the learning phase yet. However, we propose to use the EPF grammar to recognize symbols made of line segments. 4.1 Symbol Recognition Using Line Segments Mathematical symbol recognition is quite difficult because of the large number of symbol classes (LATEX uses around 500 symbols for mathematics). It is there important to reduce the number of mathematical symbol classes given to classifiers, in order to get better recognition rates. This is even more crucial as redundancy is quite weak in the mathematical notation [8]. We can notice that quite a lot of mathematical symbols are made of line segments. As with the DMOS method it is relatively easy to describe (and then recognize) graphical objects made of line segments, we propose to describe those symbols with EPF. For the notation we implemented we can recognize those symbols with EPF: +, -, ×, /, fraction line, square root symbol, >, <, ≥, ≤, =, =, , and the absolute value. The classifierthen will only have to recognize those symbols: 0 to 9, a to z, the dot, (, ) and . This reduces the number of classes recognized by the classifier from 54 to 40. This reduction can be even more important when we will increase the vocabulary (to add Greek letters. . . ). For example there are 60 symbols in LATEX (AMS) for negation of binary relations, and on those 60 symbols 44 can be described with EPF. Moreover those descriptions are size invariant. 4.2 Resolving Some Segmentation Problems Using the DMOS method to recognize symbols made of line segments offers the possibility to deal with over and under segmentation problems which can be found mostly in old versions of mathematical formulae. Fig. 5. Square root over segmented Over Segmented Symbols: Figure 5 presents an example of an over segmented square root symbol found in a book of 1962. This is not a problem for the system we propose because a description of a graphical object with EPF (like for the square root symbol) produces a recognition system which is not linked to connexity. Using a Generic Document Recognition Method 243 Fig. 6. Fraction lines touching symbols (“x” and “)”) Under Segmented Symbols: Figure 6 is from a book of 1970 and shows a fraction line touching some symbols. As the DMOS method can deal with these kinds of segmentation problems [5] by automatically changing the parsed structure during parsing, all the symbols described by EPF can touch other symbol - made of line segments or not - without any problem to be recognized. 5 Results This description of mathematical notation done with EPF was compiled by the generator of structured document recognition systems. This automatically produced the recognition system for mathematical formulae. Due to time constraints, we have not been able yet to integrate a classifier for non line segments symbols. Therefore we used a manual labelling for those symbols to validate the grammatical part. Thus we show that it is possible to use the generic method DMOS for mathematical formulae recognition. We present in figure 7 some examples of recognized formulae taken in more than 60 formulae (a LATEX representation of the formula is produced by the recognition system). 6 Conclusion The grammatical and generic DMOS method have been used to automatically produce three recognition systems of document structure: one for musical scores, 4x2 + 3x + 4 5x3 + 4x + 2 (1) (3) n 0≤i≤n |a + b| ≤ |a| + |b| 2 − i=0 3 0 x 2 y 2 + 3dy 2 3 + 56 −1+2× i= n i=0 n × (n + 1) 2 (2) cos2 (x) + sin2 (x) = 1 (4) 1+4 3+x+yz 5 +6 + 24 (5) 2 i/ 1+ 2xy 1 2 3 × ea+ 1 3 Fig. 7. Five examples of recognized mathematical formulae 244 Pascal Garcia and Bertrand Coüasnon one on recursive table structures and one on military forms of the 19th century. We have presented in this paper how to use the same DMOS method to automatically produce a mathematical formulae recognition system. Due to the genericity of the method we have been able to strongly reduce the development time. Moreover by using EPF (the grammatical formalism associated to DMOS) to recognize symbols made of line segments, we can limit the number of symbols recognized by a classifier. This grammatical description of symbols can also deal with size variation. Thanks to the DMOS method it is also possible to overcome some segmentation problems on symbols. We still have to integrate a classifier for symbols without line segments and increase the number of classes to be able to validate the whole system on a real test set. We will also study the possibility to recognized non isolated formulae. References 1. E. Anquetil, B. Coüasnon, and F. Dambreville. A symbol classifier able to reject wrong shapes for document recognition systems. In Atul K. Chhabra and Dov Dori, editors, Graphics Recognition, Recent Advances, volume 1941 of Lecture Notes in Computer Science, pages 209–218. Springer, 2000. 242 2. K. F. Chan and D. Y. Yeung. Mathematical expression recognition: a survey. International Journal on Document Analysis and Recognition, 3:3–15, 2000. 238 3. B. Coüasnon. Dmos: A generic document recognition method, application to an automatic generator of musical scores, mathematical formulae and table structures recognition systems. In ICDAR, International Conference on Document Analysis and Recognition, pages 215–220, Seattle, USA, September 2001. 236, 238 4. B. Coüasnon and J. Camillerapp. Using grammars to segment and recognize music scores. In L. Spitz and A. Dengel, editors, Document Analysis Systems. World Scientific, 1995. 236 5. B. Coüasnon and J. Camillerapp. A way to separate knowledge from program in structured document analysis: application to optical music recognition. In ICDAR, International Conference on Document Analysis and Recognition, volume 2, pages 1092–1097, Montréal, Canada, August 1995. 238, 243 6. B. Coüasnon and L. Pasquer. A real-world evaluation of a generic document recognition method applied to a military form of the 19th century. In ICDAR, International Conference on Document Analysis and Recognition, pages 779–783, Seattle, USA, September 2001. 236 7. A. Grbavec and D. Blostein. Mathematics recognition using graph rewriting. In ICDAR, International Conference on Document Analysis and Recognition, volume 1, pages 417–421, Montréal, Canada, August 1995. 238 8. D. Blostein A. Grbavec. Recognition of mathematical notation. Handbook of character recognition and document image analysis, pp. 557-582, 1997. 238, 242 9. S. Lavirotte L. Pottier. Mathematical formula recognition using graph grammar. Electronic Imaging, 1998. 238
© Copyright 2026 Paperzz