Using a Generic Document Recognition Method for

Using a Generic Document Recognition Method
for Mathematical Formulae Recognition
Pascal Garcia and Bertrand Coüasnon
IRISA / INSA-Département Informatique
20, Avenue des buttes de Coësmes, CS 14315
F-35043 Rennes Cedex, France
[email protected]
Abstract. We present in this paper how to apply to mathematical formulae a generic recognition method already used for musical scores, table
structure and old forms recognition. We propose to use this method to
recognize the structure of formulae and also to recognize some symbols
made of line segments. This offers two possibilities: improving the symbol recognition when there is a lot of symbols like in mathematics; and
overcoming segmentation problems we usually find in old mathematical
formulae.
1
Introduction
We presented in [3] DMOS (Description and MOdification of Segmentation) a
generic recognition method for structured documents. This method is made of:
– the grammatical formalism EPF (Enhanced Position Formalism), which can
be seen as a description language for structured documents. EPF makes
possible at the same time a graphical, a syntactical or even a semantical
description of a class of documents;
– the associated parser which is able to change the parsed structure during the
parsing. This allows the system to try other segmentations with the help of
context to improve recognition.
We have implemented this DMOS method to build an automatic generator of
structured document recognition systems. Using this generator, adaptation to
a new kind of document is then simply done by defining a description of the
document with an EPF grammar. This grammar is then compiled to produce
a new structured document recognition system. Each system can then use a
classifier to recognize symbols which can be seen as characters.
By only changing the EPF grammar, and when needed by training a classifier,
we produced automatically various recognition systems: one for musical scores [4]
(figure 1a), one on recursive table structures whatever the number of rows or
columns (figure 1c), and one on military forms of the 19th century (figure 1b).
We have been able to successfully test this military forms recognition system on
5,000 pages of forms even if they were quite damaged [6].
D. Blostein and Y.-B. Kwon (Eds.): GREC 2002, LNCS 2390, pp. 236–244, 2002.
c Springer-Verlag Berlin Heidelberg 2002
Using a Generic Document Recognition Method
(a) On musical scores: the original image (up above) and the
reconstructed score from the
recognized structure (down below)
Original Image
Detected Table: Level 1
237
(b) On a military form of the 19th century:
the original image and his well recognized
structure in spite of the added sheets of paper
Detected Table: Level 2
Detected Table: Level 3
(c) On table structure: recognition of the table hierarchy
Fig. 1. Applications of DMOS, a generic document recognition method
We present in this paper the way we used this generic method to automatically produce a mathematical formulae recognition system. Thanks to DMOS we
have been able to strongly reduce the development time which is one of the main
interests of using a generic method. Moreover, we could propose an answer to
some unsolved problems of mathematical formulae recognition:
– by describing some symbols with EPF, we could limit the number of mathematical symbols recognized by classifiers. This is very important as mathematical notation can easily uses 250 symbols and sometimes a lot more
(LATEX documentation presents around 500 symbols). Besides, some symbols
can have very different sizes in a same document. All this makes difficult the
building of good classifiers for mathematical symbols. The grammatical description we propose on some symbols can appreciably decrease the number
of symbols classes handled by classifiers and therefore increase their quality;
238
Pascal Garcia and Bertrand Coüasnon
– the use of the DMOS method offers the possibility to deal with some of
the over and under segmentation problems we can find in old mathematical
books.
Even if we can find in the literature quite a lot of work done on the mathematical formulae recognition [8,2], only few of them tried to deal with an important
number of symbol classes. On the formulae structure recognition, various grammatical methods have been proposed (for example [7,9]), but none of them are
able to use the context to improve segmentation problems and therefore to improve the recognition quality.
The system we propose in this paper is limited to isolated and printed formulae with the following vocabulary: basic arithmetical expressions with subscript
and superscript, trigonometric expressions, relational expressions, sums, products, square roots and integrals. The generated system produced by the EPF
grammar compilation is able to produce the recognized formula in LATEX.
We will start this paper by a fast presentation of the EPF formalism. Then
we will see how we used it to define a description of mathematical formulae
in a way to recognize their structure. Section 4 will show that this description
can improve the symbols recognition and can deal with segmentation problems.
Before concluding we will present some results.
2
DMOS Method and EPF Formalism
To rapidly develop a mathematical formulae recognition system, we used DMOS,
the generic method we proposed for structured document recognition. We only
had to define an EPF grammar describing mathematical notation.
The EPF formalism (morely presented in different papers [5,3]) can be seen
as a grammatical language to describe a structured class of documents. From an
EPF grammar we can automatically produce an adapted recognition system by
using the automatic generator we developed.
EPF can be seen as an adding of several operators to mono-dimensional
grammars like:
Position Operator (encapsulated by AT):
A && AT(pos) && B
means A, and at the position pos in relation to A, we find B.
Where && is the concatenation in the grammar, A and B represent a terminal
or a non-terminal. The writer of the grammar can define as much as necessary
position operators as well as he can for non-terminals.
Factorization Operator ## (in association with the position operators):
A && (AT(pos1) && B ##
AT(pos2) && C)
Using a Generic Document Recognition Method
239
means
(A && AT(pos1) && B) and (A && AT(pos2) && C)
Space Reduction Operator (IN ... DO):
EPF offers also an operator to optionally reduce the area where a rule should
be applied:
IN(aera_definition) DO(rule)
This is very useful, for example, to make recursive descriptions. Thus we can
describe what a square root notation is (where ::= is the constructor of a
grammar rule):
squareRoot ::=
termSqrtSign &&
AT(rightSqrt) && IN(areaUnderSqrt)
DO(expression).
The termSqrtSign describes the symbol of square root and the IN DO operator allows to limit the recursive description in the part of the image under
the square root. The expression is positioned relatively to termSqrtSign
by the position operator rightSqrt.
x
0
y 2 dy + (x + y)2
i2
0≤i≤n
1+
1
1−x2
3
p+q=n
(a) The beginning symbol of each
formula is the leftmost
xp + yq
(b) The beginning
symbol
of
and ) is
each formula (
NOT the leftmost
Fig. 2. The beginning symbol
3
Grammar of Mathematical Formulae
With the help of EPF we defined a grammatical description of mathematical
formulae. This description follows the natural reading order of a formula. The
grammar needs to be able to define three kinds of descriptions:
– the beginning of a formula;
– the alignment, subscript and superscript;
– the recursivity of mathematical formulae.
240
3.1
Pascal Garcia and Bertrand Coüasnon
Description of the Beginning of a Formula
As we work on isolated formulae, we can describe the symbol
starting the formula
by the leftmost symbol in the image except when a , a
or a fraction line
starts the formula (see figure 2). When a fraction line is the beginning of a
formula, it is not necessary the leftmost symbol because of the scanning skew. We
can then describe the beginning symbol of a formula by this EPF grammatical
rule:
formBegin
formBegin
formBegin
formBegin
::=
::=
::=
::=
sumBegin.
prodBegin.
fracLineBegin.
leftMostSymbol.
This means that the beginning of a formula can be a
, a
or a fraction
line under certain conditions and if none of the conditions are found it is the
leftmost symbol in the image. The description of the conditions for a sigma to
be the beginning symbol is defined by:
sumBegin ::= termSigma && (
AT(leftSide) && noSymbol ##
AT(topBottom) && noFracRule).
Where termSigma is the non-terminal for the symbol .The position operators
leftSide and topBottom define areas - non specifically closed - like in figure 3,
relatively to termSigma with the factorization notation. This grammar rule explains that a sigma is the beginning of a formula if :
– the symbol
exists;
– at the left side there is no symbol;
– above and below there is no big fraction line.
The description of a formula is then defined using EPF relatively to this
beginning symbol (formBegin).
(a) Area defined by leftSide
(b)
Area
topBottom
Fig. 3. Position operators for the description of
defined
by
as the beginning symbol
Using a Generic Document Recognition Method
3.2
241
Description of Alignment, Subscript and Superscript
To describe the relative position (alignment, subscript or superscript) of symbols
in a mathematical expression, we use position operators (defined by AT) to find
the next symbol after the current one. Then we add a condition on the positions
of base lines of these two following symbols.
3.3
Description of the Recursivity
Mathematical notation is very recursive. For example, we can find a mathematical formula in superscript, in a denominator or under a square root. Using the
IN DO operator of EPF, we can easily describe this recursivity, like it is done in
the squareRoot rule previously presented.
For example, a mathematical expression can contains a variable with subscript or superscript. It can then be described by:
expression ::= ... | atomExpr | ...
atomExpr ::=
variable && (
AT(subPos) && subExpressionOrNot ##
AT(supPos) && supExpressionOrNot
).
supExpressionOrNot ::=
IN(areaSup) DO(expression).
supExpressionOrNot.
The position operator supPos defines the area where a superscript can be found.
supExpressionOrNot uses the IN DO operator to recursively call expression.
As for the beginning of the formula, recursivity will look for the beginning of the
formula inside the areaSup (see figure 4). The beginning symbol will be used to
check conditions of being superscript.
Fig. 4. Recursivity in superscript : area defined by areaSup
4
Mathematical Symbol Recognition
In this mathematical formulae recognition system we need to develop a mathematical symbol classifier.
242
Pascal Garcia and Bertrand Coüasnon
To do so, we decided to use the classifier (able to deal with reject notions) we
presented in [1], but for time reasons we have not been able to make the learning
phase yet. However, we propose to use the EPF grammar to recognize symbols
made of line segments.
4.1
Symbol Recognition Using Line Segments
Mathematical symbol recognition is quite difficult because of the large number of symbol classes (LATEX uses around 500 symbols for mathematics). It is
there important to reduce the number of mathematical symbol classes given to
classifiers, in order to get better recognition rates. This is even more crucial as
redundancy is quite weak in the mathematical notation [8].
We can notice that quite a lot of mathematical symbols are made of line
segments. As with the DMOS method it is relatively easy to describe (and then
recognize) graphical objects made of line segments, we propose to describe those
symbols with EPF. For the notation we implemented we can recognize those
symbols
with EPF: +, -, ×, /, fraction line, square root symbol, >, <, ≥, ≤, =,
=, ,
and the absolute value. The classifierthen will only have to recognize
those symbols: 0 to 9, a to z, the dot, (, ) and .
This reduces the number of classes recognized by the classifier from 54 to 40.
This reduction can be even more important when we will increase the vocabulary
(to add Greek letters. . . ). For example there are 60 symbols in LATEX (AMS) for
negation of binary relations, and on those 60 symbols 44 can be described with
EPF. Moreover those descriptions are size invariant.
4.2
Resolving Some Segmentation Problems
Using the DMOS method to recognize symbols made of line segments offers the
possibility to deal with over and under segmentation problems which can be
found mostly in old versions of mathematical formulae.
Fig. 5. Square root over segmented
Over Segmented Symbols: Figure 5 presents an example of an over segmented square root symbol found in a book of 1962. This is not a problem for
the system we propose because a description of a graphical object with EPF (like
for the square root symbol) produces a recognition system which is not linked
to connexity.
Using a Generic Document Recognition Method
243
Fig. 6. Fraction lines touching symbols (“x” and “)”)
Under Segmented Symbols: Figure 6 is from a book of 1970 and shows
a fraction line touching some symbols. As the DMOS method can deal with
these kinds of segmentation problems [5] by automatically changing the parsed
structure during parsing, all the symbols described by EPF can touch other
symbol - made of line segments or not - without any problem to be recognized.
5
Results
This description of mathematical notation done with EPF was compiled by the
generator of structured document recognition systems. This automatically produced the recognition system for mathematical formulae.
Due to time constraints, we have not been able yet to integrate a classifier
for non line segments symbols. Therefore we used a manual labelling for those
symbols to validate the grammatical part. Thus we show that it is possible to use
the generic method DMOS for mathematical formulae recognition. We present
in figure 7 some examples of recognized formulae taken in more than 60 formulae
(a LATEX representation of the formula is produced by the recognition system).
6
Conclusion
The grammatical and generic DMOS method have been used to automatically
produce three recognition systems of document structure: one for musical scores,
4x2 + 3x + 4
5x3 + 4x + 2
(1)
(3)
n
0≤i≤n
|a + b| ≤ |a| + |b|
2
−
i=0 3
0
x
2
y 2 + 3dy
2
3
+ 56
−1+2×
i=
n
i=0
n × (n + 1)
2
(2)
cos2 (x) + sin2 (x) = 1
(4)
1+4
3+x+yz
5
+6
+ 24
(5)
2
i/
1+
2xy
1
2
3
× ea+ 1
3
Fig. 7. Five examples of recognized mathematical formulae
244
Pascal Garcia and Bertrand Coüasnon
one on recursive table structures and one on military forms of the 19th century.
We have presented in this paper how to use the same DMOS method to automatically produce a mathematical formulae recognition system. Due to the
genericity of the method we have been able to strongly reduce the development
time. Moreover by using EPF (the grammatical formalism associated to DMOS)
to recognize symbols made of line segments, we can limit the number of symbols
recognized by a classifier. This grammatical description of symbols can also deal
with size variation. Thanks to the DMOS method it is also possible to overcome
some segmentation problems on symbols.
We still have to integrate a classifier for symbols without line segments and
increase the number of classes to be able to validate the whole system on a real
test set. We will also study the possibility to recognized non isolated formulae.
References
1. E. Anquetil, B. Coüasnon, and F. Dambreville. A symbol classifier able to reject
wrong shapes for document recognition systems. In Atul K. Chhabra and Dov Dori,
editors, Graphics Recognition, Recent Advances, volume 1941 of Lecture Notes in
Computer Science, pages 209–218. Springer, 2000. 242
2. K. F. Chan and D. Y. Yeung. Mathematical expression recognition: a survey.
International Journal on Document Analysis and Recognition, 3:3–15, 2000. 238
3. B. Coüasnon. Dmos: A generic document recognition method, application to an
automatic generator of musical scores, mathematical formulae and table structures
recognition systems. In ICDAR, International Conference on Document Analysis
and Recognition, pages 215–220, Seattle, USA, September 2001. 236, 238
4. B. Coüasnon and J. Camillerapp. Using grammars to segment and recognize music
scores. In L. Spitz and A. Dengel, editors, Document Analysis Systems. World
Scientific, 1995. 236
5. B. Coüasnon and J. Camillerapp. A way to separate knowledge from program in
structured document analysis: application to optical music recognition. In ICDAR,
International Conference on Document Analysis and Recognition, volume 2, pages
1092–1097, Montréal, Canada, August 1995. 238, 243
6. B. Coüasnon and L. Pasquer. A real-world evaluation of a generic document recognition method applied to a military form of the 19th century. In ICDAR, International Conference on Document Analysis and Recognition, pages 779–783, Seattle,
USA, September 2001. 236
7. A. Grbavec and D. Blostein. Mathematics recognition using graph rewriting. In ICDAR, International Conference on Document Analysis and Recognition, volume 1,
pages 417–421, Montréal, Canada, August 1995. 238
8. D. Blostein A. Grbavec. Recognition of mathematical notation. Handbook of
character recognition and document image analysis, pp. 557-582, 1997. 238, 242
9. S. Lavirotte L. Pottier. Mathematical formula recognition using graph grammar.
Electronic Imaging, 1998. 238