Improving Formula Analysis with Line and Mathematics Identification

2013 12th International Conference on Document Analysis and Recognition
Improving Formula Analysis with Line and
Mathematics Identification
Mohamed Alkalai, Josef B. Baker and Volker Sorge
Xiaoyan Lin
School of Computer Science
University of Birmingham, UK
Email: M.A.Alkalai|J.Baker|[email protected]
URL: www.cs.bham.ac.uk/∼maa897|∼jbb|∼vxs
Institute of Computer Science and Technology
Peking University, Beijing, China
Email: [email protected]
In Section II we introduce line segmentation and the
problems that occur when traditional techniques are applied to
documents rich in mathematical notation. We then show how
the method previously implemented in Maxtract, projection
profile cutting, PPC, performed poorly in certain cases and
present a statistical histogram-based alternative which dramatically improves accuracy. This is followed by an experimental
analysis of the method over a large and diverse data-set in
comparison to PPC. Section III reviews formula identification
techniques, and shows how a machine learning and heuristic
technique improves the formula recognition rate of Maxtract,
which previously used a solely heuristic approach based on
fonts and geometric features. The new approach is evaluated by
comparing the experimental results to those originally reported
in [6], then combined with the line segmentation technique
in Section II and evaluated further. Section IV presents a
qualitative evaluation of Maxtract’s output, with examples of
the LATEX produced after implementing the aforementioned
changes. Finally, Section V summarises the paper, drawing
conclusions over our work.
Abstract—The explosive growth of the internet and electronic
publishing has led to a huge number of scientific documents
being available to users, however, they are usually inaccessible to
those with visual impairments and often only partially compatible
with software and modern hardware such as tablets and ereaders. In this paper we revisit Maxtract, a tool for analysing
and converting documents into accessible formats, and combine
it with two advanced segmentation techniques; statistical line
identification and machine learning formula identification. We
show how these advanced techniques improve the quality of both
Maxtract’s underlying document analysis and its output. We rerun and compare experimental results over a number of datasets, presenting a qualitative review of the improved output and
drawing conclusions.
Keywords—Math formula recognition, line segmentation, formula identification
I.
I NTRODUCTION
Through the explosive growth of the internet and electronic publishing, a wealth of scientific documents are now
available to users. Furthermore, Adobe has given us a format
to share such documents, the ubiquitous Portable Document
Format, PDF, with a specification for creating fully structured,
accessible documents. However, the majority of PDF authoring
software do not offer these options, instead producing files
with very limited capabilities. Whilst this does not cause
too many issues with standard text rich files, it does with
scientific documents– especially those containing mathematics.
Such documents become extremely difficult to view on modern
electronic devices such as smart phones, tablets and e-book
readers because they do not reflow correctly and formulae
cannot be enlarged or copied. Furthermore, for those with
visual impairments, the documents become almost completely
inaccessible due to incompatibility with screen readers and
other accessibility tools.
II.
Line segmentation is often considered as a prerequisite
step for document analysis of both printed and handwritten
documents, and a wide range of techniques have been proposed
for this process including those based on PPC, run-length
smoothing, grouping and Hough transforms [7], [8], [9], [10].
Whilst these techniques have been shown to perform well
on regular text documents, they do not suffice when dealing
with those rich in mathematical notation, as many of the
assumptions used for text are no longer applicable due to the
inherent two-dimensional structure of mathematics. A further
review of the segmentation of mathematical documents is
covered in [11] however this is more concerned with the
layout of mathematical structures than their corresponding and
surrounding lines.
This has led to a relatively new field of research, that of the
analysis and conversion of native-digital scientific documents.
In previous work we have described Maxtract [1], a tool
for automatically analysing PDF files through a combination
of PDF and image analysis. Whilst performing extremely
well compared to contemporary state of the art OCR-based
systems [2], there are still a number of shortcomings that we
aim to address in this paper. We combine Maxtract with our
previously proposed methods for line segmentation [3] and
formula identification [4], [5] to improve both the underlying
structure of the recognised document and the visual output.
1520-5363/13 $26.00 © 2013 IEEE
DOI 10.1109/ICDAR.2013.74
L INE S EGMENTATION
A. Previous Approach
Maxtract previously used a PPC based approach to segment
each page into lines [2]. Whilst this was a fast and efficient
technique, which was modified to correctly deal with fractions
and most mathematical accents through a series of heuristics,
limits were incorrectly cut from a base expression when a
horizontal cut could be made. This led to the incorrect analysis
of many isolated formulae containing limit structures.
334
B. A Histogrammatic Approach to Line Segmentation
The basic idea of our approach is to first identify all
possible individual lines and then merge them into single lines
if there is an indication that they are components of mathematical expressions. Thereby we rely neither on knowledge
of the content of lines, font information nor vertical distance.
Instead we construct a histogram of horizontal spaces between
glyphs on each line and use a statistical analysis to determine
which lines are to be merged. In addition we use simple
height of characters considerations in a post-processing step
to re-classify lines that are wrongly identified by the statistical
analysis to be merged.
The first two steps of the procedure are quite similar
to those previously used; the bounding boxes of all glyphs
on a page are calculated and initial lines are determined by
separating sets of glyphs with respect to vertical whitespace.
We then construct a histogram for each page that captures
the horizontal distances between pairs of successive glyphs in
each line. We thereby omit elements in a line that overlap
horizontally, i.e., where the distance is 0. Figure 1 shows two
examples for such histograms, where the x-axis denotes the
values for the horizontal distance between pairs of glyphs in
pixels, and the y-axis denotes the number of occurrences of a
particular distance.
Fig. 1.
Examples of pages and their histogram of the gap between glyphs
height less than or equal to T2 it is converted to a non-principal
line.
Once the classification of lines is completed we merge
non-principal lines with their horizontally closest neighbouring
principal line, but only if they overlap horizontally. If a nonprincipal line has no horizontal overlap with any neighbouring
principal line it is converted to principle line itself.
We can observe a general pattern in these histograms; they
can be split into two parts by a global minimum that is roughly
in the middle of the x-axis. This leaves two parts, each with
a global maximum. Furthermore, in the right part one can
identify another global minimum. While this can be at the very
end of the x-axis it usually is not. We call these two minimal
points v1 and v2 , respectively, and use them to classify lines
as follows:
C. Experimentation
For an initial evaluation of our approach we experimented
over a manually ground-truthed data-set containing 5801 lines
in 200 pages. The pages were taken from 12 documents
comprising a mixture of books and journal articles. Table I
presents the experimental results of the original PPC approach,
the histrogrammatic approach alone as well as improved using
first the height ratio and then the height bound for correction.
The value T1 for the height ratio was set to 1.7, while the
height bound T2 was computed for every page in question
independently.
We define a line to be a principal line P if there exists a
horizontal distance value d between two neighbouring glyphs
in the line that satisfies the condition of v1 < d < v2 , otherwise
it is a non-principal line N . Whilst the former constitute
a proper line, the latter should be merged with its closest
principal line.
The intuition behind this heuristic is that values in the
histogram which are less than v1 represent distances between
single characters in a word or a mathematical expression,
whereas the area between v1 and v2 represent the distance
between single words. These spaces generally do not occur
in lines that only constitute part of a mathematical formula.
While this measure alone already yields good results, it still
suffers from occasionally misclassifying lines.
TABLE I.
Method
PPC
Histogram
Height Ratio
Height Bound
To correct lines wrongly classified as non-principal lines,
we use a simple height ratio between glyphs of consecutive
principal lines P and non-principal lines N . Let H(N ) and
H(P ) be the maximum height of glyphs in N and P , respectively. If H(N ) > H(P )/T1 , where 1 < T1 < 2 is a threshold
value, we reclassify N as a principal line. The value for T1 is
determined empirically by experiments in on a small sample
set. For the experiments we present in this paper the value was
set to 1.7.
E XPERIMENTAL RESULTS FOR LINE RECOGNITION
Actual lines
5801
5801
5801
5801
Lines found
6987
5727
5910
5863
Correct lines
5015
5265
5587
5625
Accuracy
86.4%
90.7%
96.3%
96.9%
Not surprisingly PPC results in a larger number of lines
and, as there is no subsequent merging, in a relatively low
accuracy of 86.4%. Using the histogram alone improves this
accuracy, but it generally merges too many lines. This is
corrected by the addition of the height ratio that re-classifies
some of the lines incorrectly assumed to be non-principal as
principal lines. As a consequence we get a slightly higher
number of lines but also a higher accuracy of 96.3%. A further
slight improvement in this accuracy to 96.9% is obtained using
the height bound.
To re-classify lines that were wrongly labelled as principal
lines, we use a height bound T2 that is the maximum height
of all non-principal lines in a page. If a principal line P is of
To further examine the robustness of our technique and, in
particular to rule out that over-fitting occurred over our original
335
data set, we experimented with a second independent and
larger data-set, containing 1000 pages composed from more
than 60 mathematical papers. We ran our full method over
the second data-set and then manually checked each result.
Consequently we have done this comparison only for the full
classification including both height ratio and height bound correction. Whilst we can not rule out some classification mistakes
due to human error we are confident that the experimental
results given in Table II are accurate.
geometry features of the neighbour graph of connected components which are the results of the Voronoi Graph Analysis.
Although these methods show good performance, they are
generally used over images of documents rather than digitally
born documents thus ignoring the extra information available
from such documents. Recently, learning-based formula identification methods are proposed specifically for PDF documents
[4], [5]. Although they are proved to perform well through
experimentation, they lack a comparison to other methods.
The experimental results over the second data-set show an
increase in recognition rate of approximately 2%. The increase
in performance is attributed to a more representative data-set,
and certainly gives us confidence about the effectiveness of
our technique over a wide range of documents.
A. Previous Approach
TABLE II.
E XPERIMENTAL RESULTS OF 1000
Actual lines
34146
Lines found
34526
Correct lines
33678
In our previous work, we used a rule-based approach to
discriminate isolated formula lines from ordinary text lines automatically [6]. In this method, rules are constructed according
to the number of embedded math formulae which are identified
by a LALR parser. In this method, the performance of both
embedded and isolated formula identification are affected by
the performance of the LALR parser. In addition, the rules
being used are relative simplistic to be adopted to identify a
wide range of isolated formulae.
PAGES
Accuracy
98.6%
The incorrectly identified lines fall into two categories;
Incorrect non-principal lines: The most common error is
incorrectly classifying a non-principal line as a principal. This
occurs when gaps between two neighbouring glyphs satisfy the
horizontal distance condition. Two examples of such errors are
shown below.
B. Proposed method
In this paper, we integrate the mathematics identification
methods proposed in [4], [5] with modification into our
mathematical formula recognition system, Maxtract in order
to overcome the following two unsolved problems:
√
≤C
∞ l
i=1 j=1
iB
i + . . .
i+1 B
i = 2M
B
E[|x + S(l) . . .
The first problem is that isolated formulae are split or
identified partially due to mis-identified text lines. Previously
PPC was used to identify text lines, however we have now
adopted the line detection method proposed in Section II for
more reliable recognition. Additionally, merging for successive
isolated formula lines is also introduced as a post-processing
step for isolated formula identification.
Although, these expressions should be detected as a single
line, the limits under the two summation symbols are at a
distance that coincides with the distance identified by the histogram for the entire page. Likewise, in the second expression
the same problem is caused by the spacing of the tilde accents.
The other unsolved problem was that classification performance was reduced by imbalanced training data for the
learning-based formula identification method. In the learningbased formula identification method, classifiers are trained to
identify words and formulae. When training the classifiers,
it is inevitable to introduce imbalanced training data, where
there are many more negative instances, in this case ordinary
text lines or words, than positive instances i.e. math lines
or formulae, since there is usually less maths than text in
standard documents. To overcome this problem, re-sampling
techniques have been adopted to improve the performances of
classification, so as to improve the overall performance of the
learning-based formula identification methods.
Incorrect principal lines: This is the opposite of the previous error and occurs when a line does not contain any glyph
gaps that coincide with the distance measure derived from
the histogram. Examples of these lines are those with single
words, page numbers, single expressions etc. Whilst these are
often corrected by the height ratio, this is not always the case
sometimes they are missed out as they do not satisfy the ratio
condition. Below is an example taken from our data-set:
+ 35
k−1,k,2
V1
(n; (1)).(Lk−1,k,2
− Lk−1.k.2
)
3
n−3
12
Here the page number 12 is merged as a non-principal line
to the expression above, as firstly it does not exhibit a glyph
gap satisfying the distance measure and secondly its height is
significantly smaller as the height of the open parenthesis in
the mathematical expression.
III.
The workflow of the proposed mathematical formula identification method is shown in Figure 2. The input text lines are
obtained using the method presented in Section II.
To identify isolated formulae, a classifier is built to predict
whether a text line is an isolated formula or an ordinary text
line. To train such a classifier, an unsupervised subsampling
technique, which randomly filters negative instances is first
adopted to balance the training data. Then, a feature vector
containing nine geometric layout features and one character
feature is extracted for each text line as shown in Table III.
Next, the classifier is trained using LibSVM1 . To decide
F ORMULA I DENTIFICATION
For mathematical document analysis, formula identification
is a vital step that partitions the input for further analysis by
either a text recognition or math recognition engine. Learningbased approaches are being increasingly studied with Jin et
al. [12] exploiting the Parzen Windows technique to identify
isolated formulae and Drake et al. [13] adopting computational
1 http://www.csie.ntu.edu.tw/∼cjlin/libsvm/
336
TABLE IV.
Fig. 2. Workflow of the proposed mathematical formula identification method
(modifications compared with previous work are in red)
TABLE III.
1
Definitions in detail of these features can be found in [5].
F EATURES OF A TEXT LINE
method [6] based on two data-sets respectively, as shown in
Tables V and VI: 1) It is seen from Table V that in D1
[15] the SVM-based method identifies more correct isolated
formulae than the rule-based method [6] by more than 12%.
Furthermore, by integrating the text line detection method,
about 2% more correct results are identified and about 3% split
regions are decreased. 2) It is shown in Table VI that in D2
[2] the SVM-based method outperforms the rule-based method
[6] in both isolated and embedded formula identification.
Moreover, after integrating the line detection method, about
30% split regions are overcome and more than 24% correct
isolated formulae are therefore identified. Also, some splitting
embedded formulae are tackled and the correct rate is increased
by about 4%. A relatively high missed rate of the proposed
SVM-based method is mainly caused by the low recall rate
of the classifier of words trained by the imbalanced data of
embedded formulae.
Name
Definition
Geometric layout features
AlignCenter
The distance of the line’s center and the column’s center in horizontal.
LeftSpace
The left space of the text line (normalized by the column’s width).
RightSpace
The right space of the text line (normalized by the column’s width).
AboveSpace
The space between the current text line and successive text line above
it (normalized by the most commonly seen line space).
BelowSpace
The space between the current text line and successive text line below
it (normalized by the most commonly seen line space).
Height
A line’s height (normalized by the main font’s size of the page).
SparseRatio
The ratio of the characters’ area of the text line’s area.
V-FontSize
The variance of the font size of text objects in the text line.
I-SerialNo
Whether there is a formula serial number at the end of the text line.
Character feature
I-Math
Whether the text line contains any math functions or math symbols.
1
F EATURES OF A WORD
Name
Definition
Geometric layout features
V-Fontsize
Variance of the font size of the symbols in a word.
V-Position
Variance of the Y-coordinates of the baseline of the symbols in a word.
V-Space
Variance of the space of the bounding box of the symbols in a word.
V-Width
Variance of the width of the bounding box of the symbols in a word.
V-Height
Variance of the height of the bounding box of the symbols in a word.
Character features
D-Purity
Degree of the symbols in a word that belong to the same type.
P-Latin
Percentage of the Latin characters of a word.
I-Math
Same with I-Math which is defined in Table III
T-Leftmost
Type of the leftmost symbol of a word.
T-Rightmost
Type of the rightmost symbol of a word.
Context features
T-LeftAdj
Type of the rightmost symbol of the former word of the current word.
T-RightAdj
Type of the leftmost symbol of the following word of the current word.
Definitions in detail of these features can be found in [4].
whether a text line is display math or not, features of the text
line are extracted and predicted by the classifier. After that, the
successive isolated formula lines are merged to be the finalised
isolated formulae.
TABLE V.
To identify embedded formulae, first, text lines are segmented into words. Then, a feature vector containing five
geometric layout features, five character features and two
context features is extracted for each word, Table IV. Next,
a classifier is trained to predict whether a word is an inline
math fragment or an ordinary word. Then, a supervised oversampling technique, SMOTE [14], is adopted to generate more
positive instances, balancing the training data. Next, to identify
inline formula, words are first predicted by the classifier
to be inline math fragments or ordinary text words. Lastly,
the embedded formula locations are finalised by merging the
adjacent inline math fragments.
Isolated
[6]
SVM
S+L
R ESULTS OF I SOLATED F ORMULA D ETECTION ON D1 [15]
Total
713
719
701
TABLE VI.
Isolated
[6]
SVM
S+L
Inline
[6]
SVM
S+L
Correct
59.29%
71.77%
73.18%
Split
6.67%
9.32%
6.56%
Merged
17.14%
11.27%
12.41%
FalsePos
16.90%
7.65%
7.85%
FalseNeg
2.40%(19/792)
0.88%(7/792)
1.26%(10/792)
R ESULTS OF F ORMULA I DENTIFICATION ON D2 [2]
Total
60
53
52
Total
321
362
441
Correct
41.67%
54.72%
78.85%
Correct
31.15%
31.22%
35.60%
Split
16.67%
41.51%
11.54%
Split
16.20%
22.93%
19.05%
Merged
30.00%
3.77%
7.69%
Merged
24.61%
22.93%
19.27%
FalsePos
11.67%
0.00%
1.92%
FalsePos
28.04%
22.93%
26.08%
FalseNeg
9.43%(5/53)
1.89%(1/53)
3.77%(2/53)
FalseNeg
5.90%(22/373)
19.57%(73/373)
10.99%(41/373)
C. Experimentation
IV.
To evaluate the performance of mathematics identification,
experiments are carried out upon two ground-truth data-sets
used in [15] and [2] which are referred to as D1[15] and
D2[2] respectively hereafter. D1[15] contains 184 document
pages with 792 display formulae labelled. D2[2] contains ten
pages which include 373 inline and 53 display formulae.
Different result types of formula identification are adopted
as evaluation metrics, including Correct, Split Regions,
M erged Regions, F alse P ositive, and F alse N egative.
Q UALITATIVE C OMPARISON
In this section we show how the modifications identified
in this paper improve the output of Maxtract in two ways;
improving the readability and semantics of the generated LATEX
code and underlying structure, and consequentially creating
a more accurate reconstruction of the original document.
Each of the examples shown are generated from the data-sets
mentioned previously.
Table VII shows two examples of how the output of
Maxtract has improved after implementing the new techniques.
The first column is a screenshot of the original formula, the
second column shows the LATEX and its rendering generated,
We compare the proposed SVM-based method with (denoted as “S+L”) or without (denoted as “SVM”) the new
line detection method integrated with our previous rule-based
337
TABLE VII.
Original Image
proposed and implemented more advanced techniques, then rerun previously reported experiments and for each case reported
significant improvements.
C OMPARISON OF PREVIOUSLY GENERATED AND NEW
LATEX
Previous Output
New Output
\begin{align*}
\mathrm{and} F
(-\infty) < \infty.
\end{align*}
and $ F (-\infty)
< \infty. $
andF (−∞) < ∞.
\[\begin{aligned}
N_{i} = \sum
W_{j i} X_{j}.\\
j
\end{aligned}\]
Ni =
Wji Xj .
j
To improve line segmentation we implemented a
histogram-based statistical approach, increasing the line recognition rate of Maxtract from 86.4% to 96.9%. To solve the
issue of formula identification, we used a machine-learning
based approach, improving the number of correctly identified
embedded formulae by up to 15% and isolated formulae by
up to 95%.
and F (−∞) < ∞.
\[N_{i} = \sum_{j}
W_{j i} X_{j}.\]
Ni =
We finally show how these changes have improved the
overall output of Maxtract, with comparisons of the generated
LATEX. Improvements are made to the readability and semantics
of the generated LATEX code and underlying structure, and
consequentially the output is a more accurate reconstruction of
the original document. Whilst there are further issues still to
be dealt with, in this paper we have proposed and implemented
significant improvements to the accuracy of Maxtract.
Wji Xj .
j
ACKNOWLEDGEMENT
by the original Maxtract, and the third column shows the LATEX
and its rendering generated after the modifications.
This work was supported by the co-supervised Ph.D. student scholarship program of the China Scholarship Council
(CSC) as well as via a JISC OER Rapid Innovation Grant.
The first example shows how originally, a text line with
an embedded formula was incorrectly recognised entirely as
mathematics, and wrapped in a mathematical align environment. Whilst this only made a small difference to the
rendering, with a different font and lack of space between
F and and, it showed that the semantics of this fragment had
been incorrectly analysed by Maxtract. After the embedded
formula had been correctly identified using the machine learning method, the fragment was correctly recognised as a text
line containing inline math.
R EFERENCES
[1]
[2]
[3]
[4]
The second example shows where a formula with a lower
limit was originally incorrectly split into two separate lines
via PPC. This error was then propagated through to the
formula recognition stage where it was erroneously treated as
a multiline formula. When converted into LATEX the component
was wrapped in an aligned environment, resulting not only
in an incorrect structural analysis, but also when rendered, a
formula that is no longer understandable. The histogramatic
approach correctly identified that the limit was in fact a nonprincipal line and merged it with the previous principal line.
This allowed the formula analysis to correctly identify the
structure of an equation with a limit.
[5]
[6]
[7]
[8]
[9]
Other examples of corrections are very similar to those
shown in Table VII. None of the modifications to Maxtract
resulted in additional errors, either in terms of line or formula
identification, however the recognition is still not perfect as
indicated by the experimental results shown in the previous
sections. Furthermore, some of the reported layout and segmentation issues in Maxtract, such as touching lines, are not
addressed in this paper and still remain[15], [2].
V.
[10]
[11]
[12]
[13]
C ONCLUSIONS
[14]
In this paper we revisit Maxtract, a tool for the automatic
analysis of mathematical PDF documents and identify two
significant areas of weakness, that of correctly segmenting
text and math lines, and precisely identifying the locations
of mathematical formulae. For each of these issues we have
[15]
338
“Maxtract,” 2013. [Online]. Available: http://www.cs.bham.ac.uk/
research/groupings/reasoning/sdag/maxtract.php
J. B. Baker, A. P. Sexton, V. Sorge, and M. Suzuki, “Comparing
approaches of mathematical formula recognition from PDF,” in Proc.
of ICDAR ’11. IEEE Computer Society, 2011, pp. 463–467.
M. A. I. Alkalai and V. Sorge, “Issues in mathematical table recognition,” in Proc. of CICM ’12, MIR Workshop, 2012.
X. Lin, L. Gao, Z. Tang, X. Lin, and X. Hu, “Mathematical formula
identification in PDF documents,” in Proc. of ICDAR ’11. Washington,
DC, USA: IEEE Computer Society, 2011, pp. 1419–1423.
X. Lin, L. Gao, Z. Tang, X. Hu, and X. Lin, “Identification of embedded
mathematical formulas in PDF documents using SVM,” in Proc of DRR
XIX, 2012, pp. 8297 0D 1–8.
J. B. Baker, A. P. Sexton, and V. Sorge, “Towards reverse engineering of
PDF documents,” in Towards a Digital Mathematics Library. Masaryk
University Press, 2011.
U.-V. Marti and H. Bunke, “On the influence of vocabulary size and
language models in unconstrained handwritten text recognition,” in
Proc. of ICDAR ’01. IEEE Computer Society, 2001, pp. 260–265.
K. Wong, R. Casey, and F. Wahl, “Document analysis system,” IBM
journal of research and development, vol. 26, no. 6, pp. 647–656, 1982.
L. O’Gorman, “The document spectrum for page layout analysis,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 15,
no. 11, pp. 1162–1173, 1993.
P. C. Hough V, “Method and means for recognizing complex patterns,”
1962. [Online]. Available: www.freepatentsonline.com/3069654.html
R. Zanibbi and D. Blostein, “Recognition and retrieval of mathematical
expressions,” IJDAR, vol. 15, no. 4, pp. 331–357, 2012.
J. Jin, X. Han, and Q. Wang, “Mathematical formulas extraction,” in
Proc. of ICDAR ’03. IEEE, 2003, pp. 1138–1141.
D. Drake and H. Baird, “Distinguishing mathematics notation from
english text using computational geometry,” in Proc. of ICDAR ’05.
IEEE, 2005, pp. 1270–1274.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote:
Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
J. B. Baker, A. P. Sexton, and V. Sorge, “A linear grammar approach to
mathematical formula recognition from PDF,” in Proc. of CICM ’08,
MKM,, ser. LNAI, vol. 5625. Springer, 2009, pp. 201–216.