2013 12th International Conference on Document Analysis and Recognition Improving Formula Analysis with Line and Mathematics Identification Mohamed Alkalai, Josef B. Baker and Volker Sorge Xiaoyan Lin School of Computer Science University of Birmingham, UK Email: M.A.Alkalai|J.Baker|[email protected] URL: www.cs.bham.ac.uk/∼maa897|∼jbb|∼vxs Institute of Computer Science and Technology Peking University, Beijing, China Email: [email protected] In Section II we introduce line segmentation and the problems that occur when traditional techniques are applied to documents rich in mathematical notation. We then show how the method previously implemented in Maxtract, projection profile cutting, PPC, performed poorly in certain cases and present a statistical histogram-based alternative which dramatically improves accuracy. This is followed by an experimental analysis of the method over a large and diverse data-set in comparison to PPC. Section III reviews formula identification techniques, and shows how a machine learning and heuristic technique improves the formula recognition rate of Maxtract, which previously used a solely heuristic approach based on fonts and geometric features. The new approach is evaluated by comparing the experimental results to those originally reported in [6], then combined with the line segmentation technique in Section II and evaluated further. Section IV presents a qualitative evaluation of Maxtract’s output, with examples of the LATEX produced after implementing the aforementioned changes. Finally, Section V summarises the paper, drawing conclusions over our work. Abstract—The explosive growth of the internet and electronic publishing has led to a huge number of scientific documents being available to users, however, they are usually inaccessible to those with visual impairments and often only partially compatible with software and modern hardware such as tablets and ereaders. In this paper we revisit Maxtract, a tool for analysing and converting documents into accessible formats, and combine it with two advanced segmentation techniques; statistical line identification and machine learning formula identification. We show how these advanced techniques improve the quality of both Maxtract’s underlying document analysis and its output. We rerun and compare experimental results over a number of datasets, presenting a qualitative review of the improved output and drawing conclusions. Keywords—Math formula recognition, line segmentation, formula identification I. I NTRODUCTION Through the explosive growth of the internet and electronic publishing, a wealth of scientific documents are now available to users. Furthermore, Adobe has given us a format to share such documents, the ubiquitous Portable Document Format, PDF, with a specification for creating fully structured, accessible documents. However, the majority of PDF authoring software do not offer these options, instead producing files with very limited capabilities. Whilst this does not cause too many issues with standard text rich files, it does with scientific documents– especially those containing mathematics. Such documents become extremely difficult to view on modern electronic devices such as smart phones, tablets and e-book readers because they do not reflow correctly and formulae cannot be enlarged or copied. Furthermore, for those with visual impairments, the documents become almost completely inaccessible due to incompatibility with screen readers and other accessibility tools. II. Line segmentation is often considered as a prerequisite step for document analysis of both printed and handwritten documents, and a wide range of techniques have been proposed for this process including those based on PPC, run-length smoothing, grouping and Hough transforms [7], [8], [9], [10]. Whilst these techniques have been shown to perform well on regular text documents, they do not suffice when dealing with those rich in mathematical notation, as many of the assumptions used for text are no longer applicable due to the inherent two-dimensional structure of mathematics. A further review of the segmentation of mathematical documents is covered in [11] however this is more concerned with the layout of mathematical structures than their corresponding and surrounding lines. This has led to a relatively new field of research, that of the analysis and conversion of native-digital scientific documents. In previous work we have described Maxtract [1], a tool for automatically analysing PDF files through a combination of PDF and image analysis. Whilst performing extremely well compared to contemporary state of the art OCR-based systems [2], there are still a number of shortcomings that we aim to address in this paper. We combine Maxtract with our previously proposed methods for line segmentation [3] and formula identification [4], [5] to improve both the underlying structure of the recognised document and the visual output. 1520-5363/13 $26.00 © 2013 IEEE DOI 10.1109/ICDAR.2013.74 L INE S EGMENTATION A. Previous Approach Maxtract previously used a PPC based approach to segment each page into lines [2]. Whilst this was a fast and efficient technique, which was modified to correctly deal with fractions and most mathematical accents through a series of heuristics, limits were incorrectly cut from a base expression when a horizontal cut could be made. This led to the incorrect analysis of many isolated formulae containing limit structures. 334 B. A Histogrammatic Approach to Line Segmentation The basic idea of our approach is to first identify all possible individual lines and then merge them into single lines if there is an indication that they are components of mathematical expressions. Thereby we rely neither on knowledge of the content of lines, font information nor vertical distance. Instead we construct a histogram of horizontal spaces between glyphs on each line and use a statistical analysis to determine which lines are to be merged. In addition we use simple height of characters considerations in a post-processing step to re-classify lines that are wrongly identified by the statistical analysis to be merged. The first two steps of the procedure are quite similar to those previously used; the bounding boxes of all glyphs on a page are calculated and initial lines are determined by separating sets of glyphs with respect to vertical whitespace. We then construct a histogram for each page that captures the horizontal distances between pairs of successive glyphs in each line. We thereby omit elements in a line that overlap horizontally, i.e., where the distance is 0. Figure 1 shows two examples for such histograms, where the x-axis denotes the values for the horizontal distance between pairs of glyphs in pixels, and the y-axis denotes the number of occurrences of a particular distance. Fig. 1. Examples of pages and their histogram of the gap between glyphs height less than or equal to T2 it is converted to a non-principal line. Once the classification of lines is completed we merge non-principal lines with their horizontally closest neighbouring principal line, but only if they overlap horizontally. If a nonprincipal line has no horizontal overlap with any neighbouring principal line it is converted to principle line itself. We can observe a general pattern in these histograms; they can be split into two parts by a global minimum that is roughly in the middle of the x-axis. This leaves two parts, each with a global maximum. Furthermore, in the right part one can identify another global minimum. While this can be at the very end of the x-axis it usually is not. We call these two minimal points v1 and v2 , respectively, and use them to classify lines as follows: C. Experimentation For an initial evaluation of our approach we experimented over a manually ground-truthed data-set containing 5801 lines in 200 pages. The pages were taken from 12 documents comprising a mixture of books and journal articles. Table I presents the experimental results of the original PPC approach, the histrogrammatic approach alone as well as improved using first the height ratio and then the height bound for correction. The value T1 for the height ratio was set to 1.7, while the height bound T2 was computed for every page in question independently. We define a line to be a principal line P if there exists a horizontal distance value d between two neighbouring glyphs in the line that satisfies the condition of v1 < d < v2 , otherwise it is a non-principal line N . Whilst the former constitute a proper line, the latter should be merged with its closest principal line. The intuition behind this heuristic is that values in the histogram which are less than v1 represent distances between single characters in a word or a mathematical expression, whereas the area between v1 and v2 represent the distance between single words. These spaces generally do not occur in lines that only constitute part of a mathematical formula. While this measure alone already yields good results, it still suffers from occasionally misclassifying lines. TABLE I. Method PPC Histogram Height Ratio Height Bound To correct lines wrongly classified as non-principal lines, we use a simple height ratio between glyphs of consecutive principal lines P and non-principal lines N . Let H(N ) and H(P ) be the maximum height of glyphs in N and P , respectively. If H(N ) > H(P )/T1 , where 1 < T1 < 2 is a threshold value, we reclassify N as a principal line. The value for T1 is determined empirically by experiments in on a small sample set. For the experiments we present in this paper the value was set to 1.7. E XPERIMENTAL RESULTS FOR LINE RECOGNITION Actual lines 5801 5801 5801 5801 Lines found 6987 5727 5910 5863 Correct lines 5015 5265 5587 5625 Accuracy 86.4% 90.7% 96.3% 96.9% Not surprisingly PPC results in a larger number of lines and, as there is no subsequent merging, in a relatively low accuracy of 86.4%. Using the histogram alone improves this accuracy, but it generally merges too many lines. This is corrected by the addition of the height ratio that re-classifies some of the lines incorrectly assumed to be non-principal as principal lines. As a consequence we get a slightly higher number of lines but also a higher accuracy of 96.3%. A further slight improvement in this accuracy to 96.9% is obtained using the height bound. To re-classify lines that were wrongly labelled as principal lines, we use a height bound T2 that is the maximum height of all non-principal lines in a page. If a principal line P is of To further examine the robustness of our technique and, in particular to rule out that over-fitting occurred over our original 335 data set, we experimented with a second independent and larger data-set, containing 1000 pages composed from more than 60 mathematical papers. We ran our full method over the second data-set and then manually checked each result. Consequently we have done this comparison only for the full classification including both height ratio and height bound correction. Whilst we can not rule out some classification mistakes due to human error we are confident that the experimental results given in Table II are accurate. geometry features of the neighbour graph of connected components which are the results of the Voronoi Graph Analysis. Although these methods show good performance, they are generally used over images of documents rather than digitally born documents thus ignoring the extra information available from such documents. Recently, learning-based formula identification methods are proposed specifically for PDF documents [4], [5]. Although they are proved to perform well through experimentation, they lack a comparison to other methods. The experimental results over the second data-set show an increase in recognition rate of approximately 2%. The increase in performance is attributed to a more representative data-set, and certainly gives us confidence about the effectiveness of our technique over a wide range of documents. A. Previous Approach TABLE II. E XPERIMENTAL RESULTS OF 1000 Actual lines 34146 Lines found 34526 Correct lines 33678 In our previous work, we used a rule-based approach to discriminate isolated formula lines from ordinary text lines automatically [6]. In this method, rules are constructed according to the number of embedded math formulae which are identified by a LALR parser. In this method, the performance of both embedded and isolated formula identification are affected by the performance of the LALR parser. In addition, the rules being used are relative simplistic to be adopted to identify a wide range of isolated formulae. PAGES Accuracy 98.6% The incorrectly identified lines fall into two categories; Incorrect non-principal lines: The most common error is incorrectly classifying a non-principal line as a principal. This occurs when gaps between two neighbouring glyphs satisfy the horizontal distance condition. Two examples of such errors are shown below. B. Proposed method In this paper, we integrate the mathematics identification methods proposed in [4], [5] with modification into our mathematical formula recognition system, Maxtract in order to overcome the following two unsolved problems: √ ≤C ∞ l i=1 j=1 iB i + . . . i+1 B i = 2M B E[|x + S(l) . . . The first problem is that isolated formulae are split or identified partially due to mis-identified text lines. Previously PPC was used to identify text lines, however we have now adopted the line detection method proposed in Section II for more reliable recognition. Additionally, merging for successive isolated formula lines is also introduced as a post-processing step for isolated formula identification. Although, these expressions should be detected as a single line, the limits under the two summation symbols are at a distance that coincides with the distance identified by the histogram for the entire page. Likewise, in the second expression the same problem is caused by the spacing of the tilde accents. The other unsolved problem was that classification performance was reduced by imbalanced training data for the learning-based formula identification method. In the learningbased formula identification method, classifiers are trained to identify words and formulae. When training the classifiers, it is inevitable to introduce imbalanced training data, where there are many more negative instances, in this case ordinary text lines or words, than positive instances i.e. math lines or formulae, since there is usually less maths than text in standard documents. To overcome this problem, re-sampling techniques have been adopted to improve the performances of classification, so as to improve the overall performance of the learning-based formula identification methods. Incorrect principal lines: This is the opposite of the previous error and occurs when a line does not contain any glyph gaps that coincide with the distance measure derived from the histogram. Examples of these lines are those with single words, page numbers, single expressions etc. Whilst these are often corrected by the height ratio, this is not always the case sometimes they are missed out as they do not satisfy the ratio condition. Below is an example taken from our data-set: + 35 k−1,k,2 V1 (n; (1)).(Lk−1,k,2 − Lk−1.k.2 ) 3 n−3 12 Here the page number 12 is merged as a non-principal line to the expression above, as firstly it does not exhibit a glyph gap satisfying the distance measure and secondly its height is significantly smaller as the height of the open parenthesis in the mathematical expression. III. The workflow of the proposed mathematical formula identification method is shown in Figure 2. The input text lines are obtained using the method presented in Section II. To identify isolated formulae, a classifier is built to predict whether a text line is an isolated formula or an ordinary text line. To train such a classifier, an unsupervised subsampling technique, which randomly filters negative instances is first adopted to balance the training data. Then, a feature vector containing nine geometric layout features and one character feature is extracted for each text line as shown in Table III. Next, the classifier is trained using LibSVM1 . To decide F ORMULA I DENTIFICATION For mathematical document analysis, formula identification is a vital step that partitions the input for further analysis by either a text recognition or math recognition engine. Learningbased approaches are being increasingly studied with Jin et al. [12] exploiting the Parzen Windows technique to identify isolated formulae and Drake et al. [13] adopting computational 1 http://www.csie.ntu.edu.tw/∼cjlin/libsvm/ 336 TABLE IV. Fig. 2. Workflow of the proposed mathematical formula identification method (modifications compared with previous work are in red) TABLE III. 1 Definitions in detail of these features can be found in [5]. F EATURES OF A TEXT LINE method [6] based on two data-sets respectively, as shown in Tables V and VI: 1) It is seen from Table V that in D1 [15] the SVM-based method identifies more correct isolated formulae than the rule-based method [6] by more than 12%. Furthermore, by integrating the text line detection method, about 2% more correct results are identified and about 3% split regions are decreased. 2) It is shown in Table VI that in D2 [2] the SVM-based method outperforms the rule-based method [6] in both isolated and embedded formula identification. Moreover, after integrating the line detection method, about 30% split regions are overcome and more than 24% correct isolated formulae are therefore identified. Also, some splitting embedded formulae are tackled and the correct rate is increased by about 4%. A relatively high missed rate of the proposed SVM-based method is mainly caused by the low recall rate of the classifier of words trained by the imbalanced data of embedded formulae. Name Definition Geometric layout features AlignCenter The distance of the line’s center and the column’s center in horizontal. LeftSpace The left space of the text line (normalized by the column’s width). RightSpace The right space of the text line (normalized by the column’s width). AboveSpace The space between the current text line and successive text line above it (normalized by the most commonly seen line space). BelowSpace The space between the current text line and successive text line below it (normalized by the most commonly seen line space). Height A line’s height (normalized by the main font’s size of the page). SparseRatio The ratio of the characters’ area of the text line’s area. V-FontSize The variance of the font size of text objects in the text line. I-SerialNo Whether there is a formula serial number at the end of the text line. Character feature I-Math Whether the text line contains any math functions or math symbols. 1 F EATURES OF A WORD Name Definition Geometric layout features V-Fontsize Variance of the font size of the symbols in a word. V-Position Variance of the Y-coordinates of the baseline of the symbols in a word. V-Space Variance of the space of the bounding box of the symbols in a word. V-Width Variance of the width of the bounding box of the symbols in a word. V-Height Variance of the height of the bounding box of the symbols in a word. Character features D-Purity Degree of the symbols in a word that belong to the same type. P-Latin Percentage of the Latin characters of a word. I-Math Same with I-Math which is defined in Table III T-Leftmost Type of the leftmost symbol of a word. T-Rightmost Type of the rightmost symbol of a word. Context features T-LeftAdj Type of the rightmost symbol of the former word of the current word. T-RightAdj Type of the leftmost symbol of the following word of the current word. Definitions in detail of these features can be found in [4]. whether a text line is display math or not, features of the text line are extracted and predicted by the classifier. After that, the successive isolated formula lines are merged to be the finalised isolated formulae. TABLE V. To identify embedded formulae, first, text lines are segmented into words. Then, a feature vector containing five geometric layout features, five character features and two context features is extracted for each word, Table IV. Next, a classifier is trained to predict whether a word is an inline math fragment or an ordinary word. Then, a supervised oversampling technique, SMOTE [14], is adopted to generate more positive instances, balancing the training data. Next, to identify inline formula, words are first predicted by the classifier to be inline math fragments or ordinary text words. Lastly, the embedded formula locations are finalised by merging the adjacent inline math fragments. Isolated [6] SVM S+L R ESULTS OF I SOLATED F ORMULA D ETECTION ON D1 [15] Total 713 719 701 TABLE VI. Isolated [6] SVM S+L Inline [6] SVM S+L Correct 59.29% 71.77% 73.18% Split 6.67% 9.32% 6.56% Merged 17.14% 11.27% 12.41% FalsePos 16.90% 7.65% 7.85% FalseNeg 2.40%(19/792) 0.88%(7/792) 1.26%(10/792) R ESULTS OF F ORMULA I DENTIFICATION ON D2 [2] Total 60 53 52 Total 321 362 441 Correct 41.67% 54.72% 78.85% Correct 31.15% 31.22% 35.60% Split 16.67% 41.51% 11.54% Split 16.20% 22.93% 19.05% Merged 30.00% 3.77% 7.69% Merged 24.61% 22.93% 19.27% FalsePos 11.67% 0.00% 1.92% FalsePos 28.04% 22.93% 26.08% FalseNeg 9.43%(5/53) 1.89%(1/53) 3.77%(2/53) FalseNeg 5.90%(22/373) 19.57%(73/373) 10.99%(41/373) C. Experimentation IV. To evaluate the performance of mathematics identification, experiments are carried out upon two ground-truth data-sets used in [15] and [2] which are referred to as D1[15] and D2[2] respectively hereafter. D1[15] contains 184 document pages with 792 display formulae labelled. D2[2] contains ten pages which include 373 inline and 53 display formulae. Different result types of formula identification are adopted as evaluation metrics, including Correct, Split Regions, M erged Regions, F alse P ositive, and F alse N egative. Q UALITATIVE C OMPARISON In this section we show how the modifications identified in this paper improve the output of Maxtract in two ways; improving the readability and semantics of the generated LATEX code and underlying structure, and consequentially creating a more accurate reconstruction of the original document. Each of the examples shown are generated from the data-sets mentioned previously. Table VII shows two examples of how the output of Maxtract has improved after implementing the new techniques. The first column is a screenshot of the original formula, the second column shows the LATEX and its rendering generated, We compare the proposed SVM-based method with (denoted as “S+L”) or without (denoted as “SVM”) the new line detection method integrated with our previous rule-based 337 TABLE VII. Original Image proposed and implemented more advanced techniques, then rerun previously reported experiments and for each case reported significant improvements. C OMPARISON OF PREVIOUSLY GENERATED AND NEW LATEX Previous Output New Output \begin{align*} \mathrm{and} F (-\infty) < \infty. \end{align*} and $ F (-\infty) < \infty. $ andF (−∞) < ∞. \[\begin{aligned} N_{i} = \sum W_{j i} X_{j}.\\ j \end{aligned}\] Ni = Wji Xj . j To improve line segmentation we implemented a histogram-based statistical approach, increasing the line recognition rate of Maxtract from 86.4% to 96.9%. To solve the issue of formula identification, we used a machine-learning based approach, improving the number of correctly identified embedded formulae by up to 15% and isolated formulae by up to 95%. and F (−∞) < ∞. \[N_{i} = \sum_{j} W_{j i} X_{j}.\] Ni = We finally show how these changes have improved the overall output of Maxtract, with comparisons of the generated LATEX. Improvements are made to the readability and semantics of the generated LATEX code and underlying structure, and consequentially the output is a more accurate reconstruction of the original document. Whilst there are further issues still to be dealt with, in this paper we have proposed and implemented significant improvements to the accuracy of Maxtract. Wji Xj . j ACKNOWLEDGEMENT by the original Maxtract, and the third column shows the LATEX and its rendering generated after the modifications. This work was supported by the co-supervised Ph.D. student scholarship program of the China Scholarship Council (CSC) as well as via a JISC OER Rapid Innovation Grant. The first example shows how originally, a text line with an embedded formula was incorrectly recognised entirely as mathematics, and wrapped in a mathematical align environment. Whilst this only made a small difference to the rendering, with a different font and lack of space between F and and, it showed that the semantics of this fragment had been incorrectly analysed by Maxtract. After the embedded formula had been correctly identified using the machine learning method, the fragment was correctly recognised as a text line containing inline math. R EFERENCES [1] [2] [3] [4] The second example shows where a formula with a lower limit was originally incorrectly split into two separate lines via PPC. This error was then propagated through to the formula recognition stage where it was erroneously treated as a multiline formula. When converted into LATEX the component was wrapped in an aligned environment, resulting not only in an incorrect structural analysis, but also when rendered, a formula that is no longer understandable. The histogramatic approach correctly identified that the limit was in fact a nonprincipal line and merged it with the previous principal line. This allowed the formula analysis to correctly identify the structure of an equation with a limit. [5] [6] [7] [8] [9] Other examples of corrections are very similar to those shown in Table VII. None of the modifications to Maxtract resulted in additional errors, either in terms of line or formula identification, however the recognition is still not perfect as indicated by the experimental results shown in the previous sections. Furthermore, some of the reported layout and segmentation issues in Maxtract, such as touching lines, are not addressed in this paper and still remain[15], [2]. V. [10] [11] [12] [13] C ONCLUSIONS [14] In this paper we revisit Maxtract, a tool for the automatic analysis of mathematical PDF documents and identify two significant areas of weakness, that of correctly segmenting text and math lines, and precisely identifying the locations of mathematical formulae. For each of these issues we have [15] 338 “Maxtract,” 2013. [Online]. Available: http://www.cs.bham.ac.uk/ research/groupings/reasoning/sdag/maxtract.php J. B. Baker, A. P. Sexton, V. Sorge, and M. Suzuki, “Comparing approaches of mathematical formula recognition from PDF,” in Proc. of ICDAR ’11. IEEE Computer Society, 2011, pp. 463–467. M. A. I. Alkalai and V. Sorge, “Issues in mathematical table recognition,” in Proc. of CICM ’12, MIR Workshop, 2012. X. Lin, L. Gao, Z. Tang, X. Lin, and X. Hu, “Mathematical formula identification in PDF documents,” in Proc. of ICDAR ’11. Washington, DC, USA: IEEE Computer Society, 2011, pp. 1419–1423. X. Lin, L. Gao, Z. Tang, X. Hu, and X. Lin, “Identification of embedded mathematical formulas in PDF documents using SVM,” in Proc of DRR XIX, 2012, pp. 8297 0D 1–8. J. B. Baker, A. P. Sexton, and V. Sorge, “Towards reverse engineering of PDF documents,” in Towards a Digital Mathematics Library. Masaryk University Press, 2011. U.-V. Marti and H. Bunke, “On the influence of vocabulary size and language models in unconstrained handwritten text recognition,” in Proc. of ICDAR ’01. IEEE Computer Society, 2001, pp. 260–265. K. Wong, R. Casey, and F. Wahl, “Document analysis system,” IBM journal of research and development, vol. 26, no. 6, pp. 647–656, 1982. L. O’Gorman, “The document spectrum for page layout analysis,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 15, no. 11, pp. 1162–1173, 1993. P. C. Hough V, “Method and means for recognizing complex patterns,” 1962. [Online]. Available: www.freepatentsonline.com/3069654.html R. Zanibbi and D. Blostein, “Recognition and retrieval of mathematical expressions,” IJDAR, vol. 15, no. 4, pp. 331–357, 2012. J. Jin, X. Han, and Q. Wang, “Mathematical formulas extraction,” in Proc. of ICDAR ’03. IEEE, 2003, pp. 1138–1141. D. Drake and H. Baird, “Distinguishing mathematics notation from english text using computational geometry,” in Proc. of ICDAR ’05. IEEE, 2005, pp. 1270–1274. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. J. B. Baker, A. P. Sexton, and V. Sorge, “A linear grammar approach to mathematical formula recognition from PDF,” in Proc. of CICM ’08, MKM,, ser. LNAI, vol. 5625. Springer, 2009, pp. 201–216.
© Copyright 2026 Paperzz