Chapter-4 Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression Contents 4.1 Introduction ................................................................................................................................ 83 4.2 Overview of the proposed system .............................................................................................. 84 4.2.1 Image Acquisition ............................................................................................................... 84 4.2.2 Data Set ............................................................................................................................... 84 4.2.3 ME Pre-processing .............................................................................................................. 85 4.2.3.1 Algorithm used for Binarization of offline HME ......................................................... 88 4.2.4 Segmentation Technique for superscript and subscript characters within offline Handwritten Mathematical Expressions....................................................................................... 89 4.2.4.1 Proposed algorithm Developed for Segmentation with respect to superscript, subscript and main character positions: SANSEG (image) ..................................................................... 91 4.2.4.2 Experimental Implementation of Segmentation Algorithm ......................................... 92 4.2.5 Symmetric Density Based Feature Extraction..................................................................... 93 4.2.6 K-NN Classification ............................................................................................................ 97 4.2.6.1 The Nearest-Neighbor algorithm.................................................................................. 97 4.2.6.2 The k-Nearest-Neighbor algorithm .............................................................................. 98 4.2.6.3 The value of k............................................................................................................... 98 4.2.6.4 Advantages of K-NN classifier .................................................................................... 98 4.2.6.5 Steps Performed in Classification ................................................................................ 98 4.2.7 Performance Evaluation and Recognition of Offline HME with respect to their superscript and subscript characters ............................................................................................................... 99 4.2.7.1 Ambiguities observed in Confusion matrix................................................................ 102 4.3 Results and Discussion ............................................................................................................. 102 4.4 Applications of the Study ......................................................................................................... 103 4.5 Conclusion................................................................................................................................ 103 References ...................................................................................................................................... 103 The proposed segmentation algorithm in the next section 4.2.4.1 has been published in International Journal of Technology (IJTech), Volume 6, Number 3, ISSN -2086-9614, pp-336-347, July 2015, Indonesia. (Scopus Indexed) Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 82 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 4.1 Introduction This chapter deals with the conception 1: Segmentation with respect to superscript and subscripts of mathematical expression improves the accuracy of the recognition as well as this technique supports for reconstruction of Mathematical Expressions (ME). Recognition of handwritten mathematical expressions is an important topic for many researchers for decades. It remains one of the most challenging and exciting areas in pattern recognition. In the recognition process of offline handwritten mathematical expressions, segmentation is the most important process. Problems in identifying ambiguities of superscript and subscript in complex offline mathematical expressions remain one of the most important problems. To the best of our knowledge, less research work has been done in the segmentation of offline handwritten mathematical expressions with respect to superscript and subscript and complex mathematical expressions. In this chapter, an efficient segmentation technique for superscript, subscript and main characters within offline handwritten mathematical expressions has been proposed. This technique is based on the generation of predictions for superscript, subscript and main characters within handwritten mathematical expressions, which helps for the reconstruction of mathematical expressions during the recognition process with their spatial interrelationship. The need of this chapter is to address the issue of segmentation with respect to superscript and subscript of mathematical expressions improves the accuracy of the recognition as well as this technique can support for reconstruction of mathematical expressions. Today’s classification techniques support for good recognition rate but the ambiguities raised within the characters (Numerals+symbol+characters) of the mathematical expression is because of the incorrect segmentation and feature extraction techniques. Segmentation technique improves the classification accuracy. In this chapter, we propose a novel segmentation technique for identification of superscript, subscript characters within offline handwritten mathematical expressions. With reference to Literature review of chapter 2, three issues related to existing segmentation techniques addressed in this chapter. The first issue is that the existing segmentation techniques does not support for unconstrained handwriting. The second issue is related to slant correction; in most of the existing technique, the slant correction needs to be performed. The third issue is the existing segmentation a technique does not support for reconstruction of the mathematical expressions. The proposed segmentation technique is introduced with an experiment with a database of 400 samples of scanned mathematical expressions that Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 83 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression comprised with 8,000 symbols out of which there were 28 different types of characters (Numerals+Operators+characters) are exists. The classifications of the elements were carried out by the K-NN-classification algorithm based on symmetric density features. 4.2 Overview of the proposed system Figure 4.1 given below shows the architecture of proposed system, which consists of processes like Image Acquisition, Pre-processing, Segmentation, Feature Extraction, Classification, and Recognition. Database of ME Module 1 Image Acquisition and Pre-processing Recognition Module 2 Segmentation Module 4 Module 3 Classification (K-NN, SVM) Feature Extraction Figure 4.1 Overview of Proposed Recognition system for Offline HME 4.2.1 Image Acquisition A handwritten ME was scanned using a scanner at 600dpi along with any one of the file formats i.e. .png and .bmp. A scanned offline HME typically in a grey scale image given as an input for the pre-processing step. No Constraints on the type of ink, the size of character and superscript subscript frame are imposed on the writer. The writer has to write normally as per his writing style. 4.2.2 Data Set Data has been collected from 150 different writers. Total 400 mathematical expressions obtained, which consist of 5 to 16 isolated mathematical characters (Numerals, Alphabets, operators, parenthesis).Refer Appendix-3 of this thesis for database collected for this recognition process and some of the equations are shown in Figure 4.2 Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 84 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression Figure 4.2: Types of ME used for Experiment From the different types of ME’s, which consist of possible combinations in terms of superscript and subscripts, the total 28 types of mathematical symbols are extracted. The data set shown in Table 4.1 is the data set obtained from the types of mathematical expressions collected for this experiment. Table 4.1: Sample types of ME symbols for the experiment Type of Symbol Samples used From ME Digits(4) 0,1,2,9 Alphabets(3) X,Y,Z +,- ,±, =, *, ≤, ≥ ,≠ ,×, ., / ,>, < , ∝, , ,∑ [ ]( ) operators (17) Parenthesis (4) 4.2.3 ME Pre-processing Conversion of paper-based scanned Mathematical Expressions (ME) into electronic images is an important process in the recognition system. Pre-processing techniques are used for enhancing the contrast of the image, removal of noise and isolating the components of a mathematical expression which is of interest for further processing. The pre-processing of the image is carried out to reduce some undesirable variability that can have an effect on the recognition process. In this, pre-processing steps include image acquisition, RGB to grayscale conversion, binarization, noise removal and morphological operation includes thining and dilation are applied to the mathematical expressions to put them in a suitable format for segmentation process. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 85 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression Algorithm for Pre-Processing image: Image scanned is in the form of .png or .bmp format Noise removal using a median filter. Conversion of RGB image to grayscale image and Conversion of the grayscale image into a binary image using ostu’s method. Apply morphological operations. In this experiment, the mathematical expression which consists of superscript, subscript characters are taken for study. Figure 4.3 shows some of the sample images for categories superscript, superscript and subscript, subscript respectively as shown below. (a) (b) (c) Figure 4.3: Input expressions for pre-processing (a) ME with superscript characters, (b) ME with superscript and subscript characters, c) ME with subscript characters Noise removal operations in gray level images are used for blurring and smoothing. Blurring is used in noise removal steps to remove small details from an image. In binary images, smoothing operating is used to straighten the edges of the characters which include filling small gaps. Filtering is a neighbourhood operation which is used to remove noise and smoothes the input image. In filtering, the value of any given pixel in the output image determined by applying filtering algorithm to the values of pixels in the neighbourhood of the corresponding input pixel. A simple linear filtering approach is used in which the value of an output pixel is a linear combination of the values of the pixels in input pixels neighbourhood. A neighbourhood is a set of pixels defined by their locations relative to that pixel. A median low-pass filter is used to remove the small pieces of the noise (salt and pepper noise) in the scanned Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 86 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression mathematical expression; also, it can blur the images in order to remove the unwanted details. (a) (b) (c) Figure 4.4: Output of pre-processing steps like conversion into gray scale, noise removal, thining and dialation. (a) with superscript characters, (b) ME with superscript and subscript characters (c) ME with subscript characters Binarization is the process of converting the image into a digital image that is into each pixel contains values either o or 1. It uses two colors black and white which can be foreground color and background color it is also referred as bi-tonal.Thresholding suggested is carried out by Otsu’s Method. RE, R. G. (2002), Otsu’s algorithm assumes an image has two classes of pixels these are foreground and background classes and then it calculates optimum threshold which separates these two classes. This is based on bi-tonal histogram, algorithm supports for minimizing intra-class variance which is defined as weighted sum of variances of two classes given by following equation (1). σ2w (t) =w0 (t) σ20(t) + w1(t) σ2 (t) --------------------------------------------------------------- (1) In this; wi = Probabilities of the two classes separated by threshold. t = Threshold σ2 = Variance of these classes. The intra-class variance is same as maximizing the inter-class variations is given in following equation (2) Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 87 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression σ2b (t) = σ2- σ2w (t) = w0(µ0 - µT)2 + w1(µ1- µT)2 = w0(t) w1(t)[µ0(t) - µ1(t)]2-------------- (2) In this; wi = t = Probabilities of the two classes separated by threshold. Threshold σ2 = Variance of these classes. µ = Class means The class mean can be computed iteratively. The following algorithm shows the steps for threshold identification. 4.2.3.1 Algorithm used for Binarization of offline HME 1. Compute histogram and probabilities of each intensity level 2. Set up initial wi(0) and µi(0) 3. Step through all possible thresholds t=1..... maximum intensity i. Update wi and µi ii. Compute σ2b (t) 2 4. Desired threshold corresponds to the maximum σ b (t) The obtained binary images from binarization process is given in Figure 4.5 Figure 4.5: Binarized images of ME In the normalization process image is mapped onto a standard plane which is a predefined size to represent fixed dimensionality for classification. The goal is to reduce the withinclass variations of the characters or digits in order to work on feature extraction process which thereby increases the classification accuracy. After thinning and binarization, a normalization of the image is carried out. A ME image is normalized to fixed size window Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 88 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression which is resized to (100 X 400) pixels due to size of handwritten mathematical expressions collected from writers. Here the normalization process goes through two steps, first is normalization of the image containing a complete mathematical expression, the second steps is concerned with normalization of segments after segmentation process. 4.2.4 Segmentation Technique for superscript and subscript characters within offline Handwritten Mathematical Expressions Segmentation is a key technique in recognition of the characters within a mathematical expression. This technique supports splitting of scanned mathematical expressions into sub-images that are into individual characters so that individual characters within the mathematical expression are given as an input to further the recognition process (Lu, C., Mohan, K., 2013). Segmentation of a complex ME is a challenging task because of unconstrained handwritten expressions, overlapping and touching components, different character sizes, varied skew angles of characters and identification of spatial relations of symbols within mathematical expressions, which is one of the critical issue (Simistira, et.al., 2014).Reference to Literature Review in chapter 2, the existing segmentation techniques is applied over isolated simple mathematical expressions (Surendra et.al.2012, Sanjay S. Gharde, et.al. 2013, RE, R. G. (2002)). Most of these segmentation techniques are for offline printed mathematical expressions. There is a need to develop for complex offline handwritten mathematical expressions. In view of this, a segmentation algorithm was developed which supports for segmentation of unconstrained isolated handwritten mathematical expressions with their superscript, subscript and main character positions. In this process the pre-processed input image is segmented into isolated characters and these isolated characters are labelled using a labelling process. In this thesis, the function SANSEG (image) is the name given to this proposed algorithm. This proposed algorithm for segmentation predicts the position of the components in the reconstruction phase. It also provides information about the number of components within the handwritten mathematical expression. A dynamic label matrix stores the labels for components within ME which consists of 3 rows and n columns, where then value depends on the number of components within ME. The predicted characters from the recognition process are mapped to the label matrix to identify the appropriate position of the components within a handwritten mathematical expression. This label matrix is useful for reconstruction of mathematical expressions. With reference to the Literature Review of Chapter-2, some of the challenges are identified for proposed segmentation algorithm. In Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 89 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression the proposed algorithm we have addressed the three challenges for the segmentation which are not addressed by previous research work. These three challenges with the Existing Segmentation approaches are given as under and discussed in brief, and it is resolved by our proposed algorithm. 1. Does not support for unconstrained handwritten expressions in terms of superscripts and subscript components. 2. Slant correction needs to be performed. 3. Segmentation processes do not take care of reconstruction phase in terms of ME component positions such as superscript, subscripts and main characters. The first issue is concerned with the existing segmentation techniques does not support for the unconstrained handwritten mathematical expressions in terms of superscript, subscript characters. These segmentation techniques require constraints in handwritten mathematical expression. In earlier day’s techniques, a frame window is given to the writer to write the mathematical expressions. Writer has to write within the given fixed size window. In our proposed segmentation algorithm, there is no constraint on the writer to write the mathematical expression in a fixed size window. Writers are allowed to write the mathematical expressions as per their own style. The second issue is concerned with slant correction. The existing segmentation techniques required to perform the slant correction before segmentation process. The proposed algorithm in this chapter does not require slant correction process at the preprocessing stage. The third issue is related to reconstruction stage of offline handwritten mathematical expressions. The existing segmentation techniques do not focus on the reconstruction logic of the mathematical expression. This is challenging task to identify the superscript, subscript positions of the character after the recognition process. The mathematical expression needs to be reconstructed as per its original structure. In this proposed segmentation algorithm, the logic for reconstruction of mathematical expression is provided, which helps the recognition system to reconstruct the mathematical expressions after recognition of characters within a mathematical expression. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 90 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 4.2.4.1 Proposed algorithm Developed for Segmentation with respect to superscript, subscript and main character positions: SANSEG (image) INPUT Pre-Processed Handwritten Mathematical Expressions may consist of superscript and subscript components. PROCESS 1. Apply the bounding box, vertical segmentation technique for ME 2. Compute the midpoint, top-left, top-right, Bottom-left, bottom-right location for each component. 3. Scan the first component, treat it as the main character, go to step (5). 4. Scan second component, and do If (top_left (second_component) < top_right (first_component)) && (bottom_left (second_component) <=midpoint (first_component)) Then first_component = superscript Go to Step (5) Else if top_left (second_component) >= mid_point (first_component) && bottom_left (second_component)>bottom_right (first_component) Then second_component = subscript Go to Step (5) Else If top-right (second_component) < mid_point (first_component) && Bottom_left (second_component) > mid_point (second_component) Then second_component = Main character Go to Step (5). 5. Locate the sequence of symbols in 3×n matrix at appropriate positions for main character, superscript and subscript with label numbers. 6. Repeat the step (4) till end of the expression. OUTPUT 3×n matrix with the appropriate location for superscript, subscript and main characters within HME. Segmented characters are the sub-images within the Handwritten Mathematical Expressions Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 91 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 4.2.4.2 Experimental Implementation of Segmentation Algorithm Input Image Figure 4.6 Input Image with superscript characters Detection of Components Figure 4.7 Segmentation using Bounding Box Table 4.2: Computation of components within ME Character Bounding Box Coordinates [top-left, Bottom-left, top-right, bottom-right] Position calculated as minimum and maximum value for bounding box Mid-Point for each character (x,y) axis Computation for top-left, Bottom-left, top-right, bottom-right X [25.5000 17.5000 54 57] Min=18 ,Max=74 mid = 2 [83.5000 0.5000 60 32] Min=1 ,Max=32 mid = 15.5000 Coordinates > [165.5000 29.5000 52 49] Min=30 ,Max=78 mid = Y [275.5000 20.5000 52 60] Min=21 ,Max=80 mid = 29.5000 2 [336.5000 2.5000 47 21] Min=3 ,Max=23 mid = 10 28 24 Table 4.3: Conditions for Extracting Segments Conditions used in Algorithm Type Main character Superscript Subscript Condition If top-right (second_component) < mid_point (first_component) &&Bottom_left (second_component) > mid_point (second_component) If (top_left (second_component) < top_right (first_component)) && (bottom_left (second_component) <=midpoint (first_component)) if top_left (second_component) >= mid_point (first_component) && bottom_left (second_component)>bottom_right (first_component) Table 4.4: Output of Algorithm shows position of superscript, subscript Output (output 3 X N metrics) 1st Component <48x48 double> <48x48 logical> <48x48 double> 2nd Component <48x48 logical> <48x48 double> <48x48 double> 3rd component 4th component 5th component <48x48 double> <48x48 double> <48x48 logical> <48x48 logical> <48x48 logical> <48x48 double> <48x48 double> <48x48 double> <48x48 double> Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 92 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression In this,<48 X 48 logical> represents the existence of the characters within ME. In Table 4.4, the first row indicates the location of the superscript characters, the second row indicates the location of main characters and third row represents the location of the subscript within a mathematical expression. The above position is used to predict a number of superscripts, subscript and main characters within mathematical expression which is used to reconstruction phase after the recognization of mathematical expression. A pre-processed binary image is resized to (100 X 400) pixels as an input to the segmentation process. In this, a bounding box technique is used to segment characters within ME. Segmentation components are resized to size (48 X48) pixels. Prediction of superscript, subscript and main character by the algorithm constructs a 3XN matrix comprising structure of element within Mathematical expression shown in Table 4.4 .This matrix is dynamic can grow in terms of columns depending on the number of characters within a mathematical expression. The overall segmentation accuracy achieved using this propsoed segmentation algorithm is given in following Table 4.5 Table 4.5: Segmentation Accuracy No. of ME’s 400 Total Number of ME Segmentation segmented accuracy 382 95.50% 4.2.5 Symmetric Density Based Feature Extraction The selection of appropriate features is an important step in pattern recognition. The features are termed as the information which we retrieve from an image. The accuracy of the recognition process is depends on the type of information we retrieve from an image .With the detailed analysis of the recognized character , a feature extraction technique need to be developed for extracting efficient and accurate information from the image. In this, a symmetric density based feature extraction technique is proposed. This technique works on the segmented components of mathematical expressions which is treated as a binary image, which is normalized to a nominal size of (48×48) pixels. The normalized segment is divided into ‘n’ equal zones where n=4, 9, 16 and 36, respectively, are considered for calculating the recognition rates. The density of the zone is computed by taking the ratio of a total number of object pixels (i.e. pixels representing the numeral viz. binary 1) to the total number of pixels in the zone. This is carried out for all the zones in the image. A total Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 93 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression of 65 features are extracted from each image and then for each image feature vectors are created. Based on the feature vectors, test sets and training sets are created for classification. In the equation below N is the number of object pixels in each zone Z, and T is a total number of pixels in the corresponding zone are computed using following equation (3). Density (Z) = N / T --------------------------------------------- (3) The steps involved in calculating the feature vector are as follows: 1. Segmented input image of size 48×48 2. Calculate density for n=4, 9, 16 and 36 which will provide 4, 9, 16 and 36 features. 3. Images for each zone are given below. For each image, a feature vector is created; this is a summation of all the features for zones 4, 9, 16 and 36. A total of 65 features are collected and a feature vector is formed. Figure 4.8, 4.9, 4.10, and 4.11 shows 4,9,16,36 zone feature space respectively. 0.060764 0.069444 0.032986 0.001736 Figure 4.8: 4 Zone Feature Space Figure 4.8, gives the details of features extracted from the sample input image. In this, the input image is divided into Zone1 (Z1), Zone 2(Z2) , Zone 3 (Z3) and Zone 4 (Z4). These zones contain foreground pixels and background pixels. The foreground pixels contain the data about the character in terms of binary 1. The background pixels represents the binary value 0. The values computed shown in the corresponding window are 0.060764, 0.069444 etc. are computed using equations (3) , as number of data pixels divided by total number of pixels in each zone. These values represent the extracted information which is termed as features about that character. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 94 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 0.058594 0 0.032548 0.109375 0 0.152344 0.050781 0.20895 0 Figure 4.9 : 9 Zone Feature Space Figure 4.9 gives the details of features extracted from the sample input image. In this, the input image is divided into 9 Zones represented by Z1, Z2 , Z3...z9. By applying equation (3) the corresponding window demonstrates the values computed for each zone.The value 0 represents there is no existence of image data into that particular zone.Thus, from the input image, 9 feature values are obtained. 0.076389 0 0 0.028620 0 0.166667 0.277778 0.20898 0 0.006944 0.1391944 0.003598 0.015635 0.30869 0 Figure 4.10: 16 Zone Feature Space Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 95 0 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 0.12986 0.025879 0 0 0 0 0 0.357149 0.47619 0.190476 0.012478 0 0 0 0 0.452381 0.309524 0.25487 0 0 0.452381 0.02381 0 0 0.03459 0.13584 0.014754 0.002141 0 0 0 0 0 0 0 0.001278 Figure 4.11: 36 Zone Feature Space In Figure 4.10 and 4.11 , demonstrate the 16 zone and 36 zone feature space. After computing features for each zone 4,9,16 and 36, a feature vector is created as concatenation of features from 4 zones, 9 zones, 16 zones and 36 zones using following equation (4). Feature Vector (V) = 4 Zone Feature Space + 9 Zone Feature Space + 16 Zone Feature Space + 36 Zone Feature Space ---------------------------(4) A sample Feature Vector of size 65 given in Table 4.6 Table 4.6: A sample Feature Vector of size 65 for input segment ‘ >’ 0.060764 0.069444 0.032986 0.001736 0.058594 0.109375 0 0 0.152344 0.050781 0.032548 0.20895 0 0.076389 0.02862 0 0 0 0.166667 0.277778 0.20898 0 0.006944 0.139194 0.003598 0.015635 0.30869 0 0 0.12986 0.025879 0 0 0 0 0 0.357149 0.47619 0.190476 0.012478 0 0 0 0 0.452381 0.309524 0.25487 0 0 0.452381 0.02381 0 0 0.03459 0.13584 0.014754 0.002141 0 0 0.001278 0 0 0 0 0 The feature extraction process is carried out for each input segment.Each feature vector for input character is represented by the a feature vector of size 65 feature values. The feature Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 96 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression extraction process is conducted for all the segments obtained from mathematical expression to be recognised. The input data to be recognised consists of two types i.e. , training data and testing data. The feature vector for each input character is obtained for testing data and training data separately. 4.2.6 K-NN Classification The set of known m classes that are identified by some type of description about the objects. In terms of character classification system, we consider the description or information about the character about its appearance to assign the classes for these objects. For the classification, we have set of these characters that are used for training and testing purpose. In a case, there will be existence of a reject class of objects that is not part of any known class. In general, the known classes are denoted by the class labels. Classification process of assigning labels to the different objects that is based on the properties of the object. The reject class is a generic class that cannot be placed in any known class (Shapiro, L. (2002)). In the simplest form, these algorithms do not perform any computation during training. The computation is performed only when a test example is presented, so it is with input that contains the training examples and one test example. The expensive step in these algorithms is the computation of nearest-neighbours. Efficient implementations exist with sophisticated data structures for efficient computation of nearest-neighbours. Input: m training examples, given as the pairs (xi; yi), where xi is an n-dimensional feature vector and yi is its label and x is a test example. Output: y, the computed label of x. 4.2.6.1 The Nearest-Neighbor algorithm a. Determine xi nearest to x. It minimizes the distance to x according to a pre-defined norm is shown in equation 1. Distance (xi, x) = |xi – x | (5) b. Return y = yi. The most commonly used norm in (2) is the Euclidean norm: (6) There are multiple approaches for handling the case in which there is more than one training example nearest to x. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 97 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 4.2.6.2 The k-Nearest-Neighbor algorithm a. Determine xi1,....., xik, and the k training examples nearest to x according to a predefined norm. b. Let yi1,...., yik be the labels of the k-nearest neighbours. Choose y as the label that appears most frequently among yi1,.....,yik. There are multiple approaches for handling the case in which no label has a clear majority in b. 4.2.6.3 The value of k In many practical problems k-NN with k > 1 performs better than the simple 1-NN. The most effective method of estimating a useful value of k is the technique of cross-validation. K-NN classifier act as a basic classifier for k=1. Increasing the k values helps to reduce the effect of noise within data set. 4.2.6.4 Advantages of K-NN classifier It has a simple implementation Works optimal for large data samples Does not require parameter estimation Susceptible to noise in the training data by varying values of k. Helps for parallel implementations very easily. It uses local information which helps to yield adaptive behaviour. 4.2.6.5 Steps Performed in Classification 1. Assigning class labels of mathematical operators, numerals and characters. 2. Preparation of Training and Testing set using cross-validation technique. 3. Training and Testing data set is an input to K-NN classification. 4. Use of KNN rules i.e. K= 1. 5. Preparation of the confusion matrix. 6. Calculating error rate against testing set. This experiment verifies the varying number of neighbours i.e. K=1, 3, 5. The performance of the algorithm is optimal at k=3. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 98 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 4.2.7 Performance Evaluation and Recognition of Offline HME with respect to their superscript and subscript characters The proposed step for recognition of handwritten mathematical expressions is given below: 1. Scanned handwritten mathematical expressions is an input. 2. Apply pre-processing, segmentation, feature extraction and classification technique. 3. Use the predicted symbol and characters within handwritten mathematical expressions from classification and map it to the appropriate positions within the matrix created in the segmentation. 4. Display reconstructed handwritten mathematical expression with the help of position numbers which is given in (3 × n) segmentation matrix. In the (Table 4.4) matrix ‘n’ represents the number of input segments from the expression. This matrix keeps the record of the relative superscript, subscript and main character position of the handwritten mathematical expression with respect to its original structure. In the earlier work, the recognition process focuses only on the recognition of the mathematical characters. It is the need of recognition systems for HME to provide the reconstruction logic for these mathematical expressions which helps to identify the superscript, subscript and main characters within HME. The proposed segmentation method for reconstruction is an innovative method for reconstruction of mathematical expressions. The accuracy of the proposed recognition system is studied using k-fold cross validation technique. A cross-validation technique is a statistical method for evaluating the learning algorithms with the comparison. This is a popular technique which is used to test the data which is not trained. This method randomly partitions the data into k-groups. The training of data samples is done k-times without the samples from k groups. The algorithms work as follows; a. Partition the data set into k groups. b. For each k i. Train the data set containing all training data except k-group ii. Test the trained algorithms using kth group as the test set. c. Count number of errors and calculate mean error over all k test set. The correctness of the algorithm is depends on the correct value of k. To choose the value of k, size of the data set is considered Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 99 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression The obtained feature vectors from the feature extraction stage considered for training and testing with K-NN classification. A cross-validation technique discussed in section 4.2.7 is used to create the training and testing feature vectors sets. The K-NN classifier is trained with the training feature vector set. Testing feature vector set is used to test the characters for recognition purpose. In classification stage, the labels are assigned to each type of mathematical characters. These labels are assigned as per Table 4.7. Table 4.7: Classification scheme for HME recognition system Label Number Symbol Label Number 1 15 2 16 3 17 4 18 5 19 6 20 7 21 8 22 9 23 10 24 11 25 12 26 13 27 Symbol Letter and Multiplication operator 14 The major step in the recognition system is the classification. Classification scheme includes the labelling of training and testing data set. These label numbers are uniquely assigned to each label. In this experiment, multiplication operator ‘x’ and the capital alphabet letter ‘X’ are assigned with same label number due to the similarity between these symbols .To implement this experiment total, 8000 mathematical symbols scanned from 400 Mathematical expressions are considered as a sample for the classification process. The output of the K-NN classification based on 2700 testing data segments using five-fold Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 100 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression cross validation gives error rate 0.0230 i.e. the recognition accuracy rate is 97.70%. Table 4.8 gives overall recognition rate obtained by experiment. Table 4.8: Recognition Result (in %) using K-NN classifier for Mathematical Symbols Symbols Fold-1 100 100 Average Recognition Cross Validation Fold-2 Fold-3 Fold-4 100 100 100 100 100 100 Fold-5 100 100 Average Recognition 96 100 100 97 95 95 93 95 100 93 95 98 95 100 100 91 100 100 100 100 100 100 98 100 100 100 100 100 100 97 100 100 100 100 100 100 95 100 100 100 100 100 100 96 100 100 100 100 100 91 97 100 100 100 100 100 100 100 100 93 100 100 100 81 100 100 100 80 100 100 100 84 100 100 100 84 100 100 100 85 100 100 100 100 100 100 100 100 100 100 96 100 100 100 94 100 86 100 95 99 98 100 100 100 96 100 89 100 96 98 95 100 100 100 87 100 93 100 93 98 95 100 100 100 91 100 95 100 98 95 98 100 100 100 94 100 93 100 100 100 95 100 100 96 95 100 94 100 97 98 99 99 98.36 97.91 97.59 98.09 97.95 97.70 Based on the predictions of the classifier, a confusion matrix is constructed shown in Table 4.9 (Refer Page No. 105). The logic used behind confusion matrix is the number of test Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 101 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression data segments versus predicted misclassified mathematical characters. For testing purpose, 28 different types of mathematical symbols obtained from the database. The testing data size of each symbol is 100. Numbers of labels assigned are 27, out of which operator multiplication ‘x’ and the capital alphabet letter ‘X’ are assigned with same label number. A confusion matrix shows a detailed analysis of misclassified mathematical symbols from the testing data set. 4.2.7.1 Ambiguities observed in Confusion matrix Investigating the ambiguities identified from confusion matrix given in Table 4.9 (refer Page No.105) for the mathematical characters is shown in Table 4.10. Table: 4.10 Ambiguities observed in confusion matrix for recognition system Symbol Ambiguous with Symbol No. of times ambiguous with other symbol 5 6 3 15 4 4 6 In this experiment the operator ‘.’ is most ambiguous with operator minus. These ambiguities need to be resolved. The issue of ambiguities within mathematical symbols are specifically addressed in chapter-6. 4.3 Results and Discussion In this chapter, a novel segmentation technique for recognition of mathematical expression with respect to the superscript and subscript characters is proposed. This proposed algorithm for segmentation predicts the position of the components which helps for reconstruction of HME. This algorithm addresses three challenges which are pointed out in the existing segmentation techniques are given below. 1. The existing methods do not support for unconstrained handwritten expressions in terms of superscripts and subscript components. 2. Slant correction needs to be performed at pre-processing step. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 102 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression 3. Segmentation process doesn’t take care of reconstruction phase in terms of ME character positions such as superscript, subscripts and main characters. The recognition system is constructed over the symmetric density based feature extraction technique; the five-fold cross validation technique is used to study the accuracy of the proposed feature extraction method using K-NN classification technique for handling above three problems. The data size for this experiment contains 8000 mathematical characters scanned from 400 Mathematical expressions is considered for the classification process. The output is based on classification of 2700 testing mathematical characters which gives K-NN classification error rate 0.0230 that is with the recognition accuracy rate of 97.70%. 4.4 Applications of the Study The proposed recognition system can be applied to the following: Digitization of the mathematical formula which consists of superscript and subscript characters. With references to Literature Review in Chapter-2 (Section 2.1 to Section 2.4), segmentation techniques with respect to superscript and subscript characters within mathematical expressions are not available in existing literature for online HME’s. The proposed offline HME segmentation technique can be applied to the online recognition systems which are useful for many digital devices. 4.5 Conclusion In this chapter, an efficient predictive segmentation technique for superscript, subscript and main characters within offline handwritten mathematical expressions has been proposed. This technique is based on the generation of predictions for superscript, subscript and main characters within handwritten mathematical expressions, which helps for the reconstruction of mathematical expressions during the recognition process with their spatial interrelationship. It has been also observed that the segmentation with respect to superscript and subscript for mathematical expression helps for improving the recognition rate as well as it provides logic for reconstruction of mathematical expressions. References Álvaro, F., & Sánchez, J. A. (2010, August). Comparing several techniques for offline recognition of printed mathematical symbols. In 2010 International Conference on Pattern Recognition (pp. 1953-1956). IEEE. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 103 Chapter 4: Segmentation of Superscript and Subscript Characters within Offline Handwritten Mathematical Expression Mohan, K., & Lu, C. (2013). Recognition of Online Handwritten Mathematical Expressions. Technical report, Stanford. Shapiro, L. G., & Linda, G. (2002). Stockman, George C. Computer Vision, Prentice hall. ISBN 0-13-030796-3. Simistira, F., Papavassiliou, V., Katsouros, V., & Carayannis, G. (2014, September). Recognition of spatial relations in mathematical formulas. In 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014 (pp. 164-168). IEEE. Ramteke, S. P., Patil, D. V., & Patil, N. P. (2012, December). Neural Network Approach To Mathematical Expression Recognition System. In International Journal of Engineering Research and Technology (Vol. 1, No. 10 (December-2012)). ESRSA Publications. Sanjay S. Gharde, Pallavi V. Baviskar, K. P. Adhiya( 2013). Identification of Handwritten Simple Mathematical Equation Based on SVM and Projection Histogram. International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-3,Issue2,May RE, R. G. Woods (2002), Digital Image Processing. Ph.D. Thesis on Segmentation and Recognition of Handwritten Mathematical Expression Page 104
© Copyright 2026 Paperzz