Journal of Computing and Information Technology - CIT 13, 2005, 3, 211–223 211 Arabic Font Recognition using Decision Trees Built from Common Words Ibrahim S. I. Abuhaiba Department of Electrical and Computer Engineering, Islamic University of Gaza, Gaza, Palestine We present an algorithm for a priori Arabic optical Font Recognition (AFR). The basic idea is to recognize fonts of some common Arabic words. Once these fonts are known, they can be generalized to lines, paragraphs, or neighbor non-common words since these components of a textual material almost have the same font. A decision tree is our approach to recognize Arabic fonts. A set of 48 features is used to learn the tree. These features include horizontal projections, Walsh coefficients, invariant moments, and geometrical attributes. A set of 36 fonts is investigated. The overall success rate is 90.8%. Some fonts show 100% success rate. The average time required to recognize the word font is approximately 0.30 seconds. Keywords: Arabic fonts, optical font recognition,vertical normalization, projections, Walsh coefficients, invariant moments, geometrical features, decision tree learning. 1. 1. Introduction Optical font recognition can be addressed through two complementary approaches: the a priori approach, in which characters of the analyzed text are not yet known, and the a posteriori approach, where the content of the given text is used to recognize the font. To our knowledge, there has been nearly no studies of the Arabic Font Recognition (AFR) problem. Available studies deal with Latin fonts, which have different characteristics than Arabic fonts. Therefore, in this paper, we present a novel solution to the a priori AFR problem. Often, the font style is not the same for a whole document; it is a word feature, rather than a document feature, and its detection can be used to discriminate between different regions of the document, such as title, figure caption, or normal text. The detection of the font style of a word can also be used to improve character recognition: we know that Mono-font OCR systems achieve better results than Multi-font ones, so the recognition of a document can be done using first an OFR, and then a Mono-font OCR. To our knowledge, the only work that addresses the AFR problem is that of reference 1 ], where an algorithm for AFR is presented. First, words in the training set of documents for each font are segmented into symbols that are rescaled. Next, templates are constructed, where every new training symbol that is not similar to existing templates is considered as a new template. Templates are sharable between fonts. To classify the font of a word, its symbols are matched to the templates and the fonts of the best matching templates are retained. The most frequent font is the word font. The overall error, rejection and success rates were 15.0%, 7.6%, and 77.4%, respectively. The high error rate is mainly due to errors in size recognition. In 2 ], font recognition is developed to enhance the recognition accuracy of a text recognition system. Font information is extracted from two sources: one is the global page properties and the other is the graph matching result of recognized short words such as a, it, and of. In 3 ], a multi-font OCR system to be used for document processing is presented. The system performs, at the same time, both character recognition and font-style detection of the digits belonging to a subset of the existing fonts. The detection of the font-style of the document words can guide a rough automatic classification of documents, and can also be used to improve character recognition. The system uses 212 Arabic Font Recognition using Decision Trees Built from Common Words the tangent distance as a classification function in a nearest neighbour approach. The nearest neighbour approach is always able to recognize the digit, but the performance in font detection is not optimal. To improve the performance, they used a discriminant model, the TD-Neuron, which is used to discriminate between two similar classes. In 6 ], a study of image degradations effects on the performance of a font recognition system is presented. The evaluation that has been carried out shows that the system is robust against natural degradations such as those introduced by scanning and photocopying, but its performance decreases with very degraded document images. In order to avoid this weakness, a degradation modeling strategy has been adapted, allowing an automatic adaptation of the system to these degradations. The adaptation is derived from statistical analysis of features behavior against degradations and is performed by specific transformations applied to the system knowledge base. In 4 ], a texture analysis-based approach is used for font recognition. Existing methods are typically based on local features that often require connected components analysis. In this work, the document is taken as an image containing some special textures, and font recognition as texture identification. The method is content independent and involves no local feature analysis. The well-established 2-D Gabor filtering technique is applied to extract such features and a weighted Euclidean distance classifier is used in the recognition task. The reported average recognition accuracy of 24 fonts over 6,000 samples is 98.6%. All previous OFR studies, except the first, deal with Latin fonts and there have been no similar studies on Arabic fonts, which have different characteristics than Latin fonts. The most impeding characteristic of Arabic OCR systems is the cursive nature of Arabic script, which makes characters not ready for direct OCR or OFR. In this paper, we present a novel contribution to the a priori AFR problem. In 5 ], a statistical approach based on global typographical features is proposed for font recognition. It aims at the identification of the typeface, weight, slope, and size of the text from an image block without any knowledge of the content of that text. The recognition is based on a multivariate Bayesian classifier and operates on a given set of known fonts. The effectiveness of the adopted approach has been experimented on a set of 280 fonts. Font recognition accuracies of about 97% are reached on high-quality images. Rates higher than 99.9% were obtained for weight and slope detection. Experiments have also shown the system robustness to document language and text content and its sensitivity to text length. Due to difficulties in segmenting cursive Arabic text, the basic idea in our method is to recognize fonts of some common Arabic words. Once these fonts are known, they can be generalized to lines, paragraphs, or neighbor non-common words since these components of a textual material almost have a single font. Figure 1 shows one line of Arabic text written in three different fonts, from top to bottom: Simplified Arabic, Traditional Arabic, and Tahoma. This line contains the words “ ”, “ ”, and “ ”, which are all common Arabic words used in all contexts irrespective of the subject of material presented. If we could recognize the font of these a) b) c) Fig. 1. One line of Arabic text written in three different fonts, all Roman and Regular, from top to bottom: (a) Simplified Arabic, (b) Traditional Arabic, and (c) Tahoma. Arabic Font Recognition using Decision Trees Built from Common Words words only, it can be assigned to the rest of words in the line. Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method for approximating functions that is robust to noisy data. So, a decision tree is our approach to recognize Arabic fonts. In our AFR system, a font is identified by four attributes: typeface (Simplified Arabic, Traditional Arabic, etc.), size expressed in typographic points, slant (Roman, Italic), and weight (Regular, Bold). The rest of the paper is organized as follows. Common Arabic words are selected in Section 2. The feature set to be used in the font classification decision tree is presented in Section 3. The learning algorithm of the font decision tree is described in Section 4. The recognition process is explained in Section 5. In Section 6, experimental results are reported. Finally, the paper is concluded in Section 7. 2. Finding Common Arabic Words Font recognition of every character is a very difficult and time-consuming problem, especially in cursive scripts such as Arabic. The reason of this difficulty is that words have to be segmented first into characters, which is also another difficult problem and no efficient solution to the 213 segmentation problem is available. Actually, the details of character segmentation depend on the word font which, in current OCR systems, remains unknown until the characters are recognized. Therefore, we chose not to detect the font per character. In document analysis systems, the stage of layout analysis results in segmenting the page into logical components. Examples of such components are non-textual components such as graphs, photos, line arts, etc., and textual components such as titles, section headings, and paragraphs. When we talk about font recognition, we are interested in textual components. Assuming that a textual component is written in almost one font, it is not necessary to detect the font of every word in a textual component. It is sufficient to detect the font of some representative words and generalize that font to the whole textual component, which can be faster. The set of representative words are the most common words used in Arabic texts irrespective of the subject. For easier segmentation, we selected words that do not contain a separation character in the middle, see Figure 2 for a list of these characters. These are characters which, Fig. 2. Arabic separation characters. Fig. 3. The 100 most common Arabic words not having a separation character in the middle. 214 Arabic Font Recognition using Decision Trees Built from Common Words when exist in the middle of the word, separate it into more than one non-connected sub-words. Such discontinuity causes difficulties when segmenting lines into complete words. A word without separation characters is connected and it is almost very easy to segment it successfully. So, we compiled very long files of Arabic texts from many disciplines such as Islamic publications, arts, engineering, science, journalism, etc. The frequency of every such word is computed. The top 100 words having the highest frequencies were selected, see Figure 3. These words were printed in different fonts to constitute the learning dataset and one testing dataset. Details of the learning and testing datasets can be found Section 6. 3. Feature Set 1 0 for foreground pixels for background pixels (1) In the following, there is a complete description of our set of features. After vertically normalizing a word Image, I (x y ), its horizontal projection, h(y ), is calculated using the equation: X x Since Walsh transform proved to be powerful in shape discrimination, we used the 1-D Walsh discrete transform of the horizontal projections, h(y ), described above, to find N Walsh coefficients. This provides N other features, fi , i = N N + 1 : : : 2N ; 1. Walsh coefficients are found using the equation: I (x y ) 1 N ; X N 1 y =0 ; Y n 1 h(y ) i=0 ;1)b (y)b ; ; (u) i ( n 1 i (3) where N = 2n , and bk (z ) is the kth bit in the binary representation of z. We don’t use Walsh coefficients for the vertical projections since we perform vertical normalization resulting in a varying number of columns from word to word. In practice, Equation (3) is not used directly. Instead, the fast Walsh transform is found using the successive doubling method in a way similar to the FFT 7 ]. Thus, Equation (3) becomes: W (u) = 1 Weven (u) + Wodd (u)] 2 (4) and W (u + M ) = where M = 1 Weven (u) ; Wodd (u)] 2 N=2, u = 0 1 ::: (5) M ; 1. 3.3. Invariant Moments For a 2-D discrete word image, I (x y ), the moment of order ( p + q) is defined as 3.1. Projections h(y ) = 3.2. Walsh Coefficients W (u) = Some features to be described in this section need words to be normalized first in order to get a fixed number of features, which facilitates comparison processes. In the context of document understanding, the normalization operation almost means to normalize in two directions: x and y. Actually, this can be problematic since information is lost due to this kind of twodimensional normalization. In our system, we perform a one-dimensional normalization, in the vertical direction, such that the height/width ratio is preserved, which results in normalized words that retain the relative geometrical attributes of the original un-normalized words. Assume that the image of a word is enclosed by a minimum bounding rectangle and is defined by the function: I (x y ) = where N is the word height after normalization. This yields N features, fi , i = 0 1 : : : N ; 1. We don’t use the vertical projection since the number of columns varies from word to word, which produces different number of features for different words. y = 0 1 ::: N ; 1 (2) m pq = XX x x py q I (x y ) (6) y The central moments can be expressed as µ pq = XX x y (x ; x)p(y ; y)qI (x y) (7) 215 Arabic Font Recognition using Decision Trees Built from Common Words where x= m10 m00 (8) and m01 m00 The central moments of order up to 3 are: y= (9) µ00 = m00 µ10 = 0 µ01 = 0 µ20 = m20 ; x m10 µ02 = m02 ; y m01 µ11 = m11 ; y m10 µ30 = m30 ; 3x m20 + 2x2 m10 µ12 = m12 ; 2y m11 ; x m02 + 2y2 m10 µ21 = m21 ; 2x m11 ; y m20 + 2x2 m01 µ03 = m03 ; 3y m02 + 2y2 m01 The normalized central moments, denoted, η pq, are defined as µ pq η pq = γ (10) µ00 where γ = p+q + 1 for 2 p+q = 2 3 ::: φ2 = (η20 ; η02 )2 + (12) 2 4η11 φ3 = (η30 ; η12 )2 + (3η21 ; η03 )2 φ4 = (η30 + η12 )2 + (η21 + η03 )2 (13) (14) (15) φ5 = (η30 ; 3η12 )(η30 + η12 ) 2 (η 30 + η 12 ) ; 3(η21 + η03)2 ] ; η03)(η21 + η03) 2 2 3(η 30 + η 12 ) ; (η 21 + η 03 ) ] + (3η 21 (16) 2 (η 21 +η 03 ) ] φ6 = (η20 ;η02 )(η30 +η12 ) ; + 4η 11 (η 30 + η 12 )(η 21 + η 03 ) 2 φ7 = (3η21 ; η30 )(η30 + η12 ) 2 (η 30 + η 12 ) (17) ; 3(η21 + η03)2 ] ; η30)(η21 + η03) 2 2 3(η 30 + η 12 ) ; (η 21 + η 03 ) ] + (3η 12 3.4. Geometrical Features These include: width, height, aspect ratio, area, x-coordinate of center of area, y-coordinate of center of area, perimeter, thinness ratio, and direction of axis of least second moment. After word segmentation, the limits, x min , x max , y min , and y max , of the bounding rectangle become known. The width, D, and height, H, of a word are defined as: D = x max ; x min + 1 (19) ; ymin + 1 (20) and H = y max Width and height features provide information about the size of the bounding rectangle of the word. The aspect ratio, α , is defined by (11) In our font recognition system, we used the following seven invariant moments which are derived from the second and third moments: φ1 = η20 + η02 This set of moments, which will be denoted by fi , i = 2N 2N + 1 : : : 2N + 6, is invariant to translation, rotation, and scale change 7 ]. (18) D H α= (21) The area, A, which equals the number of 1 pixels in the image, indicates the relative size of the word, and is defined by: A= XX x I (x y ) (22) y Note that the area equals m00 defined earlier. The x and y coordinates of the center of area are the parameters x and y defined previously. These two features locate where in the bounding rectangle the center of area lies. To find the perimeter, P, for every 1 pixel we find the number of 0 pixels in its 4-neighbors. The sum of these numbers for all 1 pixels in the word yields the perimeter, P. The thinness ratio, T, is calculated by T = 4π A P2 (23) As the perimeter becomes larger relative to the area, this ratio decreases and the word gets thinner. This can be useful in discriminating between Regular and Bold fonts and between thin and thick fonts. The axis of least second 216 Arabic Font Recognition using Decision Trees Built from Common Words moment feature provides information about the word’s orientation, which can be helpful in discriminating between Roman and Italic fonts. This axis corresponds to the line about which it takes the least amount of energy to spin the word or the axis of least inertia. If we move our origin to the center of area (x y), the angle, θ , of axis of least second moment is defined as follows: P P xyI (x y ) 1 x y P P x 2I (x θ = tan ;1 P P 2 2 y I (x y ) ; 2 x y x y y) (24) where θ is measured in counterclockwise direction from the vertical axis. Geometrical features provide 9 more features and will be denoted by fi , i = 2N + 7 2N + 8 : : : 2N + 15. Therefore, the total number of features becomes N horizontal projections + N Walsh coefficients + 7 invariant moments + 9 geometrical features = 2N + 16 features. In our system, N was set to 16 yielding 48 total feature fi = 0 1 : : : 47. 4. Decision Tree Learning Decision tree learning is one of the most widely used and practical methods for inductive inference. It is a method for approximating functions that is robust to noisy data and capable of learning disjunctive expressions 8 ]. Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some feature of the instance, and each branch descending from that node corresponds to one of the possible values for this feature. Most algorithms that have been developed for learning decision trees are variations on a core algorithm that employs a top-down, greedy search through the space of possible decision trees 8-10 ]. We will use an algorithm that learns the decision tree by constructing it top-down beginning with the question “which feature should be tested at the root of the tree?” To answer this question, each instance feature is evaluated using a statistical test to determine how well it alone classifies the training examples. The best feature is selected and used as the test at the root node of the tree. A descendent of the root node is then created for each possible suitable range of this feature, and the training examples are sorted to the appropriate descendent node (i.e., down the branch corresponding to the example’s range for this feature). The entire process is then repeated using the training examples associated with each descendent node to select the best feature to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices. The central choice in the algorithm is selecting which feature to test at each node in the tree. We would like to select the feature that is most useful for classifying examples. There are many several quantitative measures of the worth of a feature 8 ]. We will use a statistical property, called information gain that measures how well a given feature separates the learning examples according to their target font classification. The algorithm uses this measure to select among the candidate features at each step while growing the tree. In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy, which characterizes the impurity of an arbitrary collection of examples. Given a collection S of font examples where the target font classification can take on c different values, the entropy of S relative to this c-wise classification is defined as: Entropy (S) = c X i=1 ; pi log2 pi (25) where pi is the proportion of S belonging to font class i. Given the entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the effectiveness of a feature in classifying the training examples. The measure we will use, called information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this feature. More precisely, the information gain, Gain(S f ) of a feature f relative to a collection Arabic Font Recognition using Decision Trees Built from Common Words of examples S, is defined as Gain(S f ) Entropy (S) ; P jSvj Entropy (S ) v v2Values( f ) jSj (26) where values( f ) is the set of all possible values for feature f , and Sv is the subset of S for which feature f has value v (i.e., Sv = fs 2 S j f (s) = vg ). Note that the first term in Equation (26) is just the entropy of the original collection, S, and the second term is the expected value of the entropy after S is partitioned using feature f . The expected entropy described by this second term is simply the sum of the entropies of each subset, Sv , weighted by jSvj that belong to S . the fraction of examples v jSj Gain(S f ) is therefore the expected reduction in entropy caused by knowing the value of feature f. To give a formal description of our font decision tree learning algorithm, let S = fsi i = 0 1 : : : n ; 1g be a set of training examples, i.e., instance word images, where n is the total number of examples and each instance, si , is a 2-tuple (V τ ), V = (v0 v1 : : : v47 ), where vj is the value of feature fj , j = 0 1 : : : 47, and τ is the value of an additional Font feature, i.e., the font classification of this example. Notice that the Font feature is the target function value to be learned and then predicted by the tree. Initially, the set S contains the whole training examples. A node in the tree to be created contains the following fields: 1. feature: which is used by interior nodes to store the number of the feature to be tested at that node, 2. threshold: which is used by interior nodes to decide which branch to follow. If the value of the feature to be tested at that node is less than threshold, then the left branch is followed, otherwise, the right branch is followed. 3. font: which is used by leaf nodes to store the target font classification. 4. Examples: which is used by a leaf node to store the examples sorted to that node. Here, we store only the V vector, i.e., the 48 feature vector values are stored. The font classification per example is not stored since it is 217 common to all examples sorted to this node and it is retained in the font field. 5. left and right: which are pointers used by an interior node to point to left and right child nodes of that node, respectively. A formal description of our font decision tree learning algorithm follows. Algorithm I. Font Decision Tree Learning (FDTL) DecisionTreeNode FDTL(S) Step 1. New Node Creation Create a Root node for the tree. Step 2. Leaf Node Check If all Examples have the same font classification, τ , then f g Mark Root as a leaf node Let Root ! f ont = τ Store the V vector, i.e., the 48 feature values of S in Root ! S Return Root Step 3. Best Classifying Feature Selection Find the feature, fbest , of the features f0 to f47 which best classifies the set S and the corresponding dividing threshold, t best . Step 4. Interior Node and Descendent Nodes Creation Mark Root as an interior node Set Root ! feature = fbest and Root ! threshold = t best Let S<tbest S be the set of examples for which v fbest < t best Let Root ! left = FDTL(S<tbest ) Let Stbest S be the set of examples for which v fbest t best Let Root ! right = FDTL(Stbest ) Return Root End of Algorithm I. Steps 3 of Algorithm I needs explanation. Notice that the feature set, f0 to f47 , in our problem, 218 Arabic Font Recognition using Decision Trees Built from Common Words is continuous-valued. The decision tree creation can be made easier if these features are converted into discrete-valued features. Actually, this is accomplished by dynamically defining new discrete-valued features that partition the continuous-valued features into a discrete set of intervals. In particular, for a feature fj that is continuous-valued, the algorithm creates a new Boolean feature b that is true if the value of fj is less than a certain threshold t and false otherwise. We pick a threshold, t, which produces the greatest information gain of feature b. This can be achieved by sorting the examples according to the continuous feature, fj , then identifying adjacent examples that differ in their target font classification, we can generate a set of candidate thresholds midway between the corresponding values of fj . The value of t that maximizes information gain of feature b must lie at such boundary 11 ]. These candidate thresholds are evaluated by computing the information gain associated with each and the best feature can be selected. This is repeated for every feature, fj , and the feature which yields the maximum information gain is selected. Therefore, in Step 3 of Algorithm I, the best feature, fbest , which best classifies the set S and the corresponding dividing threshold, tbest , are found as follows. For every feature, fj , in the 48 features: 1. Let Y = fy i i = 0 1 : : : n ; 1g, where y i = (vi j τi ) and vi j is the jth feature value in si . 2. Sort Y in ascending order according to feature fj 3. Let gainMax = 0 4. In the following For loop, instead of comparing the font classifications of adjacent examples, we compare font classifications of y i and y i+δ to speed up the learning phase, since otherwise a great number of candidate thresholds arises, which highly increases the learning time. This is achieved by defining δ = blog2 nc instead of δ = 1, where n is the number of examples in S: For (i = 0; i < n ; δ ; i = i + δ ) f Let τ i and τ i+δ be the font classes of yi and y i+δ , respectively. If τi 6= τi+δ then f g g t = (vi j + vi+δ j )=2 Define the set of examples Z = fzi = (bi τ i )g, where τi is the font class in y i = (vi j τi ) and bi equals 1 if vi j < t and 0 otherwise. Let g be equal to the information gain of feature b relative to the set Z If g > gainMax then set gainMax = g and tbest = t The feature, fbest , which achieves the maximum information gain and the corresponding threshold, t best are retained. Pages of common Arabic words, defined in Section 2, are printed and scanned to constitute the learning dataset. Some image preprocessing is required in both the learning and testing stages. First, an image is skew-corrected using the algorithm of 12 ]. Next, horizontal and vertical solid lines are removed. Third, pepper noise is removed. The image is segmented into lines using horizontal white cuts, where each line consists of a sequence of words. Then, every line is segmented into words using vertical white cuts. Words which don’t satisfy certain size constraints are filtered out. For example, if the height or width of a word is less than some specified thresholds, or is greater than some other thresholds, then the word is discarded, see Section 6 for values of these thresholds. This process is repeated for every image of the training dataset. Features of extracted words are calculated to obtain the complete set of training examples. The range of each feature, to be used in the recognition stage, is found. A formal description of the complete learning algorithm follows. Algorithm II. Arabic Font Learning (AFL) DecisionTreeNode AFL() Step 1. Preprocessing a. Correct the skew of the input image b. Remove horizontal and vertical solid lines c. Remove pepper noise d. Segment the image into lines using horizontal white cuts 219 Arabic Font Recognition using Decision Trees Built from Common Words e. Every line is segmented into words using vertical white cuts f. Repeat steps (a-e) for all images of the training dataset Step 2. Feature Calculation a. Find the 48 feature values of every extracted word passing some tests to be mentioned in Section 6. At the end of this step, a set of training example words, S = fsi i = 0 1 : : : n ; 1g is obtained, where n is the total number of examples and each instance, si , is a 2-tuple (V τ ), V = (v0 v2 : : : v47), where vj is the value of feature fj , j = 0 1 : : : 47, and τ is the value of the additional Font feature or the font classification of this example. b. Find the full range, rj , of every feature, fj , j = 0 1 ::: 47, where rj ; i=0 n 1 = ; i=0 n 1 max(vi j) ; min(vi j), where vi j is the value of feature fj in training example si . Step 3. Decision Tree Learning Apply the decision tree learning algorithm, Algorithm I, to the set of examples S to obtain the target font recognition decision tree. Let Root point to the root of this tree. Return Root End of Algorithm II. 5. Recognition The same preprocessing operations used in the learning stage are also used in the recognition stage. However, it is not necessary that all examples extracted in the recognition stage belong to the set of common Arabic words defined in Section 2. To recognize the font of a word, its features are calculated. Feature testing starts at the root of the tree. Let the feature which must be tested at the root be f eaturej and the corresponding threshold be thresholdj . The value vj of feature fj of the word is compared against thresholdj . If vj < thresholdj then the left branch is followed, otherwise, the right branch is followed and the testing process is repeated at new interior nodes we arrive at. If a leaf node is reached, a check is made to make sure that the word being tested belongs to the set of common Arabic words defined in Section 2. This is achieved by finding whether the distance between the word being tested and any training example, in the set of training examples found at that leaf node, is not greater than a previously defined distance threshold. If so, the font saved at the leaf node is returned as the recognized font. Otherwise, the word is rejected. We define our distance threshold as: DistThreshold = α Nf (27) where α is some constant that is empirically determined and N f is the number of features, which is 48 in our case. The distance measure we incorporated in our test is the rangenormalized city block distance metric defined as 47 X vj ; uj (28) D(V U ) rj j=0 where vj is the value of feature fj in a training example found in the leaf node and uj is the corresponding feature value of the word being tested. This metric is computationally faster than the Euclidean distance, but gives similar results. The following is a formal description of the recognition algorithm per word. Before we call the algorithm, the set of feature values, U = (u0 u1 : : : u47 ), of the word to be tested is calculated. Algorithm III. Arabic Font Recognition (AFR) font AFR(DecisionTreeNode U ) f If DecisionTreeNode is an interior node, then f g Let f = DecisionTreeNode ! f eature If (u f < DecisionTreeNode ! threshold), then Return AFR(DecisionTreeNode ! lef t ) Else Return AFR(DecisionTreeNode ! right ) Else 220 f g g Arabic Font Recognition using Decision Trees Built from Common Words Learning Dataset If the range-normalized city block distance between U and any of the examples in the set DecisionTreeNode ! Examples is not greater than DistThreshold, then Return DecisionTreeNode ! f ont Else Return rejected Here rejected is a dedicated font value to indicate an unknown font value. End of Algorithm III. Words whose fonts are recognized using the above algorithm can be used to predict the fonts of constituent headlines, text lines, and paragraphs of a page in many ways. For example, following a maximum voting strategy, a line/paragraph can be assigned the most frequent font in the set of recognized words in that line/paragraph. Or, a rejected word can be assigned the font of the nearest recognized word. 6. Results In our system, a font, τ , is characterized by the 4-tuple ( p z s w ), where p, z, s, w are the typeface, size, slant, and weight, respectively. In the fonts used in experimentation, we have three typefaces: Simplified Arabic, Traditional Arabic, and Tahoma. The slant is either Roman or Italic. The weight is either Regular or Bold. The size is 12, 13, or 14 points. Thus, a total of 36 fonts were investigated. Table I summarizes the fonts used in our AFR system. Typeface Size The most frequent 100 Arabic words found in Section 2, see Figure 3, were printed once for every font in Table I, i.e., they were printed 3 typefaces 3 point sizes 2 slants 2 weights = 36 times. In every font, 20 samples were printed for each word. Thus, the total number of learning examples is 36 fonts 20 examples per word 100 words per font = 72,000 examples. Testing Datasets A dataset, dataset no. 1, similar to that of the learning dataset was created, but this time 10 samples per word were created. This provided 36,000 samples for testing purposes. Another dataset, dataset no. 2, was created by compiling a file of normal Arabic text, i.e., it is not constrained to only common Arabic words of Figure 3. This file contained 15 paragraphs and, when printed in the fonts of Table I, yields from two to three A4 pages per font. The learning and testing datasets were printed using a laser jet printer. Then, they were scanned as binary images with resolution set to 300 dpi in both directions, using an HP ScanJet 3400C scanner. Figure 4 shows one sample of a scanned paragraph from testing dataset no. 2. In learning and testing stages, a word that does not satisfy any of the following constraints is filtered out: the word width and height are at least 12 pixels each, the maximum width and maximum height are 600 and 200 pixels, respectively. These values were empirically determined and proved adequate to eliminate flecks and some non-textual content. It is worth mentioning that our font recognition decision tree is not designed to recognize every Arabic word. It only recognizes instances of words which belong to the set of common Arabic words defined in Section 2. If a word is Slant Weight Simplified Arabic 12, 13, 14 Roman, Italic Regular, Bold Traditional Arabic 12, 13, 14 Roman, Italic Regular, Bold Tahoma 12, 13, 14 Roman, Italic Regular, Bold Table I. Arabic fonts used in our AFR system. Arabic Font Recognition using Decision Trees Built from Common Words 221 Fig. 4. Sample scanned paragraph of the testing dataset no. 2, printed in (Simplified Arabic, Roman, Regular). outside that set, then it is rejected by the tree. Therefore, in defining the recognition rate, R, we don’t depend on the usual definition of this rate which is: R= No: of words correctly recognized Total no: of words 100 (29) This is due to the fact that in usual Arabic text most words don’t belong to our set of common words, hence, they are rejected by the tree which gives the incorrect illusion that our algorithm is not practical. Instead, we use the following definition: R0 = No: of words correctly recognized Total no: of words;No: of rejected words 100 (30) The recognition algorithm was run with the threshold DistThreshold set to 0:01 N f = 0:01 48 = 0:48, where N f is the number of features. For this threshold value, 46.6% of the total words in dataset no.1 were rejected by the tree although the training and testing datasets look similar. However, in font recognition, very small degradations in intensity values of image pixels produce new variants of word images that no longer pass the threshold test of the decision tree. The rejection rate can be decreased by increasing DistThreshold. However, this increases, as empirically verified, the error rate. Actually, there is very little harm of the rejection rate being high as long there is an adequate number of common words successfully recognized. Also, it is not necessary to recognize every common Arabic word in a paragraph. For the set of words accepted by the decision tree, the overall success rate of testing dataset no. 1 is 93.9%. Error is mainly due to errors in size and slant recognition. Errors in typeface are almost absent, while very few errors in weight are noticed. Simplified Arabic and Traditional Arabic fonts show high size, slant, and overall error rates, high rejection rate, and low recognition success rate compared to Tahoma fonts. For testing dataset no. 2, 89.5% of input words were rejected by the decision tree, which is higher than the rejection rate of dataset no. 1. The reason is that dataset no. 2 contains too many other classes of words other than the common words on which the decision tree is trained. Of accepted words, the overall success rate of testing dataset no. 2 is 90.8%. Some fonts show 100% success rate. Error is mainly due to errors in size in contrast to results of dataset no. 1, in which error was mainly due to errors in size and slant recognition. Also, errors in typeface and weight are almost absent. Some Traditional Arabic fonts show high size and overall error rates and low recognition success rate compared to Simplified Arabic and Tahoma fonts. The error rate can be decreased by increasing the size of the training dataset. The algorithm was implemented and run on a Pentium III 866MHz PC with 128 MB RAM. The average time required to recognize the word font is approximately 0.31 and 0.29 seconds for testing datasets nos. 1 and 2, respectively. This time doesn’t include disk overhead and preprocessing times as these times are counted towards the overall document understanding software and not towards every constituent task in that software. The recognition time can be reduced by using some programming optimization techniques and more powerful computers. The size of the learned decision tree is 15633 nodes, 7816 of which are interior nodes and 222 Arabic Font Recognition using Decision Trees Built from Common Words the rest are leaves. In each interior node, one feature is tested and the same feature may be used more than once in a path from the root to a leaf node. Here, we have the following important observations about the usage of features in interior nodes of the decision tree: 1. All defined features were used in the tree. The mostly used feature is the horizontal projection of row no. 0, h(0), of the vertically-normalized word image, which was used in 657 nodes out of 7816 interior nodes. The least used feature is Walsh coefficient no. 9, W (9), which was used in 30 interior nodes. 2. Most of the heavily used features are very simple features which are not computationally expensive such as horizontal projections, width, area, and height. 3. Also, most of the heavily used features are the invariant moments. However, these are computationally expensive. Table II summarizes feature usage which shows that horizontal projections occupy 47.1% of the interior nodes. This is a useful result since it indicates that we can depend on simple features to classify fonts. Next come invariant moments Feature Name and Walsh coefficients which constitute 19.3% and 12.7%, respectively, of the features tested at interior nodes. 7. Conclusion In this paper, we presented a novel contribution to the a priori AFR problem. A decision tree was our approach to recognize Arabic fonts. A set of 48 features was used to learn the tree. These features include horizontal projections, Walsh coefficients, invariant moments, and geometrical attributes. A set of 36 fonts was investigated. The total number of learning examples was 72,000. Two datasets were used to test the method. The overall success rate is 90.8%. Some fonts show 100% success rate. The error rate can be decreased by increasing the size of the training dataset. Recognized fonts of common words can be generalized to their containing paragraphs, assuming that each paragraph contains only a single font, which can be practical in printed materials such as newspapers and old books that were not previously stored in electronic format. Symbol Nodes Percent h(0) to h(15) 3685 47.1 Invariant moments φ1 to φ7 1509 19.3 Walsh coefficients W (0) to W (15) 996 12.7 Width D 354 4.5 Area A 265 3.4 Height H 233 3.0 Center of area x coordinate x 220 2.8 Center of area y coordinate y 129 1.7 Perimeter P 117 1.5 Angle of axis of least second moment θ 109 1.4 Aspect ratio α 103 1.3 Thinness ratio T 96 1.2 7816 100 Horizontal projections All features Table II. Summary of feature usage in interior nodes of learned font decision tree. Nodes: no. of interior nodes using the corresponding feature, Percent: percentage of interior nodes using the feature. Arabic Font Recognition using Decision Trees Built from Common Words The algorithm was implemented and run on a Pentium III 866MHz PC with 128 MB RAM. The average time required to recognize the word font is approximately 0.30 seconds. The time can be reduced by using some programming optimization techniques and more powerful computers. Regarding feature usage in the decision tree, horizontal projections occupy 47.1% of the interior nodes. This is a useful result since it indicates that we can depend on simple features to classify fonts. Next come invariant moments and Walsh coefficients which constitute 19.3% and 12.7%, respectively, of the features tested at interior nodes. Acknowledgement The author thanks the referees for their valuable comments which helped to improve this paper. 223 8] T.M. MITCHELL, Machine learning, The McGrawHill Companies, Inc., Singapore, 1997. 9] J.R. QUINLAN, Induction of Decision Trees, Machine Learning, 1 (1986), pp. 81–106. 10 ] J.R. QUINLAN,C4.5: Programs for machine learning, San Mateo, CA: Morgan Kaufmann, 1993. 11 ] U.M. FAYYAD AND K.B. IRANI, On the handling of continuous-valued attributes in decision tree generation, Machine Learning, 8 (1992), pp. 87–102. 12 ] I.S.I. ABUHAIBA, Skew Correction of Textual Documents, King Saud University Journal: Computer and Information Sciences Division, 15 (2003), pp. 67–86. Received: September, 2004 Revised: December, 2004 Accepted: February, 2005 Contact address: Ibrahim S. I. Abuhaiba P. O. Box 1276 Department of Electrical and Computer Engineering Islamic University of Gaza Gaza Palestine [email protected] References 1] I.S.I. ABUHAIBA, Arabic Font Recognition Based on Templates, The International Arab Journal of Information Technology, 1 (2003), pp. 33–39. 2] H. SHI AND T. PAVLIDIS, Font Recognition and Contextual Processing for More Accurate Text Recognition, Forth International Conference on Document Analysis and Recognition (ICDAR’97), (1997) Ulm, Germany. 3] S.L. MANNA, A.M. COLLA, AND A. SPERDUTI, Optical Font Recognition for Multi-Font OCR and Document Processing, 10th International Workshop on Database & Expert Systems Applications, (1999) Florence, Italy. 4] Y. ZHU, T. TAN, AND Y. WANG, Font Recognition Based on Global Texture Analysis, Fifth International Conference on Document Analysis and Recognition (ICDAR’99), (1999) Bangalore, India. 5] A. ZRAMDINI AND R. INGOLD, Optical Font Recognition Using Typographical Features, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20 (1998), pp. 877–882. 6] A. ZRAMDINI AND R. INGOLD, A Study of Document Image Degradation Effects on Font Recognition, Third International Conference on Document Analysis and Recognition (ICDAR’95), (1995) Montreal, Canada. 7] R.C. GONZALEZ AND R.E. WOODS, Digital Image Processing, Prentice Hall, 2002. IBRAHIM S. I. ABUHAIBA is an associate professor at the Islamic University of Gaza, Department of Electrical and Computer Engineering, Gaza, Palestine. He obtained his Master of Philosophy and Doctorate of Philosophy from Britain, in the field of document understanding and pattern recognition. His research interests include computer vision, image processing, document analysis and understanding, pattern recognition, artificial intelligence, and other fields. Dr. Abuhaiba presented important theorems and more than 30 algorithms in document understanding. He published many original contributions in the field of document understanding in well-reputed international journals and conferences.
© Copyright 2026 Paperzz