FONT CLASSIFICATION FOR KANNADA LANGUAGE CHAPTER 4 FONT CLASSIFICATION FOR KANNADA LANGUAGE 4.1 Introduction Character recognition using computer is more challenging and has been around since the introduction of digital computers. With fast development of internet and multimedia, Optical Character Recognition (OCR)has been greatly developed. Some OCRs are able to handle high quality printed characters with better recognition rate and are so perfect that are now available commercially. However, most existing OCRs are not able to extract typographical attributes of fonts such as type, size, serifness and soon. For producing the reeditable text, font recognition plays an important role. Correct identification of the font helps in improving the recognition rate of the OCR systems. One of the challenges for researchers is to reduce the rejection rates and to achieve lower substitution error rates even on good quality machine-printed documents. Selection of a feature probably plays a significant role in order to achieve high recognition rate. OCRs development requires that they distinguish font styles such as italic, bold, sanserif. Typographically, a font is a particular representation of a typeface design and may have a particular size, style, weight, spacing of characters, x-height proportion, shape of serifs and loop axes. The font features, which determine the accuracy with which an OCR system recognizes it correctly, can be derived based on: i. Local attributes of fonts such as boldness, serifness. ii. Local and/or global typographical features. iii. Texture analysis. The kind of featuresused to recognize the fonts depends upon the type of the method used. Many research papers have been published on contour following, classification algorithms, line thinning etc. Most of these researchesfocused on isolated characters either printed or handwritten. Even a better OCR system requires considerable amount user interaction about the layout, and others expect each field to be demarcated prior to OCR stage. In trainable systems, considerable amount user time is required to identify the characters. User feels same steps are repeated again and again. 35 FONT CLASSIFICATION FOR KANNADA LANGUAGE Font classification plays an important role in automatic document analysis and is often a tedious task and requires more time. It is necessary that the OCR should know the font type a priori for successful recognition. Font type plays an important role for character recognition and script identification. In order to achieve better recognition rate an automatic data processing system should include font type and the content of the document. Secondly, font classification can reduce number of alternative shape for each class this inturn reduces the recognition of characters based on various fonts. A well-developed OCR system adopts integrated treatment of entire document and well organized typographic knowledge. To develop such a system, it is necessary to address some of the problems associated with text data. In many situations iftextshaveunusual typefaces, then characters are such that they may be segmented wrongly. Formulae and equations present in the text poses different problems and may be lacerated. Other problems with automatic recognition are presence of mathematical symbols,Lines having dropcaps, or subscripts and superscripts leads to improper recognition and may be missed altogether. Further, figure captions may be misplaced, paragraphs in multicolumn documents appear in the wrong order, headings are recognized and tables are disconnected.In such scenario it is necessary that omitted characters need to be keyed in and misidentified characters need to be replaced. Many methods have been developed to recognize Kannada characters. However, perfect it may be, it is insufficient for conversion of complex archival documents to a form that can be read by a computer. Development of OCR for Kannada language is in the naive stage posing more challenges to researchers because of its large character set. There is a much scope for the development of OCR system in Kannada language and a stage need to be reached where printed pages are inserted into OCR system and coded file compatible with that of keyed-in version is obtained.Even though it is clear that font recognition is an important step in automatic processing not much work has been carried out in Kannada language for font recognition. 36 FONT CLASSIFICATION FOR KANNADA LANGUAGE 4.2 Review of some methods to recognize the font type Font recognition is a fundamental issue in document analysis and recognition. Effectiveness of OCRs depends on identification of the font. The increase in diversity of font styles, in today’s business scenario, limits the performance of an OCR. With advent in print media and new approach in information presentation, large number of different types of fonts is used in the written documents. OCRs,which havebeen developed to work with mono fonts or limited number of font types, tend to misinterpret the documents leading to poor functioning of automation system. Many methods have been proposed in the literature. Most of these worksare developed for the languages such as English, Chinese, and Japanese. Good success rate is achieved only for a few font types and still this area is fascinating to researchers to achieve better result with verities of font types. Most of these methods are typically based on typographical features local to the font. The local features of the font such as serifness, boldness etc has been used for estimation of font attributes in an OCR systems [Cooperman]. Font types are also identified by using page properties such as histogram of word length and stroke slopes [Shi and Paulidis]. Yet another method identifies the font type by matching clusters of word images which are obtained from an input document with a database of functional words derived from fonts and document images [Khoubyari and Hull]. Considering the impact of fonts type on the performance of OCR system [SedarOzturk, BulentSankur, AToygarAbak], clustering behavior of fonts have been investigated, which will be useful in the design of robust OCR system and reproduction of documents in its original appearance. Font features are extracted using Bitmaps, Fourier descriptor, DCT coefficients and Eigen characters. All these methods depend on local features associated with fonts. A few researchers attempted to recognize the font based on the global characteristic of the fonts [Yong Zhu, Tieniutan and Yunhongwang]. In this method, an image containing specific structure is used as an input. Features are then extracted using multichannel Gabor filter technique. These features are used for font identification. An OCR system for printed document in Kannada language has been developed using support vector machine. It works independently of font type and size [T.V. Ashwin and P.S. Sastry]. The features are extracted by splitting each segment image into number of zones and 37 FONT CLASSIFICATION FOR KANNADA LANGUAGE then finding distribution of the ON pixels in the radial and the angular directions. Much work needs to be carried out for the development of OCR for Kannada language. When the document supplied to the OCR contains different font type,then the OCR fails to recognize all the characters properly, may misinterpret the character as another type. To enhance the capability of the OCR, it is necessary that font type should be identified. This information when inputted to the OCR, it enables necessary routines and identifies the characters properly. This is a bottleneck for the Kannada language. When the documents are prepared in Kannada language, either using NUDI and BARAHA Software (Which are most popularly used Kannada Software), many font types are used for better presentation. OCR developed for Kannada language may not work properly with multiple font type. This has motivated us to develop a method to identify the font type for Kannada language which will be useful in the preprocessing stage of OCR system.A sample of the documentis shown for the font Nudi11,Nudi28 in the Figure 4.1. The font Nudi10 is shown in the Regular, Italic, Bold and Bold Italic styles in Figure 4.2. (a) (b) Figure 4.1: Font type Nudi11 and Nudi28 (a) (c) (b) (d) Figure 4.2: Font Nudi10in (a)Regular,(b)Bold,(c)Italic and (d)Bold italic type 38 FONT CLASSIFICATION FOR KANNADA LANGUAGE The organization of the chapter is as follows: Section 4.3 gives an outline of the proposed method. Details of the algorithm to compute the font features have been discussed in section 4.4. Classification of the features using Euclidean distance method is presented in section 4.5. Section 4.6 discusses the experiment results and paper is concluded in section4.7. 4.3 Outline of the proposed method Dataanalysis is an essential step in understanding the physical processes asdata analysis is important for both theoretical and experimental studies (Link between theory and reality is provided only by data).The methods used for this purpose should give a better insight into the process used that actually generatesthe data. The method used should yield physically meaningful results which can be used for better understanding of the processes. Hilbert-Huang Transform (HHT) has been designed to fulfill the above requirements and specially to deal with the processes that are nonlinear and non-stationary in nature. HHT decomposesthe signal into intrinsic mode functions (IMFs) which yields instantaneous frequency data. It was proposed by Huang, N E Shen, Z, Long, S.R specifically for analyzing data that exhibits nonlinear and non-stationary characteristics. The limitations of earlier data analysis method such as Fourier-based method have been overcome in this technique. The Hilbert-Huang transform is the result of the empirical mode decompositionandthe Hilbert spectral analysis. As the EMD method is more fundamental, is anecessary step to reduce any given data into a collection of intrinsic mode functionsto which the Hilbert analysis can be applied. An IMF represents a simple oscillatorymode as a counterpart to the simple harmonic function, but it is much more general:by definition, an IMF is any function with the same number of extrema and zerocrossings, with its envelopes, as defined by all the local maxima and minima, beingsymmetric with respect to zero. Finding EMD consists of following steps:For any data we first identify all the local extrema andthen connect all the local maxima by a cubic spline line as the upper envelope. Werepeat the procedure for the local minima to produce the lower envelope. The upperand lower envelopes should cover all the data between them. Their mean is designatedasm1and the difference between the data andm1is the first proto-IMF (PIMF) component,h1: h1= x(t) –m1. (4.1) 39 FONT CLASSIFICATION FOR KANNADA LANGUAGE This procedure of extracting an IMF is calledsifting.By construction, this PIMF,h1,should satisfy the definition of an IMF, butthe change of its reference frame from rectangular coordinate to a curvilinear may cause anomalies, where multi-extrema between successive zero-crossings still existed. To eliminate such anomalies, the shifting processis repeated as many times as necessary to eliminate all the riding waves. Inthe subsequent sifting process steps,h1is treated as the data. Then h11=h1 – m11 (4.2) Where,m11is the mean of the upper and lower envelopes ofh1. This process whenrepeated up toktimes; then,h1kis given by h1k=h (k-1)– m1k (4.3) Each time the procedure is repeated, the mean moves closer to zero. Theoretically, this step can go on for much iteration.Buteach time, as the effects of the iterations make the mean approach to zero, they alsomake amplitude variations of the individual waves more even. Thus thisiteration procedure, though serving the useful purpose of making the mean to bezero, also drains the physical meaning out of the resulting components if carriedtoo far. Thus to attain the delicate balance ofachieving a reasonably small mean and also retaining enough physical meaning inthe resulting component two stoppage criteria has been used. The first one is when the sum of the differences between present and previous waveforms is smaller than the pre-defined value and another criterion is number of consecutive siftings when the numbers of zero-crossings and extremaare equal or at most differ by one. The usefulness of empirical mode decomposition/Hilbert-Huang transformfor the analysis of nonlinear and non-stationary data has been extendedto include the analysis of image data. Because image data can be expressedin terms of an array of rows and columns, this robust concept is applied to thesearrays row by row. Each slice of the data image, either row or column-wise, representslocal variations of the image being analyzed. The data fromnatural phenomena are either nonlinear or non-stationary, or both, so are thedata that form images of natural processes. Thus, the EMD/HHT approach is especially well-suited for image data, giving frequencies,inverse distances or wavenumbers asa function of time or distance, alongwith the amplitudes or energy values associated with these, aswell asa sharpidentification of embedded structures. This method is used to identify the font type in a 40 FONT CLASSIFICATION FOR KANNADA LANGUAGE given Kannada text which will be useful for automatic identification of the text by the machine. In order to achieve better recognition result and processing of fonts, it is important to identify the correct features of the font type. This in turn improves the performance of OCR system.To recognize the different fonts in a text page, images are first acquired and subjected to empirical decomposition to produce IMFs.The IMFs produced differ for each font type andthis can be used to identify the font type. The IMFs produced can used to compute the total average energy of the IMFs which can be used as a feature. For experiment purpose we have considered first four IMFs and energy associated with them. This decision is made based on the fact that the contribution of the remaining IMFs towards the energy computation is small as compared to time required to compute them from the data. 4.4 Selection of features using empirical mode decomposition The font type identification is vital in the construction of OCR which automatically converts the text in the image format to editable form.As different types of fonts are used in text messages to draw the attention of customers, processing of such messages or text by an OCR yields poor result. A few lines of text having the same font type are treated as a block and will be useful in the development of robust optical character recognition system. Hence the text block with same type of font may be treated as an image of unique texture. In many text documents, it is observed that the texts are typed with similar font type, in certain situations to highlight some of the important passages or paragraphsthe text may be presented with different font types. EMD is empirical, intuitive, direct, and adaptive, with the aposterioridefined basis derived from the data. The decomposition is designed to seek the different simple intrinsic modes of oscillations in any data based on the principle of scale separation. Each of the discrete oscillatory modes referred to as intrinsic mode function (i) In the whole data set, the number of extrema and the number of zero-crossings must be either equal or differ at most by one, and (ii) At any point, the mean value of the envelope defined by the local maxima and the envelope defined by the local minima is zero. 41 FONT CLASSIFICATION FOR KANNADA LANGUAGE In order to extract the feature of the font type, text images of size 162×72, are used for experiment purpose. The noise if any in the image is removed. The line spacing is normalized. The lines are padded with the characters to avoid the influence of the blank spaces in the line. Text isselected as gray scale image with 256 gray levels. The text blocks are processed using empirical mode decomposition method.It is observed that for the different font type, the high frequency energy associated with them is different. The total number of IMF components is roughly limited to ln2 N, where Nis the total number of data points. An EMD decomposes the original signal x(t) in to set of IMF’s through an iterative procedure.The algorithm is used to extract the features ofeach font style. For each font style its IMFs and residuesare computed and stored. This is used for identification ofcharacters. The EMD algorithm is shown below: Algorithm 4.1: Empirical mode decomposition Input: Image containing the font whose feature is to be extracted Step 1: Initialize r0(t) = x(t), i=1, ri(t) = r0(t) Step 2: Procedure to extract the ithintrinsic mode function (a) Initialize: h0(t) = ri(t), j = 1. (b) Extract all the local minima and maximaof hj-1(t). (c) Interpolate the local maxima and the localminima by a cubic spline t. Form uj1(t)and lj-1(t) as the upper and lower envelopes of hj-1(t), respectively. (d) Calculate the mean mj-1(t) of the upperand lower envelopes. (e) hj(t) = hj-1(t) - mj-1(t) (f) if stopping criterion is satisfied then seti.ehj(t) is an IMF, set imfi(t) = hj(t), else go to (b) with j = j + 1 Step 3: Let ri(t) = ri-1(t) - imfi(t) Step 4: if ri(t) still has at least 2 extremathen go to 2 with i = i + 1 Elsethe decomposition procedure endsand r i (t) is the residue. Output: The IMFs of the given font type 42 FONT CLASSIFICATION FOR KANNADA LANGUAGE Many IMFs are generated from the above process. Of these first few IMFs contain the significant information and subsequent IMFs have less contribution to characterize each font. For each font, we have used first four IMFs for computing the feature vector. The algorithm for the computing the feature vector is shown below: Algorithm 4.2Computation of feature vector Input: Image whose feature vector is to be computed Step1: Conduct the preprocessing of text and perform normalization. Step 2: Calculate the energy functions of the each character. Calculate high frequency energy ex ex = 1 2N N ∑ [ A1x (i ) + Ax2 (i )] where (4.4) i=1 Axj (i) = [imf xj (i)]2 + [ H (imf xj (i))]2 for j=1,2 (4.5) Where N is number of observation N ≈ 10 max(width of the image, height of the image) and H (imf x j (i )) is the Hilbert transform of imf x j (i ) Step 3: Let V = [ ex] which is the feature vector for the character. Output: High frequency energy 4.5 Minimum distance classifier Major problem in image recognition is finding the distance between images. Tangent distance, Euclidean distance, artificial neural network or Support vector machine may be used for the purpose. Among all the image metrics, Euclidean distance is the most commonly used due to its simplicity. Let x,y be two M x N images, x=(x1,x2,…xMN), y=(y1,y2,…yMN), where xkN+l,ykN+lare the gray levels at location (k,l). Finding the distance between the two images is to find the distance between each pixels of the image. The Weighted Euclidean Distance(WED) depends on the statistical distribution of the samples.It can be used to compare two templates, especially if the template is composed of integer values. Feature vector is computed for each font type which acts as the metrics for the font type. The WED 43 FONT CLASSIFICATION FOR KANNADA LANGUAGE gives the measure of how similar a collection of values are between two templates. After obtaining the metrics, Weighted Euclidean Distance is computed as follows: N WED(K)= ∑ i=1 (k) 2 (f i -f i ) (δ i (k) ) 2 (4.6) Where, fi is the ith feature of the unknown font fi(k) is the ith feature of font type k δi(k) is the standard deviation of ith feature in the template k The unknown template is found to match font template k when WED is minimum at k 4.6 Experimental results Font recognition using empirical mode decomposition identifies the font type by finding the nearest font type for a given unknown font using Euclidian distance classifier. As there is no standard database is available for Kannada font type, texts containing different font types are used for experimental purpose. To demonstrate significance of EMD for font classification, different fonts which are used frequently in Kannada text along with different font styles such as Regular, Italic, Bold and Italic Bold are used.If the proposed system is able to recognize text font type to that of stored one, then the experiment is success; otherwise it is failure. To test the suitability of the algorithm, experiments have been conducted by using the computer generated text blocks and scanner generated text blocks. For each font type four different font styles namely regular, bold, italic and bold italic is used. Kannada Nudi3 software is used to type the Kannada text, then it is printed and scanned using HP scanner. Adobe Photoshop is used for computer generated images. The image size of 162× 72 is used in the experiment. For different fonts, separate text blocks are created. For each font, 10 computer generated text blocks and 10 scanner generated text blocks are used. The average energy of each font type is calculated. 44 FONT CLASSIFICATION FOR KANNADA LANGUAGE The Figure 4.3(a) shows the font type Nudi03 and Figure 4.3(b) shows its empirical mode decomposition and first four IMFs. The Figure 4.4(a) shows features of the text block for the font type Nudi06 and its first 4 IMFs are depicted in Figure 4.4(b). It is observed that the IMFs different for different fonts. Among these IMFs the first few IMFs (Higher frequency IMFs) contributes more energy compared to that of lower frequency energies. Hence only first 4 IMFs are used to compute energies. Along with this, energy associated with residue components is utilized as a feature to classify different fonts. Figure 4.3(a): Text printed with the font type Nudi03 IMF1 IMF2 IMF3 IMF4 Figure 4.3b:The first four IMFs for the font Nudi03 Regular 45 FONT CLASSIFICATION FOR KANNADA LANGUAGE Figure 4.4(a): Text printed with the font type Nudi28 IMF1 IMF2 IMF3 IMF4 Figure 4.4(b):The first four IMFs for the font Nudi03 Regular The feature of each font is computed by calculating the energy of differentfont types as depicted by the IMFs of each font. The energy value is as shown in Table4.1 46 FONT CLASSIFICATION FOR KANNADA LANGUAGE Table 4.1: Feature vectors of the different font text blocks Font Type Regular 0.1827 0.1871 0.2168 0.2124 0.1963 0.1913 0.2687 0.2060 0.2757 0.2423 N01 N03 N06 N07 N08 N09 N10 N11 N13 N22 Energy Values Bold Italic 0.2196 0.1991 0.2423 0.2093 0.2363 0.2174 0.2409 0.2116 0.2248 0.1945 0.2239 0.2006 0.2915 0.2796 0.2500 0.2057 0.2682 0.2881 0.2615 0.2663 Average Bold Italic 0.2270 0.2494 0.2500 0.2546 0.2210 0.2324 0.2998 0.2569 0.2880 0.2856 0.2071 0.2220 0.2301 0.2299 0.2091 0.2120 0.2849 0.2296 0.2800 0.2639 From the experiment it is clear that few font types such as Nudi06 and Nudi07, Nudi08 and Nudi09, have close resembles each other and have high confusion rate. Nudi09, Nudi10, Nudi13 have distinct features from other fonts and have better recognition rates. The recognition rate for few fonts is shown in the Table 4.2. Table 4.2: Recognition rate of fonts and font styles Font Type N01 N03 N07 N09 N10 N13 N22 Regular 93.2 86.4 83.5 91.4 90.3 88.6 91.8 Bold 93.9 87.1 83.9 93.4 91.7 89.5 92.6 Font styles Italic 91.8 85.9 84.1 92.1 90.7 87.6 92.2 Bold Italic 92.7 86.2 84.6 92.8 91.3 88.4 92.6 4.7 Conclusion The identification of font type plays an important role in the design of optical character recognizer. This will help for automatic conversion of text imageinto machine editable form.Here, anovel method – Empirical mode decomposition is used for identification of fonts in Kannada language. This method identifies the font based on the global information rather than local attributes. A block of text printed in each font can be 47 FONT CLASSIFICATION FOR KANNADA LANGUAGE seen as a specific structure. Numerous font types present in the text poses various problems in the construction of the optical character recognizers. Researchers are making an attempt to identify the font type in a text document which will help the OCRs to identify the characters more efficiently. This is with the view that each font has its own features and it is necessary to develop new techniques to separate these features which can be effectively used for font separation. Different Kannada font type has its own appearance and this characteristic can be used to identify the font type. Empirical mode decomposition method is used to extract the features of a given font type by decomposing it into intrinsic mode functions. The first few IMFs contain significant information and their average energy is computed. These values represent the feature of the font type. Minimum distance classifier is used to recognize the font. The main advantages of this method are i. Less feature dimension. ii. Better success rate. iii. No need to convert the gray image into Binary format. The early detection of font type helps in better recognition of the text using Optical Character Recognizer. Since not much work has been carried out in this area for Kannada language so far, we are certain that this new approach will assist for better development of Optical Character Recognizer for Kannada language. Here, Kannada printed characters are considered for the experimentation purpose. The method used is content independent, so the contents of the training and testing documents need not be same. Since writings in Kannada consist of three levels,presence of too many consonant conjugate affects the results. Line spacing between text lines also affects the computed results and necessary care should be taken. This problem can be overcome by finding the features associated with the single line of text rather than multiple lines of text or keeping the constant space between the lines. This requires additional computation time. The method exhibits strong robustness against noise and requires no detailed local feature analysis. 48
© Copyright 2026 Paperzz