font classification for kannada language

FONT CLASSIFICATION FOR KANNADA LANGUAGE
CHAPTER 4
FONT CLASSIFICATION FOR KANNADA
LANGUAGE
4.1 Introduction
Character recognition using computer is more challenging and has been around since
the introduction of digital computers. With fast development of internet and multimedia,
Optical Character Recognition (OCR)has been greatly developed. Some OCRs are able to
handle high quality printed characters with better recognition rate and are so perfect that are
now available commercially. However, most existing OCRs are not able to extract
typographical attributes of fonts such as type, size, serifness and soon. For producing the reeditable text, font recognition plays an important role. Correct identification of the font helps
in improving the recognition rate of the OCR systems. One of the challenges for researchers
is to reduce the rejection rates and to achieve lower substitution error rates even on good
quality machine-printed documents.
Selection of a feature probably plays a significant role in order to achieve high
recognition rate. OCRs development requires that they distinguish font styles such as italic,
bold, sanserif. Typographically, a font is a particular representation of a typeface design and
may have a particular size, style, weight, spacing of characters, x-height proportion, shape of
serifs and loop axes. The font features, which determine the accuracy with which an OCR
system recognizes it correctly, can be derived based on:
i. Local attributes of fonts such as boldness, serifness.
ii. Local and/or global typographical features.
iii. Texture analysis.
The kind of featuresused to recognize the fonts depends upon the type of the method
used. Many research papers have been published on contour following, classification
algorithms, line thinning etc. Most of these researchesfocused on isolated characters either
printed or handwritten. Even a better OCR system requires considerable amount user
interaction about the layout, and others expect each field to be demarcated prior to OCR
stage. In trainable systems, considerable amount user time is required to identify the
characters. User feels same steps are repeated again and again.
35
FONT CLASSIFICATION FOR KANNADA LANGUAGE
Font classification plays an important role in automatic document analysis and is
often a tedious task and requires more time. It is necessary that the OCR should know the
font type a priori for successful recognition. Font type plays an important role for character
recognition and script identification. In order to achieve better recognition rate an automatic
data processing system should include font type and the content of the document. Secondly,
font classification can reduce number of alternative shape for each class this inturn reduces
the recognition of characters based on various fonts.
A well-developed OCR system adopts integrated treatment of entire document and
well organized typographic knowledge. To develop such a system, it is necessary to address
some of the problems associated with text data. In many situations iftextshaveunusual
typefaces, then characters are such that they may be segmented wrongly. Formulae and
equations present in the text poses different problems and may be lacerated. Other problems
with automatic recognition are presence of mathematical symbols,Lines having dropcaps, or
subscripts and superscripts leads to improper recognition and may be missed altogether.
Further, figure captions may be misplaced, paragraphs in multicolumn documents appear in
the wrong order, headings are recognized and tables are disconnected.In such scenario it is
necessary that omitted characters need to be keyed in and misidentified characters need to be
replaced.
Many methods have been developed to recognize Kannada characters. However,
perfect it may be, it is insufficient for conversion of complex archival documents to a form
that can be read by a computer. Development of OCR for Kannada language is in the naive
stage posing more challenges to researchers because of its large character set. There is a
much scope for the development of OCR system in Kannada language and a stage need to be
reached where printed pages are inserted into OCR system and coded file compatible with
that of keyed-in version is obtained.Even though it is clear that font recognition is an
important step in automatic processing not much work has been carried out in Kannada
language for font recognition.
36
FONT CLASSIFICATION FOR KANNADA LANGUAGE
4.2 Review of some methods to recognize the font type
Font recognition is a fundamental issue in document analysis and recognition.
Effectiveness of OCRs depends on identification of the font. The increase in diversity of font
styles, in today’s business scenario, limits the performance of an OCR. With advent in print
media and new approach in information presentation, large number of different types of fonts
is used in the written documents. OCRs,which havebeen developed to work with mono fonts
or limited number of font types, tend to misinterpret the documents leading to poor
functioning of automation system. Many methods have been proposed in the literature. Most
of these worksare developed for the languages such as English, Chinese, and Japanese. Good
success rate is achieved only for a few font types and still this area is fascinating to
researchers to achieve better result with verities of font types. Most of these methods are
typically based on typographical features local to the font. The local features of the font such
as serifness, boldness etc has been used for estimation of font attributes in an OCR systems
[Cooperman].
Font types are also identified by using page properties such as histogram of word
length and stroke slopes [Shi and Paulidis]. Yet another method identifies the font type by
matching clusters of word images which are obtained from an input document with a
database of functional words derived from fonts and document images [Khoubyari and Hull].
Considering the impact of fonts type on the performance of OCR system [SedarOzturk,
BulentSankur, AToygarAbak], clustering behavior of fonts have been investigated, which
will be useful in the design of robust OCR system and reproduction of documents in its
original appearance. Font features are extracted using Bitmaps, Fourier descriptor, DCT
coefficients and Eigen characters. All these methods depend on local features associated with
fonts. A few researchers attempted to recognize the font based on the global characteristic of
the fonts [Yong Zhu, Tieniutan and Yunhongwang]. In this method, an image containing
specific structure is used as an input. Features are then extracted using multichannel Gabor
filter technique. These features are used for font identification.
An OCR system for printed document in Kannada language has been developed using
support vector machine. It works independently of font type and size [T.V. Ashwin and P.S.
Sastry]. The features are extracted by splitting each segment image into number of zones and
37
FONT CLASSIFICATION FOR KANNADA LANGUAGE
then finding distribution of the ON pixels in the radial and the angular directions. Much work
needs to be carried out for the development of OCR for Kannada language. When the
document supplied to the OCR contains different font type,then the OCR fails to recognize
all the characters properly, may misinterpret the character as another type. To enhance the
capability of the OCR, it is necessary that font type should be identified. This information
when inputted to the OCR, it enables necessary routines and identifies the characters
properly. This is a bottleneck for the Kannada language. When the documents are prepared in
Kannada language, either using NUDI and BARAHA Software (Which are most popularly
used Kannada Software), many font types are used for better presentation. OCR developed
for Kannada language may not work properly with multiple font type. This has motivated us
to develop a method to identify the font type for Kannada language which will be useful in
the preprocessing stage of OCR system.A sample of the documentis shown for the font
Nudi11,Nudi28 in the Figure 4.1. The font Nudi10 is shown in the Regular, Italic, Bold and
Bold Italic styles in Figure 4.2.
(a)
(b)
Figure 4.1: Font type Nudi11 and Nudi28
(a)
(c)
(b)
(d)
Figure 4.2: Font Nudi10in (a)Regular,(b)Bold,(c)Italic and (d)Bold italic type
38
FONT CLASSIFICATION FOR KANNADA LANGUAGE
The organization of the chapter is as follows: Section 4.3 gives an outline of the
proposed method. Details of the algorithm to compute the font features have been discussed
in section 4.4. Classification of the features using Euclidean distance method is presented in
section 4.5. Section 4.6 discusses the experiment results and paper is concluded in section4.7.
4.3 Outline of the proposed method
Dataanalysis is an essential step in understanding the physical processes asdata
analysis is important for both theoretical and experimental studies (Link between theory and
reality is provided only by data).The methods used for this purpose should give a better
insight into the process used that actually generatesthe data. The method used should yield
physically meaningful results which can be used for better understanding of the processes.
Hilbert-Huang Transform (HHT) has been designed to fulfill the above requirements and
specially to deal with the processes that are nonlinear and non-stationary in nature.
HHT decomposesthe signal into intrinsic mode functions (IMFs) which yields
instantaneous frequency data. It was proposed by Huang, N E Shen, Z, Long, S.R specifically
for analyzing data that exhibits nonlinear and non-stationary characteristics. The limitations
of earlier data analysis method such as Fourier-based method have been overcome in this
technique.
The Hilbert-Huang transform is the result of the empirical mode decompositionandthe
Hilbert spectral analysis. As the EMD method is more fundamental, is anecessary step to
reduce any given data into a collection of intrinsic mode functionsto which the Hilbert
analysis can be applied. An IMF represents a simple oscillatorymode as a counterpart to the
simple harmonic function, but it is much more general:by definition, an IMF is any function
with the same number of extrema and zerocrossings, with its envelopes, as defined by all the
local maxima and minima, beingsymmetric with respect to zero. Finding EMD consists of
following steps:For any data we first identify all the local extrema andthen connect all the
local maxima by a cubic spline line as the upper envelope. Werepeat the procedure for the
local minima to produce the lower envelope. The upperand lower envelopes should cover all
the data between them. Their mean is designatedasm1and the difference between the data
andm1is the first proto-IMF (PIMF) component,h1:
h1= x(t) –m1.
(4.1)
39
FONT CLASSIFICATION FOR KANNADA LANGUAGE
This procedure of extracting an IMF is calledsifting.By construction, this
PIMF,h1,should satisfy the definition of an IMF, butthe change of its reference frame from
rectangular coordinate to a curvilinear may cause anomalies, where multi-extrema between
successive zero-crossings still existed. To eliminate such anomalies, the shifting processis
repeated as many times as necessary to eliminate all the riding waves. Inthe subsequent
sifting process steps,h1is treated as the data. Then
h11=h1 – m11
(4.2)
Where,m11is the mean of the upper and lower envelopes ofh1.
This process whenrepeated up toktimes; then,h1kis given by
h1k=h (k-1)– m1k
(4.3)
Each time the procedure is repeated, the mean moves closer to zero. Theoretically,
this step can go on for much iteration.Buteach time, as the effects of the iterations make the
mean approach to zero, they alsomake amplitude variations of the individual waves more
even. Thus thisiteration procedure, though serving the useful purpose of making the mean to
bezero, also drains the physical meaning out of the resulting components if carriedtoo far.
Thus to attain the delicate balance ofachieving a reasonably small mean and also retaining
enough physical meaning inthe resulting component two stoppage criteria has been used. The
first one is when the sum of the differences between present and previous waveforms is
smaller than the pre-defined value and another criterion is number of consecutive siftings
when the numbers of zero-crossings and extremaare equal or at most differ by one.
The usefulness of empirical mode decomposition/Hilbert-Huang transformfor the
analysis of nonlinear and non-stationary data has been extendedto include the analysis of
image data. Because image data can be expressedin terms of an array of rows and columns,
this robust concept is applied to thesearrays row by row. Each slice of the data image, either
row or column-wise, representslocal variations of the image being analyzed. The data
fromnatural phenomena are either nonlinear or non-stationary, or both, so are thedata that
form images of natural processes. Thus, the EMD/HHT approach is especially well-suited for
image data, giving frequencies,inverse distances or wavenumbers asa function of time or
distance, alongwith the amplitudes or energy values associated with these, aswell asa
sharpidentification of embedded structures. This method is used to identify the font type in a
40
FONT CLASSIFICATION FOR KANNADA LANGUAGE
given Kannada text which will be useful for automatic identification of the text by the
machine.
In order to achieve better recognition result and processing of fonts, it is important to
identify the correct features of the font type. This in turn improves the performance of OCR
system.To recognize the different fonts in a text page, images are first acquired and subjected
to empirical decomposition to produce IMFs.The IMFs produced differ for each font type
andthis can be used to identify the font type. The IMFs produced can used to compute the
total average energy of the IMFs which can be used as a feature. For experiment purpose we
have considered first four IMFs and energy associated with them. This decision is made
based on the fact that the contribution of the remaining IMFs towards the energy computation
is small as compared to time required to compute them from the data.
4.4 Selection of features using empirical mode decomposition
The font type identification is vital in the construction of OCR which automatically
converts the text in the image format to editable form.As different types of fonts are used in
text messages to draw the attention of customers, processing of such messages or text by an
OCR yields poor result. A few lines of text having the same font type are treated as a block
and will be useful in the development of robust optical character recognition system. Hence
the text block with same type of font may be treated as an image of unique texture. In many
text documents, it is observed that the texts are typed with similar font type, in certain
situations to highlight some of the important passages or paragraphsthe text may be presented
with different font types.
EMD is empirical, intuitive, direct, and adaptive, with the aposterioridefined basis
derived from the data. The decomposition is designed to seek the different simple intrinsic
modes of oscillations in any data based on the principle of scale separation. Each of the
discrete oscillatory modes referred to as intrinsic mode function
(i) In the whole data set, the number of extrema and the number of zero-crossings must
be either equal or differ at most by one, and
(ii) At any point, the mean value of the envelope defined by the local maxima and the
envelope defined by the local minima is zero.
41
FONT CLASSIFICATION FOR KANNADA LANGUAGE
In order to extract the feature of the font type, text images of size 162×72, are used
for experiment purpose. The noise if any in the image is removed. The line spacing is
normalized. The lines are padded with the characters to avoid the influence of the blank
spaces in the line. Text isselected as gray scale image with 256 gray levels. The text blocks
are processed using empirical mode decomposition method.It is observed that for the
different font type, the high frequency energy associated with them is different.
The total number of IMF components is roughly limited to ln2 N, where Nis the total
number of data points. An EMD decomposes the original signal x(t) in to set of IMF’s
through an iterative procedure.The algorithm is used to extract the features ofeach font style.
For each font style its IMFs and residuesare computed and stored. This is used for
identification ofcharacters. The EMD algorithm is shown below:
Algorithm 4.1: Empirical mode decomposition
Input: Image containing the font whose feature is to be extracted
Step 1: Initialize
r0(t) = x(t), i=1, ri(t) = r0(t)
Step 2: Procedure to extract the ithintrinsic mode function
(a) Initialize: h0(t) = ri(t), j = 1.
(b) Extract all the local minima and maximaof hj-1(t).
(c) Interpolate the local maxima and the localminima by a cubic spline t. Form uj1(t)and
lj-1(t) as the upper and lower envelopes of hj-1(t), respectively.
(d) Calculate the mean mj-1(t) of the upperand lower envelopes.
(e) hj(t) = hj-1(t) - mj-1(t)
(f) if stopping criterion is satisfied then seti.ehj(t) is an IMF, set
imfi(t) = hj(t), else go to (b) with j = j + 1
Step 3: Let ri(t) = ri-1(t) - imfi(t)
Step 4: if ri(t) still has at least 2 extremathen go to 2 with i = i + 1
Elsethe decomposition procedure endsand r i (t) is the residue.
Output: The IMFs of the given font type
42
FONT CLASSIFICATION FOR KANNADA LANGUAGE
Many IMFs are generated from the above process. Of these first few IMFs contain the
significant information and subsequent IMFs have less contribution to characterize each font.
For each font, we have used first four IMFs for computing the feature vector. The algorithm
for the computing the feature vector is shown below:
Algorithm 4.2Computation of feature vector
Input: Image whose feature vector is to be computed
Step1: Conduct the preprocessing of text and perform normalization.
Step 2: Calculate the energy functions of the each character. Calculate high frequency energy
ex
ex =
1
2N
N
∑
[ A1x (i ) + Ax2 (i )] where
(4.4)
i=1
Axj (i) = [imf xj (i)]2 + [ H (imf xj (i))]2 for j=1,2
(4.5)
Where N is number of observation N ≈ 10 max(width of the image, height of the image) and
H (imf x j (i )) is the Hilbert transform of imf x j (i )
Step 3: Let V = [ ex] which is the feature vector for the character.
Output: High frequency energy
4.5 Minimum distance classifier
Major problem in image recognition is finding the distance between images. Tangent
distance, Euclidean distance, artificial neural network or Support vector machine may be
used for the purpose. Among all the image metrics, Euclidean distance is the most commonly
used due to its simplicity. Let x,y be two M x N images, x=(x1,x2,…xMN), y=(y1,y2,…yMN),
where xkN+l,ykN+lare the gray levels at location (k,l). Finding the distance between the two
images is to find the distance between each pixels of the image. The Weighted Euclidean
Distance(WED) depends on the statistical distribution of the samples.It can be used to
compare two templates, especially if the template is composed of integer values. Feature
vector is computed for each font type which acts as the metrics for the font type. The WED
43
FONT CLASSIFICATION FOR KANNADA LANGUAGE
gives the measure of how similar a collection of values are between two templates. After
obtaining the metrics, Weighted Euclidean Distance is computed as follows:
N
WED(K)=
∑
i=1
(k)
2
(f i -f i )
(δ i (k) ) 2
(4.6)
Where,
fi is the ith feature of the unknown font
fi(k) is the ith feature of font type k
δi(k) is the standard deviation of ith feature in the template k
The unknown template is found to match font template k when WED is minimum at k
4.6 Experimental results
Font recognition using empirical mode decomposition identifies the font type by
finding the nearest font type for a given unknown font using Euclidian distance classifier. As
there is no standard database is available for Kannada font type, texts containing different
font types are used for experimental purpose. To demonstrate significance of EMD for font
classification, different fonts which are used frequently in Kannada text along with different
font styles such as Regular, Italic, Bold and Italic Bold are used.If the proposed system is
able to recognize text font type to that of stored one, then the experiment is success;
otherwise it is failure.
To test the suitability of the algorithm, experiments have been conducted by using the
computer generated text blocks and scanner generated text blocks. For each font type four
different font styles namely regular, bold, italic and bold italic is used. Kannada Nudi3
software is used to type the Kannada text, then it is printed and scanned using HP scanner.
Adobe Photoshop is used for computer generated images. The image size of 162× 72 is used
in the experiment. For different fonts, separate text blocks are created. For each font, 10
computer generated text blocks and 10 scanner generated text blocks are used. The average
energy of each font type is calculated.
44
FONT CLASSIFICATION FOR KANNADA LANGUAGE
The Figure 4.3(a) shows the font type Nudi03 and Figure 4.3(b) shows its empirical
mode decomposition and first four IMFs. The Figure 4.4(a) shows features of the text block
for the font type Nudi06 and its first 4 IMFs are depicted in Figure 4.4(b). It is observed that
the IMFs different for different fonts. Among these IMFs the first few IMFs (Higher
frequency IMFs) contributes more energy compared to that of lower frequency energies.
Hence only first 4 IMFs are used to compute energies. Along with this, energy associated
with residue components is utilized as a feature to classify different fonts.
Figure 4.3(a): Text printed with the font type Nudi03
IMF1
IMF2
IMF3
IMF4
Figure 4.3b:The first four IMFs for the font Nudi03 Regular
45
FONT CLASSIFICATION FOR KANNADA LANGUAGE
Figure 4.4(a): Text printed with the font type Nudi28
IMF1
IMF2
IMF3
IMF4
Figure 4.4(b):The first four IMFs for the font Nudi03 Regular
The feature of each font is computed by calculating the energy of differentfont types as
depicted by the IMFs of each font. The energy value is as shown in Table4.1
46
FONT CLASSIFICATION FOR KANNADA LANGUAGE
Table 4.1: Feature vectors of the different font text blocks
Font Type
Regular
0.1827
0.1871
0.2168
0.2124
0.1963
0.1913
0.2687
0.2060
0.2757
0.2423
N01
N03
N06
N07
N08
N09
N10
N11
N13
N22
Energy Values
Bold
Italic
0.2196
0.1991
0.2423
0.2093
0.2363
0.2174
0.2409
0.2116
0.2248
0.1945
0.2239
0.2006
0.2915
0.2796
0.2500
0.2057
0.2682
0.2881
0.2615
0.2663
Average
Bold Italic
0.2270
0.2494
0.2500
0.2546
0.2210
0.2324
0.2998
0.2569
0.2880
0.2856
0.2071
0.2220
0.2301
0.2299
0.2091
0.2120
0.2849
0.2296
0.2800
0.2639
From the experiment it is clear that few font types such as Nudi06 and Nudi07, Nudi08 and
Nudi09, have close resembles each other and have high confusion rate. Nudi09, Nudi10,
Nudi13 have distinct features from other fonts and have better recognition rates. The
recognition rate for few fonts is shown in the Table 4.2.
Table 4.2: Recognition rate of fonts and font styles
Font Type
N01
N03
N07
N09
N10
N13
N22
Regular
93.2
86.4
83.5
91.4
90.3
88.6
91.8
Bold
93.9
87.1
83.9
93.4
91.7
89.5
92.6
Font styles
Italic
91.8
85.9
84.1
92.1
90.7
87.6
92.2
Bold Italic
92.7
86.2
84.6
92.8
91.3
88.4
92.6
4.7 Conclusion
The identification of font type plays an important role in the design of optical
character recognizer. This will help for automatic conversion of text imageinto machine
editable form.Here, anovel method – Empirical mode decomposition is used for
identification of fonts in Kannada language. This method identifies the font based on the
global information rather than local attributes. A block of text printed in each font can be
47
FONT CLASSIFICATION FOR KANNADA LANGUAGE
seen as a specific structure. Numerous font types present in the text poses various problems
in the construction of the optical character recognizers. Researchers are making an attempt to
identify the font type in a text document which will help the OCRs to identify the characters
more efficiently. This is with the view that each font has its own features and it is necessary
to develop new techniques to separate these features which can be effectively used for font
separation.
Different Kannada font type has its own appearance and this characteristic can be
used to identify the font type. Empirical mode decomposition method is used to extract the
features of a given font type by decomposing it into intrinsic mode functions. The first few
IMFs contain significant information and their average energy is computed. These values
represent the feature of the font type. Minimum distance classifier is used to recognize the
font. The main advantages of this method are
i.
Less feature dimension.
ii.
Better success rate.
iii.
No need to convert the gray image into Binary format.
The early detection of font type helps in better recognition of the text using Optical
Character Recognizer. Since not much work has been carried out in this area for Kannada
language so far, we are certain that this new approach will assist for better development of
Optical Character Recognizer for Kannada language.
Here, Kannada printed characters are considered for the experimentation purpose.
The method used is content independent, so the contents of the training and testing
documents need not be same. Since writings in Kannada consist of three levels,presence of
too many consonant conjugate affects the results. Line spacing between text lines also affects
the computed results and necessary care should be taken. This problem can be overcome by
finding the features associated with the single line of text rather than multiple lines of text or
keeping the constant space between the lines. This requires additional computation time. The
method exhibits strong robustness against noise and requires no detailed local feature
analysis.
48