Morphology Based Handwritten Line Segmentation

Morphology Based Handwritten Line Segmentation Using
Foreground and Background Information
Partha Pratim Roy
Computer Vision Centre,
Universitat Autònoma de
Barcelona, 08193, Bellaterra
(Barcelona), Spain.
[email protected]
Umapada Pal
Computer Vision and Pattern
Recognition Unit, Indian
Statistical Institute,
Kolkata-108, India
[email protected]
Abstract
Currently text line segmentation is an important stage
of research in historical document processing. Because of
inter-line distance variability and base-line skew
variability,
line
segmentation
in
unconstrained
handwritten document is very difficult. The line
segmentation task gets complicated, when overlapping or
inter-penetration situation occurs between two consecutive
text lines. In this paper we propose a method mostly based
on morphological operation and run-length smearing
algorithm (RLSA) to segment individual text lines from
unconstrained handwritten document images. Here at first
RLSA is applied to get individual word as a component.
Next, the foreground portion of this smoothed image is
eroded to get some seed components from the individual
words of the document. Erosion is also done on
background portions to find some boundary information of
text lines. Finally, using the positional information of the
seed components and the boundary information, the lines
are segmented. We tested our scheme on images of five
different scripts and we obtained encouraging results from
the experiments.
Keywords: Handwritten Document, Mathematical
morphology, RLSA, Handwritten Line Segmentation.
1. Introduction
At present text line segmentation is an important topic
of research in historical document processing area. It costs
to the quality of word and character segmentation greatly.
Segmentation of unconstrained handwritten text line is
difficult because of inter-line distance variability and baseline skew variability. Components of two consecutive textlines may be touched or overlapped in unconstrained
handwritten text. In Indian languages such situation occurs
frequently because of several modified characters. These
overlapping or touching characters complicate the line
segmentation task greatly. There are many methods for text
line segmentation [1,3-4,6,10,13-14,16]. Global projection
analysis of black pixels is often used for text line
segmentation. But this method will not work properly when
Josep Lladós
Computer Vision Centre,
Universitat Autònoma de
Barcelona, 08193, Bellaterra
(Barcelona), Spain.
[email protected]
overlapping or skewed text lines occur. Modification of
this method is done by some researchers using partial
projection method [16]. Input image is divided into vertical
stripes and based on the projection profile on these stripes
segmentation is done. Pal and Dutta [10] proposed a
modified technique of stripe-based method using water
reservoir concept. Some studies decompose the text into
individual components [14]. By means of hierarchical
clustering procedure the components are grouped into
individual text lines. This method cannot assign a character
in its correct group properly when overlapping or interpenetration situation occurs in text lines. Techniques based
on statistical modeling [4], thinning [13], linear
programming [15], level set [3], HMM [6] etc. are also
used for text line segmentation.
In this paper we propose a scheme to segment
unconstrained handwritten document pages of Indian
scripts into individual text lines. We have used the
foreground and background information to segment each
line. Here, at first, a horizontal Run Length Smearing
Algorithm (RLSA) is applied on the input image. The
threshold for RLSA is computed based on the height
information of the text lines and it is determined using
water reservoir concept. Next, the foreground portion of
this smoothed image is eroded to get some seed
components from the individual words of the document.
These seed components generally represent the central
portion of individual words of a text line. This erosion also
reduces the touching effect of modified characters and
makes the line segmentation task easier. Erosion is also
done in background region to find the upper and lower
boundary information of a text line. Finally, using the
positional information of the seed components and the
boundary information, individual lines are segmented.
Rest of the paper is organized as follows. In Section 2
properties of different scripts used in our experiment are
discussed. Estimation of text line height is described in
Section 3. In Section 4, we briefly explain our proposed
method, used for line segmentation. The experimental
results are discussed in Section 5. Conclusion is given in
Section 6.
2. Properties of Different Scripts Used in
Our Experiment
In our scheme we consider text lines of Devnagari,
Bangla, Oriya, Gujarati and English scripts for our
experiment. We briefly discuss properties of Devnagari,
Bangla, Oriya and Gujarati scripts here. Among Indian
scripts, Devnagari is the most popular script in India and
the most popular Indian language Hindi is written in
Devnagari script. Nepali, Sanskrit and Marathi are also
written in Devnagari script. Moreover, Hindi is the national
language of India and the third most popular language in
the world [9]. In modern Devnagari script there are 14
vowels and 37 consonants. These characters may be called
basic characters.
Bangla, the second most popular language in India and
the fifth most popular language in the world, is an ancient
Indo-Aryans language. Bangla script alphabet is used in
texts of Bangla, Assamese and Manipuri languages. Bangla
is the national language of Bangladesh. Also Bangla is the
official language of West Bengal State of India. The
alphabet of the modern Bangla script consists of 50 basic
characters (11 vowels and 39 consonants).
Figure 1. Basic characters of (a) Bangla and (b) Devnagari
alphabet are shown. First eleven characters are vowels
and rest is consonants in both the alphabet sets.
In Devnagari and Bangla scripts, most of the characters
have a horizontal line at the upper part. See Fig.1 where
basic characters of Devnagari and Bangla scripts are
shown. When two or more characters sit side by side to
form a word, these horizontal lines touch and generate a
long line called head-line. In Devnagari/Bangla script a
vowel following a consonant takes a modified shape,
which, depending on the vowel, is placed at the left, right
(or both) or bottom of the consonant [2]. These are called
modified characters. A consonant or vowel following a
consonant sometimes takes a compound orthographic
shape, which we call as compound character. A set of
modified characters of Bangla and Devnagari scripts is
shown in Fig.2.
Gujarati is a popular language spoken by about 46
million people in the Indian States of Gujarat, Maharashtra,
Rajasthan, Karnataka and Madhya Pradesh. There are 46
basic characters (12 vowels and 34 consonants) in Gujarati.
Oriya is another popular language and script of India.
This language is used mainly in the Orissa State of India as
it is the official language of Orissa State. The alphabet of
the modern Oriya script consists of 52 basic characters (11
vowels and 41 consonants). Like Devnagari and Bangla
scripts modified and compound characters are also present
in Oriya and Gujarati scripts.
Since modified characters may sit at the top or bottom
of the consonant in these scripts, words of two consecutive
text lines may touch because of these modified characters.
Such touching through modified characters complicates our
line segmentation task.
Figure 2. Examples of Bangla and Devnagari modified
characters.
3. Estimation of Text Line Height
To compute height information of the text lines in a
document page, we apply water reservoir concept. The
water reservoir principle is as follows. If water is poured
from top (bottom) of a component, the cavity regions of the
component where water will be stored are considered as
top (bottom) reservoirs [8]. For an illustration see Fig.3.
For each reservoir we compute its height. By height of a
reservoir we mean the perpendicular distance of the base
point (the deepest border point of a reservoir) from the
water flow level of the reservoir. From each component of
a document page, heights of the different reservoirs
obtained are computed. A height histogram is computed
from these reservoirs heights. In handwritten document
there exists a variety of character size and many touching
may occur because of handwriting style of different
individual. As a result connected component analysis can
not give proper text line height information. To get proper
height information we take the average height (HL) of those
reservoirs whose heights lie in the right half of the height
histogram. This is done to ignore the small reservoirs in
our height computation. This HL gives an idea about the
height of a text line and it is very useful to determine
different parameters for smoothing algorithm and to decide
the structuring element of morphological operation used in
our segmentation approach.
Figure 3. A top water reservoir and its different features
are shown. Water reservoir is marked by grey shade.
4. Proposed Line Detection Algorithm
We have used binary image for our work and to convert
the original grey-level document images into binary image,
we have applied the algorithm due to Otsu [7]. Binary
image may contain some small components and we have
removed such small components for line segmentation. The
original image and the resultant binary image are shown in
Fig.4(a) and 4(b).
4.1.
Foreground Smoothing
RLSA algorithm links together neighboring
black/white areas that are separated by less than a distance
(which represents the smoothing threshold). In other word,
it replaces a sequence of background pixels between two
object pixels by object pixel value in a specified direction.
Normally, it is used to fill the background pixel-run of
length less than a certain pre-defined threshold in
horizontal or vertical direction. In our approach, this
method is applied only in horizontal direction, i.e. row-byrow smoothing is done. The threshold for the smoothing is
given as 2.5*HL in our experiment. The smoothed image is
shown in Fig.4(c). From the figure it can be noted that
middle part of the text line are mostly black because of
RLSA. Sometimes because of overlapping of text from two
consecutive lines, two or more components may touch
vertically due to smoothing. Based on the horizontal
histogram of each smoothed component we detect such
touching. When two smoothed components of two different
text lines touch then we generally get a peak and valley
shape. We analyze the histogram to find possible valleys
between peaks. Generally, the peaks are obtained from
parts of different text lines that touch. If a valid valley is
found in between two peaks then the valley region is
marked as possible touching area. Now, we analyze a
touching area to detect whether such touching is formed
due to RLSA or not. To do that we consider the
corresponding area of the initial image (before RLSA is
done) and trace its contour. Starting from the top most
point of the considered area of image if we can reach its
bottom most point by the tracing, then we conclude that the
touching was formed before RLSA. To segment such
touching we analyze the contour points of that area and
based on the structural shape of the contour the touching
point is detected. Using angular information of contour
point and run-length information we segment a touching. If
the touching is formed due to RLSA then we replace that
touching area by corresponding area of initial image.
4.2.
in morphological image processing from which all other
morphological operations are based. For details about this
see [12]. After RLSA, we will have a smoothed image,
where the foreground part belongs to black text regions and
background part consists of white regions. By erosion, we
determine some important information from foreground
and background portion which are very helpful in our line
segmentation purpose.
Morphological Operation for Foreground
and Background Information Extraction
We have used morphological operations, mainly,
erosion to extract the useful foreground and background
information. Erosion is one of two fundamental operations
(a)
(b)
(c)
Figure 4. (a) Example of Bangla handwritten document
image. (b) Binary Image after Pre-processing of (a). (c)
Horizontal RLSA result of (b).
4.2.1. Background Information Extraction
The background region of run-length smoothed image
is eroded to extract some obstacle lines. These obstacle
lines will act as “Separator Lines” (SL) between two
consecutive text lines. The shape of the structuring element
for erosion is chosen as rectangular and its height and
width are 0.5*HL and 5*HL, respectively. These threshold
values are determined from the experiment. The anchor
point is set at the centre of the structuring element.
Background eroded image of Fig.4(c) is shown in Fig.5(a).
From each of the eroded components of background
region, we take the upper and lower profiles in each
column and we compute the middle point from these
profiles in each column. A line fitting algorithm of these
mid-points is used and the resultant line is the ‘SL’.
Different SLs obtained from Fig.5(a) are shown in
Fig.5(b). The left and right end points of a SL are extended
horizontally in both directions till (i) it touches the
foreground parts or (ii) it touches the left or right profiles
of smoothed foreground image or (iii) it finds another SL
within a vertical distance (HL). These extended SL lines of
Fig.5(b) are shown in Fig.5(c). Note that, we have
computed the left and right profiles of the smoothed image
and the profile information has been used for line
extraction.
(a)
RLSA as discussed in Section 4.1. To detect the FSC
portion coming from such touching part, we scan FSC
region column wise. The columns where height of FSC is
bigger than HL are removed, so that touching portions will
not affect our line segmentation method. In some cases, a
very small FSC may be obtained. We delete these small
FSCs also for better line segmentation. Generally, FSC
should lie on text line, but because of touching and
modified characters, some FSC may appear on the portions
between two text lines. To take care of such situations the
FSC components which touch a SL are also removed.
Remaining FSCs of R are used for line extraction and we
call such FSCs of an image as the candidate FSC. The
candidate FSC components, obtained from the image given
in Fig.4(a) are shown in Fig.6. We use these candidate
FSCs for line segmentation purpose.
(b)
Figure 6. Foreground seed components are shown by
black regions.
4.3.
(c)
Figure 5. (a) The Eroded portions of background are
shown by black region (b) Separator Line (SL) obtained by
joining the mid-points of upper and lower profiles of eroded
components. (c) Extended SLs are shown along with text
image.
4.2.2.
Foreground Information Extraction
Morphological erosion has also been applied in the
foreground part of the smoothed image to extract
foreground seed component (FSC). By a FSC, we mean an
isolated eroded component obtained from smoothed
foreground part. These FSC components are generally the
representative of word components in the document. The
structuring element is chosen for foreground erosion as
rectangular in shape with height 0.5*HL and width
0.65*HL. The anchor point is set at the centre of the
structuring element.
Let, R be a set of all FCSs of an image. In handwritten
documents, the text lines sometimes touch or overlap each
other because of the modified characters of scripts as well
as for ascending and descending parts of characters. As a
result, the smoothed text lines also touch and sometimes we
may get a big FSC although we delete some touching
Line Segmentation
By joining of candidate FSC components we will get
the segmented lines. The SLs guide FSC joining to get
proper segmentation. For each candidate FSC, we compute
two reference points named as “left” (LR) and “right” (RR)
reference point. LR of a FSC is found by computing centre
of gravity of partial set of pixels of the FSC which lie
between leftmost columns to a column upto a width (HL)
from leftmost column. RR is calculated similarly from
rightmost column of FSC. If the width of a FSC is less than
2* HL, then LR (RR) of a FSC is computed from the left
(right) half of the FSC. The left and right reference points
of the candidate FSCs of Fig.6 are shown in Fig.7. The two
reference points of each FSC are marked by black points
on the FSC.
Figure 7. Left and right reference points are shown by
black dots on each FSC.
Let A and B be two candidate FSC components. For
joining B with A, we compute a searching zone (as shown
in Fig.8) from the right reference point (RR) of A. The
length (L) of the searching zone is taken as 10*HL and the
width (W) of the searching zone is determined as 1.5*HL.
We do not consider full rectangular area at the beginning
part of the searching zone. Beginning part is upto a length
HL from RR and the searching zone is triangular in shape in
this part. Rest of the searching part of length 9*HL is
rectangular in shape. This is done so that two vertical
candidate FSCs will not be joined. See Fig.8, where
searching zone of RR of ‘A’ is shown by hatched lines. We
will join B with A, if it satisfies the following conditions:
(i) The LR of B lies in the searching zone of RR of A. (ii)
The searching zone of RR of A will not cross any SL.
Figure 8. Searching zone of the right reference point ‘A’ is
shown here.
In the first condition, we consider the foreground
positional information of FSC and this is done in terms of
component overlapping. In the second condition we utilize
SLs which are based on background information. If the
above conditions are satisfied, we join two components A
and B. As soon as B is joined to A, we try to join other
FSC which is nearer to B and satisfy the above two
conditions for B. In this way all possible FSC of a line are
joined. After joining of the FSC we extend the LR point of
the leftmost FSC of the joined FSC towards left and RR
point of the rightmost FSC towards right. During this
extension, if we can reach border of the image, then a text
line is detected.
In this way, FSCs of individual lines will be clustered
by joining them through a line segment to get individual
text line. To get the characters of a text line, we collect all
the RLSA components which pass through this line and
character portions of these RLSAs in the segmented line.
Sometimes, some small RLSAs may not touch this line. We
cluster such RLSA components to their nearest line.
If in a single line, two or more line segments are
obtained due to longer distance between two FSCs, we
group them together in a single line by checking the
positional relationship of the leftmost and rightmost points
of these line segments along with the information of gap
from the line lie above it. If the line-gap is similar for these
line segments from a segmented line lies above them, we
group these line segments into one line. By line-gap, we
mean the distance between two consecutive lines, obtained
after FSC joining. Line-gap between first two lines of Fig.9
is shown by arrow. Note that the joining of FSC is done in
top-to-bottom fashion starting from topmost line. Line
joining result of Fig.7 is shown in Fig.9.
Figure 9. Result of FSC joinings.
5. Experimental Results and Discussions
For experiment, 125 handwritten document images
were considered from individual of different professions.
These documents are collected from different persons.
These data are collected from five different languages:
Bangla, Devnagari, Oriya, Gujarati and English. We noted
that these dataset contain varieties of writing styles. For the
experiment we considered single column document pages.
To check whether a text line is detected correctly or not we
draw a marker on the text line after its detection. By
viewing the results on the computer display we calculate
the line detection accuracy manually. From the experiment
we have found that on an average 92.68% of cases our
system can detect the text line properly. Line detection
results of our proposed scheme are shown on different
scripts in Fig.10.
We considered 520, 195, 406, 210 and 115 text lines of
Bangla, Devnagari, English, Oriya and Gujarati scripts and
we noted that 94.23%, 93.33%, 96.06%, 94.76% and
93.04% of text lines are correctly identified, respectively,
from our experiments. For the experiment of English
document, we also considered some images from IAM
database [5] as well as some images from examination
answer sheets. From our experiment we noted that most of
the errors occur when the two consecutive text lines are
very near to each other. Our system detects all the lines of a
document accurately when a gap is present between
consecutive lines. The SLs are very useful in preventing
joining of a FSC with that of next line which is very near.
If there is a missing of SL and the FSC candidates of two
consecutive lines are too close then an erroneous result
occurs and this is the main drawback of our approach.
Our method computes the global text-size and the line
detection thresholds are dependent on text-line height. If
there exist multi-size text lines in a single document, this
method needs to be modified for detecting the size of
structuring element. Layout understanding methodologies
may help detecting different uniform text zones and
different text line size can be computed from those
individual zones. However, we noted that hand-written size
variation in a document page of a single writer is very rare
in our documents.
There is not much work on handwritten text line
extraction on Indian languages. Recently Basu et al. [1]
proposed a method for text line extraction and they
obtained 90.34% accuracy on Bangla script and 91.44% on
English script. We obtained overall 92.68% accuracy from
our proposed approach when tested on five different
scripts.
6. Conclusion
In this paper, a script independent line segmentation
algorithm is developed from off-line unconstrained
handwritten documents. The proposed scheme is developed
based on morphological operation on the foreground and
background portion of the document. We tested our
scheme on different scripts like Bangla, Devnagari, Oriya,
Gujarati, English etc. and obtained encouraging results.
7. References
[1] S. Basu , C. Chaudhuri , M. Kundu , M. Nasipuri and D. K.
[2]
[3]
(a)
[4]
[5]
[6]
[7]
[8]
(b)
[9]
[10]
[11]
(c)
[12]
[13]
[14]
(d)
Figure 10. Results of line segmentation are shown on (a)
Bangla (b) Gujarati (c) Oriya (d) English text.
[15]
[16]
Basu, Text line extraction from multi-skewed handwritten
documents, Pattern Recognition, vol. 40(6), 2007, June,
pp.1825-1839.
B. Chaudhuri and U. Pal, “Skew angle detection of
digitized Indian Script documents”, IEEE PAMI, vol. 19,
1997, pp.182-186.
Y. Li, Y. Zheng, D. Doermann and S. Jaeger, “A new
algorithm for detecting text line in handwritten documents”,
In Proc. 10th IWFHR, 2006, pp.35-40.
J. Liang, I. Philips and R. M. Haralick,“A statistically based
highly accurate text-line segmentation method”, In Proc.
5th ICDAR, 1999, pp.551-554.
U. Marti and H. Bunke, “A full English sentence database
for off-line handwriting recognition”. In Proc. 5th ICDAR,
1999, pp. 705 - 708.
S. Nicolas, Y. Kessentini, T. Paquet and L. Heutte,
“Handwritten Document using Hidden Markov Random
Fields”, In Proc. 8 th ICDAR, 2005, pp.212-216.
N. Otsu, A Threshold selection method from grey level
histogram, IEEE Trans on SMC, vol.9, pp.62-66, 1979.
U. Pal, A. Belaïd and C. Choisy “Touching numeral
segmentation using water reservoir concept” Pattern
Recognition Letters, vol.24, 2003, pp. 261-272.
U. Pal and B. B. Chaudhuri, “Indian script character
recognition: A Survey”, Pattern Recognition, vol. 37, 2004,
pp. 1887-1899.
U. Pal and S.Datta, “Segmentation of Bangla Unconstrained
Handwritten Text”, Proc. 7th ICDAR, 2003, pp.1128-1132.
U. Pal and P. P. Roy, Multioriented and curved text lines
extraction from Indian documents. IEEE Trans. on SMC.
Part B. vol.34, 2004, pp.1676-1684.
J. Serra, Image Analysis and Mathematical Morphology.
Academic Press, London, 1982.
S. Tsuruoka, Y. Adachi and T. Yoshikawa, “The
Segmentation of a Text line for a Handwritten
Unconstrained Document using Thinning Algorithm”, In
Proc. of 7th IWFHR, 2000, pp. 505-510.
W. Xiaoying and C. G. Leedham,“Seperating lines and
words in unconstrained handwriting”, In Proc. 8 th IGS,
1997, pp. 117-118.
B. Yanikoglu and P. A. Sandon, “Segmentation of off-line
cursive handwriting using linear programming”, Pattern
Recognition, vol.31, 1998, pp. 1825–1833.
A. Zahour, B. Taconet, P. Mercy and S. Ramdane, “Arabic
hand-written text-line extraction”, In Proc. 6th ICDAR,
2001, pp. 281-285.