Separation of touching and overlapping words in adjacent lines of

SEPARATION OF TOUCHING AND OVERLAPPING WORDS IN ADJACENT
LINES OF HANDWRITTEN TEXT
KALYAN TAKRU and GRAHAM LEEDHAM
School of Computer Engineering, Nanyang Technological University,
N4-2A-32 Nanyang Avenue, Singapore 839798
Email: [email protected], [email protected]
This paper reports on a novel technique for the separation
of characters and words that are connected through
touching or overlapping of characters between adjacent
lines of text. The technique employs structural knowledge
of handwriting styles where overlap is most frequently
observed. The method is shown to work well in the most
usual cases and resolve many of the more difficult cases
observed in very poor quality handwritten documents
.
1. Introduction
An important preprocessing stage in off-line
automatic handwriting recognition systems is separating
the lines and words in a section of scanned hand-written
text, so that the location of all individual words is known
for later processing. The main difficulties in automatic
line separation are due to slanting lines, crooked lines and
touching
or
overlapping
lines.
In
general
touching/overlapping lines are found when ascenders of
the lower line touch or cross the upper line, or when
descenders of the upper line touch or cross the lower line.
If the line spacing is small, handwritten lines may
frequently overlap each other. Even in handwritten text
with adequate line spacing, overlapping line problems
may also exist due to long ascenders and descenders.
Numerous methods have been proposed for dealing
with slanting and crooked lines [1][2][3]. One method for
separating overlapping characters employs contour
information [4]. Another method that has been used relies
upon connected component grouping and splitting [5]. In
this paper we are concerned with the accurate separation
of touching and overlapping lines where characters in
adjacent lines extend upwards or downwards to touch and
interfere with the above or below line. In these cases the
touching region will contain pixels which belong to the
upper, lower or both text lines. An example image
containing five occurrences of touching and overlapping
lines is shown in Figure 1. The images were scanned as
binary images at a resolution of 300 dpi (dots per inch).
Figure 1. Example image illustrating five occurrences
of touching and overlapping handwritten lines.
2. Algorithm
The entire algorithm, including initial line separation,
can be divided into 4 main stages:
1.
Connected component labeling. The whole image is
divided into a set of connected components. The
connected components are then used as the unit of
processing.
2.
Detect and split touching/overlapping lines.
Following initial line separation some components
will extend over more than one text line due to
touching/overlapping components. Components with
greater than average line height indicate regions
where overlapping or touching of lines may occur.
Pixels belonging to different words in adjacent lines
will be included in a single component if there is any
form of interconnection. The locations of these
components are marked for further analysis by
splitting the component into two new components.
3.
Mark the connected pixels. For each marked
component, analysis of the immediate area around it
is performed to locate the boundary area at which the
interconnection takes place between the two
components. This is marked with gray color for
identification.
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE
4.
Separate the two components. Once the boundary
region between the two lines is established, further
analysis is carried out to determine whether each
pixel belongs to the upper or lower line.
These stages are described in more detail below.
up
v
down
upper region
overlapping region
lower region
Figure 2. Detail of a typical connected component and
its horizontal histogram.
2.1 Extract Connected Components
2.3 Marking the boundary of connected pixels
Connected component labeling is a basic and widely
used technique in image processing and pattern
recognition. The classical algorithm of connected
component labeling requires two passes through the
image, and a table to store equivalent labels [6]. In this
work, the classical approach to connected component
labeling was used. The important parameters recorded for
each component were its size and position in the original
image.
2.2 Detect and split touching/overlapping lines
Due to overlapping lines, pixels from two different
lines are grouped into a single component during
connected component extraction. The task of marking
crossing-line components is not easy. The algorithm
employs structural properties observed in crossing-line
components (see section 3). Crossing-line components
have a height larger than the average height of
components in the image. Also, the horizontal projection
of the crossing-line component will usually exhibit an
obvious valley and two obvious peaks beside the valley.
Based on these assumptions, the following method is used
to detect crossing line components:
1.
If a component’s height is larger than a threshold,
then it is considered as a possible crossing-line
component and is sent to the next step. The threshold
is defined based on the average height of lines
obtained during the initial line separation.
2.
The possible crossing-line component is assured if its
horizontal histogram has an obvious valley and the
two portions beside the valley have peaks of
sufficient size.
For a possible crossing-line component (ie step 1 has been
passed), its horizontal histogram is obtained and
examined. The valley is detected as row v with the
minimum histogram value. Given the presence of a valley
and two attendant peaks along with a greater than average
height, the component is marked as being a crossing-line
component. The marking is done by splitting the
component into two components along a row in the
vicinity of the valley. Figure 2 shows one connected
component from along with the smoothed horizontal
histogram and detected valley.
The first step is to find the exact locations of the
marked components for analysis. The method employed is
to scan the image for the telltale signs of a split
component. A crossing-line component will have been
split into two components (step 2 above) and as a result,
the physical boundaries of the two components would
overlap in that the bottom most row of the component in
the higher line would be the same as the top-most row of
the component in the lower line.
For purposes of determining the boundary, an
analysis of the immediate area around the row where the
split has taken place (hence known as the split-row) is
performed. It is first necessary to know which direction
the analysis should work in. The vertical distance between
the split-row and the Centers of Gravity (COG) of the two
involved components is calculated. The analysis is then
carried out in the direction which has the greater distance.
The rationale is that since there is greater distance
between the COG and the split-row, it is highly likely that
that direction will provide more material to work with for
separating the two components. If the split-row is closer
to component B instead of component A, then pixels
belonging to component B are to be found on component
A’s side of the split-row. Thus, the analysis is carried out
towards component A.
The algorithm looks for a number of different pixel
patterns, which indicate the kind of interconnection
involved and that impacts upon exactly where the
boundary might lie. Depending upon the type of
interconnection at a specific crossing-line component, the
rationale for marking the boundary is different. Figure 3
shows the marking of the connected pixels.
It must be explained that the boundary is not a single
row of gray pixels. It is always more than one row. The
rows that are marked always belong to the same
component and serve as a stepping-stone towards step 4
where the entire component is marked with gray to
differentiate it from the other component involved, which
remains black. The area analyzed is rectangular and its
dimensions are determined in the following manner:
1. Width: The column boundaries of the two components
involved in the interconnection are used to calculate
the common columns between the two interconnected
components. Some leeway is provided on both sides
for cases where the handwriting is slanted heavily.
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE
2.
Height: The rows used extend from the split-row
towards the peak of the component with the furthest
COG. Thus, the rows used are only on one side of the
split-row. For a case involving an ascender and a
descender, more rows are allocated than a case where
an ascender/descender connects with a normal letter.
3. Types of Interconnections
The interconnections can range from touching letters to
overlapping letters. Within both categories, there are
further sub-categories. The sub-categories define the kinds
of interconnections that exist and determine the logic
employed for marking the boundary in each case. The
sub-categories are:
1.
A descender with a loop overlapping a vertical
ascender: in such a case, the analysis works
downwards and the pattern that is searched for is
the presence of the ascender among the loop of
the descender. Once the ascender has been found
with respect to the descender, it remains to mark
the boundary between the ascender and the
descender. The ascender is found by scanning the
immediate area row-by-row (beginning at the
split-row) and searching for vertical lines
representing the descender. Once it has been
established how many such lines exist and
another one is found at a lower row, that new line
must belong to the ascender and is marked as
such. This method has a high success rate
compared to others used in this work.
2.
A descender with a loop touching a vertical
ascender or a lower-case letter in the word
below: this is slightly easier to solve than the
overlapping case. The logic used here is
different. One of the characteristics of a loop,
complete or incomplete, is that it consists of a
horizontal curve at the bottom followed by an
upward sweep. The algorithm searches for the
bottom of the loop and marks the boundary
accordingly. The pixels below the bottom of the
loop belong to the ascender or the lower-case
letter, as the case may be. This method has a high
success rate.
3.
A vertical descender touching lower-case letters
or the curving top of an ascender or a capital
letter: the analysis, as in the previous cases, is
again downwards. The criteria searched for is the
row where the descender meets the lower-case
letter, the curve of the ascender or the capital
letter. The descender may be a loop or a single
vertical line. When it touches the bottom letter
(whichever type it may be), there is a distinct
change in the pixel pattern as the vertical line(s)
of the descender are replaced by the horizontal
pixel formation of the lower letter or the top of
the ascender. That is the point at which the
boundary is marked.
Gray area represents the
boundary. Notice that, in this
particular case, all the gray
pixels belong to the lower
letter.
Figure 3. The marked boundary of the joined region.
2.4 Separating the two components
Once the boundary is marked successfully and the
direction of analysis is known, the remaining component
can be marked. In all but the toughest cases, extending the
boundary to the entire component accurately is fairly
straightforward. The logic used is that once the boundary
is marked, the area is analyzed for the lowest or the
highest marked gray row depending upon which direction
the analysis is performed. Once that row is determined, all
the pixels that lie within the component towards which the
analysis is concentrated are marked gray. Care has to be
taken to ensure that pixels from the other component,
which may lie within the boundaries, are not marked
accidentally. For that reason, the separation algorithm
colors the pixels which are neighbors of previously
colored pixels, while also using information about the
kind of interconnection involved. As the first colored
pixel is found in the boundary, the coloring of the
component moves from there towards the center of the
component, ultimately leading to a fully gray component.
Figure 4 shows the final separation result for the example
shown in Figure 3.
Gray pixels represent the lower
letter. Notice how the gray color
has propagated downwards from
the boundary (Figure 3) to cover
the entire component.
Figure 4. Detail of the final separation of pixels.
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE
4.
A vertical descender meets a vertical ascender:
this case is quite straightforward. In this case, the
two letters merge into one another and thus the
boundary is set as the middle of the vertical line
connecting the two.
histogram test. However, even in the difficult images, the
success rate of separation given detection is still high.
Some images with no interconnections whatsoever were
also included to check that the program does not identify
interconnections where none exist. The results of the
analysis are tabulated in Table 1 and a number of
successful example separations are shown in Figure 5.
The testing of the complete program revealed certain
flaws in some cases. These are elucidated below along
with the reasons why the algorithm fails to deal with those
interconnections correctly and completely.
Cases 2 and 3 also have the corresponding cases
where the relative placement of the two components is
reversed. In cases where the direction of analysis is
calculated to be upwards, the logic used is to shift the
analysis area higher such that it now ends at the split row
instead of starting at the split-row (as is otherwise the case
when the direction of analysis is calculated to be
downwards) and then analyze. This approach yields better
results than actually analyzing upwards from the splitrow, as it is more intuitive to analyze downwards with
reference to the descender.
1.
The only case in which the program could report
interconnections where there are none is when
the handwriting has outrageously large letters
causing the height of the component to far
exceed the limit used as a criterion for
identification. However, this did not happen in
any of the images.
2.
The program fails when the loop of the
descender coincides with the loop of the
ascender. Such a case is extremely difficult to
handle with the approach described in this paper.
In such a case, the major issue is that there is no
way to show that there exists a boundary
between the two letters because there is actually
no physical boundary since they coincide almost
exactly.
4. Results and Observations on Failed Cases
Preliminary testing has been carried out on a set of 20
different handwriting images that represent the various
forms of touching and overlapping lines. Most had a
number of cases of interconnecting lines. Some (like the
example shown in Figure 1, included in the Image Nos. 17 in Table 1) were relatively easy as the
touching/overlapping regions were clear due to the lines
being well spaced vertically. Some of the images (Image
Nos. 15-20) were more difficult as the writing was
cramped vertically leading to a higher failure rate for the
Image
No.
Connections
present
Connections
detected
Connections
missed
1.
2.
3
4.
5.
6.
7.
5
4
1
5
1
5
1
5
4
1
5
1
4
1
0
0
0
0
0
1
0
Boundaries
correctly
marked
4
3
1
5
1
3
1
Separations
done
correctly
3
2
1
4
1
2
1
Statistics of images with no connections present:
8-14
0
0
0
0
0
Statistics of images with low success at detecting connections:
15.
16.
17.
18.
19.
20.
12
9
7
5
6
4
9
3
2
2
4
0
3
6
5
3
2
4
7
1
2
1
2
0
Table 1. Summary of the results of the analysis.
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE
5
1
1
1
2
0
3.
The major cause of failures in testing was the
incorrect identification of the area of analysis.
This may happen when the split-row is too near
to one component. This causes the analysis area
to not cover the parts of the other component,
which are critical to identifying the pixel pattern.
Thus, the boundary is not determined correctly
and the whole separation process goes awry.
4.
The algorithm does not work for handwriting
which is extremely badly formed. For example,
the lines of the letters may not be smooth but
irregular and shaky leading to badly formed
letters. Such writing is frequently illegible to
humans who have a lifetime of experience at
reading handwriting. It is quite beyond the scope
of an algorithm that uses pixel patterns and
maybe even one that employs a high order of
artificial intelligence.
5.
The algorithm fails when the ascenders and
descenders criss-cross each other repeatedly at
several places. This may be found in cases where
two loops meet each other at different slants.
This is somewhat similar to case 2. It is equally
rare and just as difficult to resolve.
Figure 6 shows image 16 from the table. An
examination of the image shows the cramped style of
writing (relative to the image in Figure 1) and extremely
high pixelation of the letters. 3 of the connections were
detected out of 9 cases involving cross-line components.
The detection failures were caused primarily by the fact
that the writing in the scanned paper was so small that the
scanned image had poorly formed letters. As an extreme
case, consider the case where the letters are so small that
each letter only consists of a few pixels. In such a case,
the program would obviously fail. Figure 6 represents a
test image intermediate in difficulty to the successful
images) and the case considered above.
Figure 6. Example of an image with low detection rate.
5. Conclusions
Figure 5. Eight examples of separated touching lines.
The described method for detecting and separating
touching lines in unconstrained handwritten documents
has been demonstrated effective in all but the most
difficult of cases. The remaining problems of separating
these rare more difficult cases is the subject of current
research as is the detection and appropriate allocation of
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE
pixel regions which are common to both upper and lower
lines. Details of these algorithms and results will be
available soon and reported in the final paper.
4.
References
1.
2.
3.
H. S. Baird, “The skew angle of printed documents”,
Proc. Conf. Photographic Scientists and Engineers,
SPIE, Bellingham, Wa., pp. 14-21, 1987.
V. Shapiro, G. Gluhchev, and V. Sgurev,
“Handwritten document image segmentation and
analysis”, Pattern Recognition Letters, Vol. 14, No. 1,
pp. 71-78, 1993.
L. Likforman-Sulem, A. Hanimyan and C. Faure, “A
Hough based algorithm for extracting text lines in
5.
6.
handwritten documents”, Proc. Int. Conference on
Document Analysis and Recognition, Montreal, pp.
774-777, August, 1995.
Venturelli F., Kovacs-V M. Zs., “An unconstrained
handwritten line segmentation technique”, Fifth
international workshop on frontiers in handwriting
recognition, University of Essex, England, pp. 385388, 1996.
Bruzzone E., Coffetti M. C., “An algorithm for
extracting cursive text lines”, Fifth international
conference on Document Analysis and Recognition,
Bangalore, India, pp. 749-752, 1999.
Rosenfeld, A. and Pfaltz, J. L. “Sequential operations
in digital image processing”. Journal of the ACM,
13(4), 471-494, 1966.
Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02)
0-7695-1692-0/02 $17.00 © 2002 IEEE