SEPARATION OF TOUCHING AND OVERLAPPING WORDS IN ADJACENT LINES OF HANDWRITTEN TEXT KALYAN TAKRU and GRAHAM LEEDHAM School of Computer Engineering, Nanyang Technological University, N4-2A-32 Nanyang Avenue, Singapore 839798 Email: [email protected], [email protected] This paper reports on a novel technique for the separation of characters and words that are connected through touching or overlapping of characters between adjacent lines of text. The technique employs structural knowledge of handwriting styles where overlap is most frequently observed. The method is shown to work well in the most usual cases and resolve many of the more difficult cases observed in very poor quality handwritten documents . 1. Introduction An important preprocessing stage in off-line automatic handwriting recognition systems is separating the lines and words in a section of scanned hand-written text, so that the location of all individual words is known for later processing. The main difficulties in automatic line separation are due to slanting lines, crooked lines and touching or overlapping lines. In general touching/overlapping lines are found when ascenders of the lower line touch or cross the upper line, or when descenders of the upper line touch or cross the lower line. If the line spacing is small, handwritten lines may frequently overlap each other. Even in handwritten text with adequate line spacing, overlapping line problems may also exist due to long ascenders and descenders. Numerous methods have been proposed for dealing with slanting and crooked lines [1][2][3]. One method for separating overlapping characters employs contour information [4]. Another method that has been used relies upon connected component grouping and splitting [5]. In this paper we are concerned with the accurate separation of touching and overlapping lines where characters in adjacent lines extend upwards or downwards to touch and interfere with the above or below line. In these cases the touching region will contain pixels which belong to the upper, lower or both text lines. An example image containing five occurrences of touching and overlapping lines is shown in Figure 1. The images were scanned as binary images at a resolution of 300 dpi (dots per inch). Figure 1. Example image illustrating five occurrences of touching and overlapping handwritten lines. 2. Algorithm The entire algorithm, including initial line separation, can be divided into 4 main stages: 1. Connected component labeling. The whole image is divided into a set of connected components. The connected components are then used as the unit of processing. 2. Detect and split touching/overlapping lines. Following initial line separation some components will extend over more than one text line due to touching/overlapping components. Components with greater than average line height indicate regions where overlapping or touching of lines may occur. Pixels belonging to different words in adjacent lines will be included in a single component if there is any form of interconnection. The locations of these components are marked for further analysis by splitting the component into two new components. 3. Mark the connected pixels. For each marked component, analysis of the immediate area around it is performed to locate the boundary area at which the interconnection takes place between the two components. This is marked with gray color for identification. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE 4. Separate the two components. Once the boundary region between the two lines is established, further analysis is carried out to determine whether each pixel belongs to the upper or lower line. These stages are described in more detail below. up v down upper region overlapping region lower region Figure 2. Detail of a typical connected component and its horizontal histogram. 2.1 Extract Connected Components 2.3 Marking the boundary of connected pixels Connected component labeling is a basic and widely used technique in image processing and pattern recognition. The classical algorithm of connected component labeling requires two passes through the image, and a table to store equivalent labels [6]. In this work, the classical approach to connected component labeling was used. The important parameters recorded for each component were its size and position in the original image. 2.2 Detect and split touching/overlapping lines Due to overlapping lines, pixels from two different lines are grouped into a single component during connected component extraction. The task of marking crossing-line components is not easy. The algorithm employs structural properties observed in crossing-line components (see section 3). Crossing-line components have a height larger than the average height of components in the image. Also, the horizontal projection of the crossing-line component will usually exhibit an obvious valley and two obvious peaks beside the valley. Based on these assumptions, the following method is used to detect crossing line components: 1. If a component’s height is larger than a threshold, then it is considered as a possible crossing-line component and is sent to the next step. The threshold is defined based on the average height of lines obtained during the initial line separation. 2. The possible crossing-line component is assured if its horizontal histogram has an obvious valley and the two portions beside the valley have peaks of sufficient size. For a possible crossing-line component (ie step 1 has been passed), its horizontal histogram is obtained and examined. The valley is detected as row v with the minimum histogram value. Given the presence of a valley and two attendant peaks along with a greater than average height, the component is marked as being a crossing-line component. The marking is done by splitting the component into two components along a row in the vicinity of the valley. Figure 2 shows one connected component from along with the smoothed horizontal histogram and detected valley. The first step is to find the exact locations of the marked components for analysis. The method employed is to scan the image for the telltale signs of a split component. A crossing-line component will have been split into two components (step 2 above) and as a result, the physical boundaries of the two components would overlap in that the bottom most row of the component in the higher line would be the same as the top-most row of the component in the lower line. For purposes of determining the boundary, an analysis of the immediate area around the row where the split has taken place (hence known as the split-row) is performed. It is first necessary to know which direction the analysis should work in. The vertical distance between the split-row and the Centers of Gravity (COG) of the two involved components is calculated. The analysis is then carried out in the direction which has the greater distance. The rationale is that since there is greater distance between the COG and the split-row, it is highly likely that that direction will provide more material to work with for separating the two components. If the split-row is closer to component B instead of component A, then pixels belonging to component B are to be found on component A’s side of the split-row. Thus, the analysis is carried out towards component A. The algorithm looks for a number of different pixel patterns, which indicate the kind of interconnection involved and that impacts upon exactly where the boundary might lie. Depending upon the type of interconnection at a specific crossing-line component, the rationale for marking the boundary is different. Figure 3 shows the marking of the connected pixels. It must be explained that the boundary is not a single row of gray pixels. It is always more than one row. The rows that are marked always belong to the same component and serve as a stepping-stone towards step 4 where the entire component is marked with gray to differentiate it from the other component involved, which remains black. The area analyzed is rectangular and its dimensions are determined in the following manner: 1. Width: The column boundaries of the two components involved in the interconnection are used to calculate the common columns between the two interconnected components. Some leeway is provided on both sides for cases where the handwriting is slanted heavily. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE 2. Height: The rows used extend from the split-row towards the peak of the component with the furthest COG. Thus, the rows used are only on one side of the split-row. For a case involving an ascender and a descender, more rows are allocated than a case where an ascender/descender connects with a normal letter. 3. Types of Interconnections The interconnections can range from touching letters to overlapping letters. Within both categories, there are further sub-categories. The sub-categories define the kinds of interconnections that exist and determine the logic employed for marking the boundary in each case. The sub-categories are: 1. A descender with a loop overlapping a vertical ascender: in such a case, the analysis works downwards and the pattern that is searched for is the presence of the ascender among the loop of the descender. Once the ascender has been found with respect to the descender, it remains to mark the boundary between the ascender and the descender. The ascender is found by scanning the immediate area row-by-row (beginning at the split-row) and searching for vertical lines representing the descender. Once it has been established how many such lines exist and another one is found at a lower row, that new line must belong to the ascender and is marked as such. This method has a high success rate compared to others used in this work. 2. A descender with a loop touching a vertical ascender or a lower-case letter in the word below: this is slightly easier to solve than the overlapping case. The logic used here is different. One of the characteristics of a loop, complete or incomplete, is that it consists of a horizontal curve at the bottom followed by an upward sweep. The algorithm searches for the bottom of the loop and marks the boundary accordingly. The pixels below the bottom of the loop belong to the ascender or the lower-case letter, as the case may be. This method has a high success rate. 3. A vertical descender touching lower-case letters or the curving top of an ascender or a capital letter: the analysis, as in the previous cases, is again downwards. The criteria searched for is the row where the descender meets the lower-case letter, the curve of the ascender or the capital letter. The descender may be a loop or a single vertical line. When it touches the bottom letter (whichever type it may be), there is a distinct change in the pixel pattern as the vertical line(s) of the descender are replaced by the horizontal pixel formation of the lower letter or the top of the ascender. That is the point at which the boundary is marked. Gray area represents the boundary. Notice that, in this particular case, all the gray pixels belong to the lower letter. Figure 3. The marked boundary of the joined region. 2.4 Separating the two components Once the boundary is marked successfully and the direction of analysis is known, the remaining component can be marked. In all but the toughest cases, extending the boundary to the entire component accurately is fairly straightforward. The logic used is that once the boundary is marked, the area is analyzed for the lowest or the highest marked gray row depending upon which direction the analysis is performed. Once that row is determined, all the pixels that lie within the component towards which the analysis is concentrated are marked gray. Care has to be taken to ensure that pixels from the other component, which may lie within the boundaries, are not marked accidentally. For that reason, the separation algorithm colors the pixels which are neighbors of previously colored pixels, while also using information about the kind of interconnection involved. As the first colored pixel is found in the boundary, the coloring of the component moves from there towards the center of the component, ultimately leading to a fully gray component. Figure 4 shows the final separation result for the example shown in Figure 3. Gray pixels represent the lower letter. Notice how the gray color has propagated downwards from the boundary (Figure 3) to cover the entire component. Figure 4. Detail of the final separation of pixels. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE 4. A vertical descender meets a vertical ascender: this case is quite straightforward. In this case, the two letters merge into one another and thus the boundary is set as the middle of the vertical line connecting the two. histogram test. However, even in the difficult images, the success rate of separation given detection is still high. Some images with no interconnections whatsoever were also included to check that the program does not identify interconnections where none exist. The results of the analysis are tabulated in Table 1 and a number of successful example separations are shown in Figure 5. The testing of the complete program revealed certain flaws in some cases. These are elucidated below along with the reasons why the algorithm fails to deal with those interconnections correctly and completely. Cases 2 and 3 also have the corresponding cases where the relative placement of the two components is reversed. In cases where the direction of analysis is calculated to be upwards, the logic used is to shift the analysis area higher such that it now ends at the split row instead of starting at the split-row (as is otherwise the case when the direction of analysis is calculated to be downwards) and then analyze. This approach yields better results than actually analyzing upwards from the splitrow, as it is more intuitive to analyze downwards with reference to the descender. 1. The only case in which the program could report interconnections where there are none is when the handwriting has outrageously large letters causing the height of the component to far exceed the limit used as a criterion for identification. However, this did not happen in any of the images. 2. The program fails when the loop of the descender coincides with the loop of the ascender. Such a case is extremely difficult to handle with the approach described in this paper. In such a case, the major issue is that there is no way to show that there exists a boundary between the two letters because there is actually no physical boundary since they coincide almost exactly. 4. Results and Observations on Failed Cases Preliminary testing has been carried out on a set of 20 different handwriting images that represent the various forms of touching and overlapping lines. Most had a number of cases of interconnecting lines. Some (like the example shown in Figure 1, included in the Image Nos. 17 in Table 1) were relatively easy as the touching/overlapping regions were clear due to the lines being well spaced vertically. Some of the images (Image Nos. 15-20) were more difficult as the writing was cramped vertically leading to a higher failure rate for the Image No. Connections present Connections detected Connections missed 1. 2. 3 4. 5. 6. 7. 5 4 1 5 1 5 1 5 4 1 5 1 4 1 0 0 0 0 0 1 0 Boundaries correctly marked 4 3 1 5 1 3 1 Separations done correctly 3 2 1 4 1 2 1 Statistics of images with no connections present: 8-14 0 0 0 0 0 Statistics of images with low success at detecting connections: 15. 16. 17. 18. 19. 20. 12 9 7 5 6 4 9 3 2 2 4 0 3 6 5 3 2 4 7 1 2 1 2 0 Table 1. Summary of the results of the analysis. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE 5 1 1 1 2 0 3. The major cause of failures in testing was the incorrect identification of the area of analysis. This may happen when the split-row is too near to one component. This causes the analysis area to not cover the parts of the other component, which are critical to identifying the pixel pattern. Thus, the boundary is not determined correctly and the whole separation process goes awry. 4. The algorithm does not work for handwriting which is extremely badly formed. For example, the lines of the letters may not be smooth but irregular and shaky leading to badly formed letters. Such writing is frequently illegible to humans who have a lifetime of experience at reading handwriting. It is quite beyond the scope of an algorithm that uses pixel patterns and maybe even one that employs a high order of artificial intelligence. 5. The algorithm fails when the ascenders and descenders criss-cross each other repeatedly at several places. This may be found in cases where two loops meet each other at different slants. This is somewhat similar to case 2. It is equally rare and just as difficult to resolve. Figure 6 shows image 16 from the table. An examination of the image shows the cramped style of writing (relative to the image in Figure 1) and extremely high pixelation of the letters. 3 of the connections were detected out of 9 cases involving cross-line components. The detection failures were caused primarily by the fact that the writing in the scanned paper was so small that the scanned image had poorly formed letters. As an extreme case, consider the case where the letters are so small that each letter only consists of a few pixels. In such a case, the program would obviously fail. Figure 6 represents a test image intermediate in difficulty to the successful images) and the case considered above. Figure 6. Example of an image with low detection rate. 5. Conclusions Figure 5. Eight examples of separated touching lines. The described method for detecting and separating touching lines in unconstrained handwritten documents has been demonstrated effective in all but the most difficult of cases. The remaining problems of separating these rare more difficult cases is the subject of current research as is the detection and appropriate allocation of Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE pixel regions which are common to both upper and lower lines. Details of these algorithms and results will be available soon and reported in the final paper. 4. References 1. 2. 3. H. S. Baird, “The skew angle of printed documents”, Proc. Conf. Photographic Scientists and Engineers, SPIE, Bellingham, Wa., pp. 14-21, 1987. V. Shapiro, G. Gluhchev, and V. Sgurev, “Handwritten document image segmentation and analysis”, Pattern Recognition Letters, Vol. 14, No. 1, pp. 71-78, 1993. L. Likforman-Sulem, A. Hanimyan and C. Faure, “A Hough based algorithm for extracting text lines in 5. 6. handwritten documents”, Proc. Int. Conference on Document Analysis and Recognition, Montreal, pp. 774-777, August, 1995. Venturelli F., Kovacs-V M. Zs., “An unconstrained handwritten line segmentation technique”, Fifth international workshop on frontiers in handwriting recognition, University of Essex, England, pp. 385388, 1996. Bruzzone E., Coffetti M. C., “An algorithm for extracting cursive text lines”, Fifth international conference on Document Analysis and Recognition, Bangalore, India, pp. 749-752, 1999. Rosenfeld, A. and Pfaltz, J. L. “Sequential operations in digital image processing”. Journal of the ACM, 13(4), 471-494, 1966. Proceedings of the Eighth International Workshop on Frontiers in Handwriting Recognition (IWFHR’02) 0-7695-1692-0/02 $17.00 © 2002 IEEE
© Copyright 2026 Paperzz