20rv1 - EECS at UC Berkeley - University of California, Berkeley

Document Image Analysis
Lecture 20: Intro to Layout
Richard J. Fateman
Henry S. Baird
University of California – Berkeley
Xerox Palo Alto Research Center
UC Berkeley CS294-9 Fall 2000
11- 1
Page layout analysis
• Structural (Physical, Geometric) Layout
Analysis
• Functional (Syntactic, Logical) Layout
Analysis
• Read-order determination
UC Berkeley CS294-9 Fall 2000
11- 2
Structural
• Isolation of columns, paragraphs, lines
words, tables, figures. Maybe letters.
• Without some layout analysis, much of
the previous work would be impossible!
• Without layout analysis, what is the
sequence of words in a multi-column
format?
UC Berkeley CS294-9 Fall 2000
11- 3
Functional
• Typically domain dependent
• May require merging or splitting of
syntactic components
• Encoding into ODA (object oriented
document architecture) or SGML (DTD
describes components like section,
title..)
UC Berkeley CS294-9 Fall 2000
11- 4
Functional Components
• First page of a technical article may have
•
•
•
•
•
Title
Author
Abstract, body/column1 body/column2 footnotes
Pagination
Journal name/volume/date…
• Business letter might have
•
•
•
•
•
•
Sender
Date
Logo
Recipient
Body
Signature
UC Berkeley CS294-9 Fall 2000
11- 5
Finding structural blocks
UC Berkeley CS294-9 Fall 2000
11- 6
Common Approaches
• Top Down analysis
– Horizontal and vertical profiles
– Recursive: columns, paragraphs/lines/words
– As illustrated earlier
• Bottom Up analysis
– Use adjacency based on
• Pixels / morphology of dilation (millions)
• RLE/ merge lines (thousands)
• Connected Components (hundreds)
• Look at the background (shape-directed covers)
• Also, human hints.
UC Berkeley CS294-9 Fall 2000
11- 7
Standard images…the
Scanned Input
UC Berkeley CS294-9 Fall 2000
11- 8
Smear character boxes
UC Berkeley CS294-9 Fall 2000
11- 9
Smear words to get lines
UC Berkeley CS294-9 Fall 2000
11- 10
Smear lines to get paragraphs
UC Berkeley CS294-9 Fall 2000
11- 11
Issues:
• Sensitivity to noise. Solutions:
– Clean up via kfill or similar filtering, ruthlessly
– Divide the page (artificially) and keep the noise from
affecting the document globally
• Slanted lines. Solution(s):
– Deskew (since it is not too hard(?))
– Use nearest neighbors “docstrum”
• Concave regions (text flow around a box).
Solution(?) look at background
• Variation in font, spacing can throw off analysis
– Allow for local analysis
UC Berkeley CS294-9 Fall 2000
11- 12
Interactive semi-automatic
zoning (RJF)
UC Berkeley CS294-9 Fall 2000
11- 13
Zoom in
UC Berkeley CS294-9 Fall 2000
11- 14
Scroll around
UC Berkeley CS294-9 Fall 2000
11- 15
View individual pixels
UC Berkeley CS294-9 Fall 2000
11- 16
Semi…
Turn up the noise filter until we
start to kill some of the
punctuation. How?
As we turn up the threshold,
the number of connected
components drops, then
reaches a stable plateau after
the noise is gone, and then
drops again as we remove
punctuation, the dots above the
“i” etc.
UC Berkeley CS294-9 Fall 2000
11- 17
auto…
Turn the horizontal smear
knob until the number of
components drops suddenly
from about 3000 to about 600.
Character boxes have been
merged into wordboxes
Turn the horizontal smear
knob until the number of
components drop from about
600 to about 100.
Wordboxes have become
lineboxes.
UC Berkeley CS294-9 Fall 2000
11- 18
matic..
Tweek the vertical
smear knob. Lines
become paragraphs.
(Turn further, and
paragraphs become
columns).
UC Berkeley CS294-9 Fall 2000
11- 19
Specify read order
UC Berkeley CS294-9 Fall 2000
11- 20
Interactive functional tagging:
mark subject/author/etc?
Here we attempt automatic id
of math…
Automatic math zone.
This is a challenge
because the zone is in
two parts, containing
the math … f(p)=F(p)
UC Berkeley CS294-9 Fall 2000
11- 21
Docstrum/ L.O’Gorman
UC Berkeley CS294-9 Fall 2000
5 nearest neighbors (ogorman93)11- 22
Example of “spectrum”
UC Berkeley CS294-9 Fall 2000
Each point represents distance and angle
of a cc.
11- 23
N^2, but not so bad.
Statistics for skew and
spacing
UC Berkeley CS294-9 Fall 2000
11- 24
Extract Lines, group to
paragraphs
• Statistically close enough horizontally to
be words, then lines
• Statistically close enough and parallel
enough and the same length as… group
two lines into the same text block.
• (arguably saving time by not deskewing;
dealing with non-constant skew)
Example follows..
UC Berkeley CS294-9 Fall 2000
11- 25
Sections with different skew
UC Berkeley CS294-9 Fall 2000
6 business cards, nearest
neighbors vectors
11- 26
Extracted text lines, blocks
UC Berkeley CS294-9 Fall 2000
Useful? General?
11- 27
Without layout analysis
•
•
•
•
Reading across columns
Misplacing captions
Misplacing footnotes
Misunderstanding page numbers (which should
be REMOVED in the reformatting process)
• Need extraction of biblio data: title, author,
abstract, keywords
• Nearly every subsequent step is compromised
by lack of context.
UC Berkeley CS294-9 Fall 2000
11- 28
A Diversion: Separating Math
from Text
•
•
•
•
Why separate math from text?
Types of mathematics encountered
Previous Work
Two approaches
– post-processing commercial OCR
– character-based (details!)
• Errors and their correction
• Ambiguities
UC Berkeley CS294-9 Fall 2000
11- 29
Why separate
math/text/images/..
• OCR programs do not work for math
becomes, in Textbridge,
fli
l~F(P)=(~ ~(P)
j(P) -~7~(p)
Designation as a “picture” is only a partial solution
UC Berkeley CS294-9 Fall 2000
11- 30
Mathematics on a Page
Inline is harder to pick out
because it may look like italics text
UC Berkeley CS294-9 Fall 2000
11- 31
Previous Work
• Isolation by hand (most math parser papers)
• Texture/ statistics based heuristics
– useful for display math “paragraphs”
– not useful for in-line math
• Character based pseudo-parsing (but without
font information or true parsing feedback)
• Incomplete
UC Berkeley CS294-9 Fall 2000
11- 32
Proposal: Post-Processing of
OCR
• Start with commercial best-effort recognition
• Reprocess the intermediate data structure (e.g.
for TextBridge, the XDOC file)
• Accept recognition of text zones with high
recognition certainty. (Lines with no errors
surrounded by lines with no errors are
considered solved)
UC Berkeley CS294-9 Fall 2000
11- 33
Separate uncertain areas
• Re-consider “the rest of the image” as potential
mathematics zones: uncertain regions
(including nearby “certain” characters/lines)
• Isolate characters, identify fonts, etc.
• Play out heuristic rules for separating text and
math zones.
• Consider eradicating math and re-submitting
text; separately recognizing math and
reinserting in XDOC
UC Berkeley CS294-9 Fall 2000
11- 34
Alternatively, Starting from
our own naïve OCR
•
•
•
•
Connected component recognition
Separate characters by initial classification
Repeatedly re-examine via rules
Determine text zones, remove math / feed
remainder to commercial OCR
– How best to blank-out math? XXX
• Most likely human interaction remains
UC Berkeley CS294-9 Fall 2000
11- 35
Two bags: Math vs Text
• Initially Math
– + - = / Greek, scientific symbols, 0-9, italics,
bold, (), [], sin, cos, tan, dots, commas,
decimal points
• Initially Text
– Roman Letters, junk
UC Berkeley CS294-9 Fall 2000
11- 36
Sample Text Bag
UC Berkeley CS294-9 Fall 2000
11- 37
Sample Math Bag
UC Berkeley CS294-9 Fall 2000
11- 38
Second Pass
• Correct for too much Math
• Grow “clumps” (expand BBs) to categorize
–
–
–
–
–
3.14159 vs “end of sentence.”
(comment) vs f(x)
hyphen-words vs x2 - y2
horizontal lines generally
isolated 1 or is it l “ell” or I “eye”
• “bags” or “zones” of geometric-relation boxes
containing either words or potential math
UC Berkeley CS294-9 Fall 2000
11- 39
Importance of Context
Here are 12 L’s and a 1
UC Berkeley CS294-9 Fall 2000
11- 40
Third Pass
• Too much is in the text bag now
– blur the math to allow for embedded Roman text like
“sin” or “l”
• Re-clump the mathematics to see if new
bridges have been formed
• Some italics in the math bag may be really
– English words in theorems
– emphasized text
UC Berkeley CS294-9 Fall 2000
11- 41
On Ambiguity and Correctness
• Can we find the math in ad - bc by ad hoc
methods?
• If we are unable to disambiguate English
words, why should we be able to
disambiguate mathematics?
• Abuse of mathematical notation is
widespread: can we insist that new papers
either have a non-ambiguous notation or an
underlying electronic non-ambiguous
notation?
UC Berkeley CS294-9 Fall 2000
11- 42
Conclusions
• We can make a first cut on separating
math from text
• If we wish to “enliven” math publication
with semantic underpinnings, this may
help in their production
• Incorporation of AI rule-based
transformations as well as hand
correction are likely to be important
UC Berkeley CS294-9 Fall 2000
11- 43