Kein Folientitel - LIRIS laboratory

1/25
Detection and Extraction of Artificial
Text for Semantic Indexing
Christian Wolf and Jean-Michel Jolion
Laboratoire Reconnaissance de Formes et Vision
Bât. Jules Verne, INSA de Lyon
69621 Villeurbanne cedex, France
January 9th 2002
Dagstuhl Seminar on Content-Based Image and Video Retrieval
This presentation can be downloaded from:
http://rfv.insa-lyon.fr/~wolf/presentations
2/25
Plan of the presentation
Introduction
Detection and tracking
Enhancement and binarization of the text
boxes
Experiments and results
Open problems
Conclusion and Outlook
Slides:
6
3
4
2
9
1
25
This work resulted in a patent submitted by France
Télécom on May 23th, 2001 under the reference
FR 01 06776.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
3/25
Content based image retrieval
Result
Example image
Similarity
Function
Indexing phase
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
4/25
Similarity measures
similar
similar
Introduction
Detection
Not similar
Enh/Binarization
Exp.Results
Open problems
Conclusion
5/25
Indexing using Text
Result
Key word
Keyword based
Search
Patrick Mayhew
Indexing phase
Patrick Mayhew
Min. chargé de
l´irlande de Nord
ISRAEL
Jerusalem
montage
T.Nouel
...
...
...
...
...
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
6/25
Video properties
80 px
12 px
8 px
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
7/25
Text extraction: general scheme
Detection
of the text
in single
frames
Tracking
Image enhancement
- Multiple frame
integration
OCR
Segmentation/
Binarisation
Video
"EVENEMENT"
"ACTU"
"SPELEOS"
"Gouffre Berger
(Isére)"
"aujourd'hui"
"France 3 Alpes"
"un spéléologue
sauveteur"
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
8/25
Text detection
by accumulation
of horizontal
gradients
(LeBourgeois,
1997).
Justification:
Text forms a
regular texture
containing
vertical edges
which are
aligned
horizontally.
Post processing
by mathematical
morphology.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
9/25
Detection in video sequences
Detection per
single frame
Text occurrences
List of rectangles
per frame
Frame nr.
(time)
Tracking keeping track of
text occurrences
Suppression of
false alarms
Image Enhancement Multiple frame integration
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
10/25
Image enhancement
Integration of multiple frames to create a
single image of higher quality.
Super-resolution
(interpolation)
M1
M2
M4
M3
An additional weight is included into the
interpolation scheme, which decreases the
weights of temporal outlier pixels.
Multiple frame integration:
Averaging
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
11/25
Binarization
Niblack:
T  m  k .s
Sauvola et al.:
s
T  m.(1  k .(  1))
R
Im
 LC
s
Contrast in the
center of the image
Mm

s
m
s
xam
k
R
mean of the window
standard deviation of the
window
parameter
dynamics of the gray
values of the image
Mm
 FC
R
C
The contrast of the
window
The maximum local
contrast
I : CL  a (Cmax  CF )
M minimum gray value
of the image
s
T  (1  a )m  aM  a ( m  M )
R
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
12/25
Binarization methods: examples
Original image
Fisher
Fisher (windowed)
Yanowitz B.
Niblack
Sauvola et al.
Our method
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
13/25
Binarization using a priori knowledge
Bayesian MAP estimation using prior
knowledge on the spatial relationships in the
image, modeled as a Markov random field.
(In collaboration with David Doermann from the
Language and Media Processing Laboratory of
the University of Maryland)
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
5 different MPEG 1
videos of resolution
384x288.
14/25
62 minutes
93000 frames
413 text appearances
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
15/25
Detection and OCR results
Detection results
DETECTION
Pred. Text
Pred. Non-Text
Total
Positives
False alarms
Logos
Scene text
Pos+Log+Scene
Total
OCR Results, classified by binarization method
301
21
322
350
947
75
72
497
1444
True
pos.
True
neg.
False
pos.
False
neg.
Introduction
Detection
% Input
93,5 AIM2
AIM3
34,4
AIM4
AIM5
Total
Enh/Binarization
Bin. method
Niblack
Sauvola R=128
R=ad
R=ad, shift
Niblack
Sauvola R=128
R=ad
R=ad, shift
Niblack
Sauvola R=128
R=ad
R=ad, shift
Niblack
Sauvola R=128
R=ad
R=ad, shift
Niblack
Sauvola R=128
R=ad
R=ad, shift
Exp.Results
Recall
67,4
53,8
75,0
78,4
92,5
69,9
85,3
96,2
78,5
48,6
69,8
80,1
62,1
66,7
64,8
69,0
73,1
58,4
73,0
79,6
Precision
87,5
87,6
87,8
90,4
78,1
89,6
92,5
95,3
92,0
87,7
84,8
90,4
71,4
89,3
90,1
91,0
82,6
88,5
88,4
91,5
Open problems
Cost
499
616,5
384,5
344,5
196
206
110
51,00
252,00
490,50
360,50
211,50
501,50
324,50
328,00
294,50
1448,5
1637,5
1183
901,5
Conclusion
16/25
Open questions
 Scene text (general orientations, deformations)
 Moving text
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
17/25
What is scene text?
Frames
containing
scene text
Frames
containing
artificial text
Video frames
We do not have enough information about the importance of text in the
destination domain. How many frames do contain text and scene text?
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
Detection:
From artificial text to scene text
18/25
Several constraints have to be removed passing from
artificial text to scene text:
 The constraints on temporal stability need to be
abandoned or at least softened (no initial frame
integration)
 Text can be aligned in all orientations (Creation of an
oriented feature in multiple directions, similar to invariant
features)
 Contrast is possibly lower because scene text is not
designed to be read easily (Is detection of unreadable
text necessary?).
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
19/25
Text models
Simple Models
Complex Models
sets of edges or vertical
strokes...
templates, probabilistic
models (MRF)...
+Generalize well,
respond to many kinds
of text
- Many false alarms
+Powerful less false
alarms
- Do not generalize well
Main problem: Distinction between characters and
structures similar to text according to the chosen model.
 Assumptions are necessary (on the font, size, style,
contrast, color, length, etc.) but not sufficient.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
20/25
Sven
Dickinson:
evolution
of models
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
21/25
What is text?
Whatever model we choose, we cannot detect/recognize all
kinds of text without solving the general image understanding
problem. The best thing we can do is to include richer features
into the detection process: a composite model for text.
 Structural analysis (e.g. detection and recognition of
characters by strokes). Very hard and very unlikely to
work in the case of noisy images, low resolutions and
difficult fonts.
 Statistical modeling of text features (e.g. by learning
techniques). Problem: For a robust detection high
neighborhood sizes are needed, which lead to
combinatorial explosions.
E.g.: Texture based methods for small text and segmentation +
perceptual grouping, structural methods for big text.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
22/25
Learning techniques: pro et contra
Bibliography:
 Learning directly the gray levels of the input image (Jung
2001)
 Learning features, i.e. coefficients of the Haar wavelet (Li
and Doermann 2000) or edge strength (Lienhart 2000)
+ Learning is an easy way to handle the complexity of text.
- Text can appear in videos in many different fonts, sizes,
styles, colors, orientations etc. Learning all different forms
is maybe not feasible.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
23/25
Color processing for detection?
Original image
Sobel on grayscale image
 1 0 1


  2 0 2
 1 0 1


Sobel on L*u*v* image
1
 
2
1
 
 Deuclid ( I x1, 0 , I x1, 0 )
 Saturating distance or non saturating distance?
 Reflection processing?
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
24/25
Tracking of moving scene text
Do we detect the text in single frames (like artificial text), or do
we treat the flow in its integrality?
 Single frames: Multiple frame integration of moving text
needs robust registration of the text boxes in different
frames (e.g. rough segmentation into text and background
pixels before the registration of the text pixels only) . Robust
methods, which are able to track objects in clutter, are
needed.
 Detection of moving objects, e.g. by optical flow, spatiotemporal methods.
 Mosaicing techniques can be employed for image
enhancement.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion
25/25
Conclusion and Outlook
 We developed a system for detection, tracking,
enhancement and binarization of artificial text in videos.
 The total recognition rate for artificial text is surprisingly
high, given the quality of the text, but not yet good enough
for indexing purposes.
 The remaining problems in text extraction seem to be typical
for applications in visual information management: We went
as far as we could with low level features. We can’t do the
necessary step to semantic information. What is text?
Possible definition: text is, what (a human or an OCR) can
recognize as text.
 We have to include as much a priori knowledge as possible
into the process.
Introduction
Detection
Enh/Binarization
Exp.Results
Open problems
Conclusion