Rejection measures for Handwriting sentence Recognition

Rejection measures for Handwriting sentence Recognition
S. Marukatat, T. Artières, P. Gallinari
LIP6, 8, rue du Capitaine Scott, 75015 Paris, France
E-mail: fSanparith.Marukatat,Thierry.Artieres,[email protected]
B. Dorizzi
EPH, Institut National des Télécommunications, 9, rue Charles Fourier, 91011, Evry, France
E-mail: [email protected]
Abstract
In this paper we study the use of confidence measures
for an on-line handwriting recognizer. We investigate various confidence measures and their integration in an isolated
word recognition system as well as in a sentence recognition system. In isolated word recognition tasks, the rejection mechanism is designed in order to reject the outputs
of the recognizer that are possibly wrong, which is the case
for badly written words, out-of-vocabulary words or general drawing. In sentence recognition tasks, the rejection
mechanism allows rejecting parts of the decoded sentence.
1. Introduction
In a pen-based user interface, an on-line Handwriting
Recognition (HWR) system (i.e. a HWR system dealing
with the temporal sequence of points) may be used to interpret the input signals. These signals may be either texts,
drawings or command gestures. However such systems,
even if they are designed with the most accuracy as possible
produce errors, which may be due to badly written or outof-vocabulary words or even drawn forms without meaning.
Thus, in order for the system to be accepted by users,
a rejection procedure has to be implemented to complete
the pure recognition process. If HWR is used to recognize
gestures, rejection is essential to avoid executing a wrong
action. In the case of text recognition, comfort for the users
will be increased if the system can propose a certain number of alternatives related to the confidence of its outputs
(the less the confidence, the more the alternatives). Measuring the confidence in the recognizer’s output is then an important part of a pen-based interface. Such measure should
allow rejecting parts or totality of the input signal.
A rejection mechanism relies on the definition of a confi-
dence measure (in the recognizer’s output) and on the comparison with a threshold in order to take the decision of accepting or rejecting the input signal. The core of this study
consists in building a rejection mechanism at the outputs
of a handwriting sentence recognizer. We investigate a few
confidence measures based on the use of letter anti-model.
An anti-model of a letter is a model of anything but this
letter.
At the word level, the rejection mechanism is used to reject or accept the recognizer’s output. This rejection mechanism is integrated in the sentence recognition process to
decide if any word in the decoded sentence is correct or not;
this means that parts of a decoded sentence may be accepted
while other parts not.
The paper is organized as follows. In section 2, we
briefly describe our on-line handwriting sentence recognition system that we will use as an input to the rejection
mechanism. Then we discuss the definition of word confidence measures in section 3. Section 4 deals with the rejection thresholds and section refexperiments presents a set
of experimental results for isolated word recognition and
sentence recognition tasks. Finally section 7 concludes this
paper.
2. Handwriting Recognition System
All experiments reported in this study have been performed with an on-line HWR system developed in our team.
It is a hybrid Hidden Markov Models (HMMs) / Neural Networks system (NNs), where emission probability densities
are estimated through mixture of predictive neural networks
(see [10] for a general presentation). First the input signal is processed into a sequence of frames; this correspond
to the observation sequence for our HMM/NN recognizer.
Word recognition is performed through a lexicon driven
procedure where the dictionary is organized in prefix form.
This structure is explored using a frame-synchronous beam
search and results in a ranked list of most likely words. Sentence recognition is performed through an extension of this
scheme where, all along the beam search, best word hypotheses (together with their boundaries and likelihoods)
are stored in an intermediate structure called word graph.
This structure is proceed later in order to introduce a language model and to determine the most likely sequence of
words (i.e. recognized sentence).
3. Confidence Measures
There are many ways to define a confidence measure.
Some methods that have been proposed exploit different
features obtained from the decoding step, for example the
lattice density [7, 16], N-bests list [14] or language models
[15], etc. Some other works use a post classifier to combine features such as likelihood and other statistics gathered from the decoding process (e.g. the number of letters
in word, etc.) into one measure [6, 8, 13]. By far, however, the most popular techniques are based on the building of a so-called anti-model or alternate model [1, 2, 4, 9].
Such an anti-model is used to normalize the likelihood of an
unknown observation sequence O1T = (o1 ; o2 ; : : : ; oT ) by
computing a ratio between the joint probabilities of the hypothesized word W and its alternate model (or anti-model)
T )
W , PP ((OO1T ;W
or more frequently to form a likelihood ratio
;W )
pothesized word W , with lW the number of letters in
W , and (wi )i=1;:::;lW the letters in W . Furthermore, let
(bi )i=1;:::;lW and (ei )i=1;:::;lW be the beginning and ending
times of the different letters in W , found by dynamic programming (with b1 = 1 and elW = T ).
We will note mword (O1T ; W ) the word-level confidence
measure, which is related to the confidence that we have,
that the sequence of observations O1T really corresponds to
the handwriting of the word W . This confidence is defined
as the log-likelihood normalization ratio:
m
1
One can try to learn explicitly the anti-model for every
word, but this is possible only in very limited applications
[4, 9, 12]. For a more general task, implicit alternate models
can be derived from other models in the system or from the
competing hypotheses [3, 4, 11]. It should be noted that
some experimental studies (e.g. [3]) have shown that an
implicit model can outperform an explicit alternate model
in performance and in computational cost.
In this study we will focus on the confidence measures
based on the use of implicit anti-models. However, since
different letters may be variously modeled in the Handwriting Recognition engine, we investigated the use of
letter-level confidence measures, that may be combined to
compute word-level confidence measures. A comparable
scheme has already been used in the speech recognition
field [2, 11] and has shown interesting results. In the following, we first present how a word-level confidence measure may be derived from letter-level confidence measures
(x3.1). Then, we define some anti-models that we used to
compute letter-level confidence measures (x3.2).
3.1. Word Confidence Measure
O1
Consider that, for an input observation sequence,
= (o1 ; o2 ; : : : ; oT ), the recognizer outputs a hy-
T
=
P (O1 jW )
log
T
P (O1 jW )
#
"
W
P (O ii jw )
1 X
log
T =1
P (O ii jw )
1
T
T
l
=
e
i
b
e
(1)
i
b
i
In a similar way, we will note mletter (Obeii ; wi ) the
letter-level confidence measure that is related to the confidence that we have, that the letter wi was really written
between the beginning time bi and the ending time ei . The
letter confidence measure is defined to be the log-likelihood
ratio normalized by the length of the letter. This allows
comparing the letters of different length.
m
1
T
P (O1 jW )
under the assumption of equal priors.
P (O T jW )
O1 ; W )
word (
T
O ; l) =
letter (
e
b
"
P (O jl)
log
e b+1
P (O jl)
1
e
b
#
(2)
e
b
Replacing this in the word-confidence formulae, the wordconfidence measure may be expressed as a function of the
letter-confidence measures:
T
word (O1 ; W ) =
m
X
lW
(ei
b
i + 1)
T
i=1
e
letter (Obii ; wi ) (3)
m
However, since it may be interesting to give various weights
to the letters in a word, we will make use of a more general formulation by introducing letter weighting coefficients
(i )i=1;:::;lW such that word confidence measures are defined using:
m
O ;W)
T
word (
1
=
lW
X
:m
i
O ii ; w )
letter (
i=1
with
lW
X
i
e
b
i
(4)
=1
i=1
In this work we considered two kinds of coefficients: uni1
) and duration coefficients
form coefficients (i.e. i = lW
ei bi +1
(i.e. i =
). Uniform coefficients allow all letters to
T
contribute equally to the confidence measure while duration
coefficients make longer letters have more importance than
shorter letters.
3.2. Letter Confidence Measure
In this work we focused on the use of implicit antimodels to compute letter- level confidence measures. Let
Obe = (ob ; ob+1 ; : : : ; oe ) be an observation sequence associated to a letter l in W after the decoding step, and let l be
the anti-model of l. We investigated a few anti-models l:
The first anti-model is obtained in considering the letter among all letters (except l) giving the highest likelihood for Obe . We call this anti-model the single letter
anti-model.
The second anti-model is slightly more sophisticated
in that the anti-model of a letter l is another letter or the
sequence of letters or even the part of letter giving the
highest probability for Obe . This allows, for example,
the anti-model of letter ’d’ to be ’cl’ or the anti-model
of letter ’i’ to be a part of the letter ’u’ or ’w’. We will
call such an anti-model the general handwriting antimodel, since the anti-model of a letter may be any part
of a handwriting signal.
The third anti-model exploits the fact that we use a
closed vocabulary. Recall that the recognizer outputs
a list of dictionary’s words. Then the anti-model for a
letter l is the part of the next hypothesis word corresponding to the same sequence of frames Obe . We will
call this type of anti-model the dictionary dependent
anti-model.
The last anti-model, classically used in the speech literature, is called the on-line garbage model [1]. It
consists in a frame-by-frame normalization, where the
emission probability of a frame is defined as the average emission probability among the n-best probabilities (computed from all letter models) for this observation.
These methods were compared with a benchmark confidence measure based on the simple duration normalized
likelihood, i.e. e 1b+1 log [P (Obe jl)℄.
4. Rejection Threshold
The decision of acceptance or rejection of the recognizer’s output is taken by comparing its confidence value
with a threshold. Let T H denote the rejection threshold,
the rejection mechanism consists in the following rules: If
mword(O1T ; W ) < T H reject the recognizer’s output. If
mword(O1T ; W ) T H accept the recognizer’s output.
However, using such a fixed threshold for every word
makes sense only if letter-confidence measures are more or
less comparable. Since letters have various duration and are
modelled with various accuracy, this property is not warranted, although the anti-model normalization is expected to
resolve this problem. In order to investigate the anti-model
quality, we used two kind of thresholds, letter-independent
thresholds and letter-dependent thresholds in order to take
into account a variation in letter-level confidence measures.
Formalizing word-level threshold as a combination of letterdependent thresholds:
TH
W) =
word (
lW
X
:T H
i
w ) with
letter (
lW
X
i
i
=1
(5)
i=1
i=1
where i are the same coefficients that we use in Eq. (4)
and T Hletter (wi ) is the threshold for letter wi . For letterindependent thresholds, T Hletter (wi ) = T h; 8i. To compute letter-dependent thresholdfor letter l, we considered
two sets of letter confidence values computed from signals
corresponding to letter l and to other letters. T Hletter (l) is
chosen to be a frontier value between these two sets.
In order to study the rejection method accuracy, we will
investigate the system performance for various rejection
rates by letting the thresholds vary according to an additional parameter that controls the sensibility of the rejection mechanism. We will then compute a word confidence
measure with:
TH
W; ) =
word (
lW
X
:(T H
i
w ) + )
letter (
i
(6)
i=1
where is an adjustable parameter to control the rejection
rate.
5. Integration into the sentence recognizer
The rejection mechanism is integrated into the sentence
recognizer as a post processor. We use the rejection mechanism in order to reject or accept each word in the best
sentence. This choice implies less computational cost than
the integration of the rejection mechanism into the word
graph’s building stage and is more meaningful than the use
of word confidence in the best sentence selection. Fig. 1
resumes the procedure. First, the handwritten signal is decoded, resulting in a word graph including best word hypotheses. Then the word graph is processed using the rejection mechanism to reject words inside the best sentence.
6. Experiments
6.1. Database
Experiments reported in this study have been performed
on the UNIPEN international database [5]. The recognizer is trained on a database composed of 30k isolated
defined as b+b d . Type 2 error consists in rejecting a correct
response and is defined as a+
. These statistics are com
puted for various rejection rates by varying the sensibility
of the rejection method, e.g. in Eq. (6). So-called detection and trade-off curve (DET curve) is obtained by plotting
type 1 error rate against type 2 error rate.
6.3. Experimental results
Figure 1. Illustration of the rejection mechanism embedded in the sentence recognition
process. The handwritten signal (top) is decoded resulting in a word graph (middle). The
word graph is processed using the rejection
mechanism to produce acceptance or rejection decision for each word (bottom).
words written by 256 writers from various countries. Letterdependent thresholds are estimated using the same training
set.
For isolated word recognition tests, we used a test set of
2k words. For sentence recognition we used a test set of 200
sentences. Isolated word recognition as well as sentence
recognition are performed using a dictionary of about 3k
words.
6.2. Performance Criteria
Throughout this paper we have presented various letterlevel confidence measures leading to different word-level
confidence measures. In this section, we will focus on combining these different components to build a rejection mechanism. We will first compare letter confidence measures,
using uniform coefficients and letter-independent thresholds. Then, using the best measure we will compare uniform and duration coefficients. After that, we will investigate the use of letter-dependent thresholds. Finally, we will
provide results for handwritten sentence recognition.
In the first experiment, we investigated the comparative
qualities of letter- level confidence measures. We used the
uniform coefficients for building the word level confidence
measure and letter-independent thresholds. Fig. 2 plots the
DET curves for the four anti-model based letter confidence
measures and for the benchmark method. According to
these results, the dictionary dependent anti-model is significantly the best method, we will use this measure in following experiments. Note that the single letter anti-model and
the on-line garbage model perform comparably and that all
anti-model based confidence measures outperform significantly the benchmark method.
To evaluate the efficiency of rejection mechanisms we
used well known measures based on the confusion matrix
shown in Table 1.
From this matrix, many statistics can be derived, for exRecognizer’s output is
Correct Not Correct
Accept
Reject
a
b
d
Table 1. Rejection statistics: a, b, c, d represent the number of words that were correctly
recognized (or not), and that are accepted (or
rejected) according to the rejection mechanism.
d
a
, the recognition rate a+
ample the rejection rate a+b+
++d
b
and well known type 1 error rate and type 2 error rate. Type
1 error consists in accepting an incorrect response and is
Figure 2. Comparison of anti-model based
confidence measures. Uniform coefficients
are used to combine letter-level confidences
and letter- independent thresholds are used
in the rejection mechanism.
The next experiment concerns the coefficients used to
build the word-level confidence measure. Using the best
letter-level confidence measure found in the experiment
above (i.e. the dictionary dependent anti-model), we investigated the use of uniform and duration coefficients as
discussed in x3.1. Fig. 3 shows clearly that the duration
coefficients outperform the uniform weight coefficients, especially in the middle area (where both error types are less
than 30%), letter-independent thresholds were used here.
Duration coefficients will be used for next experiments.
dependent thresholds method outperforms the single threshold method in the low type 2 error rate region and both
methods perform equivalently in the area of high type 2 error rate. However, this is only a slight improvement, which
suggests that the anti-model based is an efficient method to
normalize the likelihood.
In order to get more insight about the rejection mechanism, we plotted, in Fig. 5, different statistics as a function of the selectivity parameter (Eq. (6)) for letterindependent thresholds. From these results, it can be seen
that word recognition rate may be improved from 80% to
90% with a rejection rate of about 20% . Furthermore, if the
rejection rate reaches 30% the recognition rate increases to
95%.
letter-independent threshold
1
0.8
0.6
0.4
Figure 3. Comparison of letter-dependent
weighting coefficients, uniform and duration
coefficients.
0.2
0
type 1 error rate
type 2 error rate
recognition rate
rejection rate
threshold
Figure 5. Different statistics obtained using
letter-independent thresholds as a function
of the rejection sensibility.
Finally, we evaluated the rejection mechanism embedded into the sentence recognition process, the results are
summarized in Fig. 6. It can be seen that the rejection mechanism allows improving word accuracy from about 70% to
90% with a rejection rate around 30%. Furthermore, the behavior of the rejection mechanism for sentence recognition
is roughly the same as for isolated word recognition.
7. Conclusion
Figure 4. Comparison of letter-independent
thresholds and letter-dependent thresholds.
Next, we investigated the use of letter-dependent thresholds.
Fig. 4 compares letter-independent and letterdependent thresholds, where the dictionary anti-model and
duration coefficients were used. One can see that the letter-
In this paper, we have investigated rejection methods
for on-line handwriting recognition systems dealing with
isolated word recognition and sentence recognition tasks.
We have proposed and compared different letter confidence
measures derived from the definition of anti-models. These
letter confidence measures have been used to define wordlevel confidence measures. We have studied a few confidence measures and shown that anti-model based measures
letter-independent threshold
1
0.8
0.6
0.4
0.2
0
recognition rate
rejection rate
threshold
Figure 6. Word recognition accuracy in sentence recognition with embedded rejection
mechanism.
were an efficient way to measure the confidence in the recognizer’s output. At the word level, the rejection mechanism allows improving accuracy from 80% to almost 95%
with a rejection rate of about 30%. We embedded the rejection procedure in the sentence recognition process. In
this case, the rejection mechanism allows to reject parts
(i.e. words) of the recognized sentence. Experimental results have shown improvements similar to those observed in
the isolated word recognition case. Rejecting about 30% of
words allows improving word accuracy from 70% to 90%.
References
[1] J.-M. Boite, H. Boulard, B. D’hoore, and M. Heasen. A New
Approch Towards Keyword Spotting. In European Conference on Speech Communication and Technology (EUROSPEECH), volume 2, pages 1273–1276, 1993.
[2] G. Bouwman, B. L., and K. J. Weighting phone confidence
measure for automatic speech recognition. In Workshop on
Voice Operated Telecom Services, pages 59–62, Ghent, Belgium, 2000.
[3] J. Caminero, C. dela Torre, L. Villarubia, C. Martin, and
L. Hernandez. On-line Garbage Modeling with Discriminant Analysis for Utterance Verification. In International
Conference on Spoken and Language Processing (ICSLP),
1996.
[4] S. Cox and R. Rose. Confidence Measures for The Switchboard Database. In International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), volume 1, pages
511–514, Atlanta, Georgia, USA, May 1996.
[5] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and
S. Janet. UNIPEN project of on-line data exchange and recognizer benrchmak. In International Conference on Pattern
Recognition (ICPR), volume 2, pages 29–33, Jerusalem, Israel, October 1994.
[6] G. Hernández-Ábrego and J. B. Mariño. Fuzzy reasoning
in confidence evaluation of speech recognition. In IEEE International Workshop on Intelligent Signal Processing, Budapest, Hungary, 1999.
[7] I. Hetherington. A characterization of the problem of new,
out-of-vocabulary words in continous-speech recognition
and understanding. PhD thesis, Massachusetts Institue of
Technology, February 1995.
[8] S. O. Kamppari and T. J. Hazen. Word and phone level
acoustic confidence scoring. In International Conference on
Acoustics, Speech, and Signal Processing (ICASSP), Istanbul, Turkey, June 2000.
[9] E. Lleida and R. C. Rose. Efficient Decoding and Training
Procedures for Utterance Verification in Continuous Speech
Recognition. In International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), volume 1, pages
507–510, Atlanta, Georgia, USA, May 1996.
[10] S. Marukatat, T. Artires, B. Dorizzi, and P. Gallinari. Sentence Recognition through hybrid neuro-markovian modelling. In International Conference on Document Analysis
and Recognition (ICDAR), pages 731–735, Seattle, Washington, USA, September 2001.
[11] Z. Rivlin, M. Cohen, V. Abrash, and C. T. A phonedependent confidence measure for utterance rejection. In
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), 1996.
[12] R. Rose. Discriminant wordspotting techniques for rejecting non-vocabulary utterances in unconstrained speech. In
International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), 1992.
[13] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke. NEURAL - Network Based Measures of Confidence for
Word Recognition. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 887–
890, Munich, Germany, 1997.
[14] F. Wessel, K. Macherey, and H. Ney. A comparaison of word
graph and n-best list based confidence measures. In European Conference on Speech Communication and Technology
(EUROSPEECH), 1999.
[15] A. Willet, A. Worm, C. Neukirchen, and G. Rigoll. Confidence Measures for HMM-Based Speech Recognition. In
International Conference on Spoken and Language Processing (ICSLP), 1998.
[16] G. Williams. A Study of the Use and Evaluation of Confidence Measures in Automatic Speech Recognition. Technical report, University of Sheffield, March 1998.