A Corpus Based Morphological Analyzer for Unvocalized Modern

A Corpus Based Morphological Analyzer for Unvocalized Modern Hebrew
Alon Itai and Erel Segal
Department of Computer Science
The Technion—Israel Institute of Technology
Haifa, Israel
Abstract
Most words in Modern Hebrew texts are morphologically ambiguous. Some words have up to 13 different analyses, and
the average number is more than 2. The correct analysis of a word depends on its context. We describe a method for
finding the correct morphological analysis of each word in a Modern Hebrew text. The program first uses a small tagged
corpus to estimate the probability of each possible analysis of each word regardless of its context and chooses the most
probable analysis. It then applies automatically learned rules to correct the analysis of each word according to its
neighbors. Lastly, it uses a simple syntactical analyzer to further correct the analysis, thus combining statistical methods
with rule-based syntactic analysis. It is shown that this combination greatly improves the accuracy of the morphological
analysis—achieving up to 96.2% accuracy.
The software described in this article is available at Segal (2001).
1.
Introduction
The morphological analysis of words is the first stage of most natural language applications. In Modern Hebrew
(henceforth Hebrew), the problem is more difficult than in most languages. This is due to the rich morphology
of the Hebrew language and the inadequacy of the common way in which Hebrew is written—the unvocalized
script—which results in a great degree of morphological ambiguity. The average number of analyses of a word
is 2.1, and in some extreme cases can reach 13 analyses. The problem has not yet found a satisfactory solution
(see Levinger et al, 1995).
In Hebrew, the morphological analysis partitions a word token to morphemes; one of which contains the lexical
value, while others contain linguistic markers for tense, person, etc. Others represent short words, such as
determiners, prepositions and conjunctions prepended to the word token, or possessives and object clitics
appended to it. The part-of-speech (POS) of the lexical value and the linguistic markers of a word define a tag.
These tags correspond to POS tags in English.
POS tagging in English has been successfully attacked by corpus based methods. Thus we hoped to adapt
successful part-of-speech tagging methodologies to the morphological analysis of Modern Hebrew. There are
two basic approaches: Markov Models (Church, 1988; Brants, 2000) and acquired rule-based systems (Brill,
1995). Markov Model based POS tagging methods were not applicable, since such methods require large
tagged corpora for training. Weischedel et al (1994) use a tagged corpus of 64000 words, which is the smallest
corpus we found in the literature for HMM tagging, but is still very big. Such corpora do not yet exit for
Hebrew. We preferred, therefore, to adapt Brill’s rule based method, which seems to require a smaller training
corpus. Brill's method starts with assigning to each word its most probable tag, and then applies a series of
“transformation rules”. These rules are automatically acquired in advance from a modestly sized training
corpus, see Section 4.
In this work, we find the correct morphological analysis by combining probabilistic methods with syntactic
analysis. The solution consists of three consecutive stages:
1.The word stage: In this stage we find all possible morphological analyses of each word in the analyzed
text. Then we approximate, for each possible analysis, the probability that it is the correct analysis,
independent of the context of the word. For this purpose, we use a small analyzed training corpus. After
approximating the probabilities, we assign each word the analysis with the highest approximated
probability (this stage is inspired by Levinger et al, 1995).
1
2.The pair stage: In this stage we use transformation rules, which correct the analysis of a word according
to its immediate neighbors. The transformation rules are learned automatically from a training corpus
(this stage is based on Brill, 1995).
3.The sentence stage: In this stage we use a rudimentary syntactical analyzer to evaluate different
alternatives for the analysis of whole sentences. We use a hill-climbing algorithm to find the analysis
which best matches both the syntactical information obtained from the syntactical analysis and the
probabilistic information obtained from the previous two stages.
The data for the first two stages is acquired automatically, while the sentence stage uses a manually created
parser.
Using all these three stages results in a morphological analysis, which is correct for about 96% of the word
tokens. This result approaches results reported for English probabilistic part-of-speech tagging. It does so by
using a very small training corpus – only 4900 words, similar in size to the corpus used by Brill and is much
smaller than the million-word corpora used for HMM based POS tagging of English. The results show that
combining probabilistic methods with syntactic information improves the accuracy of morphological analysis.
In addition to solving a practical problem of Modern Hebrew and other scripts that lack vocalization (such as
Arabic, Farsi), we show how several learning methods can be combined to solve a problem, which cannot be
solved by any one of the methods alone.
a. Previous Work
Both academic and commercial systems have attempted to attack the problem posed by Hebrew morphology.
The Rav Millim (Choueka, 2002) commercial system provided a morphological analyzer and disambiguator.
Within a Machine Translation project, IBM Haifa Scientific Center provided a morphological analyzer (BenTur et al. 1992), that was later used in several commercial products. The sources are proprietary so we used
Segal’s publicly available morphological analyzer (Segal 2001).
Other works attempt to find the correct analysis in context. Choueka and Luisignan (Chueka and Lusingnan,
1985) proposed to consider the immediate context of a word and to take advantage of the observation that quite
often if a pair of adjacent words appears more than once the same analysis will be the correct analysis in all
these contexts. Orly Albeck (Albeck ) attempted to mimic the way humans analyze texts by manually
constructing rules that would allow to find the right analysis without backtracking. Levinger (1992) gathered
statistics to find the probability of each word, and then used hand crafted rules to rule out ungrammatical
analyses.
2.
a.
Modern Hebrew Morphology
The problem
This section follows Levinger et al (1995).
Due to its dimension morphological ambiguity is a severe problem in Hebrew. Thus, finding methods to reduce
the morphological ambiguity in the language is a great challenge for researchers in the field, and for people
who wish to develop natural language applications for Hebrew.
Number of Analyses
Number of WordTokens
%
1
17551
2
9876
3
6401
4
2493
5
1309
6
760
7
337
8
134
9
10
10
18
11
1
12
3
13
5
45.1
25.4
16.5
7.1
3.37
1.27
0.87
0.34
0.02
0.05
0.002
0.007
0.01
2
Table 1: The dimension of morphological ambiguity in Hebrew
Table 1, copied from Levinger et al (1995), demonstrates the dimension of morphological ambiguity in
Hebrew. The data was obtained by analyzing large texts, randomly chosen from the Hebrew press, consisting
of nearly 40,000 word-tokens. According to this table, the average number of possible analyses per wordtoken was 2.1, while 55% of the word-tokens were morphologically ambiguous. The main reason for this
degree of ambiguity is the standard writing system used in Hebrew (unvocalized script). In this writing system,
not all the vowels are represented, several letters represent both consonant and different vowels, and gemination
is not represented at all. In Hebrew a word token consists of several morphemes that represent lexical values;
functional words such as prepositions and pronouns; and linguistic markers such as articles, gender, number
and possessive cases. The morphemes undergo morphological transformations that further add to the ambiguity.
Moreover, Hebrew has two standards for unvocalized orthography, one in which more vowels are explicit
("ktiv male", full script), and one which resembles more the vocalized script ("ktiv xaser", partial script). In
Modern Hebrew, these two writing systems tend to be intermingled, so that many of the words have in practice
more than one form.
To demonstrate the complexity of the problem, we should take a closer look at Hebrew words and morphology.
In order to overcome technical difficulties, we write Hebrew texts using a Latin transliteration, in which each
Hebrew letter is represented by a unique Latin letter or a single special symbol. This is a 1-1 transliteration and
not a phonological transcription. See appendix A for the details of this transliteration.
For example, the morphological analysis of the word token ‫ו כ ש ר א י ת י‬
(pronounced ukshera’iti) is as follows:
transcribed as WK$RAITI,
W
K$
RAI
TI
conj. time adv. see-past 1sg-subject
and when
saw
I
= 'and when I saw'
In general, a morphological analysis of a Hebrew word should extract the following information:
 Lexical entry.
 Part of speech.
 Attached particles (i.e. conjunctions, prepositions, determiners).
 Gender and number.
 Status -- a flag indicating whether a noun or adjective is in its construct or absolute state.
 Person (for verbs and pronouns).
 Tense (for verbs only).
 Gender, number and person of possessive and object suffixes.
The above word token is non-ambiguous: it has exactly one morphological analysis. But most Hebrew strings
have more than one analysis. Consider, for example, the following word token – HQPH (‫) ה ק פ ה‬, which has
three possible analyses:



The definite article H + the noun QPH (‫קפה‬-‫ – ה‬the coffee).
The noun HQPH (‫ – הקפה‬encirclement).
The noun HQP + the feminine possessive suffix H (‫ה‬-‫ – הקף‬her perimeter).
For another example, consider the word token '$QMTI', which can be analyzed both as a noun and as a verb:
3


The noun '$QMH' (sycamore) with a first person possessive suffix 'I' ($QMT-I = my sycamore; here we
witness a morphological transformation by which the letter H at the end of a feminine noun becomes a T
when a suffix is added).
The connective '$' (that) + the past-tense verb 'QM' (got up), + TI first person singular suffix, ($-QMTI
= that I got up).
Note that in a given context only one of the possible analyses is correct, and a native speaker easily identifies it.
For example, in the sentence "@IPSTI &L $QMTI", the string $QMTI is analyzed as a noun ($QMT-I):
@IPS
TI
&L
$QMT
I
climb-past 1sg- subject prep. sycamore 1sg-possessive
climbed
I
above sycamore of-me
= 'I-climbed my-sycamore'
while in the sentence " AMRTI LW $MQTI HBWQR MWQDM", it is analyzed as a verb ($-QMTI):
AMR
TI
L
W
$
QM
TI
say-past 1sg-subject prep. 3sg
conn. get-up-past 1sg-subject
said
I
to
him
that
got up
I
= ' I said to him that-I-got-up [this-morning early]'
b.
The basic morphological analyzer
A basic morphological analyzer is a function that gets a word token and returns the set of all its possible
morphological analyses. Several basic morphological analyzers for Hebrew have been developed over the last
decade. We used a basic morphological analyzer that is public domain1.
The analyzer we used supplies all morphological information described in the previous section, except for the
object clitics. We found only 2 object clitics in a 4900 word corpus of a Hebrew newspaper, so we concluded
that adding the object clitics to the analysis won't add much its coverage, while it will increase the ambiguity
substantially. Other analyzers, such as Rav-Milim, identify the object clitic in some but not all of the words.
In this work, we didn't intend to tackle the problem of the two standards for unvocalized orthography, so we
used a conservative analyzer that identified only "full-script" unvocalized words (ktiv male).
3.
The word stage
Our purpose in this project is to find the most probable morphological analysis for a given sentence. Formally,
we define the analysis of a sentence with n words, w[1..n], to be:
analysis ( w[1..n])  arg max P(t[1..n] | w[1..n])
where the maximization is done over all arrays of n tags, t[1..n]
In the first stage, we ignore the dependencies between neighbouring words, and choose for each word its most
probable tag:
analysis (w)  arg
max
P(t )
t is an analysis of w
To use this formula, we can estimate P(t) for each possible analysis t of each word w in the text. For example,
for the word “$QMTI” we should find the probabilities:
 P(“$-QMTI”) - for the analysis as a verb.
1
Java source code of the analyzer is available at Segal (2001).
4

a.
P(“$QMT-I”) – for the analysis as a noun.
Acquisition of the probabilities
We cannot calculate the required probabilities theoretically because we do not have a probabilistic model of the
Hebrew language. However, we can calculate an empirical probability, which will estimate the true probability.
We estimate these probabilities by counting:
C (t )
Pˆ (t ) 
,
N
where C(t) is the number of times the word w was seen in the training corpus with the analysis t, and N is the
number of words in the training corpus.
This formula assigns zero probability to analyses that do not appear in the training corpus, although their true
probability may be positive. This is a problem in many statistical works, and it was a severe problem here,
because of the small size of the training corpus (only 4900 words). For the sake of comparison, works on HMM
based methods for English part-of-speech tagging use corpora of at least 64000 words (Weischedel at al, 1994),
and most of them use million-words corpora. Therefore, many analyses did not appear in the training corpus,
although their probability is larger than 0. However, even if we had a larger corpus many words would be
absent.
This is due to the large number of conjugations in which each word can appear: In Hebrew each noun has 24
different inflections, and each verb has over 30; added particles greatly increase the number of word tokens.
Contrast this with English, where each noun has two conjugations (singular and plural) and each verb at most 5.
Thus, the number of Hebrew word tokens is much larger than in English. This raises a sparseness problem: the
probability of encountering a yet unseen word token is non-negligible.
English corpus-based methods may treat each conjugation as a separate word without adding much to the
sparseness problem (although some English taggers do use a simple morphological analyzer - see, for instance,
Cutting et al, 1992); morphologically rich languages, such as Hebrew, don't have this privilege, as this will
greatly increase the data sparseness.
Manning and Schutze (1999, chapter 6) describe many approaches to the sparse data problem. We addressed
this problem in two complementary ways. First, we gathered information about every morpheme of an analysis
t. For example, we consider the lexical value (tlv) of t and its other morphemes (tmf), such that:
P(t )  P(t lv  t mf ) .
where the intersection symbol denotes a combination of features (i.e. tag t has both the lexical value tlv and the
morphological features tmf).
Assuming that the lexical value is statistically independent of the other morphemes, we can write:
P(t )  P(t lv )  P(t mf )
The statistical independence assumption is, of course, not always true: a preliminary result by Alon Altman
(2002) indicates that it's true only for about 50% of the word tokens in a Hebrew journal text. We still use it as
a heuristic, though, because our main goal here is to rank the tags in order of their probabilities, and for this
goal we don't need the absolute probability value, but only a value that will have the correct rank relative to
other tags.
Returning to the above example, we get:
P̂ ($QMT-I) = P̂ ($QMH)  P̂ (noun-with-1sg-possessive-suffix)
P̂ ($-QMTI) = P̂ (QM)  P̂ ($+verb-past-1sg)
The above probabilities can also be estimated by counting:
5
P̂ ($QMH) = C($QMH)/N, P(QM) = C(QM)/N, ...
Obviously, there is much more information about lexical values alone and morphemes alone than there is about
whole analyses. However, using as small a corpus as we did -- there were still many lexical values and
morphemes with zero counts.
To overcome this problem, we estimated the probabilities of both the lexical values and the morphemes using
the Good-Turing formula (Good ,1953). This formula approximates the probability that a word selected at
random from the test corpus has a lexical value (or a morpheme set, or any other tag) that occurred r times in
the tagged corpus. The approximated probability is:
E[ N r  1 ]
Pr 
,
N
where the nominator is the expected number of values that appeared r  1 times in the tagged corpus, and the
denominator is the total number of words in the corpus.
Calculating the Nr 's in the tagged corpus is easy, but estimating the expectations of the Nr ’s is not so easy. In
this work, we used the Simple Good Turing algorithm developed by Gale and Sampson (1995).
The above formula gives the probability that a new word will have any lexical value that appeared r times in the
tagged corpus. But what we really need is the probability that a new word will have a specific lexical value that
appeared r times in the tagged corpus. For r  0 this poses no problem: all we have to do is divide the
probability Pr by the expected number of lexical values with r occurrences in the tagged corpus, to get:
E[ N r 1 ]
.
Pr 
N * E[ N r ]
However, for r  0 we have to find the number of lexical values with 0 occurrences in the tagged corpus, which
is much more difficult to estimate.
A possible solution is to take the number of lexical values in the lexicon and subtract the number of different
lexical values in the training corpus. However, this solution would give arbitrary results, since the number of
lexical values depends on subjective decisions and on the effort we put in building the lexicon. Currently, our
lexicon doesn’t contain all words in Hebrew but an arbitrary subset of about 20,000 lexical values. Had we
invested more time in entering values to the lexicon, we could have had up to 80,000 lexical values, which
would have made any probability calculated this way 4 times smaller! On the other hand, we could have taken a
much smaller lexicon, containing only lexical values that belong to Modern Hebrew, and this would make all
probabilities calculated this way much larger.
For this reason, we used a more general method called Empirical Bayes, described by Efron and Thisted (1976).
We used the Euler transformation described in formula (4.3) there.
b.
Our method compared with other methods of dealing with sparse data
Levinger et al (1995) achieved a similar effect by a different method. They developed a method for estimating
word probabilities using a raw (untagged) corpus. They defined, for each analysis t, a group of similar analyses
-- analyses that have the same lexical value as t but differ in a single morpheme (not the lexical value), and are
therefore supposed to have similar frequency in Hebrew. In their method, the probability of each analysis is
initialized to 1/k, where k is the number of analyses of the word. It is then recalculated using the probabilities of
the similar analyses. The process is performed iteratively until it converges. There are three main differences
between Levinger's method and ours:
1. Levinger et al. used a large untagged corpus, while we used a small tagged corpus.
6
2.
In order to avoid counts from words with different lexical values, Levinger et al. preferred
unambiguous similar words. Since we look at all words our method is easier to justify
probabilistically.
3. When the number of unambiguous similar words in Levinger et al.’s corpus was too low to make
reliable statistical estimations, they used ambiguous words and performed several iterations to
overcome the noise introduced by words with different lexical values. We did not need to resort
to more than one iteration.
A similar effect was achieved by Merialdo (1994) for the problem of part-of-speech tagging in English.
Merialdo implemented the Forward-Backward algorithm. This algorithm assumes that the untagged training
corpus was generated by a hidden Markov model. The parameters of this model are initialized randomly, and
the resulting model is used for calculating the probability of generating the training corpus. The parameters are
then changed in order to increase that probability. The algorithm proceeds iteratively until it converges. The
resulting Markov model can be used to tag new articles. There are two differences between this method and
ours:
1. Our algorithm has only a single iteration, as explained above.
2. The Forward-Backward algorithm is more general -- it can be used to calculate the probability of
pairs or triples of parts-of-speech and it does not rely on splitting analyses into their features. We
address the problem of pairs of analyses in a more efficient way, which will be described in the
next section.
c.
Computational complexity and evaluation
The process of acquiring the probabilities requires going over the training corpus at least once and keeping
counts for each lexical value and each set of morphological features. This takes O(N) time and space, where N
is the number of words in the training corpus.
Having estimated the probabilities of every possible analysis of every word, we can now assign each word its
most probable analysis. This takes time O(n), where n is the number of words in the analyzed sentence.
This scheme was tested using a manually-tagged training corpus of about 4900 words, and manually-tagged test
texts ranging from 100 to 800 words long. On average, 83% of the word-tokens in the test article were assigned
the correct morphological analysis
4.
The pair stage
After estimating the context-free probability of each analysis, we try to re-rank the analyses by looking at their
neighbors.
a.
Using transformation rules to improve the analysis
The concept of rules was introduced by Brill (1995), who first used acquired transformation rules to build a
rule-based part-of-speech tagger. He argued that transformation rules have several advantages over other
context-dependent part-of-speech tagging methods (such as Markov models):
1. The transformation-rules method keeps only the most relevant data. This both saves a lot of
memory space and enables the development of more efficient learning algorithms.
2. Transformation rules can be acquired using a relatively small training corpus.
We too use transformation rules, but in contrast to Brill, our transformation rules do not automatically change
the analyses of the matching words. In order to use the probabilistic information gathered in the word stage, we
7
assign each possible analysis of each word a morphological score. The score of each analysis is initialized to
the 1000  its probability as determined at the word-token stage, rounded to the nearest integer (the factor 1000
was introduced so that we can work with integers instead of fractions). The transformation rules modify the
scores of the analyses. The modified scores can be used to select a single analysis for each word (that with the
highest score), or used as an input to a higher level analyzer (such as the syntactic analyzer that will be
described in the next section).
A transformation rule operates on a pair of adjacent words (let's mark them: w1 w2). Here is an example of a
transformation rule:
if the current analysis of w1 is a proper-noun and the current analysis of w2 is a noun:
find an analysis of w2 as a verb that matches w1 by gender, and
add 500 to its morphological score, and
normalize the scores of w1 so that they sum up to 1000.
In the normalization step, each score is divided by 0.001 of the sum of all scores. The normalization is needed so that the
next transformations will have the same relative influence on all words. For example, if a certain transformation adds 200
to the morphological score of all past verbs, this "200" should be 20% of the sum of all scores, regardless of how many
transformations were previously executed on each word.
Consider the combination “YWSP &DR” (‫)יוסף עדר‬. The word “&DR” has two possible analyses: one as a
masculine noun (= herd) and the other as a masculine past-tense verb (= hoed). Suppose our analyzer has found,
in the word stage, that the most probable analysis of the word “YWSP” is a masculine proper name (=Joseph),
and the most probable analysis of the word “&DR” is a noun (= herd). The current analysis of this combination
is “Joseph herd”, which is ungrammatical. However, this combination of analyses matches the first part of the
transformation rule: the current analysis of w1 is a proper noun and the current analysis of w2 is a noun.
Moreover, w2 has an analysis that matches the second part of the rule: a verb that matches w1 by gender.
Therefore, the rule will add 500 to the morphological score of the other analysis of w2. If the difference between
the scores of the two analyses was less than 500 – the highest-scored analysis of w2 will now be the verb, so
that the analysis of the entire combination will be “Joseph hoed”. If the difference between the scores was
greater than 500 – the analysis will not be changed.
The general syntax used in this work for a transformation rule is:
pattern1 pattern2 [agreement]  newpattern1(+inc1) newpattern2(+inc2) [newagreement]
A rule has a left-hand side and a right-hand side. both sides contain two analysis-patterns and an optional
agreement-pattern. An analysis-pattern is any pattern that matches morphological analyses of single words, for
example "proper-name", "noun without prepended particles", "anything but absolute noun", etc. An agreement-pattern is a
pattern that indicates how two adjacent analyses agree, for example "greeing-by-gender", "greeing-by-gender-andnumber", etc.
A rule comes into effect only for pairs of adjacent analyses, where the first analysis matches "pattern1", the
second analysis matches "pattern2", and the two analyses agree according to "agreement" (if it exists). In this case
we say that the two analyses "match the left-hand-side of the rule".
The semantics of the general rule is:
for each pair of adjacent words w1 w2:
if the current analyses of w1 and w2 match the left-hand-side of the rule,
8
find an analysis of w1 that matches newpattern1,
find an analysis of w2 that matches newpattern2,, such that the newagreement pattern is satisfied
and add inc1 to the score of the new analysis of w1;
and add inc2 to the score of the new analysis of w2;
finally, normalize the scores of the analyses of w1 and w2 , so that the sum of the scores of the analyses for
each word will be 1000.
Using this syntax, the above example will be written:
proper-name noun  proper-name(+0) verb (+500) agreeing-in-gender
In this example, inc1 = 0, which means that the rule will not change the analysis of w1. inc2 = 500, which
means that the rule will try to change the analysis of w2 from a noun to a verb that matches w1 by gender, by
adding 500 to the matching analysis (if such analysis exists).
Here are some other examples of transformation rules:
construct-noun noun 
absolute-noun(+30) adjective (+70) agreeing-in-gender-and-number
This rule will have an effect only in case w1 has an analysis as an absolute noun, and w2 has an analysis as an
adjective with the same gender and number. It will try to make these analyses the most probable analyses, by
adding 30 to the first analysis and 70 to the second analysis.
anything-but-aux-verb verb  aux-verb(+700) verb
This rule will become active when w2 is analyzed as a verb, and w1 has an analysis as an auxiliary verb,
regardless of its current analysis.
Negative rules can also be defined:
absolute-noun noun  anything-but-absolute-noun(+400) noun(+0)
This rule means:
if w1 is currently analyzed as an absolute noun and w2 is analyzed as a noun:
find an analysis of w1 that is not an absolute noun,
add 400 to its score, and normalize to 1000 as usual.
This rule detects and corrects a word-pair that is very improbable in Hebrew.
Rules can depend on the lexical value of the words:
'HWA' noun  'HWA'(+0) verb-agreeing-in-person-gender-and-number(+500)
This rule means:
if the current analysis of w1 belongs to the lexical entry “HWA” (I, you, he, she or they)
and the current analysis of w2 is a noun,
find an analysis of w2 as a verb that agrees with the pronoun in person, gender and number,
add 500 to its score, and normalize to 1000 as usual.
The reasoning behind this rule is that the lexical entry "HWA" (third personal singular masculine pronoun – he)
is usually used as a subject, so it is probably followed by an agreeing verb.
9
b.
Acquiring the transformation rules
Transformation rules are acquired automatically using an analyzed training corpus. The learning algorithm uses
the following input:
1. For each word in the training corpus: the set of its analyses, and the morphological score of each analysis
(this is initialized, as said above, by assigning each analysis its estimated probability multiplied by
1000).
2. The training corpus, separated into words, where punctuation marks (including periods at ends of
sentences) are treated as separate words.
3. The correct analysis of each word in the training corpus.
4. The current analysis of each word in the training corpus (this is initialized at the beginning of the
algorithm).
The output of the algorithm is an ordered list of transformation rules.
The learning algorithm proceeds as follows:
a. (Initialization): Assign each word its most probable analysis.
b. (Loop start) Initialize an empty set S, to hold transformation rules.
c. (Transformation rule generation): loop over all words in the corpus.
For each word w1, let w2 be the word that follows it. {
Compare the current analysis of the pair (w1,w2) to its correct analysis. If the current analysis for either w1 or w2 or
both is different from the correct analysis: {
Generate a set of transformation rules that can fix at least one of the errors (see subsection c below).
Put all generated rules into the set S.
}
}
d. (Transformation rule evaluation): loop over all transformation rules in the set S. For each rule, r: {
Find the score of rule r, by applying it to the current analysis of the entire training corpus, and then counting the
number of errors that remain. The score of rule r is defined as the current number of errors minus the number of
errors after applying r.
}
e. Find the transformation rule with the highest score, rbest. If the score of rule rbest is less than 2 -- quit2.
Otherwise: {
Apply rbest to the current analysis.
Return to stage (b).
}
2
The threshold of 2 was chosen because a transformation rule whose score is less than 2 is very specific: it can fix only the error that
triggered its generation. Therefore, it will probably not be able to fix errors in new texts. This is an example of the overfitting
problem, which is a general problem in the domain of automatic learning.
10
To show that the algorithm terminates, note that in stage (e), if rbest is applied to the current analysis -- the
number of errors must decrease by at least 2 (by the definition of the score of rule rbest). The number of errors is
finite, so the algorithm must terminate.
The only stage in the algorithm that needs further elaboration is the generation of transformation rules in stage
(c).
c.
Error-driven generation of transformation rules
The process of generating transformation rules is best explained by an example. Consider the pair of strings:
"MCD AXD". Each string has several possible analyses:


“MCD” can be an absolute noun (“fort”), a construct noun (“the fort of”), a prepositional particle M
attached to an absolute noun (“from side”) or a prepositional particle M attached to a construct noun
(“from the side of”).
“AXD” can be a quantifying adjective (“one”) or a noun (“somebody”).
In the pair “MCD AXD”, the correct analysis of “MCD” is a prepositional particle attached to an absolute noun,
and the correct analysis of “AXD” is a quantifying adjective. Therefore, the literal translation of this pair of
strings is “from one side”. (The common use of this phrase is idiomatic, equivalent to “on one hand”.)
Suppose the word-token stage assigned the following probabilities to the analyses of the relevant words:
 MCD: 0.45 to the correct analysis, 0.55 to the analysis as a construct noun ("the fort of...").
 AXD: 0.4 to the correct analysis, 0.6 to the analysis as a noun ("somebody").
The initial scores are:
For MCD: 450 to the correct analysis, 550 to the analysis as a construct noun
For AXD: 400 to the correct analysis, 600 to the analysis as a noun
The initial analysis has two errors in this phrase: analyzing 'MCD' as a construct noun and analyzing 'AXD'
as a noun. To create a transformation rule that fixes the error, the algorithm works as follows:
 Put the current (wrong) analysis on the left-hand side of the rule
 Omit from the left-hand side all morphological features except:
o the parts of speech
o the features that connect the first word with the second word (i.e. the status of the first word,
and the particles attached to the second word).
 Put the correct analysis on the right-hand side of the rule.
 Omit from the right-hand side all morphological features except the parts of speech and the features
that connect the first word with the second word, as above.
 Set the score-increment for each word to the difference between the score of the current analysis of
the word and the score of its correct analysis. Theoretically this is the most specific rule that still can
fix the error.
o However, due to round-off errors, the rule may be too specific: if the same error repeats with
slightly different scores, the rule might not be able to fix it. To avoid this round-off error,
add a small quantity of 10 to the score-increment.
The resulting rule is:
construct-noun noun

absolute-noun(+110)
adjective (+210)
The score 110 stems from the difference between the correct analysis and the current analysis of the word "MCD", 550450, plus 10 to avoid round-off errors. The score 210 stems from the difference between the correct analysis and the
current analysis of the word "AXD", 600-400, plus 10 to avoid round-off errors.
11
Then, the algorithm creates less general rules, by adding all possible relevant agreement-patterns to the above rule:

construct-noun noun
absolute-noun(+110)
adjective-matching-noun-by-gender(+210)

construct-noun noun
absolute-noun(+110)
adjective-matching-noun-by-number(+210)

construct-noun noun
absolute-noun(+110)
adjective-matching-noun-by-gender-and-number(+210)
and so on. All such rules will fix the error in the combination 'MCD AXD' as well.
The basic rule is then generalized, by replacing each pattern in its left-hand side by a negative pattern based in
its right-hand side. This creates rules such as:
NOT-absolute-noun noun
 absolute-noun(+110) adjective(+210)
construct-noun NOT-adjective  absolute-noun(+110) adjective(+210)
and so on.
The rules created so far all tried to fix both errors. After creating all such rules, the algorithm creates two new
basic rules, each of which fixes only one error:
construct-noun noun-without-prefixes  absolute-noun(+0)
adjective-without-prefixes(+210)
construct-noun noun-without-prefixes  absolute-noun(+110) adjective-without-prefixes(+0)
All the versions described above will be created for each of these basic rules.
The entire process described above is repeated three more times, in order to take lexical values into account.
Three more basic rules will be created:
'MCD'=construct-noun noun
 'CD'=absolute-noun(+110) adjective(+210)
construct-noun 'AXD'=noun
 absolute-noun(+110)
'AXD'=adjective(+210)
'MCD'=construct-noun 'AXD'=noun  'CD'=absolute-noun(+110) 'AXD'=adjective(+210)
and all versions described above will be created for each.
d.
Complexity-analysis and evaluation







Applying a transformation rule to the training corpus is a linear process: we go over each word in the
corpus in turn, and check the word and the following word against the left-hand side of the rule. It takes
time O(N), where N is the number of words in the training corpus.
The number of transformation rules generated in stage (b.) is proportional to the number of errors found
in the current analysis (each error stimulates the generation of about 20 transformation rules).
Therefore, stage (c.) takes time ONe , where e is the number of errors in the current analysis.
In stage (b.), the algorithm iterates over all words in the corpus, and does a constant-bounded amout of
work at each iteration (checking if there is an error in the current analysis, and creating a constantbounded number of rules to correct the error). It takes time O(N).
Each iteration decreases the number of errors by at least one. Therefore, the number of iterations is
bounded by E, where E is the number of errors in the initial analysis. The number of errors in each
iteration, e, is also bounded by E.
Hence, the total complexity of the learning algorithm is O (N E2 ).
E obviously cannot exceed N, so the worst-case complexity of the algorithm is O N 3 .
 
To test the algorithm we used an analyzed corpus of 5361 word tokens, which contained 16 articles of various
subjects from a Hebrew daily newspaper. We made an experiment two times, each with a different test article:
 Article A with 469 word tokens (which leaves 4892 word tokens in the training corpus),
 Article B with 764 word tokens (which leaves 4597 word tokens in the training corpus),
12
In each experiment, we ran the learning algorithm 6 times, each time using a different fraction of the training
corpus. We examined how the size of the training corpus affects the number of transformation rules learned and
the final accuracy. The results are shown in Figure 1.
The x axis in all graphs is the number of words used for training, starting with a minimum of 0 (for no training),
# w ords for
training [a]
0
701
1073
2105
2929
3851
4892
# learned
rules [a]
0
22
30
48
63
78
93
initial #
errors [a]
80
79
77
81
79
70
68
initial %
errors [a]
17.1%
16.8%
16.4%
17.3%
16.8%
14.9%
14.5%
f inal #
errors [a]
80
51
42
42
39
39
29
f inal %
errors [a]
17.1%
10.9%
9.0%
9.0%
8.3%
8.3%
6.2%
eff ect of
rules [a]
0.0%
6.0%
7.5%
8.3%
8.5%
6.6%
8.3%
article B:
764
w ords
# w ords for
training [b]
0
468
695
1713
2640
3562
4597
# learned
rules [b]
0
14
18
45
53
76
90
initial #
errors [b]
140
129
129
130
128
131
126
initial %
errors [b]
18.3%
16.9%
16.9%
17.0%
16.8%
17.1%
16.5%
f inal #
errors [b]
140
90
86
73
71
64
53
f inal %
errors [b]
18.3%
11.8%
11.3%
9.6%
9.3%
8.4%
6.9%
eff ect of
rules [b]
0.0%
5.1%
5.6%
7.5%
7.5%
8.8%
9.6%
number of learned rules
article A:
469
w ords
100
80
# learned rules[a]
60
40
20
# learned rules[b]
0
0
1000
2000
3000
4000
5000
6000
num be r of w ords use d for training
error rate
20.0%
initial % errors [a]
15.0%
10.0%
f inal % errors [a]
5.0%
0.0%
initial % errors [b]
0
1000
2000
3000
4000
5000
6000
error rate decrease
num be r of w ords use d for training
f inal % errors [b]
15.0%
eff ect of rules[b]
10.0%
5.0%
0.0%
0
1000
2000
3000
4000
5000
6000
eff ect of rules[a]
num be r of w ords use d for training
to a maximum of 4597 (for article B) or 4892 (for article A).
The first graph shows the number of rules learned for each training-corpus size. The second graph shows the
initial and final error rate for each training-corpus size. The third graph shows the effect of the rules, defined as
the percent of errors after the word phase minus the percent of errors after the pair stage.
13
Figure 1: The results of the pair-stage experiment
The final error rate for the first experiment (article A) was 6.2%, which means that 93.8% of the words in
article A got their correct morphological analysis after running the transformation rules. We calculate a
confidence interval for this figure using a standard statistical formula:
p  pˆ  z1 var pˆ  pˆ  z1 pˆ (1  pˆ ) / N
where p̂ is the proportion of words analyzed correctly (0.938), N is the total number of words in the article
(469) and α is the confidence level. Taking a confidence level of 95% we get (using statistical tables):
z1  z 0.05  1.65
Substituting into the confidence interval formula we get:
p  0.938  1.65 0.938 * 0.062 / 469  0.92
This means that with 95% confidence the error rate at most 8%.
The error-rate graphs are quite flat even for this small training corpus. Perhaps it is possible to conclude that
using a larger training corpus won't make the results much better. Right now, however, we cannot verify this
conclusion because we don't have a much larger tagged corpus. Such a corpus is in the process of being
prepared (SIWAN, 2001).
e.
Timing
Learning transformation rules takes a lot of time. Here are three time figures measures on a celeron 533MHz
computer with 256MB RAM:
 Learning from a corpus of 500 words - about 40 seconds.
 Learning from a corpus of 1000 words - about 4 minutes.
 Learning from a corpus of 5000 words - about 8 hours.
However, the learning algorithm should be run only once for each training corpus. After the rules are learned,
they can be used to analyze new articles very quickly.
5.
The sentence stage
The aim of this stage is to improve the accuracy of the analysis using syntactic information. We start this stage
with the output of the pair stage – a sentence in which each word is assigned a morphological analysis, such
that 93.8% of the words on average are assigned their correct analysis. A correct analysis must correspond to a
syntactically correct sentence. We therefore try to syntactically parse the sentence (actually the POS’s of the
sentence). If the parse fails, we would like to conclude that the proposed analysis is incorrect and try another
morphological analysis. However, since no syntactic parser is perfect, we do not reject sentences for which the
parsing fails. We use the syntactic grammaticality, estimated by the syntactic parser, as one of two measures for
the correctness of the analysis, and combine this with the score that results from the pair phase.
a.
Syntactic grammaticality
We use a rudimentary syntactic parser to evaluate the syntactic grammaticality of a given morphological
analysis of a sentence. This parser uses a handcrafted grammar with about 150 rules. There are about 10
nonterminals (one for each basic part-of-speech), and about 30 terminals (these 10, plus about 20 words with
special syntactic meaning, such as "AW" (or), "HWA" (he, a pronoun), etc.
The parser is implemented by a bottom-up chart-parsing algorithm, see the references in Allen (1995, chapter
3).
The main element in the algorithm is the constituent. A constituent represents a single POS of the original
sentence, or a series of adjacent POS’s that form a complete grammatical unit. The input to the algorithm is a
sequence of POS tags as determined by the previous stages of the morphological analyzer, i.e., for each word
14
token we take the POS of the lexical morpheme and additional linguistic morphemes that contribute to
agreement. For example, the word token $QMTI, “my sycamore”, will be represented as the POS tag
[noun:definite:feminine:singular], the second analysis of that word token, “that I got up”, will consist of a pair
of tags [connective] [verb:first-person:singular].
The algorithm examines pairs of adjacent constituents to see if they match a right-hand side of a grammar rule.
If so – they can be combined into a larger constituent, whose type is determined by the left-hand side of this
grammar rule. The algorithm stops when no adjacent constituents can be combined in this way.
We will now present an example of how we use the parser to calculate the grammaticality of sentences.
Consider the following sentence:
&$RWT &WBDIM ZRIM HGI&W ATMWL MTAILND AL I$RAL.
The correct morphological analysis of this sentence is:
(1)
&$RWT &WBDIM ZRIM
HGI&W ATMWL M
TAILND AL
I$RAL
number: noun:
adjective: verb:
adverb
prep. proper
prep. proper
construct masculine: masculine: past:
noun
noun
plural
plural
plural
Tens-of workers
foreign
came
yesterday from Thailand
= 'Tens of foreign workers came yesterday from Thailand to Israel'.
to
Israel
The initial sequence of constituents is therefore:
[number:construct] [noun:masculine:plural] [adj] [verb:plural] [adv] [preposition-proper noun]
[preposition] [proper noun].
A grammar rule that can be applied here is:
Noun-Phrase  Noun-Phrase Adjective .
According to this rule, the doubly underlined constituents (“[noun:masculine:plural] [adj]”, “workers foreign”)
can be combined to form a single Noun-Phrase constituent. We represent this constituent by its head component
[according to HPSG theory, described by Sag and Wasow (1999)], which is the Noun-Phrase component
([noun:masculine:plural]), and get a reduced sequence (where the combined component is singly underlined):
[number:construct] [noun:masculine:plural]
[preposition] [proper noun]
[verb:plural]
[adv]
[preposition-proper
noun]
that corresponds to the sentence:
&$RWT &WBDIM HGI&W ATMWL MTAILND AL I$RAL = Tens of workers came yesterday
from Thailand to Israel
Another grammar rule that can be applied is:
Verb-Phrase  Verb-Phrase Adverb
Applying this rule to the output of the previous rule allows us to combine two other constituents (those with a
double underline) and get a reduced sentence:
[number:construct] [noun:masculine:plural] [verb:plural] [preposition:proper noun]…
corresponding to the sentence
15
&$RWT &WBDIM HGI&W MTAILND …= Tens-of workers came from Thailand ...
Applying three more grammar rules results in the simple sequence:
After running the bottom-up chart-parsing algorithm, we get a sequence of two constituents:
[noun:masculine:plural] [verb:plural]
corresponding to simple sentence:
&WBDIM HGI&W =
Workers came.
Applying the rule:
Sentence  Noun-Phrase Verb-Phrase,
we combine the two remaining constituents to a single constituent that represents the entire sentence. This
result suggests that the sentence is syntactically correct. Note that our syntactic analyzer does not produce a
parse tree for the sentence: its only role is to evaluate the grammaticality of the sentence.
Suppose we have a mistake in the morphological analysis of the second word, and we think its analysis is:
&WBDIM
verb: masculine: plural
[they]-work
= 'Tens of they-work came yesterday from Thailand to Israel'.
There is no grammar rule whose right-hand side includes a construct number followed by a verb, or a verb
followed by an adjective. Therefore the final result of the algorithm will be a sequence of four constituents:
"[number:construct] [verb:masculine:plural] [adj] [verb].",
corresponding to the non-grammatical sentence:
"&$RWT &WBDIM ZRIM HGI&W" =
"Tens-of work foreign came."
This result indicates that the syntax of the sentence is bad.
The parser is rudimentary in the sense that it does not completely parse most grammatical sentences. For
example, the sentence:
"&$RWT &WBDIM ZRIM HGI&W AL I$RAL K$ZRXH H$M$" =
"Tens of foreign workers came to Israel when the sun rose"
will only be reduced to:
"HGI&W K$-ZRXH" =
"came when rose"
Generally speaking, only 66 out of 290 correctly-tagged sentences in our training corpus receive a full parse.
Constructs that do not receive a full parse include sentences connected by connective particles (as the above
example), two consecutive prepositions, and several special phrases.
A detailed complexity analysis of the bottom-up chart-parsing algorithm is given in Allen (1995, chapter 3).
Intuitively, the algorithm generates every possible constituent, and checks each pair of adjacent constituents to
see if they match a right-hand side of a grammar rule. If there are n words in the sentence, there can be O(n2)
16
different constituents, and each constituent has O(n) adjacent constituents, so the complexity of the algorithm is
O(n3).
b.
The syntactic score, the morphological score and the final score
Using the syntactic parsing described above, we assign to each morphological analysis of a sentence a syntactic
score. The syntactic score is calculated by performing the above syntactic parsing, and applying a predefined
score function to the final sequence of constituents. The highest syntactic score is 0; it is assigned to a
morphological analysis of the sentence whose syntactic parsing results in a single constituent (such as example
(1) above), or to several such elements that are connected by conjunctions. The score is negative when the
reduced sentence contains other elements.
Recall that after the pair stage we have an updated morphological score for each analysis of each word in a
sentence. We use this information to calculate, for each morphological analysis of the entire sentence, a
morphological score for that analysis. The morphological score is defined as the logarithm of the product of all
the morphological scores of the analyses in the sentence. The motivation for this definition is as follows:
Consider an analysis  of a sentence s  w1  wn . If we interpret score  wi  /1000 as the probability that  wi 
is the correct analysis of wi , and in addition we assume that the probabilities of the analyses of different words
of s are statistically independent then

n
i 1
score(wi ) / 1000 is the probability that  is the correct analysis of
the entire sentence and the morphological score of s is the logarithm of this probability.
We now have a problem of combining the two scores into one score, the final score.
...
In the absence of a better scehme, we chose to average the two scores to get the final score.
c.
Finding the best analysis for the sentence
Using the above-mentioned score functions, we can correct the morphological analysis of a given sentence
using the following algorithm:
Find all morphological analyses of the given sentence.
For each analysis:
find its morphological score and its syntactic score, and combine them to find its final score.
Choose the morphological analysis with the highest final score.
This brute-force algorithm is exponential in the number of ambiguous words in the sentence, and therefore it is
not acceptable. We therefore resorted to heuristic search methods and used a simple hill-climbing method
described below:
a.
(Initialization): Let  be the current morphological analysis of the sentence.
Initialize  to the analysis where each word is assigned the morphological analysis with the highest morphological
score (this is the result of the pair stage).
b. Find the final score of  (This stage requires a syntactic analysis).
c. loop over all words in the sentence.
For each word w:
 Let  0 w be the analysis of w with the highest morphological score.
 (*) If  0 w   w then continue to the next word in the sentence.
17
 Otherwise, loop over all analyses of w except  0 w .
For each such analysis  w :
o
Replace  w by  w . Let  be the resulting analysis of the entire sentence.
o
Find the final score of  . This stage requires a partial syntactic analysis, however, we
may reuse parts of the analysis that were not affected by the change.
d. Among all the syntactic analyses found in step (c), let ˆ , have the highest final score.
If its final score is higher than that of  - replace  by ˆ and return to step (b);
otherwise – quit and return  .
In step (c) of the algorithm, there is a condition marked with an asterisk (*). It means that after an analysis of a
word has been changed – it will not be changed again. Therefore, in the worst case – The analysis of each word
will be changed once. Therefore step (c) will be performed at most n times, where n is the number of words in
the sentence.
d.
Computational complexity



e.
As explained above, step (c) will be performed at most n times, where n is the number of words in the
sentence.
In step (c), we perform a partial syntactic analysis once for each word in the sentence. In doing so we
insert every word in turn into the parse-chart, as a new constituent. This is exactly what we do when we
perform a single full syntactic analysis. Therefore, the complexity of step (c) is equal to the complexity
of a single full syntactic analysis, which is O(n3) (see section a above).
Therefore the total complexity of the entire algorithm is O(n4).
Evaluation
The algorithm for finding the best morphological analysis was run on two articles whose analysis has been
corrected using transformation rules learned from a corpus of about 4900 words (as described in the end of the
previous section). The accuracy of the analysis increased from 93.8% after the pair stage to 96.2% after the
sentence stage.
A statistical analysis similar to that done in the previous section yields (for a confidence coefficient of 95%) the
following confidence interval:
p  0.962  1.65 0.962 * 0.038 / 469  0.947
This means that the error rate is less than 5.3% with a confidence level of 95%.
f.
Timing
To give some idea of the time required for syntactic analysis, we give some time measurements made on a
celeron 533MHz computer with 256MB RAM. The measurements are not accurate, and are only intended to
show the general order of magnitude. All times are in seconds:
18
6.
time
to
time to analyze
[sec]
11
0
0
12
1
1
13
2
0
13
2
0
14
0
0
14
1
0
12
15
6
1
10
16
0
0
16
2
0
16
1
0
18
6
1
19
1
0
20
6
0
23
18
1
23
20
6
24
11
0
24
11
1
25
35
1
27
37
11
31
31
3
31
28
3
31
19
1
32
79
1
8
6
2
]
4
[sec
time to analyze
#w ords in
sentence
0
0
10
20
30
40
#words in sentence
Comparing the three stages
In order to test the relative contribution of each stage to the final result, we ran the program eight times, each
time using a different combination of the three stages. We got the following results:
Word Stage
Pair Stage Sentence Stage Error rate (%)
No (i.e.: for each string choose a random analysis)
No
No
36.0
Yes
No
No
14.0
No
Yes
No
21.0
Yes
Yes
No
7.0
No
No
Yes
20.0
Yes
No
Yes
5.3
No
Yes
Yes
14.0
Yes
Yes
Yes
3.8
19
7
14
Word Phase (+)
5.3
3.8
21
36
Pair Phase (+)
Sentence Phase
(+)
14
20
The results are summarized in the above cube, where the vertical axis represents the string stage (down=no,
up=yes), the horizontal axis represents the pair stage (left=no, right=yes) and the third axis represents the
sentence stage (towards us=no, away from us=yes).
These results indicate that all three stages are important: the lowest error rate is achieved only when all stages
are applied. The results also indicate that the string stage has the highest contribution, and the pair stage has the
lowest contribution.
7.
Conclusions and Further research
We presented a system for morphological analysis of Hebrew unvocalized text. The system uses information
from three different levels: the single-word level, the word-pair level and the sentence level. Each information
source contributes to the final accuracy of 96.2%. We conclude that the best accuracy is achieved by combining
information from all levels.
a.
Further research
1. The effectiveness of the pair phase can be improved if we find a way to learn correction commands from an
untagged corpus.
2. The string phase requires a lot of memory (to hold the estimated probabilities of each lexical value). It may
be possible to reduce its memory requirements by using correction commands that correct a single word.
This improvement also requires learning correction commands from an untagged corpus.
3. Analyzing the errors shows that about 25% of the remaining errors are due to idioms. It is possible but
tedious to create a lexicon of idioms manually, but it would be much better to find a method to learn
idioms automatically, preferably from an untagged corpus (because we don't have a large analyzed
corpus). Note that it is not enough to find pairs of strings that are idioms (as Manning and Schutze (1999,
chapter 5) do); it is also required to find the correct analysis of each string in the idiom, which is a much
harder task if the corpus is not analyzed (see Justeson and Katz (1995)).
4. The accuracy of the syntactical parser would improve if it knew the subcategorization information of
verbs and nouns. Again, it is possible to create this information manually, but it would be better to learn it
automatically from an untagged corpus.
5. The sentence stage is very time-consuming. To make the system more useful in practice, it is important to
find a way to incorporate syntactic information into the analysis without performing a full syntactic
analysis.
6. The information in the first two stages is acquired automatically, while the sentence level uses a manually
created analyzer. The accuracy and portability of the system can be improved by using a statistical
syntactic analyzer in which the rules and the probabilities are learned automatically. Such an analyzer can
be built using the Hebrew treebank [SIWAN (2001)].
20
8.
Bibliography
Albeck
Orly Albeck. Formal analysis by a restriction grammar on one of the stages of Modern Hebrew.
Computerized analysis of Hebrew words. Israel Science and Technology Ministry. (In Hebrew.)
Allen (1995)
James Allen: Natural language understanding, 2nd ed., The Benjamin-Cummings Publishing Company,
CA 1995
Ben-Tur et al., (1992)
Esther Ben-Tur, Aviela Angel, Danit Ben-Ari, and Alon Lavie. Computerized analysis of Hebrew
words. Israel Science and Technology Ministry. (In Hebrew.)
Brants (2000)
Thorsten Brants: TnT – A Statistical Part-of-Speech Tagger. Proceedings of the Sixth Applied Natural
Language Processing Conference ANLP-2000, Seattle, WA.
http://www.coli.uni-sb.de/~thorsten/publications/index.html
Brill (1995)
Eric Brill: transformation-based error-driven learning and natural language processing: a case study in
part-of-speech tagging. Computational Linguistics 21, 543-565, (1995).
Charniak et al (1993)
Eugene Charniak, Curtis Hendrickson, Neil Jacobson, and Mike Perkowitz: Equations for part-ofspeech tagging. In Proceedings of the Eleventh National Conference on Artificial Intelligence, Menlo
Park: AAAI Press/MIT Press (1993) 784-789.
Choueka and Lusignan (1985).
Yaacov Choueka and S. Lusignan Disambiguation by short context.
Computers and the Humanities, 19(3), (1985).
Choueka (2002)
Choueka, Y., Rav Millim. Technical report, The Center for Educational Technology in Israel.
http://www.cet.ac.il/rav-milim/.
Church (1988)
Kenneth W. Church: A stochastic parts program and noun phrase parser for unrestricted text. In ANLP
2. 136-143 (1988).
Cutting et al (1992): Doug Cutting, Julian Kupiec, Jan Pedersen and Penelope Sibun. 1992. A Practical Part-ofSpeech Tagger. In Proceedings of ANLP-92.
Dagan and Itai (1994)
I. Dagan and A. Itai: Word sense disambiguation using a second language monolingual corpus,
Computational Linguistics 20, 563-596, (1994).
DeRose (1988)
Steven J. DeRose, Grammatical category disambiguation by statistical optimization, Computational
Linguistics 14, 31-39, (1988).
21
Efron and Thisted (1976)
Bradley Efron and Ronald Thisted: Estimating the number of unseen species: How many words did
Shakespeare know?, Biometrika (1976) 63:3, 435-447.
Essen and Steinbiss (1992)
Ute Essen and Volker Steinbiss: Co-occurrence smoothing for stochastic language modeling, in
ICASSP 92, volume 1, 161-164. (1992).
Gale and Sampson (1995)
William A. Gale and Geoffrey Sampson: Good-Turing Estimation Without Tears, Journal of
Quantitative Linguistics, 1995, Vol. 2, No. 3, 217-237.
Good (1953)
I. J. Good: The population frequencies of species and the estimation of population parameters,
Biometrika 40, 237-264.
Hopcroft and Ullman (1979)
John E. Hopcroft and Jeffrey D. Ullman: Introduction to automata theory, languages and computation,
Addison-Wesley 1979.
ISO (1999)
“Information and documentation – Conversion of Hebrew characters into Latin characters – Part 3:
Phonemic Conversion, ISO/FDIS 259-3: (E)
Justeson and Katz (1995)
John S. Justeson and Slava M. Katz: Technical terminology: some linguistic properties and an
algorithm for identification in text. Natural Language Engineering 1, 9-27 (1995).
Kaalep et al (2000)
Heiki-Jaan Kaalep, Tomaz Erjavec, Jozef Stefan, Nancy Ide, Vladimir Petkevic, Dan Tufis: “HMM
Tagging of Inflectionally Rich Languages: Results on Estonian, Romanian, Czech and Slovene”,
http://nl.ijs.si/et/project/York/emnlp3.5.text
Kasami (1965)
T. Kasami: An efficient recognition and syntax algorithm for context-free languages, Scientific Report,
AFCRL-65-758, Air Force Cambridge Research Lab., Bedford, Mass.
Levinger (1992)
Moshe Levinger: Morphological disambiguation in Hebrew. Master's thesis, Computer Science
Department, Technion, Haifa, Israel. (In Hebrew.)
Levinger et al (1995)
Moshe Levinger, Uzzi Ornan and Alon Itai: Morphological disambiguation in Hebrew using a priori
probabilities, Computational Linguistics 21, 383-404, (1995).
Manning and Schutze (1999)
Manning and Schutze: “Fundamentals of Statistical Natural Language Processing”, (1999).
Merialdo (1994)
Bernard Merialdo: Tagging English text with a probabilistic model, Computational Linguistics 20, 155171, (1994).
Ornan and Katz (1994)
Uzzi Ornan and Michael Katz: A new program for Hebrew index based on the phonemic script,
22
Technical Report #LCL 94-7, Laboratory for Computational Linguistics, Technion, Israel (1994).
http://www.multitext.co.il
Sag and Wasow (1999)
Ivan Sag and Tom Wasow : Syntactic theory: a formal introduction, CSLI publications, Ventura Hall,
Stanford University, Stanford CA 94305 (1999).
Segal (2001)
A Hebrew morphological analyzer (includes free source code)
http://come.to/balshanut or
http://www.cs.technion.ac.il/~erelsgl/bxi/hmntx/teud.html
SIWAN (2001)
Khalil Sima'an, Alon Itai, Yoad Winter, Alon Altman and Noa Nativ. Building a Tree-Bank of Modern
Hebrew Text. To appear in Traitement Automatique des Langues.
http://www.cs.technion.ac.il/~itai/Corpus-Project/project-description.html
Weischedel et al (1994)
Ralph M. Weischedel, Marie Meteer, Richard L. Schwartz, Lance Ramshaw, Jeff Palmucci: Coping
with Ambiguity and Unknown Words through Probabilistic Models. Computational Linguistics 19(2):
359-382 (1994)
Wiium (2000)
Villi Wiium, “Applied economic techniques II”, Department of Economics, National University of
Ireland, Galway.
http://www.ucg.ie/ecn/staff/wiium/teaching/ec216/notes/chapter_06/sld078.htm
Younger
D. H. Younger: Recognition and parsing of context-free languages in time n3, Information and Control
10: 2, 189-208.
23
Appendix A: The Hebrew-Latin transliteration
Working with Hebrew letters is still a non-trivial task on some computer systems. In order to overcome this
technical problem, and help readers not familiar with the Hebrew alphabet, we used a 1-1 transliteration, in
which a single capital Latin letter or a single special symbol represents each Hebrew letter.
The 1-1 correspondence is shown in the following table:
Hebrew Hebrew
Latin
Hebrew Hebrew
Latin
Hebrew Hebrew
Latin
letter
name transliteration letter
name transliteration letter
name transliteration
‫א‬
Alef
A
‫ט‬
Tet
@
‫ע‬
Ayn
&
‫ב‬
Bet
B
‫י‬
Yud
I
‫פ‬
Pei
P
‫ג‬
Gimmel
G
‫כ‬
Kaf
K
‫צ‬
Tsadiq
C
‫ד‬
Dalet
D
‫ל‬
Lamed
L
‫ק‬
Quf
Q
‫ה‬
Hei
H
‫מ‬
Mem
M
‫ר‬
Reish
R
‫ו‬
Waw
W
‫נ‬
Nun
N
‫ש‬
Shin
$
‫ז‬
Zayn
Z
‫ס‬
Samek
S
‫ת‬
Taw
T
‫ח‬
Xet
X
The end-of-word Hebrew letters are not differentiated from their middle-of-word counterparts (K, M, N, P, C).
Also, as in the undotted Hebrew script, there is no distinction between the letter "Shin" and the letter "Sin".
Note that this is not a phonologic transcription. The vowels ‘a’ and ‘e’, which are usually not represented in the
Hebrew unvocalized script, are also not represented in our Latin transliteration. For example, the word:
‫שלג‬
which is pronounced “sheleg”, is transliterated:
$LG
However, the 1-1 correspondence was chosen so that most Latin letters are similar in sound to the Hebrew
letters they replace. This makes it quite easy to read transliterated Hebrew scripts after a short practice.
24