RERANKER HW4 1. Motivation Given n

RERANKER
HW4
SIYU QIU
1. Motivation
Given n-best hypothesis sentences, we can analyze them through many meaningful aspects, extracting features and computing a weighted score so as to find the
best 1 sentence matching the source. Then the question is how to set the weights
of those features. In order to improve the BLEU score, I use MERT to find the
optimal weights of features and generating 8 features for one sentence.
2. MERT
The algorithm to achieve a minimum error training used in this homework id
Powell Search. The basic idea is, for one weight, we can construct sample number of lines, with the weight as x, the corresponding feature as slope, the other
F eatureẆ eight0 as y intersection. Then, we find the upper-envelop and threshold
points for each target weight. By combining all the threshold points, sorting them,
the whole x-axis is segmented into several parts. In each part, compute the total
BLEU score and the segments rank highest will be out candidate weights.
One thing that need to pay attention to is that size of candidate weights list
will not be one in most cases. Heuristically, the weights should not fall beyond the
interval of [−5, 5], thus before selecting a candidate weight, filtering is done first.
Then I choose randomly from those filtered candidate list, iterating those steps
until one yielding better BLEU score is found.
3. Feature
Besides the given three: language model weight, translation model p(e—f)
weight, lexical translation model p+lex(f—e) weight, 5 more features are added.
• Number of words
The number of words in hypothesis prevent the hypothesis from being to
short.
• length of reference sentence/ length of hypotheses sentence
There should be some penalty for those hypotheses sentences that have
length differ much from source sentences.
• Number of untranslated words
By using the alignment file, I built a dictionary mapping source Russian
1
2
SIYU QIU
word to English word. Then, for each hypothesis, we can compute the
number of words in source sentences that were not translated.
• OOV
For those word that never occurred in training data were treated as out of
vocabulary words,
4. Experiment and Analysis
4.1. Experiment Setting. After training with MERT the weight used are as
follows:
Table 1. settings
OOV
number of word
uncover
LM
Plex (f |e)
rl/hl
P (e|f )
-0.3252
0.1416
-0.4597
-1
-0.5
1.4240
-0.5061
4.2. Experiment Result. Those weights explain something, for example, OOV
weight is less than 0 meaning that, number of OOV words are penalized; as the
same, weight for uncover is also below zero. With these setting, the final BLEU
score on test sets are 28.09. Also, some other settings which have a better performance on training data are tried but do worse on test set, which may mean
overfitting.