Paraphrasing and Translation

Paraphrasing and Translation
Chris Callison-Burch
July 2, 2007
NI VER
S
E
Chris Callison-Burch
R
G
O F
H
Y
TH
IT
E
U
D I
U
N B
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Main Contributions & Talk Outline
R
G
O F
1
Y
TH
IT
E
U
D I
U
N B
• Paraphrasing with bilingual parallel corpora
– Generating paraphrases from data
– Ranking paraphrases with probabilities
– Evaluating quality of paraphrases
– Results
• Improved statistical MT with paraphrases
– The problem of coverage in SMT
– Paraphrases as a potential solution
– Experimental design and evaluation methodologies
– Results and conclusions
• (Identification of flaws in automatic evaluation metrics for MT)
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Usefulness of paraphrases
R
G
O F
2
Y
TH
IT
E
U
D I
U
N B
• Paraphrases are alternative ways of conveying the same information
• Useful in NLP applications such as:
– Generation
– Question answering
– Multidocument summarization
– Automatic evaluation of MT and summaries
• How to get them?
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
3
Y
TH
IT
E
U
D I
U
N B
Paraphrasing with monolingual parallel data
• Previous work used monolingual parallel corpora
• Multiple translations of the same source
– Multiple translations of French novels into English
– Evaluation data for Bleu metric
• Comparable corpora are similar
– Encyclopedia articles on same topic
– Different newspapers’ accounts of one event
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
4
Y
TH
IT
E
U
D I
U
N B
Paraphrasing with monolingual parallel data
• Find identical contexts in aligned sentences:
Emma burst into tears and he tried to comfort her, saying things to
make her smile.
Emma cried, and he tried to console her, adorning his words with puns.
• Extract burst into tears = cried and comfort = console
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Potential problems with this method
R
G
O F
5
Y
TH
IT
E
U
D I
U
N B
• Monolingual parallel corpora are relatively uncommon
• Limits what paraphrases we can generated
– Limited number of paraphrases
– Constrained to a few genres
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
6
Y
TH
IT
E
U
D I
U
N B
Paraphrasing with bilingual parallel corpora
• Bilingual parallel corpora are much more common
• However, no longer contain identical contexts
• Adopt techniques from from phrase-based statistical MT
• Use aligned foreign language phrase as pivot
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
7
Y
TH
IT
E
U
D I
U
N B
Paraphrasing with bilingual parallel corpora
I do not believe in mutilating dead bodies
no soy partidaria de mutilar
cadáveres
El mar arroja tantos
cadáveres
So many
Chris Callison-Burch
corpses
de inmigrantes ilegales ahogados a la playa ...
of drowned illegals get washed up on beaches ...
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Example paraphrases
R
G
O F
8
Y
TH
IT
E
U
D I
U
N B
• dead bodies → corpses, carcasses, bodies, skeletons, people
• military force → armed forces, defense, force, forces, peace-keeping personnel,
military forces
• sooner or later → at some point, eventually
• great care → a careful approach, greater emphasis, particular attention,
specific attention, special attention, very careful
• at work → at the workplace, employment, held, holding, in the work sphere,
organised, operate, taken place, took place, working
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Multiple paraphrases
R
G
O F
9
Y
TH
IT
E
U
D I
U
N B
• Generate multiple paraphrases for a given phrase
• Want to be able to rank them
• Define a paraphrase probability
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Paraphrase probability
R
G
O F
10
Y
TH
IT
E
U
D I
U
N B
• Defined in terms of translation model probability
• Naturally falls out of parallel corpus
eˆ2 = arg max p(e2|e1)
e2 !=e1
= arg max
e2 !=e1
≈ arg max
e2 !=e1
Chris Callison-Burch
!
f
p(f |e1)p(e2|f )
! count(f, e1)
f
count(e2, f )
∗
total(e1)
total(f )
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Context matters
R
G
O F
11
Y
TH
IT
E
U
D I
U
N B
• Further refined method to use context
• Use surrounding words in context with language model probability
• Use WSD algorithm to restrict paraphrases to same word sense
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Paraphrase evaluation
R
G
O F
12
Y
TH
IT
E
U
D I
U
N B
• Previous work used poor evaluation methodology
• Asked judges whether paraphrases had “approximate conceptual equivalence”
or were “roughly interchangeable given the genre”
• Show paraphrases out of context, or only substitute paraphrases into original
context
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Paraphrase evaluation
R
G
O F
13
Y
TH
IT
E
U
D I
U
N B
Is console a good paraphrase for comfort?
Emma burst into tears and he tried to comfort her, saying things
to make her smile.
Emma burst into tears and he tried to console her, saying things
to make her smile.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Paraphrase evaluation
R
G
O F
14
Y
TH
IT
E
U
D I
U
N B
Is console a good paraphrase for comfort?
Emma burst into tears and he tried to comfort her, saying things
to make her smile.
Emma burst into tears and he tried to console her, saying things
to make her smile.
What about now?
George Bush said Democrats provide comfort to our enemies.
George Bush said Democrats provide console to our enemies.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Paraphrase evaluation
R
G
O F
15
Y
TH
IT
E
U
D I
U
N B
• We substituted paraphrases into a number of new contexts
• Asked judges whether paraphrases preserved meaning and remained
grammatical
• Two measures: accuracy and correct meaning
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Experimental conditions
R
G
O F
16
Y
TH
IT
E
U
D I
U
N B
• Extracted paraphrases from parallel corpora
• Evaluated quality under various conditions
1. Ranking paraphrases with paraphrase probability
2. Ranking paraphrases in context with a language model
3. Using multiple parallel corpora
4. Limiting paraphrases by word sense
5. Manual word alignments
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
90
80
H
G
E
Paraphrase probability
Paraphrase prob + LM
Using multiple parallel corpora
Word sense controlled
Manual word alignments
R
17
O F
100
Y
TH
IT
E
U
D I
U
N B
70
64.5
60
50
48.9
40
30
20
10
0
Chris Callison-Burch
Accuracy
Correct Meaning
Paraphrasing and Translation
July 2, 2007
NI VER
S
90
80
H
G
E
Paraphrase probability
Paraphrase prob + LM
Using multiple parallel corpora
Word sense controlled
Manual word alignments
R
18
O F
100
Y
TH
IT
E
U
D I
U
N B
70
64.5
60
60.8
55.3
50
48.9
40
30
20
10
0
Chris Callison-Burch
Accuracy
Correct Meaning
Paraphrasing and Translation
July 2, 2007
NI VER
S
90
80
H
G
E
Paraphrase probability
Paraphrase prob + LM
Using multiple parallel corpora
Word sense controlled
Manual word alignments
R
19
O F
100
Y
TH
IT
E
U
D I
U
N B
70
64.5
60
55.3
50
57.3
65.4
60.8
48.9
40
30
20
10
0
Chris Callison-Burch
Accuracy
Correct Meaning
Paraphrasing and Translation
July 2, 2007
NI VER
S
90
80
H
G
E
Paraphrase probability
Paraphrase prob + LM
Using multiple parallel corpora
Word sense controlled
Manual word alignments
70
R
20
O F
100
Y
TH
IT
E
U
D I
U
N B
70.5
60
61.9
55.3
50
64.5
57.3
65.4
60.8
48.9
40
30
20
10
0
Chris Callison-Burch
Accuracy
Correct Meaning
Paraphrasing and Translation
July 2, 2007
NI VER
S
90
80
H
G
E
Paraphrase probability
Paraphrase prob + LM
Using multiple parallel corpora
Word sense controlled
Manual word alignments
R
21
O F
100
Y
TH
IT
E
U
D I
U
N B
84.7
75.0
70
70.5
60
61.9
55.3
50
64.5
57.3
65.4
60.8
48.9
40
30
20
10
0
Chris Callison-Burch
Accuracy
Correct Meaning
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Conclusions for Part 1
R
G
O F
22
Y
TH
IT
E
U
D I
U
N B
• Can generate high quality paraphrases bilingual parallel corpora
• Quality improvements seen with refined paraphrase probability
• Data is more common than in previous approaches
– Can generate more paraphrases for wide variety of genres
– Can be easily applied to different languages
• How useful are paraphrases when applied to a task?
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
23
Y
TH
IT
E
U
D I
U
N B
Part 2:
Using Paraphrases to Improve
Statistical Machine Translation
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Problem of coverage in SMT
R
G
O F
24
Y
TH
IT
E
U
D I
U
N B
• Statistical machine translation learns the translations of words and phrases
from parallel corpora, which are often limited in size
• Currently if a word is unseen then SMT will be unable to translate it
• If a phrase is unseen, but its individual words are, then SMT will be less likely
to produce a correct translation for it
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
The extent of the problem
Test Set Items with Translations (%)
100
90
80
D I
U
N B
unigrams
bigrams
trigrams
4-grams
70
60
50
40
30
20
10
0
10000
Chris Callison-Burch
R
G
O F
25
Y
TH
IT
E
U
100000
1e+06
Training Corpus Size (num words)
Paraphrasing and Translation
1e+07
July 2, 2007
NI VER
S
H
E
Behavior on unseen words
R
G
O F
26
Y
TH
IT
E
U
D I
U
N B
• A system trained on 10,000 sentences (≈200,000 words) may translate
Es positivo llegar a un acuerdo sobre los procedimientos, pero debemos
encargarnos de que este sistema no sea susceptible de ser usado como arma
polı́tica.
as
It is good reach an agreement on procedures, but we must encargarnos that
this system is not susceptible to be usado as political weapon.
• Since the translations of encargarnos and usado were not learned, they are
either reproduced in the translation, or omitted entirely.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
27
Y
TH
IT
E
U
D I
U
N B
Substituting paraphrases then translating
encargarnos
?
usado
?
It is good reach an agreement on procedures, but we must encargarnos that this
system is not susceptible to be usado as political weapon.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
28
Y
TH
IT
E
U
D I
U
N B
Substituting paraphrases then translating
encargarnos
garantizar
velar
procurar
asegurarnos
usado
utilizado
empleado
uso
utiliza
?
?
It is good reach an agreement on procedures, but we must encargarnos that this
system is not susceptible to be usado as political weapon.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
29
Y
TH
IT
E
U
D I
U
N B
Substituting paraphrases then translating
encargarnos
garantizar
velar
procurar
asegurarnos
usado
utilizado
empleado
uso
utiliza
?
guarantee, ensure, guaranteed, assure, provided
ensure, ensuring, safeguard, making sure
ensure that, try to, ensure, endeavour to
ensure, secure, make certain
?
used, use, spent, utilized
used, spent, employee
use, used, usage
used, uses, used, being used
It is good reach an agreement on procedures, but we must guarantee that this
system is not susceptible to be used as political weapon.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Wait a minute... isn’t this circular?
R
G
O F
30
Y
TH
IT
E
U
D I
U
N B
• How can we improve coverage when paraphrases come from same data?
• Paraphrases can come from other parallel corpora
• Imagine we’re translating from English into Maltese, and have a very small
English-Maltese parallel corpus
• We can increase coverage by generating English paraphrases from
– English-Arabic
– English-French
– English-German
– etc.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Our Data
R
G
O F
31
Y
TH
IT
E
U
D I
U
N B
• Used various sized portions of the Europarl Spanish-English parallel corpus to
train translation models / create phrase tables
• Used all other Spanish parallel corpora to train the paraphrase model
• Specifically, we used these parallel corpora for paraphrasing Spanish:
– Spanish-Danish
– Spanish-Dutch
– Spanish-Finnish
– Spanish-French
– Spanish-German
– Spanish-Italian
– Spanish-Portuguese
– Spanish-Swedish
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Experimental Setup
R
G
O F
32
Y
TH
IT
E
U
D I
U
N B
• Generated paraphrases for every Spanish test phrase f1 for which we failed to
learn any translations
• We expanded the phrase table by adding an entry for f1 with the English
translations of each paraphrase f2 for which we had learned translations
• New phrase tables entries were augmented with additional feature function
that incorporated the paraphrase probability
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Additional feature function
R
G
O F
33
Y
TH
IT
E
U
D I
U
N B
• Paraphrase probability can be used as additional feature function in the log
linear formulation of SMT

 p(f2|f1) If phrase table entry (e, f1)
is generated from (e, f2)
h(e, f1) =

1
Otherwise
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Conditions
R
G
O F
34
Y
TH
IT
E
U
D I
U
N B
• Baseline: Pharaoh decoder, standard feature functions, minimum error rate
training, unknown words output as-is
• Single word paraphrases: Pharaoh, standard FFs plus paraphrase prob FF,
MERT, phrase table expanded by paraphrasing unknown words other than
names, numbers
• Multi-word paraphrases: As above, but with phrase table expanded with
translations of unknown words and phrases
• Various sized training corpora
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Evaluation methodology
R
G
O F
35
Y
TH
IT
E
U
D I
U
N B
• Measured translation quality improvements in terms of
– Improvements in Bleu score over baseline
– Increase in coverage of expanded phrase table
– Accuracy of newly translated phrases
• Judged accuracy of newly translated phrases through targeted manual
evaluation
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Focused manual evaluation
R
G
O F
36
Y
TH
IT
E
U
D I
U
N B
enumeradas
en
el
mismo.
causas
El
artículo
combate
la
discriminación
y
el
trato
desigual
de
los
ciudadanos
por
las
Alignment Tool
The
article
combats
discrimination
and
inequality
in
the
treatment
of
citizens
for
the
reasons
listed
therein.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Focused manual evaluation
R
G
O F
37
Y
TH
IT
E
U
D I
U
N B
The article combats discrimination and
inequality in the treatment of citizens for
the reasons listed therein.
The article combats discrimination and
the different treatment of citizens for the
reasons mentioned in the same.
The article fights against uneven and the
treatment of citizens for the reasons set
out in the same.
The article is countering discrimination
and the unequal treatment of citizens for
the reasons that in the same.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Results: Bleu Scores
R
G
O F
38
Y
TH
IT
E
U
D I
U
N B
28
27
Bleu Score
26
25
24
23
22
10000
Chris Callison-Burch
Baseline
20000 30000 40000 50000 60000 70000
Training Corpus Size (num sentences)
Paraphrasing and Translation
80000
July 2, 2007
NI VER
S
H
E
Results: Bleu Scores
R
G
O F
39
Y
TH
IT
E
U
D I
U
N B
28
27
Bleu Score
26
25
24
23
22
10000
Chris Callison-Burch
Baseline
Single word paraphrases
20000 30000 40000 50000 60000 70000
Training Corpus Size (num sentences)
Paraphrasing and Translation
80000
July 2, 2007
NI VER
S
H
E
Results: Bleu Scores
R
G
O F
40
Y
TH
IT
E
U
D I
U
N B
28
27
Bleu Score
26
25
24
23
22
10000
Chris Callison-Burch
Baseline
Single word paraphrases
Multi-word paraphrases
20000 30000 40000 50000 60000 70000
Training Corpus Size (num sentences)
Paraphrasing and Translation
80000
July 2, 2007
NI VER
S
H
E
Results: Improvements in coverage
R
G
O F
41
Y
TH
IT
E
U
D I
U
N B
Test set items with translations (%)
90
80
Before paraphrasing
70
After paraphrasing
60
50
48
40
30
25
20
10
10
0
unigrams
Chris Callison-Burch
bigrams
trigrams
Paraphrasing and Translation
3
4-grams
July 2, 2007
NI VER
S
H
E
Results: Improvements in coverage
Test set items with translations (%)
90
D I
U
N B
90
80
Before paraphrasing
70
67
After paraphrasing
60
50
48
40
37
30
25
20
16
10
10
0
unigrams
Chris Callison-Burch
R
G
O F
42
Y
TH
IT
E
U
bigrams
trigrams
Paraphrasing and Translation
3
4-grams
July 2, 2007
NI VER
S
H
E
Accuracy of Newly Translated Phrases (%)
Results: Targeted Manual Evaluation
Chris Callison-Burch
90
R
G
O F
43
Y
TH
IT
E
U
D I
U
N B
Single word paraphrases
80
Multi-word paraphrases
70
53
50
66
65
64
60
67
71
57
48
40
30
20
10
0
10k
20k
40k
80k
Training Corpus Sizes (num sentences)
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Accuracy of Newly Translated Phrases (%)
Results: Targeted Manual Evaluation
Chris Callison-Burch
90
R
G
O F
44
Y
TH
IT
E
U
D I
U
N B
Single word paraphrases
80
Multi-word paraphrases
70
53
50
66
65
64
60
67
71
57
48
40
30
Bleu was insensitive to improvements
more than half of the time
20
10
0
10k
20k
40k
80k
Training Corpus Sizes (num sentences)
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Conclusions
R
G
O F
45
Y
TH
IT
E
U
D I
U
N B
• Paraphrasing is an effective solution to deal with the problem of coverage in
statistical machine translation
• Paraphrasing with bilingual corpora ideally suited to this task because of its
multilinguality, probabilistic formulation, multi-word phrases, and high recall
• For small parallel corpus vocab coverage increases from 48% to 90%, with
more than half of newly covered items translating accurately
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Future Work & Broader Vision
R
G
O F
46
Y
TH
IT
E
U
D I
U
N B
• Future Work:
– Improve statistical models of translation through increased generalization
– Generate more sophisticated paraphrases including syntactic transformations
• Broader Vision:
– Unlike data for other statistical NLP tasks, such as parsing, parallel corpora
are created naturally by other human industry
– Harnessing the data through savvy manipulation will improve translation,
and can potentially improve all other language technologies
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Thank you!
Chris Callison-Burch
Paraphrasing and Translation
R
G
O F
47
Y
TH
IT
E
U
D I
U
N B
July 2, 2007
NI VER
S
H
E
Selected Publications
R
G
O F
48
Y
TH
IT
E
U
D I
U
N B
Callison-Burch, Koehn and Osborne. “Improved Statistical Machine Translation
Using Paraphrases” in NAACL-2006.
Callison-Burch, Osborne and Koehn. “Re-evaluating the Role of Bleu in Machine
Translation Research” in EACL-2006.
Bannard and Callison-Burch. “Paraphrasing with Bilingual Parallel Corpora” in
ACL-2005.
Callison-Burch, Bannard and Schroeder. “Scaling Phrase-Based Statistical
Machine Translation to Larger Corpora and Longer Phrases” in ACL-2005.
Callison-Burch, Talbot and Osborne. “Statistical Machine Translation with Wordand Sentence-Aligned Parallel Corpora” in ACL-2004.
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Structural paraphrases
PRP VBP IN
I believe that
DT NN1 IN DT
NN2
the office of the president
MD
VB
DT
NN
will reformulate the question
la oficina del presidente
va a reformular la pregunta
De hecho
la oficina del presidente
lo ha investigado ya
In fact
IN NN
the president 's office
DT
NN2 POS NN1
Creo que
Chris Callison-Burch
R
G
O F
49
Y
TH
IT
E
U
D I
U
N B
has already investigated this
VBZ RB
VBN
DT
Paraphrasing and Translation
July 2, 2007
NI VER
S
SBAR
Structural Sparaphrases
NP
NP
VP
PRP VBP IN
I believe that
NP
NP
NP
MD
VB
DT
NN
will reformulate the question
la oficina del presidente
va a reformular la pregunta
De hecho
la oficina del presidente
lo ha investigado ya
In fact
IN NN
the president 's office
DT
NN POS NN
PP
has already investigated this
VBZ RB
VBN
DT
NP
NP
NP
VP
NP
S
Chris Callison-Burch
H
VP
DT NN IN DT
NN
the office of the president
Creo que
D I
U
N B
NP
PP
NP
G
E
R
S
O F
50
Y
TH
IT
E
U
ADVP
VP
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Structural paraphrases
R
G
O F
51
Y
TH
IT
E
U
D I
U
N B
NP
NP
NP
PP
NP
NP
DT
the
Chris Callison-Burch
NN
X
POS NN
's Y
NP
DT
the
NN IN DT
Y of the
Paraphrasing and Translation
NP
NN
X
July 2, 2007
NI VER
S
H
E
Structural paraphrases
R
G
O F
52
Y
TH
IT
E
U
D I
U
N B
In addition to lexical paraphrases we can learn structural paraphrases:
JJ PRP MD VB TO VB
JJ
PRP
MD
VB
TO
IN
VBG
VBP
RB
-
→ MD PRP VB IN VBG
→ PRP VBP TO VB IN VBG
→ RB PRP MD VB TO VB
(.09)
(.07)
(.06)
Adjective
Personal pronoun
Modal
Verb, base form
to
Preposition or subordinating conjunction
Verb, gerund or present participle
Verb, non-3rd person singular present
Adverb
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Structural paraphrases
R
G
O F
53
Y
TH
IT
E
U
D I
U
N B
• Given JJ PRP MD VB TO VB → MD PRP VB IN VBG we might be
able to transform
• First I would like to thank → May I begin by thanking
• This is quite a complex transformation:
– The infinitival verb becomes a gerund
– The the verb begin takes the place of the adverb first
– The selection of which modal verbs is quite complex
• Promising result, which requires further research
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
R
G
O F
54
Y
TH
IT
E
U
D I
U
N B
Amount of data used to generate paraphrases
• We did use a tremendous amount of data to create our Spanish paraphrases
• While the scenario of using all other Europarl language pairs may seem
idealized, it is precisely the situation that arises when adding new official
languages to the EU
Chris Callison-Burch
Paraphrasing and Translation
July 2, 2007
NI VER
S
H
E
Usefulness of FF
Spanish-English
Corpus size
10k
20k
Baseline
22.6 25.0
Single word with FF
23.1 25.2
Multi-word with FF
23.3 26.0
Single word without FF
23.0 25.1
Multi-word without FF
20.6 22.6
Chris Callison-Burch
Paraphrasing and Translation
40k
26.5
26.6
27.2
26.7
21.9
R
G
O F
55
Y
TH
IT
E
U
D I
U
N B
80k
26.5
28.0
28.0
28.0
24.0
July 2, 2007