This sentence pair can be biparsed. Eg

Homework #2: Solution notes
Problem 1
Question 6:
This sentence pair can be biparsed. E.g.:
< < [ the sky ] was > blue >
Question 7:
The sentence pair cannot be biparsed: we have two sentences of length 4, with
a word alignment in an inside-outside configuration. This is one of the two
configurations that ITGs cannot represent.
Question 8:
This sentence pair can be biparsed. E.g.:
< [ the cat ] < ate [ the mouse ] > >
Question 9:
Biparsing the sentence pair in this question requires lexical rules that let
us create pairs of words that are not identical. We can either add a general
rule (A -> w1, w2) or a rule that specifically handles the substitution we
need for this example (A -> today, yesterday)
Problem 2
Questions 10-16:
The named entity in the example is “New York”.
Building a named entity recognizer with annotated data can be framed as a
supervised sequence labeling task. As we have seen in readings, linear chain
Conditional Random Fields (CRFs) are a popular model to address this task.
They can be trained using gradient descent algorithms such as L-BFGS. CRFs
let us define rich representations of the input sequences (words in the
English sentence) and of the previous output sequence (BIO tags of previous
words). We can therefore define many useful features on the current word and
its neighbors (identity, part-of-speech tag, capitalization, alphanumerical
patterns, etc.) as well as features based on the previous tags in the
sequences (eg, tag of previous word.)
Question 17:
The goal of this question is to carefully think through the steps required to
perform annotation projection. We have seen examples of annotation projection
for POS tagging and semantic role labeling in class. The named-entity
recognition task in this problem is most similar to POS tagging, since it is
framed as a sequence labeling task.
Key steps:
1. Automatically annotate English side of parallel corpus
Input: English side of parallel corpus
Output: English side of parallel corpus where each word is tagged with
a B, I or O tag
NER tagger: Conditional Random Field tagger, defined and trained as
specified above.
2. Word align English-Spanish parallel corpus
Input: sentence-aligned English-Spanish corpus
Output: word alignment links mapping English and Spanish word positions
Word aligner: One option is to use EM-trained IBM models (eg IBM model
2 or HMM models) to map English positions to Spanish position
(asymmetric 1-to-n mapping from English to Spanish). Another option is
to learn alignment models in both directions (possibly using the Liang
et al. approach to encourage agreement), and intersect their
predictions to get higher confidence alignments at the expense of
coverage.
3. Project BIO tags from English side to Spanish side of corpus
Input: parallel corpus annotated with (1) word alignment links, and (2)
BIO tags on English side
Output: Spanish side of parallel corpus annotated with BIO tags
Projection algorithm: we could assign to each Spanish word the tag of
the English word that it is aligned to. But this is not sufficient to
fully specify the projection procedure:
(1) depending on the word aligner used, we might need heuristics to
select a tag if the Spanish word is aligned to more than one English
word
(2) the words that form a named-entity might not occur in the same
order in English and in Spanish. We therefore need additional
processing for named-entities longer than 2 words. We can scan the
output of the projection and reorder B and I tags to make sure that
Spanish named-entities always start with a B and are represented by
contiguous B and I tags.
Question 18:
Here we want to obtain a type-level dictionary.
Input: We can use the Spanish side of the parallel corpus, tagged with BIO
token with the procedure defined in Question 17.
Output: a dictionary that maps each Spanish word type with one or more BIO
tags (possibly with frequencies)
Approach: Aggregate counts for observed (Spanish word type, BIO tag) pairs.
The pairs and their counts could directly be used as a noisy dictionary. The
dictionary could be improved by using the counts to filter out noise (e.g.,
keep only the 2 most frequent tags per word, or keep only tags that have been
observed more than 10 times with a given Spanish word).