Homework #2: Solution notes Problem 1 Question 6: This sentence pair can be biparsed. E.g.: < < [ the sky ] was > blue > Question 7: The sentence pair cannot be biparsed: we have two sentences of length 4, with a word alignment in an inside-outside configuration. This is one of the two configurations that ITGs cannot represent. Question 8: This sentence pair can be biparsed. E.g.: < [ the cat ] < ate [ the mouse ] > > Question 9: Biparsing the sentence pair in this question requires lexical rules that let us create pairs of words that are not identical. We can either add a general rule (A -> w1, w2) or a rule that specifically handles the substitution we need for this example (A -> today, yesterday) Problem 2 Questions 10-16: The named entity in the example is “New York”. Building a named entity recognizer with annotated data can be framed as a supervised sequence labeling task. As we have seen in readings, linear chain Conditional Random Fields (CRFs) are a popular model to address this task. They can be trained using gradient descent algorithms such as L-BFGS. CRFs let us define rich representations of the input sequences (words in the English sentence) and of the previous output sequence (BIO tags of previous words). We can therefore define many useful features on the current word and its neighbors (identity, part-of-speech tag, capitalization, alphanumerical patterns, etc.) as well as features based on the previous tags in the sequences (eg, tag of previous word.) Question 17: The goal of this question is to carefully think through the steps required to perform annotation projection. We have seen examples of annotation projection for POS tagging and semantic role labeling in class. The named-entity recognition task in this problem is most similar to POS tagging, since it is framed as a sequence labeling task. Key steps: 1. Automatically annotate English side of parallel corpus Input: English side of parallel corpus Output: English side of parallel corpus where each word is tagged with a B, I or O tag NER tagger: Conditional Random Field tagger, defined and trained as specified above. 2. Word align English-Spanish parallel corpus Input: sentence-aligned English-Spanish corpus Output: word alignment links mapping English and Spanish word positions Word aligner: One option is to use EM-trained IBM models (eg IBM model 2 or HMM models) to map English positions to Spanish position (asymmetric 1-to-n mapping from English to Spanish). Another option is to learn alignment models in both directions (possibly using the Liang et al. approach to encourage agreement), and intersect their predictions to get higher confidence alignments at the expense of coverage. 3. Project BIO tags from English side to Spanish side of corpus Input: parallel corpus annotated with (1) word alignment links, and (2) BIO tags on English side Output: Spanish side of parallel corpus annotated with BIO tags Projection algorithm: we could assign to each Spanish word the tag of the English word that it is aligned to. But this is not sufficient to fully specify the projection procedure: (1) depending on the word aligner used, we might need heuristics to select a tag if the Spanish word is aligned to more than one English word (2) the words that form a named-entity might not occur in the same order in English and in Spanish. We therefore need additional processing for named-entities longer than 2 words. We can scan the output of the projection and reorder B and I tags to make sure that Spanish named-entities always start with a B and are represented by contiguous B and I tags. Question 18: Here we want to obtain a type-level dictionary. Input: We can use the Spanish side of the parallel corpus, tagged with BIO token with the procedure defined in Question 17. Output: a dictionary that maps each Spanish word type with one or more BIO tags (possibly with frequencies) Approach: Aggregate counts for observed (Spanish word type, BIO tag) pairs. The pairs and their counts could directly be used as a noisy dictionary. The dictionary could be improved by using the counts to filter out noise (e.g., keep only the 2 most frequent tags per word, or keep only tags that have been observed more than 10 times with a given Spanish word).
© Copyright 2026 Paperzz