Honor Code of the School of Engineering "All students taking courses in the School of Engineering agree, individually and collectively, that they will not give or receive unpermitted aid in examinations or other course work that is to be used by the instructor as the basis of grading." -From the Graduate/Undergraduate Bulletin I have read, understood, and agree to abide by the Honor Code of the School of Engineering. Name: ___________________________ ID: ___________________________ Midterm Examination COEN 296 Natural Language Processing Department of Computer Engineering Santa Clara University Dr. Ming-Hwa Wang Phone: (408) 526-4844 Course website: Office Hours: 1. [10 points] Please give the whole names for the following acronyms: OOV, MEMM, TTS, MLE, EM. 2. [30 points] True or false (yes or no, 1 or 0) problems with wronganswer penalties: a) To represent a regular expression, we can use either a NFA or a DFA. The NFA needs 2N states if its corresponding DFA has N states. b) The higher the conditional probability of the word sequence, the higher the perplexity. c) The more information the N-gram gives us about the word sequence, the lower the perplexity. d) Rule-based approach handles the most common or frequent natural language usages, but difficult to covers rare corner cases. e) Text is the dominant mode of human social bonding and information exchange. f) Speech recognition is a more difficult task than cross-language translation. 3. [30 points] Write regular expressions for the following languages. You may use either Perl/Python notation or the minimal “algebraic” notation of Section 2.3, but make sure to say which one you are using. By “word”, we mean an alphabetic string separated from other words by whitespace, any relevant punctuation, line breaks, and so forth. a) the set of all alphabetic strings b) the set of all lower case alphabetic strings ending in a b c) the set of all strings with two consecutive repeated words (e.g., “Humbert Humbert” and “the the” but not “the bug” or “the big bug”) d) the set of all strings from the alphabet a, b such that each a is immediately preceded by and immediately followed by a b e) all strings that start at the beginning of the line with an integer and that end at the end of the line with a word Signature: ___________________________ Date: 1. [10 points] 2. [30 points] 3. [30 points] 4. [30 points] 5. [20 points] 6. [30 points] 7. [20 points] 8. [30 points] Total Score: Fall Quarter 2016 Email address: [email protected] http://www.cse.scu.edu/~mwang2/nlp/ Tuesday & Thursday 9:00-9:30pm ______________________________ f) all strings that have both the word grotto and the word raven in them (but not, e.g., words like grottos that merely contain the word grotto) g) write a pattern that places the first word of an English sentence in a register. Deal with punctuation. 4. [30 points] Consider the following probabilistic context free grammar (PCFG): S → S S 2/3 S → “I do.” 1/3 a) What strings this grammar will generate? b) What is the probabilities to generate the first three shortest sentence from this grammar? c) What is the total probability to generate all possible strings from this grammar? 5. [20 points] Suppose you have ‘unfair’ soft drink machine, it can be in two states, cola preferring (CP) and iced tea preferring (TP), but it switches between them randomly before each purchase with probabilities below: (CP → CP 0.7, CP → TP 0.3, TP → CP 0.5, and TP → TP 0.5). When you put in your coin, the machine puts out a soft drink depending on the current state and the following probabilities: cola iced tea lemonade CP 0.6 0.1 0.3 TP 0.1 0.7 0.2 What is the probability of seeing the output sequence {lemonade, iced tea} if the machine always starts off in the CP state? 6. [30 points] A simple HMM for an ice cream task where there are two hidden states (hot and cold) corresponding to the weather, and the observations (drawn from the alphabet O = {1, 2, 3}) corresponding to the number of ice creams eaten by Jason on a given day. Let starting state is q0, hot state is q1, and cold state is q2. a01 = 0.8, a02 = 0.2, a11 = 0.7, a12 = 0.3, a21 = 0.4, a22 = 0.6, b1(1) = 0.2, b1(2) = 0.4, b1(3) = 0.4, b2(1) = 0.5, b2(2) = 0.4, b2(3) = 0.1, Please implement the Forward algorithm and run it with the HMM for the ice cream task to compute the probability of the observation sequences 331122313 and 331123312. Which is more likely? 7. [20 points] Write a program to compute unsmoothed unigrams and bigrams. 8. [30 points] Computing minimum edit distances by hand, figure out whether drive is closer to brief or to divers and what the edit distance is. You may use any version of distance (e.g., using 1-insertion, 1-deletion, 2-substitution costs) that you like.
© Copyright 2026 Paperzz