Old Midterm Examination - cse.scu.edu

Honor Code of the School of
Engineering
"All students taking courses in the School of Engineering
agree, individually and collectively, that they will not give
or receive unpermitted aid in examinations or other
course work that is to be used by the instructor as the
basis of grading."
-From the Graduate/Undergraduate Bulletin
I have read, understood, and agree to abide by the
Honor Code of the School of Engineering.
Name:
___________________________
ID:
___________________________
Midterm Examination
COEN 296 Natural Language Processing
Department of Computer Engineering
Santa Clara University
Dr. Ming-Hwa Wang
Phone: (408) 526-4844
Course website:
Office Hours:
1.
[10 points] Please give the whole names for the following acronyms:
OOV, MEMM, TTS, MLE, EM.
2.
[30 points] True or false (yes or no, 1 or 0) problems with wronganswer penalties:
a) To represent a regular expression, we can use either a NFA or a
DFA. The NFA needs 2N states if its corresponding DFA has N states.
b) The higher the conditional probability of the word sequence, the
higher the perplexity.
c) The more information the N-gram gives us about the word sequence,
the lower the perplexity.
d) Rule-based approach handles the most common or frequent natural
language usages, but difficult to covers rare corner cases.
e) Text is the dominant mode of human social bonding and information
exchange.
f) Speech recognition is a more difficult task than cross-language
translation.
3.
[30 points] Write regular expressions for the following languages. You
may use either Perl/Python notation or the minimal “algebraic” notation
of Section 2.3, but make sure to say which one you are using. By
“word”, we mean an alphabetic string separated from other words by
whitespace, any relevant punctuation, line breaks, and so forth.
a) the set of all alphabetic strings
b) the set of all lower case alphabetic strings ending in a b
c) the set of all strings with two consecutive repeated words (e.g.,
“Humbert Humbert” and “the the” but not “the bug” or “the big
bug”)
d) the set of all strings from the alphabet a, b such that each a is
immediately preceded by and immediately followed by a b
e) all strings that start at the beginning of the line with an integer and
that end at the end of the line with a word
Signature: ___________________________
Date:
1.
[10 points]
2.
[30 points]
3.
[30 points]
4.
[30 points]
5.
[20 points]
6.
[30 points]
7.
[20 points]
8.
[30 points]
Total Score:
Fall Quarter 2016
Email address: [email protected]
http://www.cse.scu.edu/~mwang2/nlp/
Tuesday & Thursday 9:00-9:30pm ______________________________
f)
all strings that have both the word grotto and the word raven in
them (but not, e.g., words like grottos that merely contain the word
grotto)
g) write a pattern that places the first word of an English sentence in a
register. Deal with punctuation.
4.
[30 points] Consider the following probabilistic context free grammar
(PCFG):
S → S S 2/3
S → “I do.” 1/3
a) What strings this grammar will generate?
b) What is the probabilities to generate the first three shortest sentence
from this grammar?
c) What is the total probability to generate all possible strings from this
grammar?
5.
[20 points] Suppose you have ‘unfair’ soft drink machine, it can be in
two states, cola preferring (CP) and iced tea preferring (TP), but it
switches between them randomly before each purchase with
probabilities below: (CP → CP 0.7, CP → TP 0.3, TP → CP 0.5, and TP →
TP 0.5). When you put in your coin, the machine puts out a soft drink
depending on the current state and the following probabilities:
cola iced tea lemonade
CP 0.6
0.1
0.3
TP 0.1
0.7
0.2
What is the probability of seeing the output sequence {lemonade, iced
tea} if the machine always starts off in the CP state?
6.
[30 points] A simple HMM for an ice cream task where there are two
hidden states (hot and cold) corresponding to the weather, and the
observations (drawn from the alphabet O = {1, 2, 3}) corresponding to
the number of ice creams eaten by Jason on a given day. Let starting
state is q0, hot state is q1, and cold state is q2. a01 = 0.8, a02 = 0.2, a11
= 0.7, a12 = 0.3, a21 = 0.4, a22 = 0.6, b1(1) = 0.2, b1(2) = 0.4, b1(3) =
0.4, b2(1) = 0.5, b2(2) = 0.4, b2(3) = 0.1, Please implement the
Forward algorithm and run it with the HMM for the ice cream task to
compute the probability of the observation sequences 331122313 and
331123312. Which is more likely?
7.
[20 points] Write a program to compute unsmoothed unigrams and
bigrams.
8.
[30 points] Computing minimum edit distances by hand, figure out
whether drive is closer to brief or to divers and what the edit distance is.
You may use any version of distance (e.g., using 1-insertion, 1-deletion,
2-substitution costs) that you like.