CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE

CSE 190
Fall 2015
Midterm
DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO
START!!!!
November 18, 2015
• THE EXAM IS CLOSED BOOK.
• Once the exam has started, SORRY, NO TALKING!!!
• No, you can’t even say ”see ya at Porter’s!” (Especially now that UCSD, in their infinite
wisdom, kicked them out of campus...what were they thinking???)
• There are 5 problems: Make sure you have all of them - AFTER YOU ARE TOLD TO
START!
• Read each question carefully.
• Remain calm at all times!
Problem
1
2
3
4
5
Type
Points Score
True/False
15
Short Answer
20
Multiple Choice
10
The Delta Rule
10
Forward/Backwards Propagation
15
Total
70
1
Problem 1: True/False (15 pts)
(15 pts: +1 for correct, -0.5 for incorrect, 0 for no answer) If you would like to justify an
answer, feel free.
Similar to learning in neural networks with the backpropagation procedure, the perceptron learning algorithm also ensures that the network output will near the target at
each iteration.
Following the percpetron learning algorithm, a perceptron is guaranteed to perfectly
learn a given linearly separable data set within a finite number of training steps.
1
The sigmoid function, y = g(x) = 1+e−w
T x may be simply interpreted as the probability
of the input, x, given the output, y.
It is best to have as many hidden units as there are patterns to be learned by a multilayer
neural network.
Robbie Jacob’s adaptive learning rate method resulted in a different learning rate for
every weight in the network.
The backpropagation procedure is a powerful optimization technique that can be applied to hidden activation functions like sigmoid, tanh and binary threshold.
Stochastic gradient descent will typically provide a more accurate estimate of the gradient of a loss function than the full gradient calculated over all examples - that is why
this method is generally preferred.
Overfitting occurs when the model learns the regularities present only in the training
data, or in other words, the model fits the sampling error of the training set.
In backpropagation learning, we should start with a small learning rate and slowly
increase it during the learning process.
People use the Rectified Linear Unit (ReLU) as an activation function in deep networks
because 1) it works; and 2) it makes computing the slope trivial.
While implementing backpropagation, it is a mistake to compute the deltas for a layer,
change the weights, and then propagate the deltas back to the next layer.
Unfortunately, minibatch learning is difficult to parallelize.
In a deep neural network, while the error surface may be very complicated and nonconvex, locally, it may be well-approximated by a quadratic surface.
A convolutional neural network learns features with shared weights (filters) in order to
reduce the number of free parameters.
One of the biggest puzzles in machine learning is who hid the hidden layers, and why.
Wherever they are, they are probably buried deep, very deep. Some suspect Wally did
it.
2
Problem 2: Short answer (20 pts)
Only a very brief explanation is necessary!
a) (2 pts) Explain why dropout in a neural network acts as a regularizer.
b) (2 pts) Explain why backpropagation of the deltas is a linear operation.
c) (3 pts) Describe two distinct advantages of stochastic gradient descent over the batch
method.
d) (2 pts) Fill in the value for w in this example of gradient descent in E(w). Calculate
the weight for Iteration 2 of gradient descent where the step-size is η = 1.0 and the
momentum coefficient is 0.5. Assume the momentum is initialized t0 0.0.
Iteration w −∇w E
0
1.0
1.0
1
2.0
0.5
2
0.25
e) (2 pts) Explain why we should use weight decay when training a neural network.
f) (3 pts) A graduate student is competing in the ImageNet Challenge with 1000 classes,
however, he is puzzled as to why his network doesn’t work. He has two tanh hidden
units in the final layer before the 1000-way output, but does not think this is a problem,
since he has many units and layers leading up to this point. Explain the error in his
thinking.
3
g) (4 pts) In the Efficient Backprop paper, preprocessing of the input data is recommended.
Illustrate this process by starting with an elongated, oval-shaped cloud of points tilted
at about 45 degrees,and showing effect of the mean cancellation step, the PCA step,
and the variance scaling step. (so you should end up with 4 pictures from start to
finish).
h) (2 pts) What is wrong with using the logistic sigmoid in the hidden layers of a deep
network? Give at least two reasons why it should be avoided.
4
Problem 3: Multiple Choice (10 pts, 2 each)
a. Which of the following is the delta rule for the hidden units?
i. δi = (ti − yi )
P
wjk δk
ii. δj =
k
P
iii. δj = y 0 (aj ) wjk δk
k
b. In a convolutional neural network, the image is of dimension ~x = 100 × 100 and one of
the learned filters is of dimension 10 × 10 with a stride of 5. The resulting feature map
of this filter over the image will have dimension,
i. 21 × 21
ii. 19 × 19
iii. 5 × 5
iv. 20 × 20
v. 100 × 5 × 5
c. Assume we have an error function E and modify
cost function C by adding an
P our
λ
2
L2-weight penalty, or specifically C = E + 2 j wj . The cost function is minimized
with respect to wi when,
∂E
i. wi = − λ1 ∂w
i
∂E
ii. wi = + ∂w
i
∂E
iii. wi = λ ∂w
i
2
∂E
iv. wi = − ∂w
i
2
v. wi = 0
which describes how our weight magnitude should vary. HINT: recall that C is
minimized when its derivative is 0.
d. The best objective function for classification is
i. Sum Squared Error
ii. Cross-Entropy
iii. Rectified linear unit
iv. Logistic
v. Funny tanh
5
e. Suppose we have a 3-dimensional input ~x = (x1 , x2 , x3 ) connected to 4 neurons with the
exact same weights w
~ = (w1 , w2 , w3 ) where: x1 = 2, w1 = 1, x2 = −1, w2 = −0.5, x3 =
1, w3 = 0, and the bias b = 0.5. We calculate the output of each of the four neurons
using the input ~x, weights w
~ and bias b.
If y1 = 0.95, y2 = 3, y3 = 1, y4 = 3, then a valid guess for the neuron types of y1 , y2 , y3
and y4 is:
i Rectified Linear, Logistic Sigmoid, Binary Threshold, Linear
ii Linear, Binary Threshold, Logistic Sigmoid, Rectified Linear
iii Logistic Sigmoid, Linear, Binary Threshold, Rectified Linear
iv Rectified Linear, Linear, Binary Threshold, Logistic Sigmoid
6
Problem 4: The delta rule (10pts)
Derive the delta rule for the case of a single layer network with a linear output and the sum
squared error loss function. To make this as simple as possible, assume we are doing this for
one input-output pattern p (then we can simply add these up over all of the patterns). So,
starting with:
1
SSE p = (tp − y p )2
2
(1)
and
p
y =
d
X
w j xj
(2)
j=1
derive that:
−
∂SSE p
= (tp − y p )xi
∂wi
7
(3)
Problem 5: Forward/Backward Propagation. (15 pts)
Consider the simple neural network in Figure 1 with the corresponding initial weights and
biases in Figure 2. Weights are indicated as numbers along connections and biases are
indicated as numbers within a node. All units use the Sigmoid Activation function
g(a) = f (a) = 1+e1−a and the cost function is the Cross-Entropy Loss.
On the following page, fill in the three panels.
1. (4 pts) In the first panel, record the ai ’s into each of the nodes.
2. (3 pts) In the second panel, record zi = g(ai ) for each of the nodes. You may use the
table of approximate Sigmoid Activation values on the next page.
3. (5 pts) In the third panel, compute the δ for each node. Do this for training example,
X = (1.0, −1.0) with target of t = 0.85.
Update the weights.
(3 pts) Given the δ’s you computed, use gradient descent to calculate the new weight from
hidden unit 1 (H1) to the Output (OUT) (currently 1.0). Use gradient descent with no
momentum and learning rate η = 1.0.
Figure 1.
Figure 2.
8
9