Consider a linear model of the form together with a sum-of

1. [50 points] Consider a linear model of the form
D
X
y(x; w) = w0 +
wi xi
i=1
together with a sum-of-squares error function of the form
N
1X
errorD (w) =
fy(xt ; w) ¡ rt g2
2 t=1
Now suppose that Gaussian noise ²i with zero mean and variance ¾2 is added independently to
each of the input variables xi. Show that minimizing errorD averaged over the noise distribution is
equivalent to minimizing the sum-of-squares error for noise-free input variables with the addition of
some weight-decay regularization term (you should clearly identify the regularization term)
2. [50 points] Consider a 3-layer perceptron (1 hidden layer) for the classification task with K classes.
Suppose the activation function for the output unit is the softmax function:
exp[neti ]
yit = PK
j=1 exp[netj ]
where neti =
PH
t
h=1 whi zh
+ w0i is the input to the i-th output unit (1≤i≤K). The activation function
for the hidden layer unit is the sigmoid function:
zht =
1 + exp(¡
1
PD
i=1 wih xi
+ w0h)
Derive the learning rule if the error function is the sum squared error, i.e.
Err(Xjw) =
1 XX t
(ri ¡ yit )2
2 t i