1. [50 points] Consider a linear model of the form D X y(x; w) = w0 + wi xi i=1 together with a sum-of-squares error function of the form N 1X errorD (w) = fy(xt ; w) ¡ rt g2 2 t=1 Now suppose that Gaussian noise ²i with zero mean and variance ¾2 is added independently to each of the input variables xi. Show that minimizing errorD averaged over the noise distribution is equivalent to minimizing the sum-of-squares error for noise-free input variables with the addition of some weight-decay regularization term (you should clearly identify the regularization term) 2. [50 points] Consider a 3-layer perceptron (1 hidden layer) for the classification task with K classes. Suppose the activation function for the output unit is the softmax function: exp[neti ] yit = PK j=1 exp[netj ] where neti = PH t h=1 whi zh + w0i is the input to the i-th output unit (1≤i≤K). The activation function for the hidden layer unit is the sigmoid function: zht = 1 + exp(¡ 1 PD i=1 wih xi + w0h) Derive the learning rule if the error function is the sum squared error, i.e. Err(Xjw) = 1 XX t (ri ¡ yit )2 2 t i
© Copyright 2026 Paperzz