final_exam_solutions

.pdf

School

New York University *

*We aren’t endorsed by this school

Course

473

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by DeanFox3968

Introduction to Machine Learning (CSCI-UA.473): Final Exam Solutions Instructor: Sumit Chopra December 21 st , 2021 1 Probability and Basics (20 Points) 1. Let X and Y be discrete random variables. You are given their joint distribution p(X, Y), and let D denote the training set and θ denote the parameters of your model. Answer the following questions: (a) [2 Points] Write the expression of the marginal distribution over X provided p ( X, Y ). [Sol:] p ( X ) = X y p ( X, y ) (1) (b) [2 Points] Write the conditional distribution of X given Y provided p ( X, Y ), p ( X ), and p ( Y ). [Sol:] p ( X | Y ) = p ( X, Y ) p ( Y ) (2) (c) [2 Points] Write the posterior distribution of Y given p ( X | Y ), p ( Y ), and p ( X ). [Sol:] p ( Y | X ) = p ( X, Y ) p ( X ) = p ( X | Y ) · p ( Y ) p ( X ) (3) (d) [2 Points] Write the expression of the posterior distribution of the parameters θ , given the prior p ( θ ) and likelihood of the data. [Sol:] p ( θ |D ) = p ( D , θ ) p ( D ) = p ( D| θ ) p ( θ ) p ( D ) (4) 2. [4 Points] Show that the Tanh function (tanh) and the Logistic Sigmoid function ( σ ) are related by tanh( a ) = 2 σ (2 a ) - 1 . 1

[Sol:] 2 σ (2 a ) - 1 = 2 1 + e - 2 a - 1 = 2 1 + e - 2 a - 1 + e - 2 a 1 + e - 2 a = 1 - e - 2 a 1 + e - 2 a = e a - e - a e a + e - a = tanh( a ) 3. [8 Points] Show that a general linear combination of Logistic Sigmoid functions of the form y ( x, w ) = w 0 + M X j =1 w j · σ x - μ j s is equivalent to a linear combination of tanh functions of the form y ( x, u ) = u 0 + M X j =1 u j · tanh x - μ j s and find the expression that relates the new parameters { u 1 , . . . , u M } to the original parameters { w 1 , . . . , w M } . [Sol:] If we take a j = ( x - μ j ) / 2 s , we can re-write the first equation as y ( x, w ) = w 0 + M X j =1 w j σ (2 a j ) = w 0 + M X j =1 w j 2 (2 σ (2 a j ) - 1 + 1) = u 0 + M X j =1 u j tanh( a j ) , where u j = w j / 2 for j = 1 , . . . , M and u 0 = w 0 + ∑ M j =1 w j / 2 2

2 Parametric Models (20 Points) 1. Let D = { ( x 1 , y 1 ) , . . . , ( x N , y N ) } be the training set where each training sample ( x i , y i ) is independently and identically distributed. The model is given by ˆ y i = w T x i + i , where i ∼ N (0 , σ 2 ). Answer the following questions: (a) [4 Points] Let Y = [ y 1 , . . . , y N ] be a vector of all the labels and X = [ x 1 , . . . , x N ] be a matrix of all the inputs. Write the expression for conditional likelihood p ( Y | X ). [Sol:] Y | X ∼ N ( W T X, σ 2 ) p ( Y | X ) = N Y i =1 1 √ 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 (b) [8 Points] Assume that the prior distribution of the parameters θ is gaussian: θ ∼ N (0 , β 2 I ). Show that computing the MAP estimate of the parameters is equivalent to minimizing a loss function composed of the mean squared error and an L 2 regularizer. [Sol:] θ MAP = arg max θ [ p ( Y | X, θ ) · p ( θ )] We already have shown the expression for p ( Y | X, θ ) in the previous part and p ( θ ) is provided as the gaussian N (0 , β 2 I ). θ MAP = arg min θ N Y i 1 √ 2 πσ 2 e - ( y i - w T x i ) 2 2 σ 2 · 1 p 2 πβ 2 e - θ T θ 2 β 2 Now we can convert the argmax operation to an argmin by taking the negative log and to further simplify remove the constants, θ MAP = arg min θ N X i ( w T x i - y i ) 2 2 σ 2 + 1 2 β θ T θ Let λ = σ 2 β , thus we have : θ MAP = arg min θ 1 2 n X i ( w T x i - y i ) 2 + λθ T θ 3

2. [8 Points] In the following data set A , B , C are input binary random variables, and y is a binary output whose value we want to predict: A B C y 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 How will the Naive Bayes Classifier predict y given the input: A = 0, B = 0, C = 1? [Sol:] From the table above we have: P ( y = 0) = 3 / 7 P ( y = 1) = 4 / 7 P ( A = 0 | y = 0) = 2 / 3 P ( B = 0 | y = 0) = 1 / 3 P ( C = 1 | y = 0) = 1 / 3 P ( A = 0 | y = 1) = 1 / 4 P ( B = 0 | y = 1) = 1 / 2 P ( C = 1 | y = 1) = 1 / 2 Predicted y maximizes P ( A = 0 | y ) P ( B = 0 | y ) P ( C = 1 | y ) P ( y ). For y = 0 we have: P ( A = 0 | y = 0) P ( B = 0 | y = 0) P ( C = 1 | y = 0) P ( y = 0) = 0 . 0317 . For y = 1 we have: P ( A = 0 | y = 1) P ( B = 0 | y = 1) P ( C = 1 | y = 1) P ( y = 0) = 0 . 0357 . Thus the Naive Bayes Classifier will predict y = 1. 4

3 Support Vector Machines (20 Points) 1. [10 Points] Kernel functions implicitly define a mapping function φ ( · ) that transforms an input instance x ∈ < d to a high dimensional feature space Q by giving the form of a dot product in Q : K ( x i , x j ) = φ ( x i ) · φ ( x j ). Assume we use a kernel function of the form K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ). Thus we assume that there is some implicit unknown function φ ( x ) such that φ ( x i ) · φ ( x j ) = K ( x i , x j ) = exp( - 1 2 k x i - x j k 2 ) . Prove that for any two input instances x i and x j , the squared Euclidean distance of their corresponding points in the feature space Q is less than 2. That is prove that k φ ( x i ) - φ ( x j ) k 2 < 2. [Sol:] k φ ( x i ) - φ ( x j ) k 2 = ( φ ( x i ) - φ ( x j )) · ( φ ( x i ) - φ ( x j )) = φ ( x i ) · φ ( x i ) + φ ( x j ) · φ ( x j ) - 2 · φ ( x i ) φ ( x j ) = 2 - 2 exp( - 1 2 || x i - x j || 2 ) < 2 since exp( - 1 2 || x i - x j || 2 ) > 0. 2. [4 Points] Let M SV M be an SVM model which you’ve trained using some training set. Consider a new point that is correctly classified and distant from the decision boundary. Why would SVMs decision boundary be unaffected by this point, but the one learnt by logistic regression would be affected. [Sol:] This is because the Hinge loss used by the SVM will assign a zero weight to this distant point and hence it’ll not have any effect of the learning of the decision boundary. In contrast the negative log likelihood loss function optimized by the logistic regression will allocate some non-zero weight to this point. Hence the decision boundary for logistic regression will be affected. 3. Answer the following True/False questions and explain your answer in no more than two lines. (a) [3 Points] If we switch from a linear kernel to a higher order polynomial kernel the support vectors will not change. [Sol:] False. There is no guarantee that the support vectors would remain the same because the transformed feature vectors implicitly generated by the polynomial kernel are non-linear functions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different. (b) [3 Points] The maximum margin decision boundary that the SVMs learn provide the best generalization error among all linear classifiers. 5

[Sol:] False. The maximum margin separating hyperplane is often a reasonable choice to pick among all the possible separating hyper-planes however there is no guarantee that this hyper-plane will provide the best generalization error among all possible hyper-planes. 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version