417-fall23-practice_exam2

pdf

School

Washington University in St Louis *

*We aren’t endorsed by this school

Course

417T

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

3

Uploaded by AmbassadorNeutron4916

Report
CSE 417T (Machine Learning): Exam 2 Practice Questions 1. For each statement below, select whether the statement is true or false and provide a one-to-two sentence justification. (a) True / False In a bagged ensemble of 100 decision trees trained on a training set with 500 data points, it must be the case that every data point is used in the construction of at least one of the decision trees. (b) True / False The depth of a decision tree learned using ID3 can be larger than the number of training data points used to learn the tree. (c) True / False Neural networks with linear acquisition functions are essentially linear models. 2. Given the following dataset in 1-d space, which consists of 4 positive data points { 0 , 1 , 2 , 3 } and 3 negative data points {− 3 , 2 , 1 } : Suppose that we want to learn a soft-margin linear SVM for this dataset. Consider the choice of the slack penalty term (i.e., C ). Answer the following two questions: (a) If C → ∞ , how many support vectors do we have for the final hypothesis? Explain why. (b) If C = 0, how many support vectors do we have for the final hypothesis? Explain why. 3. Think about weak learners in bagging and boosting. The full justification should involve the property of weak learners, the property of bagging/boosting, and their connections. (a) If we use decision stumps as the weak learner in bagging, generally speaking, do you think the test error would increase, decrease, or stay roughly the same compared with random forest (that uses fully grown decision trees as weak learners)? Why? (b) If we replace the weak learner in AdaBoost as ID3 (fully grown trees without pruning leaves), generally speaking, do you think the test error would increase, decrease, or stay roughly the same compared with the original AdaBoost? Why? 4. Consider the following neural network with one hidden layer. 1
(a) Assume we use the linear function θ ( s ) = s as the activation function for both the hidden layer and the output layer. Draw a neural network with no hidden layers (but with the same input and output layers) that is equivalent to the given one. You need to clearly specify the network structure and the weights. (b) Assume we use the sigmoid activation function θ ( s ) = e s e s e s + e s . Can we still construct a neural network with no hidden layers that is equivalent to the given network? If yes, draw it. If no, please provide brief justifications. 5. Suppose you learn a hard-margin SVM using a training data set with N data points. You observe that there are M support vectors. Explain why the following bound on the leave-one- out cross validation error must be true: E LOOCV M / N 6. Which of the following techniques help prevent neural networks from overfitting? Early stopping Dropout Data augmentation (adding noises to existing data points) All of the above 7. Compare the time efficiency of nearest neighbor and logistic regression. Which of the following is true? Nearest neighbor is less efficient in training but more efficient in testing Nearest neighbor is more efficient in training but less efficient in testing Nearest neighbor is more efficient in both training and testing Nearest neighbor is less efficient in both training and testing 8. What tends to be true when we increase k in k -Nearest Neighbor? The decision boundary tends to be more complex The bias tends to increase The training time tends to increase All of the above None of the above 9. What is the hypothesis space of single-hidden-layer neural networks that use only linear acti- vation functions? Constant functions Linear models Continuous functions Arbitrary functions Page 2
10. Consider the following dataset (with 2-dimensional feature vectors) with each point in the form of ( x 1 , x 2 , y ): { (0 , 0 , 1) , (0 , 1 , 1) , (1 , 0 , 1) , (0 , 3 , +1) , (1 , 2 , +1) , (6 , 5 , +1) } . Assume that we run hard-margin SVM on it (without transformations) and obtain a hypothesis of the form g ( x 1 , x 2 ) = sign( w 1 x 1 + w 2 x 2 + b ). Also assume that { (0 , 1 , 1) , (1 , 0 , 1) , (0 , 3 , +1) , (1 , 2 , +1) } are the support vectors. What are the values of w 1 , w 2 , and b obtained by the hard-margin SVM model? w 1 = 2, w 2 = 2, b = 4 w 1 = 1, w 2 = 0, b = 2 w 1 = 0, w 2 = 1, b = 2 None of the above Page 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help