homework2

.pdf

School

Boston University *

*We aren’t endorsed by this school

Course

350

Subject

Computer Science

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by PresidentJay247

CS 506 Spring 2021 - HW2 Classification and Dimensionality Reduction Total: 32 points Package Limitations: None. 1 Least Squares and Logistic Regression In this section, we are going to compare two linear models for classification, especially when outliers are presented in the data. Please refer to Section 4.1.3 at Page 184 of the book Pattern Recognition and Machine Learning for the context of this problem as well as the explanation of Figure1 Figure 1: The comparison of least square and logistic regression in classification of two classes with/without outliers a) [1 pt.] Generate labeled random 2D points like the ones shown in the left subfigure of Figure1. Note that the red crosses and blue circles are points of different classes, so you may need to have a third column storing the label info of the 2D points. Let’s call this data “data without outlier”. 1

Now on top of this data, add a few outliers to the blue circles just like the right subfigure of Figure1 and save the data as “data with outlier”. You may either use code or even manually choose some random 2D points. Your data need not to be exactly the same as the ones shown on the plots. b) [4 pts.] Use both the least squares method and the logistic regression method to classify the “data without outlier” and “data with outlier”. c) [2 pts.] Plot the classification results into two figures side by side just like Figure1. Have you got similar results like Figure1? Explain briefly why the logistic regression is not sensitive to outliers. 2 Logistic Regression and kNN Classification The goal of this problem is to perform classification on the famous MNIST dataset. We have already preprocessed a sample of this dataset (30% of the original dataset), that you can find here: Download from Google Drive in the format of NumPy arrays. File mnist data.npy contains an array of the data -each row corresponds to a 28 × 28 digit picture vectorized to create 28*28=784 features, while the file mnist labels.npy contains the respective labels of the images. a) [0 pts.] Randomly split the dataset , using 20% of the samples as your test set and the remaining 80% as the train set that you will use to fit your models. b) [2 pts.] Try to classify the images using Logistic Regression . Have in mind that the dataset contains more than 2 labels, hence is a multino- mial classification problem. What is your train accuracy and test accuracy ? c) [3 pts.] Now, try to classify the dataset using a k-Nearest Neighbor classifier . Plot the train and test accuracy as you vary k from 1 to 25 with a step size of 2. d) [1 pt.] Explain your results. e) [5 pts.] Now use kNN to explore how a different sized train set affects your results. Plot the accuracy of your model when only using 3,000 of the images in the train set (repeat this experiment using 6,000, 9,000, and so on until you are using the full train set). Use whatever value of k you found that worked best in part (c). You will be doing something similar in Problem 3(d) so it makes sense to run both experiments at the same time. 2

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version