asst3

.pdf

School

University of Waterloo *

*We aren’t endorsed by this school

Course

486

Subject

Computer Science

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by BarristerUniverse13215

Assignment 3: Learning Na¨ ıve Bayes and Neural Networks CS486/686 – Winter 2024 Out: March 7, 2024 Due: March 22, 2024 at 11:59pm Waterloo Time Submit your assignment via LEARN (CS486 site) in the Assignment 3 Dropbox folder. No late assignments will be accepted PART A [45pts]: NA ¨ IVE BAYES LEARNING In assignment 2, you learned a decision tree to classify text documents in two sets given a labeled training set. Here you will learn a Na¨ ıve Bayes classifier for the same data. The data is made from a subset of Reddit posts sourced from https://files.pushshift.io/reddit/ and processed it using Google BigQuery. The dataset includes the first 1500 comments of August 2019 of each of the r/books and r/atheism subreddits, cleaned by removing punctuation and some offensive language, and limiting the words to only those used more than 3 times among all posts. These 3000 comments are split evenly into training and testing sets (with 1500 documents in each). To simplify your implementation, these posts have been pre-processed and converted to the bag of words model. More precisely, each post is converted to a vector of binary values such that each entry indicates whether the document contains a specific word or not. Each line of the files trainData.txt and testData.txt are formatted ”docId wordId” which indicates that word wordId is present in document docId. The files trainLabel.txt and testLabel.txt indicate the label/category (1= atheism or 2= books ) for each document (docId = line#). The file words.txt indicates which word corresponds to each wordId (denoted by the line#). If you are using Matlab, the file loadScript.m provides a simple script to load the files into appropriate matrices. At the Matlab prompt, just type ”loadScript” to execute the script. Feel free to use any other language and to build your own loading script for the data if you prefer. Implement code to learn a na¨ ıve Bayes model by maximum likelihood 1 . More precisely, learn a Bayesian network where the root node is the label/category variable with one child variable per word feature. The word variables should be binary and represent whether that word is present or absent in the document. Learn the parameters of the model by maximizing the likelihood of the training set only. This will set the class probability to the fraction of documents in the training set from each category, and the probability of a word given a document category as the fraction of documents in that category that contain that word. You should use a Laplace correction by adding 1 to numerator and 2 to the denominator, in order to avoid situations where both classes have probability of 0 . Classify documents by computing the label/category with the highest posterior probability Pr( label | words in document ) . Report the training and testing accuracy (i.e., percentage of correctly classified articles). 1 For the precise equations for this, see the note on the course webpage https://cs.uwaterloo.ca/ ˜ jhoey/teaching/cs486/ naivebayesml.pdf 1

What to hand in: • [10 pts] A printout of your code. • [10 pts] A printout listing the 10 most discriminative word features measured by max word | log Pr( word | label 1 ) − log Pr( word | label 2 ) | Since the posterior of each label is formulated by multiplying by the conditional probability Pr( word | label i ) , a word feature should be more discriminative when the ratio Pr( word | label 1 ) / Pr( word | label 2 ) is large or small and therefore when the absolute difference between log Pr( word | label 1 ) and log Pr( word | label 2 ) is large. In your opinion, are these good word features? • [10 pts] Training and testing accuracy (i.e., two numbers indicating the percentage of correctly classified articles for the training and testing set). • [5 pts] The na¨ ıve Bayes model assumes that all word features are independent. Is this a reasonable assumption? Explain briefly. • [5 pts] What could you do to extend the Na¨ ıve Bayes model to take into account dependencies between words? • [5 pts] What if, instead of using ML learning, you were to use MAP learning. Explain what you would need to add and how it would work. 2

PART B [80pts]: Neural Networks for Classification and Regression In this part of the assignment, you will implement a feedforward neural network from scratch. Additionally, you will implement activation functions, a loss function, and a performance metric. Lastly, you will train a neural network model to perform a regression problem. Red Wine Quality - A Regression Problem The task is to predict the quality of red wine from northern Portugal, given some physical characteristics of the wine. The target y ∈ [0 , 10] is a continuous variable, where 10 is the best possible wine, according to human tasters. This dataset was downloaded from the UCI Machine Learning Repository. The features are all real-valued. They are listed below: • Fixed acidity • Volatile acidity • Citric acid • Residual sugar • Chlorides • Free sulfur dioxide • Total sulfur dioxide • Density • pH • Sulphates • Alcohol Training a Neural Network In Lecture 9b, you learned how to train a neural network using the backpropagation algorithm. In this assignment, you will apply the forward and backward pass to the entire dataset simultaneously (i.e. batch gradient descent). As a result, your forward and backward passes will manipulate tensors, where the first dimension is the number of examples in the training set, n . When updating an individual weight W ( l ) i,j , you will need to find the average gradient ∂ L ∂W ( l ) i,j (where L is the Error) across all examples in the training set to apply the update. Algorithm 1 gives the training algorithm in terms of functions that you will implement in this assignment. Further details can be found in the documentation for each function in the provided source code. Algorithm 1 Training Require: η > 0 ▷ Learning rate Require: n epochs ∈ N + ▷ Number of epochs Require: X ∈ R n × f ▷ Training examples with n examples and f features Require: y ∈ R n ▷ Targets for training examples Initiate weight matrices W ( l ) randomly for each layer. ▷ Initialize net for i ∈ { 1 , 2 , . . . , n epochs } do ▷ Conduct n epochs epochs A vals , Z vals ← net.forward pass( X ) ▷ Forward pass ˆ Y ← Z vals[-1] ▷ Predictions L ← L ( ˆ Y , Y ) Compute ∂ ∂ ˆ Y L ( ˆ Y , Y ) ▷ Derivative of error with respect to predictions deltas ← backward pass(A vals, ∂ ∂ ˆ Y L ( ˆ Y , Y ) ) ▷ Backward pass update gradients() ▷ W ( ℓ ) i,j ← W ( ℓ ) i,j − η ∑ n ∂ L ∂W ( ℓ ) i,j for each weight end for return trained weight matrices W ( ℓ ) Activation and Loss Functions You will implement the following activation functions and their derivatives: 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version