COMP_551_Winter_2024_Assignment_2

.pdf

School

McGill University *

*We aren’t endorsed by this school

Course

551

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

8

Uploaded by SargentWorldEagle20

Report
Assignment 2: Classification of Textual Data COMP 551 Winter 2024, McGill University Contact TAs: Alina Tan and Jonathan Colaco Carr Released on February 5 midnight Due on February 26 midnight Please read this entire document before beginning the assignment. Preamble • This assignment is due on February 26th at 11:59pm (EST, Montreal Time). • For late submission, 2 k percent will be deducted per k days of the delay. • To use your 6-day quota as a team, submit your request by emailing comp551.socs@mcgill. ca with subject title “A2 extension request” and in the email body specify the number of days (max at 6 days) you need to submit your assignment. You can only submit the re- quest ONCE. Once you request the days, the quota of every member on the team will be reduced by the days you requested even if you end up submitting your assignment prior to the extended deadline. Therefore, plan and use your quota wisely. • This assignment is to be completed in groups of three. All members of a group will re- ceive the same grade except when a group member is not responding or contributing to the assignment. If this is the case and there are major conflicts, please reach out to the contact TA or instructor for help and flag this in the submitted report. Please note that it is not expected that all team members will contribute equally. However every team mem- ber should make integral contributions to the assignment, be aware of the content of the submission and learn the full solution submitted. • You will submit your assignment on MyCourses as a group. You must register your group on MyCourses and any group member can submit. See MyCourses or here for details. • We recommend to use Overleaf for writing your report and Google colab for coding and running the experiments. The latter also gives access to the required computational re- sources. Both platforms enable remote collaborations. 1
• You should use Python for this and all assignments. You are free to use libraries with gen- eral utilities, such as matplotlib, numpy and scipy for Python, unless stated otherwise in the description of the task. In particular, in most cases you should implement the models and evaluation functions yourself, which means you should not use pre-existing imple- mentations of the algorithms or functions as found in SciKit learn, and other packages. The description will specify this in a per case basis. Synopsis In this assignment, you will implement logistic regression and multiclass regression and evaluate these two algorithms against Decision Trees on two distinct textual datasets. The goal is to gain experience implementing these algorithms from scratch and to get hands-on experi- ence evaluating their performances. 1 Task 1: Data preprocessing Your first task is to turn the text data into tabular format with selected features as the words and the text documents as the training or test examples. We will use two datasets in this project as described below. 1.1 IMDB Reviews The IMDB Reviews data can be downloaded from here: http://ai.stanford.edu/ ~ amaas/ data/sentiment/ . To train your model, use only the reviews in the “train” folder. Report the performance of your model on the reviews in the “test” folder. Carefully read the README file to have a clear un- derstanding of the data format. Briefly, imdb . vocab contains the vocabulary with one word per row. The row indices of the words are used to represent the feature indices that appear in the training and test documents in the “labeledBow.feat” files. Task 1.1 The entire vocabulary size is 89526, which is also the total feature size. This is too big for training our custom logistic regression. As a preprocessing step, you will need to decide which features to use. First, you may filter out words that appear in less than 1% of the documents and words that appear in more than 50% of the documents, which are the rare and “stopwords” respectively. Stopwords are the commonly used words that are not important to our tasks. Second, you need to choose the top D [100 , 1000] features by their absolute regression co- efficients with the rating scores (1-10) by using the Simple Linear Regression we covered in 2
Module 4.1 . In other words, although eventually we will use logistic regression for binary clas- sification on this data, we will perform linear regression (with the rating score as the target vari- able) in order to find important features. For this step, you must implement the Simple Linear Regression model from scratch (i.e. you cannot use the linear regression model from sklearn). Examine the top features with the most positive simple regression coefficients and the top fea- tures with most negative coefficients. Do they make sense for calling a movie good and bad, respectively? 1.2 20 news groups: a multi-class labelled textual dataset The 20-news group dataset can be loaded directly using sklearn . datasets . fetch_20newsgroups ( https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups. html ). Use the default train subset (subset=‘train’, and remove=([‘headers’, ‘footers’, ‘quotes’ ]) in sklearn.datasets) to train the multiclass prediction models and report the final performance on the test subset. Note: you need to start with the text data and convert text to feature vectors. Please refer to https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data. html for a tutorial on the steps needed for this. Task 1.2 For the sake of this assignment, it is ok to work with a partial dataset with only 5 cat- egories out of the 20 available in the dataset. You may choose your favourite 5 categories. One tip is that choosing 5 distinct categories (e.g., comp.graphics, misc.forsale, rec.sport.baseball, sci.med, talk.politics.guns) may be easy to debug your code because they are easy to distin- guish by the corresponding key words. Similar to task 1.1, you can filter out rare words, stopwords, and words that are not relevant to any of the 5 class labels. Since we are dealing with discrete class labels, we will use something different from simple regression to select features externally. For example, you may use Mutual Information (MI) ( https://scikit-learn.org/stable/modules/generated/sklearn.metrics. mutual_info_score.html ) to select the top M [10 , 100] feature words per class and take the union of all top feature words to train your multiclass model. You may choose other ways to select feature words as long as you report what you did in your report in the end. One thing to keep in mind is that our custom multiclass regression may be slow and without regularization you may want to keep the number of features fairly low. For in- stance, with 100 feature words per class, we can still have up to 500 features in total for 5 cate- gories. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help