COMP_551_Winter_2024_Assignment_2

.pdf

School

McGill University *

*We aren’t endorsed by this school

Course

551

Subject

Computer Science

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by SargentWorldEagle20

Assignment 2: Classification of Textual Data COMP 551 Winter 2024, McGill University Contact TAs: Alina Tan and Jonathan Colaco Carr Released on February 5 midnight Due on February 26 midnight Please read this entire document before beginning the assignment. Preamble • This assignment is due on February 26th at 11:59pm (EST, Montreal Time). • For late submission, 2 k percent will be deducted per k days of the delay. • To use your 6-day quota as a team, submit your request by emailing comp551.socs@mcgill. ca with subject title “A2 extension request” and in the email body specify the number of days (max at 6 days) you need to submit your assignment. You can only submit the re- quest ONCE. Once you request the days, the quota of every member on the team will be reduced by the days you requested even if you end up submitting your assignment prior to the extended deadline. Therefore, plan and use your quota wisely. • This assignment is to be completed in groups of three. All members of a group will re- ceive the same grade except when a group member is not responding or contributing to the assignment. If this is the case and there are major conflicts, please reach out to the contact TA or instructor for help and flag this in the submitted report. Please note that it is not expected that all team members will contribute equally. However every team mem- ber should make integral contributions to the assignment, be aware of the content of the submission and learn the full solution submitted. • You will submit your assignment on MyCourses as a group. You must register your group on MyCourses and any group member can submit. See MyCourses or here for details. • We recommend to use Overleaf for writing your report and Google colab for coding and running the experiments. The latter also gives access to the required computational re- sources. Both platforms enable remote collaborations. 1

• You should use Python for this and all assignments. You are free to use libraries with gen- eral utilities, such as matplotlib, numpy and scipy for Python, unless stated otherwise in the description of the task. In particular, in most cases you should implement the models and evaluation functions yourself, which means you should not use pre-existing imple- mentations of the algorithms or functions as found in SciKit learn, and other packages. The description will specify this in a per case basis. Synopsis In this assignment, you will implement logistic regression and multiclass regression and evaluate these two algorithms against Decision Trees on two distinct textual datasets. The goal is to gain experience implementing these algorithms from scratch and to get hands-on experi- ence evaluating their performances. 1 Task 1: Data preprocessing Your first task is to turn the text data into tabular format with selected features as the words and the text documents as the training or test examples. We will use two datasets in this project as described below. 1.1 IMDB Reviews The IMDB Reviews data can be downloaded from here: http://ai.stanford.edu/ ~ amaas/ data/sentiment/ . To train your model, use only the reviews in the “train” folder. Report the performance of your model on the reviews in the “test” folder. Carefully read the README file to have a clear un- derstanding of the data format. Briefly, imdb . vocab contains the vocabulary with one word per row. The row indices of the words are used to represent the feature indices that appear in the training and test documents in the “labeledBow.feat” files. Task 1.1 The entire vocabulary size is 89526, which is also the total feature size. This is too big for training our custom logistic regression. As a preprocessing step, you will need to decide which features to use. First, you may filter out words that appear in less than 1% of the documents and words that appear in more than 50% of the documents, which are the rare and “stopwords” respectively. Stopwords are the commonly used words that are not important to our tasks. Second, you need to choose the top D ∈ [100 , 1000] features by their absolute regression co- efficients with the rating scores (1-10) by using the Simple Linear Regression we covered in 2

Module 4.1 . In other words, although eventually we will use logistic regression for binary clas- sification on this data, we will perform linear regression (with the rating score as the target vari- able) in order to find important features. For this step, you must implement the Simple Linear Regression model from scratch (i.e. you cannot use the linear regression model from sklearn). Examine the top features with the most positive simple regression coefficients and the top fea- tures with most negative coefficients. Do they make sense for calling a movie good and bad, respectively? 1.2 20 news groups: a multi-class labelled textual dataset The 20-news group dataset can be loaded directly using sklearn . datasets . fetch_20newsgroups ( https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups. html ). Use the default train subset (subset=‘train’, and remove=([‘headers’, ‘footers’, ‘quotes’ ]) in sklearn.datasets) to train the multiclass prediction models and report the final performance on the test subset. Note: you need to start with the text data and convert text to feature vectors. Please refer to https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data. html for a tutorial on the steps needed for this. Task 1.2 For the sake of this assignment, it is ok to work with a partial dataset with only 5 cat- egories out of the 20 available in the dataset. You may choose your favourite 5 categories. One tip is that choosing 5 distinct categories (e.g., comp.graphics, misc.forsale, rec.sport.baseball, sci.med, talk.politics.guns) may be easy to debug your code because they are easy to distin- guish by the corresponding key words. Similar to task 1.1, you can filter out rare words, stopwords, and words that are not relevant to any of the 5 class labels. Since we are dealing with discrete class labels, we will use something different from simple regression to select features externally. For example, you may use Mutual Information (MI) ( https://scikit-learn.org/stable/modules/generated/sklearn.metrics. mutual_info_score.html ) to select the top M ∈ [10 , 100] feature words per class and take the union of all top feature words to train your multiclass model. You may choose other ways to select feature words as long as you report what you did in your report in the end. One thing to keep in mind is that our custom multiclass regression may be slow and without regularization you may want to keep the number of features fairly low. For in- stance, with 100 feature words per class, we can still have up to 500 features in total for 5 cate- gories. 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help

Related Questions

Please share the resource link or website or author reference - how to download or extracted the dataset used for this project will be the crime data provided by the City of Los Angeles Open Data Portal. The dataset includes information about crime incidents, including the type of crime, date, time, location, and other details. The dataset covers the period from January 2010 to the present and contains more than 2 million rows.

Create a MS Access report showing the following for Jan, Feb, Mar in 2009. Duty-free sales, duty-paid sales, and taxes (original) are collected monthly by location. The report should be grouped by month, sorted by location, and include detail (one per month) and summary (the quarter) numbers. For printing purposes, ensure the report prints each month on a new page with portrait orientation. This report will be distributed at a meeting. Name the report ‘MGMT Boutique INC, Monthly Sales Q1, 2009 ’ and put the title on each page

Wk 1 - Apply: Database Management Plan: Part 1 - Extracting Information [due Day 7] Assignment Content You are the vice president for information technology at a small, growing business. You have been tasked with developing a plan for maintaining databases for the storage of business data and use in business analytics. In Weeks 1–5, you will work on gathering information in a Database Management Plan. In Week 6, you will present your plan in a 20-minute presentation (10 to 12 slides) to the president of information technology. The presentation will provide recommendations to an organization regarding how to develop a plan for the maintenance of databases that store business data and its use in business analytics. To begin preparations for your presentation, create a 700-word entry in your Database Management Plan. You will use information from this entry in your presentation due in Week 6. In your Database Management Plan entry, include the following: Provide…

Warehouse Database Diagrams Tables System Tables FileTables dbo.invoicearchive dbo.invoicelineitems dbo.invoicesarchive E dbo.invoicesTerms1 dbo.invoicesTerms2 E dbo.invoicesTerms3 E dbo.vendors

Acording to the table in the image Database Using MySQL commands, answer the following questions ; (iv) Insert the following data to table students Matrixnum: S0003 Name: Raudah (v) Delete the data entered in question (iv)

Question 6 You must give a word count for any question with a maximum word limit. This question tests your understanding of Block 3. Additionally, you are expected to follow the guidance on report writing that is provided in Section 1.5 of Block 2 Part 1. The introduction to Block 3 included part of a speech by Martha Lane Fox, in which she spoke of the need for greater digital understanding. For this question you are required to write a short report on digital understanding for a non-technical audience. Your report should contain two main sections: How has your digital understanding changed because of studying TM112 Block 3? You should explain what you have learned. Give at least one example of an activity that you have completed as part of TM112 Block 3 which has changed your digital understanding. Explain how it has changed your understanding: what understanding did you gain by doing it? You should support your explanation with at least one reference to the module materials.…

Create lessonSchedule table with FK constraints

How is redundancy identified and eliminated during the normalization process?

Explain the distinction between a request for proposal (RFP) and a request for quotation (RFQ) (RFQ)

Assignment Due: Fri, Jun 18 2021 at 11:59 PM 1. What are the basic steps in processing research data? 2. What are the uses of dummy tables?

de Show Review View Tell me whatyou want to do. Sign in & Share less you need to edit, it's safer to stay in Protected View. Enable Editing Calculate mean value Mean Amean value Ex = sum of all data values (x,, X, X3,) n= number of data values

Solve the follow y=2x+3 y=-x+9 Edit View Inse

student(sid, sname, sex, age, year, gpa) dept(dname, numphds) prof(pname, dname) course(cno, cname, dname) major(dname, sid) section(dname, cno, sectno, pname) enroll(sid, grade, dname, cno, sectno) For the following mysql schema above, create a mysql query for the following question: For each department that has one or more majors with a GPA under 1.0, print the name of the department and the average GPA of its majors.

student(sid, sname, sex, age, year, gpa) dept(dname, numphds) prof(pname, dname) course(cno, cname, dname) major(dname, sid) section(dname, cno, sectno, pname) enroll(sid, grade, dname, cno, sectno) For the following mysql schema above, create a mysql query for the following question: For those departments that have no majors taking the College Geometry 1 or College Geometry 2 course, print the department name and the number of PhD students in the department.

Question 2) Maintaining an Audit Trail of Product Table Changes The accuracy of product table data is critical, and the Brewbean’s owner wants to have an audit file containing information on all DML activity on the BB_PRODUCT table. This information should include the ID of the user performing the DML action, the date, the original values of the changed row, and the new values. This audit table needs to track specific columns of concern, including PRODUCTNAME, PRICE, SALESTART, SALEEND, and SALEPRICE. Create a table named BB_PRODCHG_AUDIT to hold the relevant data, and then create a trigger named BB_AUDIT_TRG that fires an update to this table whenever a specified column in the BB_PRODUCT table changes. Preparation : Step1: ALTER TRIGGER sales_date_trg DISABLE; Step2: CREATE TABLE bb_prodchg_audit (user_id VARCHAR2(10), chg_date DATE, name_old VARCHAR2(25), name_new VARCHAR2(25), price_old NUMBER(5,2), price_new NUMBER(5,2), start_old DATE, start_new DATE,…

1. What is Consignment? (please give full explanation and avoid plagiarism. thank you.)

Using data as shown above, you need to : 1)Create January to March data each from different worksheet. 2)Calculate UTHM Trading First Quarter Summary from January to March data. 3)Create a pie chart for each sales percentage and profit percentage. 4)Create a line chart for overall UTHM Trading Quarter Summary 5)Hide January to March data. 6)Protect workbook for STRUCTURE ONLY using your 6 digit matric number as password. 7)Print screen all the step above and save. 8)Insert (name, matric number, section) on top (left side) of every worksheet. 9)Please rename and save your work as (matricnumber_LabSheet 5_2).

SEE MORE QUESTIONS

Recommended textbooks for you

Np Ms Office 365/Excel 2016 I Ntermed

Computer Science

ISBN:9781337508841

Author:Carey

Publisher:Cengage

COMPREHENSIVE MICROSOFT OFFICE 365 EXCE

Computer Science

ISBN:9780357392676

Author:FREUND, Steven

Publisher:CENGAGE L

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781305627482

Author:Carlos Coronel, Steven Morris

Publisher:Cengage Learning

Database Systems: Design, Implementation, & Manag...

Computer Science

ISBN:9781285196145

Author:Steven, Steven Morris, Carlos Coronel, Carlos, Coronel, Carlos; Morris, Carlos Coronel and Steven Morris, Carlos Coronel; Steven Morris, Steven Morris; Carlos Coronel

Publisher:Cengage Learning

Principles of Information Systems (MindTap Course...

Computer Science

ISBN:9781285867168

Author:Ralph Stair, George Reynolds

Publisher:Cengage Learning

Oracle 12c: SQL

Computer Science

ISBN:9781305251038

Author:Joan Casteel

Publisher:Cengage Learning

SEE MORE TEXTBOOKS