projB1

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

C200

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by SuperHumanWorld12883

projB1 November 16, 2023 [1]: # Initialize Otter import otter grader = otter . Notebook( "projB1.ipynb" ) 1 Project B1: Spam/Ham Classification 1.1 Due Date: Thursday, November 16th at 11:59 PM You must submit this assignment to Gradescope by the on-time deadline, Thursday, November 16th at 11:59 PM. Please read the syllabus for the grace period policy. No late submissions beyond the grace period will be accepted. While course staff is happy to help you if you encounter diffculties with submission, we may not be able to respond to last-minute requests for assistance (TAs need to sleep, after all!). We strongly encourage you to plan to submit your work to Gradescope several hours before the stated deadline. This way, you will have ample time to reach out to staff for submission support. 1.1.1 Collaboration Policy Data science is a collaborative activity. While you may talk with others about this project, we ask that you write your solutions individually . If you do discuss the assignments with others please include their names in the collaborators cell below. Collaborators : list collaborators here 1.2 Introduction You will use what you’ve learned in class to create a binary classifier that can distinguish spam (junk or commercial or bulk) emails from ham (regular non-spam) emails. In addition to providing some skeleton code to fill in, we will evaluate your work based on your model’s accuracy and your written responses in this notebook. After this project, you should feel comfortable with the following: • Feature engineering with text data, • Using the sklearn library to process data and fit models, and • Validate the performance of your model and minimize overfitting. 1

This first part of the project focuses on initial analysis, Feature Engineering, and Logistic Re- gression. In the second part of this project (to be released next week), you will build your own spam/ham classifier. 1.3 Content Warning This is a real-world dataset – the emails you are trying to classify are actual spam and legitimate emails. As a result, some of the spam emails may be in poor taste or be considered inappropriate. We think the benefit of working with realistic data outweighs these inappropriate emails and wanted to give a warning at the beginning of the project so that you are made aware. If you feel uncomfortable with this topic, please contact your TA, the instructors, or reach out via the extenuating circumstances form . [2]: # Run this cell to suppress all FutureWarnings. import warnings warnings . filterwarnings( "ignore" , category = FutureWarning ) # More readable exceptions. % pip install --quiet iwut % load_ext iwut % wut on Note: you may need to restart the kernel to use updated packages. 1.4 Grading Grading is broken down into autograded answers and free responses. For autograded answers, the results of your code are compared to provided and/or hidden tests. For free response, readers will evaluate how well you answered the question and/or fulfilled the requirements of the question. Question Manual Points 1 Yes 2 2 No 3 3 Yes 3 4 No 2 5 No 2 6a No 1 6b No 1 6c Yes 2 6d No 2 6e No 1 6f Yes 1 6g Yes 1 6h Yes 2 Total 6 23 2

[3]: import numpy as np import pandas as pd import matplotlib.pyplot as plt % matplotlib inline import seaborn as sns sns . set(style = "whitegrid" , color_codes = True , font_scale = 1.5 ) 2 The Data In email classification, our goal is to classify emails as spam or not spam (referred to as “ham”) using features generated from the text in the email. The dataset is from SpamAssassin . It consists of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains 8,348 labeled examples, and the unlabeled test set contains 1,000 unlabeled examples. Note: The dataset is from 2004, so the contents of emails might be very different from those in 2023. Run the following cells to load the data into a DataFrame . The train DataFrame contains labeled data you will use to train your model. It has four columns: 1. id : An identifier for the training example. 2. subject : The subject of the email. 3. email : The text of the email. 4. spam : 1 if the email is spam, 0 if the email is ham (not spam). The test DataFrame contains 1,000 unlabeled emails. In Project B2, you will predict labels for these emails and submit your predictions to the autograder for evaluation. [4]: import zipfile with zipfile . ZipFile( 'spam_ham_data.zip' ) as item: item . extractall() [5]: # Loading training and test datasets original_training_data = pd . read_csv( 'train.csv' ) test = pd . read_csv( 'test.csv' ) # Convert the emails to lowercase as the first step of text processing. original_training_data[ 'email' ] = original_training_data[ 'email' ] . str . lower() test[ 'email' ] = test[ 'email' ] . str . lower() original_training_data . head() [5]: id subject \ 0 0 Subject: A&L Daily to be auctioned in bankrupt… 3

1 1 Subject: Wired: "Stronger ties between ISPs an… 2 2 Subject: It's just too small … 3 3 Subject: liberal defnitions\n 4 4 Subject: RE: [ILUG] Newbie seeks advice - Suse… email spam 0 url: http://boingboing.net/#85534171\n date: n… 0 1 url: http://scriptingnews.userland.com/backiss… 0 2 <html>\n <head>\n </head>\n <body>\n <font siz… 1 3 depends on how much over spending vs. how much… 0 4 hehe sorry but if you hit caps lock twice the … 0 First, let’s check if our data contains any missing values. We have filled in the cell below to print the number of NaN values in each column. If there are NaN values, we replace them with appropriate filler values (i.e., NaN values in the subject or email columns will be replaced with empty strings). Finally, we print the number of NaN values in each column after this modification to verify that there are no NaN values left. Note: While there are no NaN values in the spam column, we should be careful when replacing NaN labels. Doing so without consideration may introduce significant bias into our model. [6]: print ( 'Before imputation:' ) print (original_training_data . isnull() . sum()) original_training_data = original_training_data . fillna( '' ) print ( '------------' ) print ( 'After imputation:' ) print (original_training_data . isnull() . sum()) Before imputation: id 0 subject 6 email 0 spam 0 dtype: int64 ------------ After imputation: id 0 subject 0 email 0 spam 0 dtype: int64 3 Part 1: Initial Analysis In the cell below, we have printed the text of the email field for the first ham and the first spam email in the original training set. 4

[7]: first_ham = original_training_data . loc[original_training_data[ 'spam' ] == 0 , ␣ ↪ 'email' ] . iloc[ 0 ] first_spam = original_training_data . loc[original_training_data[ 'spam' ] == 1 , ␣ ↪ 'email' ] . iloc[ 0 ] print ( "Ham Email:" ) print (first_ham) print ( "-------------------------------------------------" ) print ( "Spam Email:" ) print (first_spam) Ham Email: url: http://boingboing.net/#85534171 date: not supplied arts and letters daily, a wonderful and dense blog, has folded up its tent due to the bankruptcy of its parent company. a&l daily will be auctioned off by the receivers. link[1] discuss[2] (_thanks, misha!_) [1] http://www.aldaily.com/ [2] http://www.quicktopic.com/boing/h/zlfterjnd6jf ------------------------------------------------- Spam Email: <html> <head> </head> <body> a man endowed with a 7-8" hammer is simply better equipped than a man with a 5-6"hammer. would you rather have more than enough to get the job done or fall = short. it's totally up to you. our methods are guaranteed to increase y= our size by 1-3" <a href=3d"http://209.163.187.47/cgi-bin/index.php?10= 004">come in here and see how</a> </body> </html> 3.1 Question 1 Discuss one attribute or characteristic you notice that is different between the two emails that might relate to the identification of a spam email. 5

The ham email provides information about a blog closure with legitimate sources, while the spam email promotes a questionable product using HTML-formatted content and focuses on a sensitive topic. 3.2 Training-Validation Split The training data we downloaded is all the data we have available for both training models and validating the models that we train. We, therefore, need to split the training data into separate training and validation datasets. You will need this validation data to assess the performance of your classifier once you are finished training. Note that we set the seed ( random_state ) to 42. This will produce a pseudo-random sequence of random numbers that is the same for every student. Do not modify this random seed in the following questions, as our tests depend on it. [8]: # This creates a 90/10 train-validation split on our labeled data. from sklearn.model_selection import train_test_split train, val = train_test_split(original_training_data, test_size = 0.1 , ␣ ↪ random_state = 42 ) 4 Part 2: Feature Engineering We want to take the text of an email and predict whether the email is ham or spam. This is a binary classification problem, so we can use logistic regression to train a classifier. Recall that to train a logistic regression model, we need a numeric feature matrix 𝕏 and a vector of corresponding binary labels 𝑌 . Unfortunately, our data are text, not numbers. To address this, we can create numeric features derived from the email text and use those features for logistic regression. Each row of 𝕏 is an email. Each column of 𝕏 contains one feature for all the emails. We’ll guide you through creating a simple feature, and you’ll create more interesting ones as you try to increase the accuracy of your model. 4.1 Question 2 Create a function words_in_texts that takes in a list of interesting words ( words ) and a Series of emails ( texts ). Our goal is to check if each word in words is contained in the emails in texts . The words_in_texts function should output a 2-dimensional NumPy array that contains one row for each email in texts and one column for each word in words . If the 𝑗 -th word in words is present at least once in the 𝑖 -th email in texts , the output array should have a value of 1 at the position (𝑖, 𝑗) . Otherwise, if the 𝑗 -th word is not present in the 𝑖 -th email, the value at (𝑖, 𝑗) should be 0. In Project B2, we will be applying words_in_texts to some large datasets, so implementing some form of vectorization (for example, using NumPy arrays, Series.str functions, etc.) is highly recommended. You are allowed to use a single list comprehension or for loop , but you should look into how you could combine that with the vectorized functions discussed above. For example: 6

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version