projB1
.pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
C200
Subject
Computer Science
Date
Dec 6, 2023
Type
Pages
18
Uploaded by SuperHumanWorld12883
projB1
November 16, 2023
[1]:
# Initialize Otter
import
otter
grader
=
otter
.
Notebook(
"projB1.ipynb"
)
1
Project B1: Spam/Ham Classification
1.1
Due Date: Thursday, November 16th at 11:59 PM
You must submit this assignment to Gradescope by the on-time deadline, Thursday, November
16th at 11:59 PM.
Please read the syllabus for the grace period policy. No late submissions beyond the grace period will
be accepted. While course staff is happy to help you if you encounter diffculties with submission,
we may not be able to respond to last-minute requests for assistance (TAs need to sleep, after
all!).
We strongly encourage you to plan to submit your work to Gradescope several
hours before the stated deadline.
This way, you will have ample time to reach out to staff for
submission support.
1.1.1
Collaboration Policy
Data science is a collaborative activity.
While you may talk with others about this project, we
ask that you
write your solutions individually
. If you do discuss the assignments with others
please
include their names
in the collaborators cell below.
Collaborators
:
list collaborators here
1.2
Introduction
You will use what you’ve learned in class to create a binary classifier that can distinguish spam
(junk or commercial or bulk) emails from ham (regular non-spam) emails. In addition to providing
some skeleton code to fill in, we will evaluate your work based on your model’s accuracy and your
written responses in this notebook.
After this project, you should feel comfortable with the following:
• Feature engineering with text data,
• Using the
sklearn
library to process data and fit models, and
• Validate the performance of your model and minimize overfitting.
1
This first part of the project focuses on initial analysis, Feature Engineering, and Logistic Re-
gression.
In the second part of this project (to be released next week), you will build your own
spam/ham classifier.
1.3
Content Warning
This is a
real-world
dataset – the emails you are trying to classify are actual spam and legitimate
emails. As a result, some of the spam emails may be in poor taste or be considered inappropriate.
We think the benefit of working with realistic data outweighs these inappropriate emails and wanted
to give a warning at the beginning of the project so that you are made aware.
If you feel uncomfortable with this topic,
please contact your TA, the instructors, or reach
out via the
extenuating circumstances form
.
[2]:
# Run this cell to suppress all FutureWarnings.
import
warnings
warnings
.
filterwarnings(
"ignore"
, category
=
FutureWarning
)
# More readable exceptions.
%
pip
install --quiet iwut
%
load_ext
iwut
%
wut
on
Note: you may need to restart the kernel to use updated packages.
1.4
Grading
Grading is broken down into autograded answers and free responses.
For autograded answers, the results of your code are compared to provided and/or hidden tests.
For free response, readers will evaluate how well you answered the question and/or fulfilled the
requirements of the question.
Question
Manual
Points
1
Yes
2
2
No
3
3
Yes
3
4
No
2
5
No
2
6a
No
1
6b
No
1
6c
Yes
2
6d
No
2
6e
No
1
6f
Yes
1
6g
Yes
1
6h
Yes
2
Total
6
23
2
[3]:
import
numpy
as
np
import
pandas
as
pd
import
matplotlib.pyplot
as
plt
%
matplotlib
inline
import
seaborn
as
sns
sns
.
set(style
=
"whitegrid"
,
color_codes
=
True
,
font_scale
= 1.5
)
2
The Data
In email classification, our goal is to classify emails as spam or not spam (referred to as “ham”)
using features generated from the text in the email. The dataset is from
SpamAssassin
. It consists
of email messages and their labels (0 for ham, 1 for spam). Your labeled training dataset contains
8,348 labeled examples, and the unlabeled test set contains 1,000 unlabeled examples.
Note:
The dataset is from 2004, so the contents of emails might be very different from those in
2023.
Run the following cells to load the data into a
DataFrame
.
The
train DataFrame
contains labeled data you will use to train your model. It has four columns:
1.
id
: An identifier for the training example.
2.
subject
: The subject of the email.
3.
email
: The text of the email.
4.
spam
: 1 if the email is spam, 0 if the email is ham (not spam).
The
test DataFrame
contains 1,000 unlabeled emails. In Project B2, you will predict labels for
these emails and submit your predictions to the autograder for evaluation.
[4]:
import
zipfile
with
zipfile
.
ZipFile(
'spam_ham_data.zip'
)
as
item:
item
.
extractall()
[5]:
# Loading training and test datasets
original_training_data
=
pd
.
read_csv(
'train.csv'
)
test
=
pd
.
read_csv(
'test.csv'
)
# Convert the emails to lowercase as the first step of text processing.
original_training_data[
'email'
]
=
original_training_data[
'email'
]
.
str
.
lower()
test[
'email'
]
=
test[
'email'
]
.
str
.
lower()
original_training_data
.
head()
[5]:
id
subject
\
0
0
Subject: A&L Daily to be auctioned in bankrupt…
3
1
1
Subject: Wired: "Stronger ties between ISPs an…
2
2
Subject: It's just too small
…
3
3
Subject: liberal defnitions\n
4
4
Subject: RE: [ILUG] Newbie seeks advice - Suse…
email
spam
0
url: http://boingboing.net/#85534171\n date: n…
0
1
url: http://scriptingnews.userland.com/backiss…
0
2
<html>\n <head>\n </head>\n <body>\n <font siz…
1
3
depends on how much over spending vs. how much…
0
4
hehe sorry but if you hit caps lock twice the …
0
First, let’s check if our data contains any missing values. We have filled in the cell below to print
the number of
NaN
values in each column. If there are
NaN
values, we replace them with appropriate
filler values (i.e.,
NaN
values in the
subject
or
email
columns will be replaced with empty strings).
Finally, we print the number of
NaN
values in each column after this modification to verify that
there are no
NaN
values left.
Note:
While there are no
NaN
values in the
spam
column, we should be careful when replacing
NaN
labels. Doing so without consideration may introduce significant bias into our model.
[6]:
print
(
'Before imputation:'
)
print
(original_training_data
.
isnull()
.
sum())
original_training_data
=
original_training_data
.
fillna(
''
)
print
(
'------------'
)
print
(
'After imputation:'
)
print
(original_training_data
.
isnull()
.
sum())
Before imputation:
id
0
subject
6
email
0
spam
0
dtype: int64
------------
After imputation:
id
0
subject
0
email
0
spam
0
dtype: int64
3
Part 1: Initial Analysis
In the cell below, we have printed the text of the
email
field for the first ham and the first spam
email in the original training set.
4
[7]:
first_ham
=
original_training_data
.
loc[original_training_data[
'spam'
]
== 0
,
␣
↪
'email'
]
.
iloc[
0
]
first_spam
=
original_training_data
.
loc[original_training_data[
'spam'
]
== 1
,
␣
↪
'email'
]
.
iloc[
0
]
print
(
"Ham Email:"
)
print
(first_ham)
print
(
"-------------------------------------------------"
)
print
(
"Spam Email:"
)
print
(first_spam)
Ham Email:
url: http://boingboing.net/#85534171
date: not supplied
arts and letters daily, a wonderful and dense blog, has folded up its tent due
to the bankruptcy of its parent company. a&l daily will be auctioned off by the
receivers. link[1] discuss[2] (_thanks, misha!_)
[1] http://www.aldaily.com/
[2] http://www.quicktopic.com/boing/h/zlfterjnd6jf
-------------------------------------------------
Spam Email:
<html>
<head>
</head>
<body>
<font size=3d"4"><b> a man endowed with a 7-8" hammer is simply<br>
better equipped than a man with a 5-6"hammer. <br>
<br>would you rather have<br>more than enough to get the job done or fall =
short. it's totally up<br>to you. our methods are guaranteed to increase y=
our size by 1-3"<br> <a href=3d"http://209.163.187.47/cgi-bin/index.php?10=
004">come in here and see how</a>
</body>
</html>
3.1
Question 1
Discuss one attribute or characteristic you notice that is different between the two emails that
might relate to the identification of a spam email.
5
The ham email provides information about a blog closure with legitimate sources, while the spam
email promotes a questionable product using HTML-formatted content and focuses on a sensitive
topic.
3.2
Training-Validation Split
The training data we downloaded is all the data we have available for both training models and
validating
the models that we train. We, therefore, need to split the training data into separate
training and validation datasets. You will need this
validation data
to assess the performance of
your classifier once you are finished training. Note that we set the seed (
random_state
) to 42. This
will produce a pseudo-random sequence of random numbers that is the same for every student.
Do
not modify this random seed in the following questions, as our tests depend on it.
[8]:
# This creates a 90/10 train-validation split on our labeled data.
from
sklearn.model_selection
import
train_test_split
train, val
=
train_test_split(original_training_data, test_size
= 0.1
,
␣
↪
random_state
= 42
)
4
Part 2: Feature Engineering
We want to take the text of an email and predict whether the email is ham or spam.
This is a
binary classification
problem, so we can use logistic regression to train a classifier. Recall that to
train a logistic regression model, we need a numeric feature matrix
𝕏
and a vector of corresponding
binary labels
𝑌
. Unfortunately, our data are text, not numbers. To address this, we can create
numeric features derived from the email text and use those features for logistic regression.
Each row of
𝕏
is an email. Each column of
𝕏
contains one feature for all the emails. We’ll guide
you through creating a simple feature, and you’ll create more interesting ones as you try to increase
the accuracy of your model.
4.1
Question 2
Create a function
words_in_texts
that takes in a list of interesting words (
words
) and a
Series
of emails (
texts
). Our goal is to check if each word in
words
is contained in the emails in
texts
.
The
words_in_texts
function should output a
2-dimensional
NumPy
array
that contains one
row for each email in
texts
and one column for each word in
words
. If the
𝑗
-th word in
words
is
present at least once in the
𝑖
-th email in
texts
, the output array should have a value of 1 at the
position
(𝑖, 𝑗)
. Otherwise, if the
𝑗
-th word is not present in the
𝑖
-th email, the value at
(𝑖, 𝑗)
should
be 0.
In Project B2, we will be applying
words_in_texts
to some large datasets, so implementing some
form of vectorization (for example, using
NumPy
arrays,
Series.str
functions, etc.)
is highly
recommended.
You are allowed to use a single list comprehension or for loop
, but you
should look into how you could combine that with the vectorized functions discussed above.
For example:
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
The International Conference on Mathematical Sciences and Technology, Math Tech, is a biennial
conference organised by the School of Mathematical Sciences, USM. This event is open to all
national and international experts, practitioners, researchers, and academics to gather and
share their knowledge.
The details of fees for participation in the conference are shown in the following table.
Category
Local (RM)
International (USD)
Presenter
990
260
Observer
600
150
Note: 1 USD = RM4.20
Assignment Requirement
Part 1: Create Invoice
Construct a C++ program to do the following tasks:
a. Enter the participant's name, category, nationality, accommodation (Yes/No).
b. If the answer for the question on accommodation is Yes, the program will ask for the
number of days. Then the program will call function calcAccommodation() to calculate
the accommodation charges.
calcAccommodation() receives accommodation (Yes/No), number of days and participation
type (local/international) as parameters and…
arrow_forward
Write a row-level trigger that executes
before updating a project's location in the
project table. The trigger changes the
department number of the project to 4 if
the location is in 'Stafford'.
TOS
arrow_forward
An Art Museum needs to keep track of its artwork and exhibitions. The museum has a collection of
artwork. Each piece of art has a unique id [ArtID], an artist, year and title.
Artwork is categorized by type. There are two types, paintings and sculptures. Paintings have a
paintType and style. Sculptures have a material, height and weight.
Artwork is frequently shown in exhibitions. Each exhibition has a unique name, startDate, endDate.
A listing of the artwork displayed at an exhibition is also retained.
Below is an E-R diagram for a database to help the Art Museum. Provide appropriate names for
entities E1, E2, E3, E4 identifiers 11, 12, 13, 14 missing attributes and relationship cardinality RC1
and relationship type RT1.
Start_date
End_date
E1
E1
11
RC1
PaintType
Style
E3
13
A1
Title
Year
A/
A
A/
A/
A/
A/
A/
A/
A
A/
A/
E2
12
A/ E2
E3
E4
11
12
13
14
A1
A2
RC1
RT1
RT1
A2
Height
Weight
E4
14
Activate
Go to Settin
arrow_forward
3.Rose Theater Company
For this assignment you will modify the Rose Theater Company assignment created in chapter 2. Rose Theater company sells tickets to their productions. There are three seating options for purchase at Rose's theater; Orchestra seats, Center Stage seats, and Outer Stage seats. Orchestra seats cost $75 each, Center Stage seats cost $50 each, and Outer Stage seats cost $25 each. The theater company also gives a discount of 5% on tickets to the local theater group members. The user will enter the number of Orchestra, Center State, and/or Outer stage seats the customer wants to purchase along with whether the customer is a member of the local theater group. The program will calculate and display the income from each type of seat sale along with the discount, and total sale.
Requirements
Must include use of named constants
Must include meaningful variable names
Must include a main function with a call to the main function.
Must include at least one value returning…
arrow_forward
Develop a Use Case Diagram for the below description:
"The online committee setup application allows section head to
create a committee by entering committee name and objectives and
selecting members from a drop-down menu. The committee
chairperson will use the application to call for a committee meeting
by deciding on start/ end date, start/end time and list of attendants.
Once the meeting is held the committee chairperson will upload
the meeting minutes to the application. Meeting attendants can
view the minutes of the meeting and approve its content or send a
request to amend the minutes as an email. The meeting attendants
can use the application to view other attendants' profiles including
logo, personal image and then send them an email, as well as
respond to emails sent to them. The committee chairperson can
also use the same functionalities of attendants".
Instructions and Notes:
- Use Online Visual Paradigm to produce your use case diagram.
Visual Paradigm Online…
arrow_forward
Task 6:
The StayWell marketing team wants to send mail to all residents. You need to return the first name and surname of all the residents combined as NAME, with their addresses named ADDRESS. However, the address should be retrieved from the PROPERTY table for residents.
Task
Retrieve the mailing address (first name, surname, and address) for each resident.
Task 7:
The development team wants to add new residents and new service requests to StayWell without checking the latest IDs and manually incrementing it. Therefore, you need to alter the RESIDENTS table and change the RESIDENT_ID field to an auto-incremented field of type SMALLINT.
Alter the RESIDENTS table and change the ID field to an auto-incremented field.
Task 8:
The Colombia City office of StayWell indicated that there is a new tenant by the name of Yigit Yilmaz staying at the property with an ID of 13. You need to include this new resident in the RESIDENTS table.
Task
Add Yigit…
arrow_forward
Q2:Write a row-level trigger that executes before updating a project's location in the project table. The trigger
changes the department number of the project to 4 if the location is in 'Stafford'.
COL
arrow_forward
Open the summary report template, located in the What to Submit section. In the template, write pseudocode that lays out a plan for the method you chose. Ensure that you organize each step in a logical manner and that your method meets the specifications in the document for either the check in or check out process. Your pseudocode must not exceed one page.
Note: Remember, you will not be creating the actual code for the method, and you do not have to write pseudocode for both methods.
Based on the pseudocode you wrote, create a flowchart using a tool of your choice for the method you selected. Your flowchart will help your team communicate how you are planning to develop the software for your client. Your flowchart must be confined to one page. In your flowchart, be sure to do the following:
Include start and end points.
Include appropriate decision branching.
Align the flowchart to the check in or check out process.
Note: You may draw your flowchart by hand and take a clear picture…
arrow_forward
Layout
References
Mailings
Review
View
Developer
Help
KOD RENT_CH
Write the business rules that are reflected in the ERD shown in Figure P2.15. (Note that the ERD
reflects some simplifying assumptions. For example, each book is written by only one author.
Also, remember that the ERD is always read from the "1" to the "M" side, regardless of the
orientation of the ERD components.)
FIGURE P2.15 The Crow's Foot ERD for Problem 19
PUBLISHER
BOOK
publishes
sub mits
writes
CONTRACT
AUTHOR
signs
十
A business rule is an explanation that forces some type of requirement on a particular part of the information
base, like the components inside a field detail for a specific field or the attributes of a given relationship.
tes)
EO
* Accessibility: Investigate
DE
26
bp
arrow_forward
Q2:Write a row-level trigger that executes
before updating a project's location in the
project table. The trigger changes the
department number of the project to 4 if
the location is in 'Stafford'.
arrow_forward
NOTE: Use the API endpoints available at https://jsonplaceholder.typicode.com/ to get the data required in these exercises.
HINT: Read the documented Resources and Routes.
Also the guide: https://jsonplaceholder.typicode.com/guide/
///////////////////////
4) Create a function named "getUser".
This function needs to accept a "userID" parameter.
Use Fetch with Async/Await to request the data for the requested user.
The function should return JSON data.
arrow_forward
Question 8
The ERD below is the initial design to track events and the dates on which they are scheduled. Using
the ERD, what can you determine about the statement "There were 5 events on July 4, 2012."
EVENT
DATE
Has
Scheduled on
O This statement could NOT be true based on the ERD.
O This statement could be true based on the ERD.
arrow_forward
NOTE: Use the API endpoints available at https://jsonplaceholder.typicode.com/ to get the data required in these exercises.
HINT: Read the documented Resources and Routes.
Also the guide: https://jsonplaceholder.typicode.com/guide/
///////////////////////
2) Create a function named "getAllUsers".
Use Fetch with Async/Await to request all the posts. The function should return all the posts as JSON data.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage
Related Questions
- The International Conference on Mathematical Sciences and Technology, Math Tech, is a biennial conference organised by the School of Mathematical Sciences, USM. This event is open to all national and international experts, practitioners, researchers, and academics to gather and share their knowledge. The details of fees for participation in the conference are shown in the following table. Category Local (RM) International (USD) Presenter 990 260 Observer 600 150 Note: 1 USD = RM4.20 Assignment Requirement Part 1: Create Invoice Construct a C++ program to do the following tasks: a. Enter the participant's name, category, nationality, accommodation (Yes/No). b. If the answer for the question on accommodation is Yes, the program will ask for the number of days. Then the program will call function calcAccommodation() to calculate the accommodation charges. calcAccommodation() receives accommodation (Yes/No), number of days and participation type (local/international) as parameters and…arrow_forwardWrite a row-level trigger that executes before updating a project's location in the project table. The trigger changes the department number of the project to 4 if the location is in 'Stafford'. TOSarrow_forwardAn Art Museum needs to keep track of its artwork and exhibitions. The museum has a collection of artwork. Each piece of art has a unique id [ArtID], an artist, year and title. Artwork is categorized by type. There are two types, paintings and sculptures. Paintings have a paintType and style. Sculptures have a material, height and weight. Artwork is frequently shown in exhibitions. Each exhibition has a unique name, startDate, endDate. A listing of the artwork displayed at an exhibition is also retained. Below is an E-R diagram for a database to help the Art Museum. Provide appropriate names for entities E1, E2, E3, E4 identifiers 11, 12, 13, 14 missing attributes and relationship cardinality RC1 and relationship type RT1. Start_date End_date E1 E1 11 RC1 PaintType Style E3 13 A1 Title Year A/ A A/ A/ A/ A/ A/ A/ A A/ A/ E2 12 A/ E2 E3 E4 11 12 13 14 A1 A2 RC1 RT1 RT1 A2 Height Weight E4 14 Activate Go to Settinarrow_forward
- 3.Rose Theater Company For this assignment you will modify the Rose Theater Company assignment created in chapter 2. Rose Theater company sells tickets to their productions. There are three seating options for purchase at Rose's theater; Orchestra seats, Center Stage seats, and Outer Stage seats. Orchestra seats cost $75 each, Center Stage seats cost $50 each, and Outer Stage seats cost $25 each. The theater company also gives a discount of 5% on tickets to the local theater group members. The user will enter the number of Orchestra, Center State, and/or Outer stage seats the customer wants to purchase along with whether the customer is a member of the local theater group. The program will calculate and display the income from each type of seat sale along with the discount, and total sale. Requirements Must include use of named constants Must include meaningful variable names Must include a main function with a call to the main function. Must include at least one value returning…arrow_forwardDevelop a Use Case Diagram for the below description: "The online committee setup application allows section head to create a committee by entering committee name and objectives and selecting members from a drop-down menu. The committee chairperson will use the application to call for a committee meeting by deciding on start/ end date, start/end time and list of attendants. Once the meeting is held the committee chairperson will upload the meeting minutes to the application. Meeting attendants can view the minutes of the meeting and approve its content or send a request to amend the minutes as an email. The meeting attendants can use the application to view other attendants' profiles including logo, personal image and then send them an email, as well as respond to emails sent to them. The committee chairperson can also use the same functionalities of attendants". Instructions and Notes: - Use Online Visual Paradigm to produce your use case diagram. Visual Paradigm Online…arrow_forwardTask 6: The StayWell marketing team wants to send mail to all residents. You need to return the first name and surname of all the residents combined as NAME, with their addresses named ADDRESS. However, the address should be retrieved from the PROPERTY table for residents. Task Retrieve the mailing address (first name, surname, and address) for each resident. Task 7: The development team wants to add new residents and new service requests to StayWell without checking the latest IDs and manually incrementing it. Therefore, you need to alter the RESIDENTS table and change the RESIDENT_ID field to an auto-incremented field of type SMALLINT. Alter the RESIDENTS table and change the ID field to an auto-incremented field. Task 8: The Colombia City office of StayWell indicated that there is a new tenant by the name of Yigit Yilmaz staying at the property with an ID of 13. You need to include this new resident in the RESIDENTS table. Task Add Yigit…arrow_forward
- Q2:Write a row-level trigger that executes before updating a project's location in the project table. The trigger changes the department number of the project to 4 if the location is in 'Stafford'. COLarrow_forwardOpen the summary report template, located in the What to Submit section. In the template, write pseudocode that lays out a plan for the method you chose. Ensure that you organize each step in a logical manner and that your method meets the specifications in the document for either the check in or check out process. Your pseudocode must not exceed one page. Note: Remember, you will not be creating the actual code for the method, and you do not have to write pseudocode for both methods. Based on the pseudocode you wrote, create a flowchart using a tool of your choice for the method you selected. Your flowchart will help your team communicate how you are planning to develop the software for your client. Your flowchart must be confined to one page. In your flowchart, be sure to do the following: Include start and end points. Include appropriate decision branching. Align the flowchart to the check in or check out process. Note: You may draw your flowchart by hand and take a clear picture…arrow_forwardLayout References Mailings Review View Developer Help KOD RENT_CH Write the business rules that are reflected in the ERD shown in Figure P2.15. (Note that the ERD reflects some simplifying assumptions. For example, each book is written by only one author. Also, remember that the ERD is always read from the "1" to the "M" side, regardless of the orientation of the ERD components.) FIGURE P2.15 The Crow's Foot ERD for Problem 19 PUBLISHER BOOK publishes sub mits writes CONTRACT AUTHOR signs 十 A business rule is an explanation that forces some type of requirement on a particular part of the information base, like the components inside a field detail for a specific field or the attributes of a given relationship. tes) EO * Accessibility: Investigate DE 26 bparrow_forward
- Q2:Write a row-level trigger that executes before updating a project's location in the project table. The trigger changes the department number of the project to 4 if the location is in 'Stafford'.arrow_forwardNOTE: Use the API endpoints available at https://jsonplaceholder.typicode.com/ to get the data required in these exercises. HINT: Read the documented Resources and Routes. Also the guide: https://jsonplaceholder.typicode.com/guide/ /////////////////////// 4) Create a function named "getUser". This function needs to accept a "userID" parameter. Use Fetch with Async/Await to request the data for the requested user. The function should return JSON data.arrow_forwardQuestion 8 The ERD below is the initial design to track events and the dates on which they are scheduled. Using the ERD, what can you determine about the statement "There were 5 events on July 4, 2012." EVENT DATE Has Scheduled on O This statement could NOT be true based on the ERD. O This statement could be true based on the ERD.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Np Ms Office 365/Excel 2016 I NtermedComputer ScienceISBN:9781337508841Author:CareyPublisher:Cengage
Np Ms Office 365/Excel 2016 I Ntermed
Computer Science
ISBN:9781337508841
Author:Carey
Publisher:Cengage