DMA23 - Lab 1_ Data Preprocessing - Colaboratory
.pdf
keyboard_arrow_up
School
University of California, Berkeley *
*We aren’t endorsed by this school
Course
144
Subject
Industrial Engineering
Date
Jan 9, 2024
Type
Pages
6
Uploaded by MajorResolve6456
12/7/23, 8:05 PM
DMA23 - Lab 1_ Data Preprocessing - Colaboratory
https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true
1/6
DATA MINING & ANALYTICS (2023)
Make sure you ±ll in any place that says
YOUR CODE HERE
or
YOUR ANSWER HERE
, as well as your name below:
NAME =
Data transformations are useful for preparing a dataset for answering a particular question. Part of this process involves generating features
from the dataset you ±nd relevant to the question at hand. For this lab, we will be using a Yelp reviews dataset. Each row in the dataset depicts
one review along with the features of the review (the reviewer, the review text, etc.). The goal of this lab is to eventually convert this reviews
dataset into a
reviewers
dataset by creating different features describing each reviewer.
The submission for this assignment should be done
individually
, but you are allowed to work in groups of 2.
Google Colab
Colab is a free online platform provided by Google that allows you to execute python code without any installations on your local machine.
Without Colab (using Jupyter notebooks or the command line), you would have to install various packages and manage dependencies.
In Colab, you can simply import them, or even install them (for that particular session). Colab can be accessed at the link:
https://colab.research.google.com
IMPORTANT: This lab has been shared with only read permissions to you. Make sure to click File -> "Save a Copy in Drive" so that you can get
your own copy that WILL SAVE YOUR PROGRESS in your own Colab environment.
If you download the .ipynb and want to further edit the notebook, you will need to make sure you have
Jupyter
installed locally so you can
view the notebook properly (not as a JSON ±le).
Environment Setup
Run this cell to setup your environment.
Lab 1 - Data Preprocessing
# Importing libraries
import numpy as np
import pandas as pd
import math
import os
print('Libraries Imported')
#DOWNLOADING DATASET IF NOT PRESENT
!wget -nc http://askoski.berkeley.edu/~zp/yelp_reviews.csv
#!unzip yelp_reviews.zip
print('Dataset Downloaded: yelp_reviews.csv')
df=pd.read_csv('yelp_reviews.csv')
print(df.head())
print('Setup Complete')
Libraries Imported
--2023-09-06 03:11:30--
http://askoski.berkeley.edu/~zp/yelp_reviews.csv
Resolving askoski.berkeley.edu (askoski.berkeley.edu)... 169.229.192.179
Connecting to askoski.berkeley.edu (askoski.berkeley.edu)|169.229.192.179|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376638166 (359M) [text/csv]
Saving to: ‘yelp_reviews.csv’
yelp_reviews.csv
100%[===================>] 359.19M
38.2MB/s
in 7.6s
2023-09-06 03:11:38 (47.6 MB/s) - ‘yelp_reviews.csv’ saved [376638166/376638166]
Dataset Downloaded: yelp_reviews.csv
type
business_id
user_id
stars
\
0
review
mxrXVZWc6PWk81gvOVNOUw
mv7shusL4Xb6TylVYBv4CA
4
1
review
mxrXVZWc6PWk81gvOVNOUw
0aN5QPhs-VwK2vusKG0waQ
5
2
review
kK4AzZ0YWI-U2G-paAL7Fg
0aN5QPhs-VwK2vusKG0waQ
5
12/7/23, 8:05 PM
DMA23 - Lab 1_ Data Preprocessing - Colaboratory
https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true
2/6
3
review
mxrXVZWc6PWk81gvOVNOUw
1JUwyYab-uJzEx_FRd81Zg
5
4
review
mxrXVZWc6PWk81gvOVNOUw
2Zd3Xy8hUVmZkNg7RyNjhg
4
text
date
cool_votes
\
0
Definitely try the duck dish.
I rank it amon...
2011-06-13
0
1
Big Ass Burger was awesome! Great $5 mojitos. ...
2011-06-25
1
2
Unbelievable sandwiches! Good service.
2011-06-25
0
3
Awesome, awesome, awesome! My mom and sister a...
2011-07-18
1
4
I had the ribs they were great.
The beer sele...
2011-07-19
1
useful_votes
funny_votes
0
0
0
1
0
0
2
0
0
3
1
0
4
0
1
Setup Complete
Q1: What was the highest number of reviews for any one
business_id
?
For this task, we will need to group the reviews dataset by
business_id
. This will aggregate data for each business, which is what we
need for this task. This can be done using the
groupby
method. Some pointers of how you could go about this question are listed below:
yelp_businesses = yelp_dataset.groupby('business_id').size()
The
.size()
function counts the number of instances for each
business_id
, which gives us the number of reviews as each
instance in this dataset is a review.
The following command will sort this list, after which you can take note of the highest value:
sorted_yelp_businesses =
yelp_businesses.sort_values(ascending=False, inplace=False)
This approach allows you to see the data structure being used in the sort. A quicker approach to getting the max would be to use the
max function:
max(yelp_businesses)
#Make sure you return the answer value in this function
def q1(df):
# Group the DataFrame by 'business_id' and count the number of reviews for each business
yelp_businesses = df.groupby('business_id').size()
# Sort the counts in descending order
sorted_yelp_businesses = yelp_businesses.sort_values(ascending=False, inplace=False)
# Get the highest number of reviews
highest_reviews = sorted_yelp_businesses.iloc[0]
# You can also use max(sorted_yelp_businesses)
return highest_reviews
raise NotImplementedError()
#This is a graded cell, do not edit
print(q1(df))
4128
Q2: On average, how many reviews did each business get?
#Make sure you return the answer value in this function
def q2(df):
average_reviews_per_business = df.groupby('business_id').size().mean()
return average_reviews_per_business
raise NotImplementedError()
#This is a graded cell, do not edit
print(q2(df))
12.63413902163123
Q3: What is the average number of reviews per reviewer?
12/7/23, 8:05 PM
DMA23 - Lab 1_ Data Preprocessing - Colaboratory
https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true
3/6
#Make sure you return the answer value in this function
def q3(df):
average_reviews_per_reviewer = df.groupby('user_id').size().mean()
return average_reviews_per_reviewer
raise NotImplementedError()
#This is a graded cell, do not edit
print(q3(df))
3.188511934933203
Q4: Calculate the total number of cool votes per reviewer, then average these totals across reviewers.
#Make sure you return the answer value in this function
def q4(df):
average_cool_votesper_reviewer = df.groupby('user_id')['cool_votes'].sum().mean()
return average_cool_votesper_reviewer
raise NotImplementedError()
#This is a graded cell, do not edit
print(q4(df))
1.2417282785380945
Q5: Calculate the total number of funny votes per reviewer, then average these totals across reviewers.
#Make sure you return the answer value in this function
def q5(df):
average_funny_votesper_reviewer = df.groupby('user_id')['funny_votes'].sum().mean()
return average_funny_votesper_reviewer
raise NotImplementedError()
#This is a graded cell, do not edit
print(q5(df))
1.10126486404605
Q6: Calculate the total number of useful votes each business get, then average these totals across business_ids.
#Make sure you return the answer in this function
def q6(df):
average_useful_votesper_business = df.groupby('user_id')['useful_votes'].sum().mean()
return average_useful_votesper_business
raise NotImplementedError()
#This is a graded cell, do not edit
print(q6(df))
2.484476138872867
Q7: On average, what percentage of a reviewer's votes are cool votes?
(hint1: calculate the percentage of cool votes for each reviewer, then average this percentage across reviewers)
(hint2: you should discard reviewers who have absolutely no votes - from cool, funny, or useful votes - from your calculation)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help