DMA23 - Lab 1_ Data Preprocessing - Colaboratory

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

144

Subject

Industrial Engineering

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by MajorResolve6456

12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 1/6 DATA MINING & ANALYTICS (2023) Make sure you ±ll in any place that says YOUR CODE HERE or YOUR ANSWER HERE , as well as your name below: NAME = Data transformations are useful for preparing a dataset for answering a particular question. Part of this process involves generating features from the dataset you ±nd relevant to the question at hand. For this lab, we will be using a Yelp reviews dataset. Each row in the dataset depicts one review along with the features of the review (the reviewer, the review text, etc.). The goal of this lab is to eventually convert this reviews dataset into a reviewers dataset by creating different features describing each reviewer. The submission for this assignment should be done individually , but you are allowed to work in groups of 2. Google Colab Colab is a free online platform provided by Google that allows you to execute python code without any installations on your local machine. Without Colab (using Jupyter notebooks or the command line), you would have to install various packages and manage dependencies. In Colab, you can simply import them, or even install them (for that particular session). Colab can be accessed at the link: https://colab.research.google.com IMPORTANT: This lab has been shared with only read permissions to you. Make sure to click File -> "Save a Copy in Drive" so that you can get your own copy that WILL SAVE YOUR PROGRESS in your own Colab environment. If you download the .ipynb and want to further edit the notebook, you will need to make sure you have Jupyter installed locally so you can view the notebook properly (not as a JSON ±le). Environment Setup Run this cell to setup your environment. Lab 1 - Data Preprocessing # Importing libraries import numpy as np import pandas as pd import math import os print('Libraries Imported') #DOWNLOADING DATASET IF NOT PRESENT !wget -nc http://askoski.berkeley.edu/~zp/yelp_reviews.csv #!unzip yelp_reviews.zip print('Dataset Downloaded: yelp_reviews.csv') df=pd.read_csv('yelp_reviews.csv') print(df.head()) print('Setup Complete') Libraries Imported --2023-09-06 03:11:30-- http://askoski.berkeley.edu/~zp/yelp_reviews.csv Resolving askoski.berkeley.edu (askoski.berkeley.edu)... 169.229.192.179 Connecting to askoski.berkeley.edu (askoski.berkeley.edu)|169.229.192.179|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 376638166 (359M) [text/csv] Saving to: ‘yelp_reviews.csv’ yelp_reviews.csv 100%[===================>] 359.19M 38.2MB/s in 7.6s 2023-09-06 03:11:38 (47.6 MB/s) - ‘yelp_reviews.csv’ saved [376638166/376638166] Dataset Downloaded: yelp_reviews.csv type business_id user_id stars \ 0 review mxrXVZWc6PWk81gvOVNOUw mv7shusL4Xb6TylVYBv4CA 4 1 review mxrXVZWc6PWk81gvOVNOUw 0aN5QPhs-VwK2vusKG0waQ 5 2 review kK4AzZ0YWI-U2G-paAL7Fg 0aN5QPhs-VwK2vusKG0waQ 5

12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 2/6 3 review mxrXVZWc6PWk81gvOVNOUw 1JUwyYab-uJzEx_FRd81Zg 5 4 review mxrXVZWc6PWk81gvOVNOUw 2Zd3Xy8hUVmZkNg7RyNjhg 4 text date cool_votes \ 0 Definitely try the duck dish. I rank it amon... 2011-06-13 0 1 Big Ass Burger was awesome! Great $5 mojitos. ... 2011-06-25 1 2 Unbelievable sandwiches! Good service. 2011-06-25 0 3 Awesome, awesome, awesome! My mom and sister a... 2011-07-18 1 4 I had the ribs they were great. The beer sele... 2011-07-19 1 useful_votes funny_votes 0 0 0 1 0 0 2 0 0 3 1 0 4 0 1 Setup Complete Q1: What was the highest number of reviews for any one business_id ? For this task, we will need to group the reviews dataset by business_id . This will aggregate data for each business, which is what we need for this task. This can be done using the groupby method. Some pointers of how you could go about this question are listed below: yelp_businesses = yelp_dataset.groupby('business_id').size() The .size() function counts the number of instances for each business_id , which gives us the number of reviews as each instance in this dataset is a review. The following command will sort this list, after which you can take note of the highest value: sorted_yelp_businesses = yelp_businesses.sort_values(ascending=False, inplace=False) This approach allows you to see the data structure being used in the sort. A quicker approach to getting the max would be to use the max function: max(yelp_businesses) #Make sure you return the answer value in this function def q1(df): # Group the DataFrame by 'business_id' and count the number of reviews for each business yelp_businesses = df.groupby('business_id').size() # Sort the counts in descending order sorted_yelp_businesses = yelp_businesses.sort_values(ascending=False, inplace=False) # Get the highest number of reviews highest_reviews = sorted_yelp_businesses.iloc[0] # You can also use max(sorted_yelp_businesses) return highest_reviews raise NotImplementedError() #This is a graded cell, do not edit print(q1(df)) 4128 Q2: On average, how many reviews did each business get? #Make sure you return the answer value in this function def q2(df): average_reviews_per_business = df.groupby('business_id').size().mean() return average_reviews_per_business raise NotImplementedError() #This is a graded cell, do not edit print(q2(df)) 12.63413902163123 Q3: What is the average number of reviews per reviewer?

12/7/23, 8:05 PM DMA23 - Lab 1_ Data Preprocessing - Colaboratory https://colab.research.google.com/drive/1F3CmxxXhgxuRBO2naN1ogM5FulCvqAkS#scrollTo=vKduyi9dItO4&printMode=true 3/6 #Make sure you return the answer value in this function def q3(df): average_reviews_per_reviewer = df.groupby('user_id').size().mean() return average_reviews_per_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q3(df)) 3.188511934933203 Q4: Calculate the total number of cool votes per reviewer, then average these totals across reviewers. #Make sure you return the answer value in this function def q4(df): average_cool_votesper_reviewer = df.groupby('user_id')['cool_votes'].sum().mean() return average_cool_votesper_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q4(df)) 1.2417282785380945 Q5: Calculate the total number of funny votes per reviewer, then average these totals across reviewers. #Make sure you return the answer value in this function def q5(df): average_funny_votesper_reviewer = df.groupby('user_id')['funny_votes'].sum().mean() return average_funny_votesper_reviewer raise NotImplementedError() #This is a graded cell, do not edit print(q5(df)) 1.10126486404605 Q6: Calculate the total number of useful votes each business get, then average these totals across business_ids. #Make sure you return the answer in this function def q6(df): average_useful_votesper_business = df.groupby('user_id')['useful_votes'].sum().mean() return average_useful_votesper_business raise NotImplementedError() #This is a graded cell, do not edit print(q6(df)) 2.484476138872867 Q7: On average, what percentage of a reviewer's votes are cool votes? (hint1: calculate the percentage of cool votes for each reviewer, then average this percentage across reviewers) (hint2: you should discard reviewers who have absolutely no votes - from cool, funny, or useful votes - from your calculation)

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version

Access to all documents
Unlimited textbook solutions
24/7 expert homework help