2024_CSE6242_Hw3_Q1
docx
keyboard_arrow_up
School
Georgia Institute Of Technology *
*We aren’t endorsed by this school
Course
6242
Subject
Industrial Engineering
Date
Apr 3, 2024
Type
docx
Pages
11
Uploaded by kevin13mns
# HW3 - Q1 [15 pts]
## Important Notices
<div class="alert alert-block alert-danger">
WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder.
</div>
All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional. <div class="alert alert-block alert-info">
You <strong>must</strong> implement the following functions in this notebook to receive credit.
</div>
`user()`
`clean_data()`
`common_pair()`
`time_of_cheapest_fare()`
`passenger_count_for_most_tip()`
`day_with_traffic()`
Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading. ### Helper functions
You are permitted to write additional helper functions, or use additional instance variables so long as the previously described functions work as required.
<div class="alert alert-block alert-danger">
WARNING: Do <strong>NOT</strong> remove or modify the following utility functions:
</div>
`load_data()`
#### Pyspark Imports
<span style="color:red">*Please don't modify the below cell*</span>
import pyspark
from pyspark.sql import SQLContext
from pyspark.sql.functions import hour, when, col, date_format, to_timestamp
from pyspark.sql.functions import *
#### Define Spark Context
<span style="color:red">*Please don't modify the below cell*</span>
sc = pyspark.SparkContext(appName="HW3-Q1")
sqlContext = SQLContext(sc)
#### Function to load data
<span style="color:red">*Please don't modify the below cell*</span>
def load_data():
df = sqlContext.read.option("header",True).csv("yellow_tripdata_2019-
01_short.csv")
return df
df = load_data()
#df.printSchema()
# Implement the functions below for this assignment:
## 1. Update the `user()` function
This function should return your GT username, eg: gburdell3
def user():
"""
:return: string
your GTUsername, NOT your 9-Digit GTId """ return 'gburdell3'
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
## 1a. [1 pts] Casting the columns into correct types
To process the data accurately, cast the following columns into given data type: - `passenger_count` - integer - `total_amount` - float - `tip_amount` - float
- `trip_distance` - float - `fare_amount` - float - `tpep_pickup_datetime` - timestamp - `tpep_dropoff_datetime` - timestamp All of the columns in the original data should be returned with the above columns converted to the correct data type.
def clean_data(df):
'''
input: df a dataframe
output: df a dataframe with the all the original columns
'''
# START YOUR CODE HERE ---------
# END YOUR CODE HERE -----------
return df
df = clean_data(df)
#df.select(['passenger_count', 'total_amount', 'tip_amount', 'trip_distance', 'fare_amount', 'tpep_pickup_datetime', 'tpep_pickup_datetime']).printSchema()
## 1b. [4 pts] What are the top 10 pickup-dropoff locations?
Find the top 10 pickup-dropoff location pairs having the most number of trips
(`count`). The location pairs should be ordered by `count` in descending order. If two or more pairs have the same number of trips, break the tie using
the trip amount per distance travelled (`trip_rate`). Use columns `total_amount` and `trip_distance` to calculate the trip amount per distance.
In certain situations, the pickup and dropoff locations may be the same.
Example output showing expected formatting:
```
+------------+------------+-----+------------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+------------------+
| 5| 7| 24| 5.148195749283391|
| 6| 4| 19| 1.420958193039484|
| 3| 2| 15|9.1928382713049282|
| 8| 8| 14|5.1029384838178493|
| 1| 3| 9|7.4403919838271223|
| 9| 2| 9|4.4039182884283829|
| 5| 7| 6| 5.19283827172823|
| 2| 1| 5| 9.233738511638532|
| 1| 9| 3| 8.293827128489212|
| 6| 6| 1| 4.192847382919223|
+------------+------------+-----+------------------+
```
def common_pair(df):
'''
input: df a dataframe
output: df a dataframe with following columns:
- PULocationID
- DOLocationID
- count
- trip_rate
trip_rate is the average amount (total_amount) per distance (trip_distance)
'''
# START YOUR CODE HERE ---------
# END YOUR CODE HERE -----------
return df
#common_pair(df).show()
## 1c. [4 pts] When is the trip cheapest (day vs night) ?
Divide each day into two parts: Day (from 9am to 8:59:59pm), and Night (from 9pm to 8:59:59am) and find the average total amount per unit distance travelled (use column `total_amount`) for both time periods. Sort
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
the result by `trip_rate` in ascending order to find when the fare rate is cheapest.
Example output showing expected formatting:
```
+---------+-----------------+
|day_night| trip_rate|
+---------+-----------------+
| Day|2.391827482920123|
| Night|4.292818223839121|
+---------+-----------------+
```
def time_of_cheapest_fare(df):
'''
input: df a dataframe
output: df a dataframe with following columns:
- day_night
- trip_rate
day_night will have 'Day' or 'Night' based on following conditions:
- From 9am to 8:59:59pm - Day
- From 9pm to 8:59:59am - Night
trip_rate is the average amount (total_amount) per distance
'''
# START YOUR CODE HERE ---------
# END YOUR CODE HERE -----------
return df
#time_of_cheapest_fare(df).show()
## 1d. [3 pts] Which passenger group size gives the most tips?
Filter the data for trips having fares (`fare_amount`) greater than $2 and the number of passengers (`passenger_count`) greater than 0. Find the average fare and tip (`tip_amount`) for all passenger group sizes and calculate the tip
percent (`tip_amount * 100 / fare_amount`). Sort by the tip percent in descending order to get which group size tips most generously.
Example output showing expected formatting:
```
+---------------+------------------+
|passenger_count| tip_percent|
+---------------+------------------+
| 4|20.129473829283771|
| 2|16.203913838738283|
| 3|14.283814930283822|
| 1|13.393817383918287|
| 6| 12.73928273747182|
| 5|12.402938192848471|
+---------------+------------------+
```
def passenger_count_for_most_tip(df):
'''
input: df a dataframe
output: df a dataframe with following columns:
- passenger_count
- tip_percent
trip_percent is the percent of tip out of fare_amount
'''
# START YOUR CODE HERE ---------
# END YOUR CODE HERE -----------
return df
#passenger_count_for_most_tip(df).show()
## 1e. [3 pts] Which day of week has the most traffic?
Sort the days of the week based on traffic, with the day having the highest traffic on the top. You can estimate traffic on the day of the week based on the average speed of all taxi trips on that day of the week. (Speed can be calculated by using the trip time and trip distance. Make sure to print speed in distance / hour). If the `average_speed` is the same for two or more days, the days should be ordered alphabetically. A day with a low average speed indicates high levels of traffic. The average speed may be 0 indicating very high levels of traffic. Not all days of the week may be present. You should use
`date_format` along with the appropriate [pattern
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
letters](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) to
format the day of the week.
Example output showing expected formatting:
```
+-----------+------------------+
|day_of_week| average_speed|
+-----------+------------------+
| Sat| 0.0|
| Tue| 0.0|
| Fri|7.2938133827293934|
| Mon|10.123938472718228|
+-----------+------------------+
```
def day_with_traffic(df):
'''
input: df a dataframe
output: df a dataframe with following columns:
- day_of_week
- average_speed
day_of_week should be day of week e.g.) Mon, Tue, Wed, ...
average_speed (miles/hour) is calculated as distance / time (in hours)
'''
# START YOUR CODE HERE ---------
# END YOUR CODE HERE -----------
return df
#day_with_traffic(df).show()
# DO NOT MODIFY
df.select(['*'])