2024_CSE6242_Hw3_Q1

docx

School

Georgia Institute Of Technology *

*We aren’t endorsed by this school

Course

6242

Subject

Industrial Engineering

Date

Apr 3, 2024

Type

docx

Pages

11

Uploaded by kevin13mns

Report
# HW3 - Q1 [15 pts] ## Important Notices <div class="alert alert-block alert-danger"> WARNING: Do <strong>NOT</strong> add any cells to this Jupyter Notebook, because that will crash the autograder. </div> All instructions, code comments, etc. in this notebook **are part of the assignment instructions**. That is, if there is instructions about completing a task in this notebook, that task is not optional. <div class="alert alert-block alert-info"> You <strong>must</strong> implement the following functions in this notebook to receive credit. </div> `user()` `clean_data()`
`common_pair()` `time_of_cheapest_fare()` `passenger_count_for_most_tip()` `day_with_traffic()` Each method will be auto-graded using different sets of parameters or data, to ensure that values are not hard-coded. You may assume we will only use your code to work with data from the NYC-TLC dataset during auto-grading. ### Helper functions You are permitted to write additional helper functions, or use additional instance variables so long as the previously described functions work as required. <div class="alert alert-block alert-danger"> WARNING: Do <strong>NOT</strong> remove or modify the following utility functions: </div> `load_data()` #### Pyspark Imports <span style="color:red">*Please don't modify the below cell*</span>
import pyspark from pyspark.sql import SQLContext from pyspark.sql.functions import hour, when, col, date_format, to_timestamp from pyspark.sql.functions import * #### Define Spark Context <span style="color:red">*Please don't modify the below cell*</span> sc = pyspark.SparkContext(appName="HW3-Q1") sqlContext = SQLContext(sc) #### Function to load data <span style="color:red">*Please don't modify the below cell*</span> def load_data(): df = sqlContext.read.option("header",True).csv("yellow_tripdata_2019- 01_short.csv") return df df = load_data() #df.printSchema() # Implement the functions below for this assignment: ## 1. Update the `user()` function This function should return your GT username, eg: gburdell3 def user(): """ :return: string your GTUsername, NOT your 9-Digit GTId """ return 'gburdell3'
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
## 1a. [1 pts] Casting the columns into correct types To process the data accurately, cast the following columns into given data type: - `passenger_count` - integer - `total_amount` - float - `tip_amount` - float - `trip_distance` - float - `fare_amount` - float - `tpep_pickup_datetime` - timestamp - `tpep_dropoff_datetime` - timestamp All of the columns in the original data should be returned with the above columns converted to the correct data type. def clean_data(df): ''' input: df a dataframe output: df a dataframe with the all the original columns ''' # START YOUR CODE HERE --------- # END YOUR CODE HERE ----------- return df df = clean_data(df)
#df.select(['passenger_count', 'total_amount', 'tip_amount', 'trip_distance', 'fare_amount', 'tpep_pickup_datetime', 'tpep_pickup_datetime']).printSchema() ## 1b. [4 pts] What are the top 10 pickup-dropoff locations? Find the top 10 pickup-dropoff location pairs having the most number of trips (`count`). The location pairs should be ordered by `count` in descending order. If two or more pairs have the same number of trips, break the tie using the trip amount per distance travelled (`trip_rate`). Use columns `total_amount` and `trip_distance` to calculate the trip amount per distance. In certain situations, the pickup and dropoff locations may be the same. Example output showing expected formatting: ``` +------------+------------+-----+------------------+ |PULocationID|DOLocationID|count| trip_rate| +------------+------------+-----+------------------+ | 5| 7| 24| 5.148195749283391| | 6| 4| 19| 1.420958193039484| | 3| 2| 15|9.1928382713049282| | 8| 8| 14|5.1029384838178493| | 1| 3| 9|7.4403919838271223| | 9| 2| 9|4.4039182884283829| | 5| 7| 6| 5.19283827172823| | 2| 1| 5| 9.233738511638532| | 1| 9| 3| 8.293827128489212| | 6| 6| 1| 4.192847382919223| +------------+------------+-----+------------------+
``` def common_pair(df): ''' input: df a dataframe output: df a dataframe with following columns: - PULocationID - DOLocationID - count - trip_rate trip_rate is the average amount (total_amount) per distance (trip_distance) ''' # START YOUR CODE HERE --------- # END YOUR CODE HERE ----------- return df #common_pair(df).show() ## 1c. [4 pts] When is the trip cheapest (day vs night) ? Divide each day into two parts: Day (from 9am to 8:59:59pm), and Night (from 9pm to 8:59:59am) and find the average total amount per unit distance travelled (use column `total_amount`) for both time periods. Sort
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
the result by `trip_rate` in ascending order to find when the fare rate is cheapest. Example output showing expected formatting: ``` +---------+-----------------+ |day_night| trip_rate| +---------+-----------------+ | Day|2.391827482920123| | Night|4.292818223839121| +---------+-----------------+ ``` def time_of_cheapest_fare(df): ''' input: df a dataframe output: df a dataframe with following columns: - day_night - trip_rate day_night will have 'Day' or 'Night' based on following conditions: - From 9am to 8:59:59pm - Day - From 9pm to 8:59:59am - Night trip_rate is the average amount (total_amount) per distance '''
# START YOUR CODE HERE --------- # END YOUR CODE HERE ----------- return df #time_of_cheapest_fare(df).show() ## 1d. [3 pts] Which passenger group size gives the most tips? Filter the data for trips having fares (`fare_amount`) greater than $2 and the number of passengers (`passenger_count`) greater than 0. Find the average fare and tip (`tip_amount`) for all passenger group sizes and calculate the tip percent (`tip_amount * 100 / fare_amount`). Sort by the tip percent in descending order to get which group size tips most generously. Example output showing expected formatting: ``` +---------------+------------------+ |passenger_count| tip_percent| +---------------+------------------+ | 4|20.129473829283771| | 2|16.203913838738283| | 3|14.283814930283822| | 1|13.393817383918287| | 6| 12.73928273747182| | 5|12.402938192848471| +---------------+------------------+ ```
def passenger_count_for_most_tip(df): ''' input: df a dataframe output: df a dataframe with following columns: - passenger_count - tip_percent trip_percent is the percent of tip out of fare_amount ''' # START YOUR CODE HERE --------- # END YOUR CODE HERE ----------- return df #passenger_count_for_most_tip(df).show() ## 1e. [3 pts] Which day of week has the most traffic? Sort the days of the week based on traffic, with the day having the highest traffic on the top. You can estimate traffic on the day of the week based on the average speed of all taxi trips on that day of the week. (Speed can be calculated by using the trip time and trip distance. Make sure to print speed in distance / hour). If the `average_speed` is the same for two or more days, the days should be ordered alphabetically. A day with a low average speed indicates high levels of traffic. The average speed may be 0 indicating very high levels of traffic. Not all days of the week may be present. You should use `date_format` along with the appropriate [pattern
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
letters](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) to format the day of the week. Example output showing expected formatting: ``` +-----------+------------------+ |day_of_week| average_speed| +-----------+------------------+ | Sat| 0.0| | Tue| 0.0| | Fri|7.2938133827293934| | Mon|10.123938472718228| +-----------+------------------+ ``` def day_with_traffic(df): ''' input: df a dataframe output: df a dataframe with following columns: - day_of_week - average_speed day_of_week should be day of week e.g.) Mon, Tue, Wed, ... average_speed (miles/hour) is calculated as distance / time (in hours) ''' # START YOUR CODE HERE ---------
# END YOUR CODE HERE ----------- return df #day_with_traffic(df).show() # DO NOT MODIFY df.select(['*'])