mparvez2023-DataScience Assignment 3

.pdf

School

Florida Atlantic University *

*We aren’t endorsed by this school

Course

CAP 4613

Subject

Statistics

Date

Feb 20, 2024

Type

pdf

Pages

Uploaded by BailiffHeatGrouse38

7/14/23, 7:19 PM DataScience Assignment 3 - Colaboratory https://colab.research.google.com/drive/1h0xMUhziGWYy1Du_GL-ABXZNpA7KAi9f?authuser=1#scrollTo=N0bwFTRwsfYA&printMode=true 1/5 year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin des 0 2013 6 30 940 15 1216 -4 VX N626VA 407 JFK LA 1 2013 5 7 1657 -3 2104 10 DL N3760C 329 JFK SJ 2 2013 12 8 859 -1 1238 11 DL N712TW 422 JFK LA 3 2013 5 14 1841 -4 2122 -34 DL N914DL 2391 JFK TP 4 2013 7 21 1102 -3 1230 -8 9E N823AY 3652 LGA OR ... ... ... ... ... ... ... ... ... ... ... ... 32730 2013 10 8 752 -8 921 -28 9E N8505Q 3611 JFK P 32731 2013 7 7 812 -3 1043 8 DL N6713Y 1429 JFK LA 32732 2013 9 3 1057 -1 1319 -19 UA N77871 1545 EWR IA 32733 2013 10 15 844 56 1045 60 B6 N258JB 1273 JFK CH 32734 2013 3 28 1813 -3 1942 -23 UA N36272 1053 EWR CL 32735 rows × 16 columns import pandas as pd # Specify the file path #file_path = C:\Users\Mohammed Parvez\Desktop\nycflights.csv' # Read the data file into a DataFrame #df = pd.read_csv(file_path) # Display the DataFrame df Q1)Read nyc±ights.csv ²le using pandas and name it df. Create a new data frame by selecting dep_time, dep_delay, arr_time, arr_delay, and tailnum, and name it newyork_±ight_new—display with the ²rst ²ve entries. dep_time dep_delay arr_time arr_delay tailnum 0 940 15 1216 -4 N626VA 1 1657 -3 2104 10 N3760C 2 859 -1 1238 11 N712TW 3 1841 -4 2122 -34 N914DL 4 1102 -3 1230 -8 N823AY import pandas as pd # Read the CSV file and create the DataFrame #df = pd.read_csv(C:\Users\Mohammed Parvez\Desktop\\nycflights.csv') # Create a new DataFrame with selected columns newyork_flight_new = df[['dep_time', 'dep_delay', 'arr_time', 'arr_delay', 'tailnum']] # Display the first five entries of the new DataFrame newyork_flight_new.head() Q2: Filter newyork_±ight_new by selecting rows with a departure time greater than 2000 and naming it dep_time_2000. How many rows were deleted? # Filter the DataFrame by selecting rows with departure time greater than 2000 dep_time_2000 = newyork_flight_new[newyork_flight_new['dep_time'] > 2000] # Count the number of rows deleted (not meeting the condition) rows_deleted = len(newyork_flight_new) - len(dep_time_2000) # Display the number of rows deleted print("Number of rows deleted:", rows_deleted) Number of rows deleted: 29166 Q3) Do we have any missing values in dep_delay? If yes, replace the missing values with the median of dep_delay import pandas as pd import numpy as np

7/14/23, 7:19 PM DataScience Assignment 3 - Colaboratory https://colab.research.google.com/drive/1h0xMUhziGWYy1Du_GL-ABXZNpA7KAi9f?authuser=1#scrollTo=N0bwFTRwsfYA&printMode=true 2/5 # Check for missing values in 'dep_delay' missing_values = newyork_flight_new['dep_delay'].isnull().sum() if missing_values > 0: # Calculate the median of 'dep_delay' median_dep_delay = newyork_flight_new['dep_delay'].median() # Replace missing values with the median newyork_flight_new['dep_delay'].fillna(median_dep_delay, inplace=True) print("Missing values in 'dep_delay' column were replaced with the median.") else: print("No missing values found in 'dep_delay' column.") No missing values found in 'dep_delay' column. Q4) Use the query function to ²lter all the rows with airtime greater than 120 minutes and distance greater than 700 km import pandas as pd # Read the dataset using pandas file_path = 'C:\\Users\\Mohammed Parvez\\Desktop\\student-por.csv' #df = pd.read_csv(file_path, sep=';') # Filter the rows using the query function #filtered_data = df.query('airtime > 120 and distance > 700') # Display the filtered DataFrame print(filtered_data) year month day dep_time dep_delay arr_time arr_delay carrier \ 0 2013 6 30 940 15 1216 -4 VX 1 2013 5 7 1657 -3 2104 10 DL 2 2013 12 8 859 -1 1238 11 DL 3 2013 5 14 1841 -4 2122 -34 DL 5 2013 1 1 1817 -3 2008 3 AA ... ... ... ... ... ... ... ... ... 32720 2013 4 17 1023 -7 1341 -24 VX 32722 2013 7 9 600 0 822 -8 AA 32726 2013 2 4 1558 -2 1854 4 DL 32731 2013 7 7 812 -3 1043 8 DL 32732 2013 9 3 1057 -1 1319 -19 UA tailnum flight origin dest air_time distance hour minute 0 N626VA 407 JFK LAX 313 2475 9 40 1 N3760C 329 JFK SJU 216 1598 16 57 2 N712TW 422 JFK LAX 376 2475 8 59 3 N914DL 2391 JFK TPA 135 1005 18 41 5 N3AXAA 353 LGA ORD 138 733 18 17 ... ... ... ... ... ... ... ... ... 32720 N842VA 187 EWR SFO 351 2565 10 23 32722 N3ERAA 707 LGA DFW 178 1389 6 0 32726 N3737C 1331 JFK DEN 238 1626 15 58 32731 N6713Y 1429 JFK LAS 286 2248 8 12 32732 N77871 1545 EWR IAH 180 1400 10 57 [17840 rows x 16 columns] Q5) Create a new data frame by dep_time, dep_delay, arr_time, arr_delay, tail num, and destination using ²lter() and name it df1. import pandas as pd # Assuming you already have a DataFrame named 'df' # Select the desired columns using filter() df1 = df.filter(['dep_time', 'dep_delay', 'arr_time', 'arr_delay', 'tailnum', 'destination']) # Display the new DataFrame print(df1) dep_time dep_delay arr_time arr_delay tailnum 0 940 15 1216 -4 N626VA 1 1657 -3 2104 10 N3760C 2 859 -1 1238 11 N712TW 3 1841 -4 2122 -34 N914DL 4 1102 -3 1230 -8 N823AY ... ... ... ... ... ...

7/14/23, 7:19 PM DataScience Assignment 3 - Colaboratory https://colab.research.google.com/drive/1h0xMUhziGWYy1Du_GL-ABXZNpA7KAi9f?authuser=1#scrollTo=N0bwFTRwsfYA&printMode=true 3/5 32730 752 -8 921 -28 N8505Q 32731 812 -3 1043 8 N6713Y 32732 1057 -1 1319 -19 N77871 32733 844 56 1045 60 N258JB 32734 1813 -3 1942 -23 N36272 [32735 rows x 5 columns] Q6) Add a new column "total_delay" to df1 using assign(). Total_delay can be calculated by adding dep_delay and arr_delay # Assuming you already have a DataFrame named 'df1' # Add a new column "total_delay" using assign() df1 = df1.assign(total_delay=df1['dep_delay'] + df1['arr_delay']) # Display the updated DataFrame print(df1) dep_time dep_delay arr_time arr_delay tailnum total_delay 0 940 15 1216 -4 N626VA 11 1 1657 -3 2104 10 N3760C 7 2 859 -1 1238 11 N712TW 10 3 1841 -4 2122 -34 N914DL -38 4 1102 -3 1230 -8 N823AY -11 ... ... ... ... ... ... ... 32730 752 -8 921 -28 N8505Q -36 32731 812 -3 1043 8 N6713Y 5 32732 1057 -1 1319 -19 N77871 -20 32733 844 56 1045 60 N258JB 116 32734 1813 -3 1942 -23 N36272 -26 [32735 rows x 6 columns] Q7) Group df according to "months" and ²nd the average air time and maximum distance traveled. Use groupby() and agg() functions. # Group the DataFrame by "months" and calculate average airtime and maximum distance #grouped_df = df.groupby('months').agg(avg_airtime=('airtime', 'mean'), max_distance=('distance', 'max')) # Display the grouped DataFrame print(grouped_df) air_time distance month 1 152.026054 4983 2 149.713911 4983 3 151.598466 4983 4 152.737864 4983 5 147.203474 4983 6 147.172035 4983 7 147.390956 4983 8 146.139583 4983 9 145.423349 4983 10 145.775312 4983 11 158.226857 4983 12 162.330265 4983 import pandas as pd # Specify the file path # file_path_1= 'C:data\student-por.csv' file_path_1= './student-por.csv' # Read the data file into a DataFrame df = pd.read_csv(file_path_1) # Display the DataFrame df

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version