Nayini01467237_HW4

.pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

409

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

1

Uploaded by CountWombatPerson996

HW #4 ( total: 50 pts ) This homework primarily focuses on Descriptive Statistics and Data Visualization in Python. Note: (1) This is an individual assignment. It accounts for 12% of your grade. (2) Due Date: November 20th, 2023 Monday 7 : 20 : 59pm. (3) Make sure to run your code for each cell so that the output/result is visible underneath each code cell. Do not create any new cells on this notebook file. (4) Submissions: you need to submit your completed Jupyter Notebook file (.ipynb) AND a PDF version of your completed Jupyter Notebook to Blackboard. **Both files must be uploaded using one Blackboard submission (these are not to be submitted separately using two submissions)**. Please make sure that the PDF file shows the output of each code cell prior to submitting your files to the system. Submissions without the PDF file as instructed will result in a grade penality of 25%. Part 0: Get Ready (1 pts) T0-1: First, rename this Jupter Notebook file exactly as your last name followed by your G number without “G” and then followed by suffix “_HW4”. There should not be extra spaces or underlines in between. For example, the Jupyter Notebook file you submit should be like Ye12345678_HW4.ipynb. Same applies to the PDF version of the file. You will complete the tasks below on this Jyputer Notebook file, and then submit the completed the Jupyter Notebook file as well as the PDF version of the Juputer Notebook file. (1 pts) T0-2: Run the cell below. /Users/wadeyy03/Downloads Part 1: Descriptive Statistics & Data Visualization - 1 In this part, you will continue to work with the dataset downloaded from the Virginia Department of Education ( https://www.doe.virginia.gov/data-policy-funding/data-reports ), you will import it into DataFrame, and play with it to generate descriptive statistics and appropriate visualizations. The dataset shows the performance of public high schools in 2022 in three dimensions: Math, Science, and Graduation Rate (some schools with missing data are excluded). An explanation of the variables (or columns) in the dataset is listed below: Variable Explanation ZIP zipcode of the school School_Name school name Division the division the school belongs to Math_Pass_Rate proportion of students who pass the math exam, a value of 50 means 50% students pass Math. Science_Pass_Rate proportion of students who pass the science exam, a value of 50 means 50% students pass Science. Graduation_Rate proportion of students who graduates, a value of 50 means 50% students graduate. You need to download the csv file " high_school_2022.csv " provided on Blackboard and import it into a DataFrame first. There are two ways to ensure the csv file is imported correctly. 1. Use the absolute file path. For example, if the full path of your file is *"C:/ipynb/HW/high_school_2022.csv"*, you are going to pass on this entire path with quotations as an argument in the *read_csv()* function: read_csv("C:/ipynb/HW/high_school_2022.csv") 2. copy the csv file to your current working directory (this is the output from T0-2), then you can do relative file path and just use the filename. For example, if T0-2 produces output *"C:\Users\SY\Documents\ipynb"*, you are going to copy the csv file to the folder *"C:\Users\SY\Documents\ipynb"*, then you can just do: read_csv("high_school_2022.csv") (2 pts) T1-1: (To help you get you started and make sure the data is loaded correctly) Download the csv file " high_school_2022.csv ", import the csv file into a DataFrame, print the first 7 rows, then print the last 7 rows. ZIP School_Name Division Math_Pass_Rate \ 0 24263 Lee High Lee County 74.43 1 24502 Brookville High Campbell County 91.27 2 20124 Centreville High Fairfax County 79.49 3 22712 Liberty High Fauquier County 70.49 4 22655 Sherando High Frederick County 78.28 5 22302 Alexandria City High Alexandria City 46.35 6 24614 Grundy High Buchanan County 86.87 Science_Pass_Rate Graduation_Rate 0 65.27 79.67 1 73.82 91.46 2 77.36 89.61 3 65.71 90.00 4 76.96 92.70 5 49.47 80.26 6 62.30 85.57 ================================== ZIP School_Name Division Math_Pass_Rate \ 319 23692 York River Academy York County 90.00 320 23875 Prince George High Prince George County 78.69 321 24641 Richlands High Tazewell County 90.15 322 22847 Mountain View High Shenandoah County 68.28 323 20148 Independence High Loudoun County 83.33 324 20105 Lightridge High Loudoun County 78.31 325 20164 W.O. Robey High Loudoun County 78.31 Science_Pass_Rate Graduation_Rate 319 95.24 89.47 320 76.93 89.68 321 82.52 87.15 322 63.97 90.16 323 81.90 98.22 324 82.77 100.00 325 82.77 100.00 (3 pts) T1-2: Which high school has the highest pass rate of Science ? Use the appropriate DataFrame method(s) to display the result. 98.95 ZIP 22312 School_Name Thomas Jefferson High for Science and Technology Division Fairfax County Math_Pass_Rate 100.0 Science_Pass_Rate 98.95 Graduation_Rate 100.0 Name: 77, dtype: object (4 pts) T1-3: Which high school in the Loudoun County division has the highest pass rate of Math ? Use the appropriate DataFrame method(s) to display the result. 91.62 ZIP 20152 School_Name Freedom High Division Loudoun County Math_Pass_Rate 91.62 Science_Pass_Rate 90.23 Graduation_Rate 97.3 Name: 113, dtype: object (4 pts) T1-4: For each division, what is the average pass rate of Math ? and what is the average path rate of Science ? Use appropriate DataFrame method(s) to print one output that answers both questions. Math_Pass_Rate Science_Pass_Rate Division Accomack County 83.453333 72.373333 Albemarle County 70.880000 70.472500 Alexandria City 46.350000 49.470000 Alleghany County 78.880000 52.820000 Amelia County 75.320000 60.810000 ... ... ... Williamsburg-James City County 75.543333 74.640000 Winchester City 65.060000 62.360000 Wise County 93.183333 80.050000 Wythe County 93.016667 82.413333 York County 86.494000 83.760000 [130 rows x 2 columns] (5 pts) T1-5: Overall, is there a positive or negative or no obvious relationship between pass rate of Math and pass rate of Science ? Use an appropriate visualization to present a graph/plot to the screen to help answer this question. Make sure the graph/plot has at least a meaningful title, and meaningful descriptions of the horizontal axis and the vertial axis. Then based on the graph, use print() to print a statement that explains your answer to the question. In Majority of the Schools the Passing Rates of Maths and Science are Proportional in Nature. (7 pts) T1-6: Focuses on these 4 divisions in Northern Virginia only: Fairfax County, Loudoun County, Arlington County, and Prince William County. How does the graduation rate of a division compare to other divisions ? Use an appropriate visualization to present a graph/plot to the screen to help answer this question. Make sure the graph/plot has at least a meaningful title, and meaningful descriptions of the horizontal axis and the vertial axis. Division Arlington County 80.400000 Fairfax County 85.161071 Loudoun County 95.626111 Prince William County 88.124167 Name: Graduation_Rate, dtype: float64 Part 2: Descriptive Statistics & Data Visualization - 2 In this part, you will import a dataset downloaded from Redfin ( https://www.redfin.com/news/data-center/ ) into DataFrame, and play with it to generate descriptive statistics and appropriate visualizations. The dataset shows the number of homes sold in each month from year 2018 to year 2022 for seven cities in the Fairfax county. An explanation of the variables (or columns) in the dataset is listed below (most are self-explanatory): Variable Explanation Year year of each reported sales record Month numerical month of each reported sales record Month_Name (string) month name of each reported sales record Quarter (string) quarter of each reported sales record City city name Active_Listings number of houses listed for sale in the month Homes_Sold number of houses (in units) sold in the month Median_Sale_Price median sales price ( $ ) of sold houses in the month (2 pts) T2-1: (To help you get you started and make sure the data is loaded correctly) Download the csv file " home_sales.csv ", import the csv file into a DataFrame, print the first 7 rows, then print the last 7 rows. Year Month Month_Name Quarter City Active_Listings Homes_Sold \ 0 2018 1 January Q1 Arlington 599 165 1 2018 1 January Q1 Centreville 121 68 2 2018 1 January Q1 Fairfax 182 59 3 2018 1 January Q1 Herndon 78 20 4 2018 1 January Q1 McLean 220 54 5 2018 1 January Q1 Oakton 92 36 6 2018 1 January Q1 Vienna 68 6 Median_Sale_Price 0 514000 1 374000 2 509000 3 460000 4 1127000 5 460000 6 619000 ====================== Year Month Month_Name Quarter City Active_Listings Homes_Sold \ 413 2022 12 December Q4 Arlington 473 142 414 2022 12 December Q4 Centreville 99 41 415 2022 12 December Q4 Fairfax 95 47 416 2022 12 December Q4 Herndon 53 18 417 2022 12 December Q4 McLean 152 41 418 2022 12 December Q4 Oakton 69 40 419 2022 12 December Q4 Vienna 54 18 Median_Sale_Price 413 536000 414 470000 415 603000 416 605000 417 1499000 418 618000 419 1398000 (3 pts) T2-2: For each city, what is the average number of sold homes per month (i.e., for each city, the mean of all monthly records) ? Use appropriate DataFrame method(s) to print the output that answers this question. (Optional) you can also use the .astype() method to round up the values in a DataFrame or Series to integers: DataFrame_name.astype('int') or Series_name.astype('int'). City Arlington 251 Centreville 91 Fairfax 81 Herndon 43 McLean 67 Oakton 50 Vienna 20 Name: Homes_Sold, dtype: int64 (4 pts) T2-3: For each month of each year, what is the average number of sold houses across all cities (i.e., for each month of each year, the mean of all monthly records including all cities) ? Use appropriate DataFrame method(s) to print the output that answers this question. (Optional) you can also use the .astype() method to round up the values in a DataFrame or Series to integers: DataFrame_name.astype('int') or Series_name.astype('int'). Year Month Month_Name 2018 1 January 58 2 February 55 3 March 84 4 April 98 5 May 123 6 June 123 7 July 106 8 August 96 9 September 70 10 October 78 11 November 76 12 December 83 2019 1 January 58 2 February 56 3 March 72 4 April 101 5 May 113 6 June 109 7 July 102 8 August 92 9 September 76 10 October 73 11 November 63 12 December 75 2020 1 January 49 2 February 55 3 March 82 4 April 76 5 May 69 6 June 87 7 July 110 8 August 107 9 September 99 10 October 100 11 November 90 12 December 89 2021 1 January 63 2 February 77 3 March 103 4 April 123 5 May 125 6 June 134 7 July 123 8 August 107 9 September 95 10 October 98 11 November 94 12 December 91 2022 1 January 58 2 February 65 3 March 90 4 April 95 5 May 107 6 June 99 7 July 81 8 August 81 9 September 67 10 October 57 11 November 47 12 December 49 Name: Homes_Sold, dtype: int64 (7 pts) T2-4: What is the monthly trend of sold homes in the data, i.e., how does the average value of homes sold (across all cities and all years, i.e., there is one average for January, one average for February, etc.) change from month to month ? Use an appropriate visualization to present a graph/plot to the screen to help answer this question. Make sure: (1) the graph/plot has a meaningful title, and meaningful descriptions of the horizontal axis and the vertial axis; (2) the value for each year is clearly represented and marked in the graph/plot; (3) the ticks in the x-axis corresponds exactly to the month. Month 1 57 2 62 3 86 4 99 5 107 6 110 7 104 8 96 9 81 10 81 11 74 12 77 Name: Homes_Sold, dtype: int64 (7 pts) T2-5: In the class, we have learned using .mean() method to calcualte the average/mean of values. Similarly, you can use the .sum() method of DataFrame to do a sum of values to calculate how many homes in total are sold for each city from 2018-2022 (the observation window of the data). What are the top 5 cities that have sold the most number of homes ? Present the results both numerically and visually . Use appropriate DataFrame method(s) and an appropriate visualization to answer this question. Make sure the graph/plot has at least a meaningful title, and meaningful descriptions of the horizontal axis and the vertial axis. For this task, you are also going to learn and use the .nlargest() method of DataFrame. First, read the code & comments below, and run the code to learn and understand the .nlargest() method. Student Score 3 Sarah 97 1 Mike 95 0 90 1 95 2 86 3 97 4 78 Name: Score, dtype: int64 <class 'pandas.core.series.Series'> 3 97 1 95 Name: Score, dtype: int64 Now write your Python code to complete the task T2-5. Again, make sure the graph/plot has at least a meaningful title, and meaningful descriptions of the horizontal axis and the vertial axis. City Arlington 15101 Centreville 5516 Fairfax 4918 McLean 4055 Oakton 3029 Name: Homes_Sold, dtype: int64 End of Homework. Once you have completed it and run all the cells, please remember to print the Jupyter Notebook file as a PDF file and submit it along with your Jupyter Notebook file, as both are required. In [1]: # Before you begin, run this cell with with the code provided below. # This will print the current working directory # This will also help you loate your Jupyter Notebook file on your computer import os print ( os . getcwd ()) In [8]: import pandas as pd # T1-1 python solution code below df = pd . read_csv ( "high_school_2022.csv" ) print ( df . head ( 7 ), "\n" ) print ( '==================================' ) print ( df . tail ( 7 )) In [10]: # T1-2 python solution code below # What is the highest pass rate of Science among all schools? Print it. print ( df [ 'Science_Pass_Rate' ] . max ()) # Find the high school df . loc [ df [ 'Science_Pass_Rate' ] . idxmax ()] Out[10]: In [14]: # T1-3 python solution code below # What is the highest pass rate of Math among all schools in the Loudoun County division? Print it. print ( df [ df [ 'Division' ] == 'Loudoun County' ][ 'Math_Pass_Rate' ] . max ()) # Find the school df . loc [ df [ df [ 'Division' ] == 'Loudoun County' ][ 'Math_Pass_Rate' ] . idxmax ()] Out[14]: In [22]: # T1-4 python solution code below by_division = df . groupby ( 'Division' ) Science_and_Maths_mean_by_division = by_division [[ "Math_Pass_Rate" , "Science_Pass_Rate" ]] . mean () print ( Science_and_Maths_mean_by_division ) In [250… import matplotlib.pyplot as plt # need it to present the graphs to the screen # T1-5 python solution code below # code for visualization below df . plot . scatter ( x = "Math_Pass_Rate" , y = "Science_Pass_Rate" , xlabel = "Maths Passing Rate" , ylabel = "Science Passing Rate" , title = "Scatter plot Science Pass Rate vs Maths Pass Ra plt . show () # Once you have run the visualization code to see the graph, # now add your code to print your answer/conclusion below, and run the entire cell again. statement = "In Majority of the Schools the Passing Rates of Maths and Science are Proportional in Nature." print ( statement ) In [251… # T1-6 python solution code below #County = ["Fairfax County","Loudoun County","Arlington County","Prince William County"] #divisions_4 = df[df['Division'].isin(County) divisions_4 = df [ df [ 'Division' ] . isin ([ "Fairfax County" , "Loudoun County" , "Arlington County" , "Prince William County" ])] by_divisions = divisions_4 . groupby ( "Division" ) mean_graduation_rate_by_division = by_divisions [ "Graduation_Rate" ] . mean () print ( mean_graduation_rate_by_division ) mean_graduation_rate_by_division . plot . bar ( x = "Divison" , y = "Graduation_Rate" , xlabel = "County Name" , ylabel = "Average Graduation Rate" , rot = 10 , title = "Average Graduation Rate by plt . show () In [237… import pandas as pd # T2-1 python solution code below df1 = pd . read_csv ( "home_sales.csv" ) print ( df1 . head ( 7 )) print ( "\n ====================== \n" ) print ( df1 . tail ( 7 )) In [252… # T2-2 python solution code below by_month_city = df1 . groupby ([ 'City' ]) #used "Month" in grouping as "Month_Name" is saved as a 'string' not a 'datetime' variable so grouping by "Month" gives order to the avg_house_sold_by_month_city = by_month_city [ "Homes_Sold" ] . mean () print ( avg_house_sold_by_month_city . astype ( 'int' )) In [253… # T2-3 python solution code below by_year_month = df1 . groupby ([ "Year" , "Month" , "Month_Name" ]) #used "Month" in grouping as "Month_Name" is saved as a 'string' not a 'datetime' variable so grouping by "Mont avg_houses_sold = by_year_month [ "Homes_Sold" ] . mean () print ( avg_houses_sold . astype ( 'int' )) In [254… import matplotlib.pyplot as plt # need it to present the graphs to the screen # T2-4 python solution code below months = [ "January" , "February" , "March" , "April" , "May" , "June" , "July" , "August" , "September" , "October" , "November" , "December" ] homes = df1 . groupby ([ "Month" ])[ "Homes_Sold" ] . mean () . astype ( 'int' ) print ( homes ) homes . plot . line ( x = "Month" , y = "Homes_Sold" , figsize = ( 10 , 10 ), marker = 'x' , color = 'Red' , rot = 45 ) plt . xticks ( range ( 1 , 13 ), months ) plt . show () In [255… # An example of .nlargest() method # Read the code, and run the code in this cell to learn the method. import pandas as pd # df_a is a dataframe containing students and their scores. df_a = pd . DataFrame ( { "Student" : [ "Taylor" , "Mike" , "John" , "Sarah" , "Julie" ], 'Score' : [ 90 , 95 , 86 , 97 , 78 ] } ) # .nlargest() on a DataFrame # print the rows with the two largest values in the Score column in desending order print ( df_a . nlargest ( 2 , 'Score' )) # the above should print the row for Sarah with a score 97, and then the row for Mike with a score of 95 print () # .nlargest() can also be used on a Series, which has only one column of data score = df_a [ 'Score' ] # select a single column from the DataFrame, it returns a Series print ( score ) print ( type ( score )) # .nlargest() on a Series print ( score . nlargest ( 2 )) # because there is only one column in a Series, you do not need to specify column name # the above should print row 3 with score 97 and row 1 with score 95 In [256… # T2-5 python solution code below by_city = df1 . groupby ( "City" ) homes_sold_by_city = by_city [ "Homes_Sold" ] . sum () print ( homes_sold_by_city . nlargest ( 5 )) bar = homes_sold_by_city . nlargest ( 5 ) . plot . bar ( x = 'City' , y = 'Homes_Sold' , title = 'Top 5 Cities with Most Homes Sold (2018-2022)' , ylabel = 'Homes Sold' , xlabel = 'City' , legend = False , rot = 45 ) bar . bar_label ( bar . containers [ 0 ]) plt . show ()
Discover more documents: Sign up today!
Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get:
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help