Nayini01467237_HW3

.pdf

School

George Mason University *

*We aren’t endorsed by this school

Course

618

Subject

Computer Science

Date

Dec 6, 2023

Type

pdf

Pages

Uploaded by CountWombatPerson996

HW #3 ( total: 50 pts ) This homework primarily focuses on basic skills of using DataFrame in the Pandas package in Python. Note: (1) This is an individual assignment. It accounts for 12% of your grade. (2) Due Date: November 13th, 2023 Monday 7 : 20 : 59pm. (3) Make sure to run your code for each cell so that the output/result is visible underneath each code cell. Do not create any new cells on this notebook file. (4) Submissions: you need to submit your completed Jupyter Notebook file (.ipynb) AND a PDF version of your completed Jupyter Notebook to Blackboard. **Both files must be uploaded using one Blackboard submission (these are not to be submitted separately using two submissions)**. Please make sure that the PDF file shows the output of each code cell prior to submitting your files to the system. Submissions without the PDF file as instructed will result in a grade penality of 25%. Part 0: Get Ready (1 pts) T0-1: First, rename this Jupter Notebook file exactly as your last name followed by your G number without “G” and then followed by suffix “_HW3”. There should not be extra spaces or underlines in between. For example, the Jupyter Notebook file you submit should be like Ye12345678_HW3.ipynb. Same applies to the PDF version of the file. You will complete the tasks below on this Jyputer Notebook file, and then submit the completed the Jupyter Notebook file as well as the PDF version of the Juputer Notebook file. (1 pts) T0-2: Run the cell below. /Users/wadeyy03/Downloads Part 1: DataFrame Basics Run the cell below first to import the pandas package as pd. If anytime you get this error: NameError: name 'pd' is not defined, it means you haven't imported pandas as pd yet. You need to rerun the code in the cell below again to import pandas as pd. (2 pts) T1-1: Create a 2-dimensional DataFrame with 4 columns and 6 rows of data as shown below. | Team | Points | Standing | Coach | | :------- | :------ | :----- | :-------------------- | | Arsenal | 24 | 2 | Mikel Arteta | | Chelsea | 12 | 12 | Mauricio Pochettino | | Liverpool | 23 | 4 | Jurgen Klopp | | Man City | 24 | 3 | Pep Guardiola | | Man United | 16 | 8 | Erik ten Hag | | Tottenham | 26 | 1 | Ange Postecoglou | Print the DataFrame to make sure it is correct. Team Points Standing Coach 0 Arsenal 24 2 Mikel Arteta 1 Chelsea 12 12 Mauricio Pochettino 2 Liverpool 23 4 Jurgen Klopp 3 Man City 24 3 Pep Guardiola 4 Man United 16 8 Erik ten Hag 5 Tottenham 26 1 Ange Postecoglou (2 pts) T1-2: Change the row index labels of the DataFrame to be the abbreviated team names as shown below: | Team | Abbreviated Team Name | | :------- | :-------------------- | | Arsenal | ARS | | Chelsea | CHE | | Liverpool | LIV | | Man City | MCI | | Man United | MUN | | Tottenham | TOT | Note: you are not adding a new column or changing the values in any column for the DataFrame; you are simply changing the row index labels for the DataFrame. Print the DataFrame to make sure the row index labels has been changed successfully. ARS Arsenal CHE Chelsea LIV Liverpool MCI Man City MUN Man United TOT Tottenham Name: Team, dtype: object (4 pts) T1-3: Use .loc[] and .iloc[] respectively to produce a subset of the DataFrame containing the following rows and columns. Print the subset to ensure it is correct. | Team | Points | Standing | | :--- ---- | :------ | :----- | | Chelsea | 12 | 12 | | Liverpool | 23 | 4 | Team Points Standing CHE Chelsea 12 12 LIV Liverpool 23 4 Team Points Standing CHE Chelsea 12 12 LIV Liverpool 23 4 Part 2: Import Data and Descriptive Statistics In this part, you will import a dataset downloaded from the Virginia Department of Education ( https://www.doe.virginia.gov/data-policy-funding/data-reports ) into DataFrame, and play with it to generate different descriptive statistics. The dataset shows the performance of public high schools in 2022 in three dimensions: Math, Science, and Graduation Rate (some schools with missing data are excluded). An explanation of the variables (or columns) in the dataset is listed below: Variable Explanation ZIP zipcode of the school School_Name school name Division the division the school belongs to Math_Pass_Rate proportion of students who pass the math exam, a value of 50 means 50% students pass Math. Science_Pass_Rate proportion of students who pass the science exam, a value of 50 means 50% students pass Science. Graduation_Rate proportion of students who graduates, a value of 50 means 50% students graduate. (4 pts) T2-1: Download the csv file " high_school_2022.csv ", then import the csv file into a DataFrame, print the first 3 rows, then print the last 4 rows. There are two ways to ensure the csv file is imported correctly. 1. Use the absolute file path. For example, if the full path of your file is *"C:/ipynb/HW/high_school_2022.csv"*, you are going to pass on this entire path with quotations as an argument in the *read_csv()* function: read_csv("C:/ipynb/HW/high_school_2022.csv") 2. copy the csv file to your current working directory (this is the output from T0-2), then you can do relative file path and just use the filename. For example, if T0-2 produces output *"C:\Users\SY\Documents\ipynb"*, you are going to copy the csv file to the folder *"C:\Users\SY\Documents\ipynb"*, then you can just do: read_csv("high_school_2022.csv") ZIP School_Name Division Math_Pass_Rate \ 0 24263 Lee High Lee County 74.43 1 24502 Brookville High Campbell County 91.27 2 20124 Centreville High Fairfax County 79.49 Science_Pass_Rate Graduation_Rate 0 65.27 79.67 1 73.82 91.46 2 77.36 89.61 ==================================== ZIP School_Name Division Math_Pass_Rate \ 322 22847 Mountain View High Shenandoah County 68.28 323 20148 Independence High Loudoun County 83.33 324 20105 Lightridge High Loudoun County 78.31 325 20164 W.O. Robey High Loudoun County 78.31 Science_Pass_Rate Graduation_Rate 322 63.97 90.16 323 81.90 98.22 324 82.77 100.00 325 82.77 100.00 (2 pts) T2-2: Only print the number of records (i.e., rows) in the DataFrame. 326 (2 pts) T2-3: Display a concise summary of the DataFrame (such as columns, datatypes, etc.) using the appropriate DataFrame method. <class 'pandas.core.frame.DataFrame'> RangeIndex: 326 entries, 0 to 325 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ZIP 326 non-null int64 1 School_Name 326 non-null object 2 Division 326 non-null object 3 Math_Pass_Rate 326 non-null float64 4 Science_Pass_Rate 326 non-null float64 5 Graduation_Rate 326 non-null float64 dtypes: float64(3), int64(1), object(2) memory usage: 15.4+ KB (6 pts) T2-4: For the first 7 rows, print a customized description of each record like this, where {} is a placeholder for the actual value: {School_Name} in {Division} has a graduation rate of {Graduation_Rate}%. Hint: the first line of print output should be: Lee High in Lee County has a graduate rate of 79.67%. Lee High in Lee County has a graduation rate of 79.67%. Brookville High in Campbell County has a graduation rate of 91.46%. Centreville High in Fairfax County has a graduation rate of 89.61%. Liberty High in Fauquier County has a graduation rate of 90.0%. Sherando High in Frederick County has a graduation rate of 92.7%. Alexandria City High in Alexandria City has a graduation rate of 80.26%. Grundy High in Buchanan County has a graduation rate of 85.57%. (4 pts) T2-5: What is the mean value of pass rate of Math across all schools, and what is the mean value of pass rate of Science across all schools ? Use the appropriate DataFrame method(s) to display one output that answers this question. Maths Avg Pass Rate across All Schools: 76.22282208588958 Science Avg Pass Rate across All Schools: 68.25963190184049 (4 pts) T2-6: How many schools are there in each Division ? Use the appropriate DataFrame method to display the output that answers this question. Division Accomack County 3 Albemarle County 4 Alexandria City 1 Alleghany County 1 Amelia County 1 .. Williamsburg-James City County 3 Winchester City 1 Wise County 3 Wythe County 3 York County 5 Name: count, Length: 130, dtype: int64 (6 pts) T2-7: How does the pass rate of Science for public schools in the Fairfax County division look like ? To answer this question, produce a succint summary of all descriptive statistics for the pass rate of Science for publics schools in the Fairfax County division using the appropriate DataFrame method(s). count 28.000000 mean 70.275714 std 21.293555 min 17.650000 25% 57.755000 50% 74.930000 75% 87.462500 max 98.950000 Name: Science_Pass_Rate, dtype: float64 (6 pts) T2-8: For high schools whose pass rate of Science is at least 70 (a value of 70 means 70% students pass Science ), what is the mean value of Graduation Rate ? Use the appropriate DataFrame method(s) to display the output that answers this question. Then for high schools whose pass rate of Science is less than 70, what is the mean value of Graduation Rate ? Use the appropriate DataFrame method(s) to display the output that answers this question. Greater than 70%: 92.23198795180723 Lesser than 70%: 84.025125 (6 pts) T2-9: How many Divisions are there in Virginia ? What are those Divisions ? Then use negative indexes to print the last 5 Divisions. For this task, you are going to learn and use the .nunique() and unique() methods of DataFrame. First, read the code & comments below, and run the code to learn and understand the two methods. 4 3 ['John' 'Mike' 'Sarah' 'Julie'] ['VA' 'CA' 'TX'] Now write your Python code to complete the task T2-9. 130 The Unique Divisions: ['Lee County' 'Campbell County' 'Fairfax County' 'Fauquier County' 'Frederick County' 'Alexandria City' 'Buchanan County' 'Amherst County' 'Hampton City' 'Loudoun County' 'Botetourt County' 'Manassas Park City' 'Bristol City' 'Bedford County' 'Hanover County' 'Bland County' 'Chesapeake City' 'Chesterfield County' 'Goochland County' 'Caroline County' 'Albemarle County' 'Bath County' 'Colonial Beach' 'Floyd County' 'Charlotte County' 'Clarke County' 'Danville City' 'Appomattox County' 'Amelia County' 'Accomack County' 'King George County' 'Essex County' 'Gloucester County' 'Henrico County' 'Halifax County' 'Lancaster County' 'Lynchburg City' 'Dinwiddie County' 'Giles County' 'Carroll County' 'Buena Vista City' 'Buckingham County' 'Charlottesville City' 'Colonial Heights City' 'Martinsville City' 'King William County' 'Craig County' 'Franklin County' 'Franklin City' 'Arlington County' 'King and Queen County' 'Mathews County' 'Middlesex County' 'Greensville County' 'Alleghany County' 'Augusta County' 'Fluvanna County' 'Fredericksburg City' 'Cumberland County' 'Hopewell City' 'Louisa County' 'Greene County' 'Culpeper County' 'Highland County' 'Dickenson County' 'Lunenburg County' 'Harrisonburg City' 'Brunswick County' 'Falls Church City' 'Covington City' 'Manassas City' 'Grayson County' 'Henry County' 'Isle of Wight County' 'Madison County' 'Galax City' 'Charles City County' 'Wise County' 'Richmond City' 'Portsmouth City' 'Newport News City' 'Radford City' 'Wythe County' 'Virginia Beach City' 'Smyth County' 'Stafford County' 'Page County' 'Prince William County' 'Nottoway County' 'Rappahannock County' 'Roanoke County' 'Petersburg City' 'Orange County' 'Montgomery County' 'Scott County' 'Spotsylvania County' 'Sussex County' 'Nelson County' 'Roanoke City' 'Williamsburg-James City County' 'Powhatan County' 'Norfolk City' 'Rockingham County' 'Russell County' 'Prince Edward County' 'Norton City' 'West Point' 'Winchester City' 'Suffolk City' 'York County' 'Tazewell County' 'Northampton County' 'Poquoson City' 'Staunton City' 'Pulaski County' 'Richmond County' 'Surry County' 'Pittsylvania County' 'Warren County' 'Washington County' 'Southampton County' 'Waynesboro City' 'Westmoreland County' 'Shenandoah County' 'Rockbridge County' 'New Kent County' 'Salem City' 'Patrick County' 'Northumberland County' 'Prince George County'] The last 5 divisions are: ['New Kent County' 'Salem City' 'Patrick County' 'Northumberland County' 'Prince George County'] End of Homework. Once you have completed it and run all the cells, please remember to print the Jupyter Notebook file as a PDF file and submit it along with your Jupyter Notebook file, as both are required. In [1]: # Before you begin, run this cell with with the code provided below. # This will print the current working directory # This will also help you loate your Jupyter Notebook file on your computer import os print ( os . getcwd ()) In [2]: # run this cell first. # import the pandas package and rename it as pd, which will be used throughout the homework. import pandas as pd # this is important, you need to run this cell to be able to use the pandas package and DataFrame inside pandas # Once you have run the above statement, you will use pd to refer to pandas. In [4]: # T1-1 python solution code below PL_data = { "Team" :[ "Arsenal" , "Chelsea" , "Liverpool" , "Man City" , "Man United" , "Tottenham" ], "Points" :[ 24 , 12 , 23 , 24 , 16 , 26 ], "Standing" :[ 2 , 12 , 4 , 3 , 8 , 1 ], "Coach" :[ "Mikel Arteta" , "Mauricio Pochettino" , "Jurgen Klopp" , "Pep Guardiola" , "Erik ten Hag" , "Ange Postecoglou" ]} PL_Table = pd . DataFrame ( PL_data ) print ( PL_Table ) In [10]: # T1-2 python solution code below PL_Table . index = [ 'ARS' , 'CHE' , 'LIV' , 'MCI' , 'MUN' , 'TOT' ] print ( PL_Table [ 'Team' ]) Out[10]: In [18]: # T1-3 python solution code below # first use .loc[], write your code below LOC = PL_Table . loc [ 'Arsenal' : 'Liverpool' , 'Team' : 'Standing' ] print ( LOC ) # now use .iloc[], write your code below ILOC = PL_Table . iloc [ 1 : 3 , 0 : 3 ] print ( ILOC ) In [117… # T2-1 python solution code below # import the csv file into a DataFrame df = pd . read_csv ( "high_school_2022.csv" ) # print the first 3 rows print ( df . head ( 3 )) # print the last 4 rows print ( "====================================" ) print ( df . tail ( 4 )) In [28]: # T2-2 python solution code below print ( df . shape [ 0 ]) In [35]: # T2-3 python solution code below df . info () In [39]: # T2-4 python solution code below for i in range ( 7 ): line = df [ 'School_Name' ][ i ] + ' in ' + df [ 'Division' ][ i ] line = line + ' has a graduation rate of ' + str ( df [ 'Graduation_Rate' ][ i ]) + '%.' print ( line ) In [43]: # T2-5 python solution code below print ( "Maths Avg Pass Rate across All Schools:" , df [ 'Math_Pass_Rate' ] . mean (), "Science Avg Pass Rate across All Schools:" , df [ "Science_Pass_Rate" ] . mean ()) In [79]: # T2-6 python solution code below by_division = df . groupby ( "Division" ) Schools_by_division = by_division [ "Division" ] number = Schools_by_division . value_counts () print ( number ) In [92]: # T2-7 python solution code below Fairfax_County_Science_Pass_Rate = df [ df [ 'Division' ] == 'Fairfax County' ] Fairfax_County_Science_Pass_Rate [ 'Science_Pass_Rate' ] . describe () Out[92]: In [96]: # T2-8 python solution code below # output for high schools whose pass rate of Science is at least 70 Schools_Science_Pass_Rate_more_than_70 = df [ df [ "Science_Pass_Rate" ] > 70 ] print ( "Greater than 70%:" , Schools_Science_Pass_Rate_more_than_70 [ 'Graduation_Rate' ] . mean ()) # output for high schools whose pass rate of Science is less than 70 Schools_Science_Pass_Rate_less_than_70 = df [ df [ "Science_Pass_Rate" ] < 70 ] print ( "Lesser than 70%:" , Schools_Science_Pass_Rate_less_than_70 [ 'Graduation_Rate' ] . mean ()) In [97]: # An example of .nunique() and unique(). # Read the code, and run the code in this cell to learn them. import pandas as pd # df_a is a dataframe containing student names and their home states. # there are 6 students df_a = pd . DataFrame ( { "Name" : [ "John" , "Mike" , "John" , "Mike" , "Sarah" , "Julie" ], "State" : [ "VA" , "CA" , "VA" , "CA" , "CA" , "TX" ], } ) # .nunique(): return the unique number of values in a column print ( df_a [ 'Name' ] . nunique ()) # should print 4, as there are 4 unique names print ( df_a [ 'State' ] . nunique ()) # should print 3, as there are 3 unique values in the State colum. # .unique(): return the list of unique values in a column, it returns a list. print ( df_a [ 'Name' ] . unique ()) # it should print a list: ['John' 'Mike' 'Sarah' 'Julie'] unique_state = df_a [ 'State' ] . unique () # unique_state should be a list: ['VA' 'CA' 'TX'] print ( unique_state ) # it should print: ['VA' 'CA' 'TX'] In [116… # T2-9 python solution code below # How many Divisions are there in Virginia? Print the answer. Write your code below. print ( df [ 'Division' ] . nunique ()) # What are those Divisions? Print the Divisions. Write your code below. # Hint: save it in a variable, because you will need the variable to print the last 5 divisions. unique_divisions = df [ 'Division' ] . unique () print ( "The Unique Divisions:" , unique_divisions ) print ( "\nThe last 5 divisions are:" ) # Use negative indexes to print the last 5 Divisions. You can just print the value of the list. Write your code below. print ( unique_divisions [ - 5 :])

Discover more documents: Sign up today!

Unlock a world of knowledge! Explore tailored content for a richer learning experience. Here's what you'll get: