hw4

.pdf

School

University of Oregon *

*We aren’t endorsed by this school

Course

102

Subject

Statistics

Date

Apr 27, 2024

Type

pdf

Pages

Uploaded by MajorKookaburaMaster1051

hw4 April 26, 2024 [ ]: import otter grader = otter . Notebook() 1 Homework 4: Advanced operations in pandas Due Date: 11:59PM on the date posted to Canvas Collaboration Policy Data science is a collaborative activity. While you may talk with others about the homework, we ask that you write your solutions individually . If you do discuss the assignments with other students please include their names below. Collaborators: list collaborators here Grading Grading is broken down into autograded answers and free response. For autograded answers, the results of your code are compared to provided and/or hidden tests. For autograded probability questions, the provided tests will only check that your answer is within a reasonable range. For free response, readers will evaluate how well you answered the question and/or fulfilled the requirements of the question. For plots, make sure to be as descriptive as possible: include titles, axes labels, and units wherever applicable. [ ]: import numpy as np import pandas as pd import matplotlib import matplotlib.pyplot as plt import seaborn as sns 'imports completed' 1.1 Introduction The purpose of this module is to expand your ‘pandas’ skillset by performing various new and old operations on ‘pandas’ dataframes. A lot of these operations will be things you’ve done before in the datascience package, so you should reference the included notebook to translate between the two if need be. 1

You are expected to answer all relevant questions programatically i.e. use indexing and func- tions/methods to arrive to your answers. Your answers don’t need to be in one single line, you may use as many intermediate steps as you need. 1.1.1 Question 1 Reading in data from file is made easy in the pandas package. We have included two datasets in your assignment folder to read in, ‘broadway.csv’ and ‘diseases.txt’. Question 1.1 Read in broadway using pd.read_csv . [ ]: broadway = ... broadway . head( 6 ) [ ]: grader . check( "q1_1" ) Question 1.2 Now read in the diseases dataset. Diseases is not a .csv but a .txt file i.e. a plain- text file. Because it’s not .csv , we can’t assume that the values are comma separated. Fortunately pd.read_csv can be used on any file. It may not parse the data correctly, but it may reveal the values that do separate entries. Identify the separator used in diseases.txt and use it to successfully read in your data with pd.read_csv . [ ]: separator = ... diseases = pd . read_csv( "diseases.txt" , sep = ... ) diseases . head( 6 ) [ ]: grader . check( "q1_2" ) Question 1.3 Read in the the DataFrame called nst-est2016-alldata.csv from the course Github. The url path to the repository is https://github.com/oregon-data- science/DSCI101/raw/main/data/. You should do this with pd.read_csv . [ ]: pop_census = ... [ ]: grader . check( "q1_3" ) This DataFrame gives census-based population estimates for each state on both July 1, 2015 and July 1, 2016. The last four columns describe the components of the estimated change in population during this time interval. For all questions below, assume that the word “states” refers to all 52 rows including Puerto Rico & the District of Columbia. The data was taken from here . If you want to read more about the different column descriptions, click here ! The raw data is a bit messy - run the cell below to clean the DataFrame and make it easier to work with. 2

[ ]: # Don't change this cell; just run it. pop_sum_level = pop_census[ 'SUMLEV' ] == 40 pop = pop_census[pop_sum_level] # grab a numbered list of columns to use columns_to_use = pop . columns[[ 1 , 4 , 12 , 13 , 27 , 34 , 62 , 69 ]] pop = pop[columns_to_use] pop = pop . rename(columns = { 'POPESTIMATE2015' : '2015' , 'POPESTIMATE2016' : '2016' , 'BIRTHS2016' : 'BIRTHS' , 'DEATHS2016' : 'DEATHS' , 'NETMIG2016' : 'MIGRATION' , 'RESIDUAL2016' : 'OTHER' }) #pop['REGION'].unique() pop[ 'REGION' ] = pop[ 'REGION' ] . replace({ '1' : 1 , '2' : 2 , '3' : 3 , '4' : 4 , 'X' : 0 }) pop . head( 12 ) 1.1.2 Question 2 - Census data Question 2.1 Assign us_birth_rate to the total US annual birth rate during this time interval. The annual birth rate for a year-long period is the total number of births in that period as a proportion of the population size at the start of the time period. Hint: Which year corresponds to the start of the time period? [ ]: us_birth_rate = ... us_birth_rate [ ]: grader . check( "q2_1" ) Question 2.2 Assign movers to the number of states for which the absolute value ( np.abs ) of the annual rate of migration was higher than 1%. The annual rate of migration for a year-long period is the net number of migrations (in and out) as a proportion of the population size at the start of the period. The MIGRATION column contains estimated annual net migration counts by state. [ ]: ... movers = ... movers [ ]: grader . check( "q2_2" ) Question 2.3 Assign west_births to the total number of births that occurred in region 4 (the Western US). Hint: Make sure you double check the type of the values in the region column, and appropriately filter (i.e. the types must match!). 3

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version