hw04

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

61B

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

32

Uploaded by CaptainMonkey3984

Report
0.0.1 Question 1a What is the granularity of the data (i.e., what does each row represent)? Hint: Examine all variables present in the dataset carefully before answering this question! Pay special attention to the time-based columns. The dataset contains information on multiple days present in the dataset and each row represents a singular day’s hour, its weather conditions, season, and other details about the hour/day. The rows also denotes how many rental bikes were used per hour and also has additional details on casual, registered users. 1
2
0.0.2 Question 1b For this assignment, we’ll be using this data to study bike usage in Washington, DC. Based on the granularity and the variables present in the data, what might some limitations of using this data be? What are two additional data categories/variables that one could collect to address some of these limitations? The data is representative of DC but DC is a large area and may have varying weather conditions, even within a singular hour. Furthermore, although the rows contain information on rider counts per hour, some riders may utilize bike-sharing for multiple rows sequentially and this is not captured in the dataset, as we do not know how many riders are “double-counted” across several hours. New variables that would be helpful would be having additoinal information on the location of bike usages per hour or a variable on how many new riders that are just now utilizing the service per hour is helpful. 3
4
0.0.3 Question 3a Use the sns.histplot (documentation) function to create a plot that overlays the distribution of the daily counts of bike users, using blue to represent casual riders, and green to represent registered riders. The temporal granularity of the records should be daily counts, which you should have after completing question 2.c. In other words, you should be using daily_counts to answer this question. Hints: - You will need to set the stat parameter appropriately to match the desired plot. - The label parameter of sns.histplot allows you to specify, as a string, how the plot should be labeled in the legend. For example, passing in label="My data" would give your plot the label “My data” in the legend. - You will need to make two calls to sns.histplot . Include a legend , xlabel , ylabel , and title . Read the seaborn plotting tutorial if you’re not sure how to add these. After creating the plot, look at it and make sure you understand what the plot is actually telling us, e.g., on a given day, the most likely number of registered riders we expect is ~4000, but it could be anywhere from nearly 0 to 7000. For all visualizations in Data 100, our grading team will evaluate your plot based on its similarity to the provided example. While your plot does not need to be identical to the example shown, we do expect it to capture its main features, such as the general shape of the distribution , the axis labels , the legend , and the title . It is okay if your plot contains small stylistic differences, such as differences in color, line weight, font, or size/scale. In [24]: sns . histplot(data = daily_counts, x = 'casual' , stat = 'density' , alpha =0.5 , kde = True , label = 'casual' sns . histplot(data = daily_counts, x = 'registered' , stat = 'density' , color = 'green' , alpha =0.5 , kde = T plt . title( 'Distribution Comparison of Casual vs Registered Riders' ); plt . xlabel( 'Rider Count' ); plt . legend(); 5
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help