BIO259-Graphical-Data-Summaries_Tutorial
pdf
keyboard_arrow_up
School
Toronto Metropolitan University *
*We aren’t endorsed by this school
Course
259
Subject
Statistics
Date
Apr 3, 2024
Type
Pages
31
Uploaded by DeanValorHornet7
BIO259-Graphical-Data-Summaries_Tutorial
October 16, 2023
Our ability to generate visually stimulating graphical representations that evoke appropriate re-
sponses from the audiences of our data is arguably the most important aspect of biological data
science. These graphical representations not only help us communicate our results to our audience,
but also enable us to gain greater insight into our own data through simplification.
By leveraging the power of computer programming, we can generate and manipulate figures in order
to produce an extremely wide range of plots that simply aren’t accessible through most graphic
user interfaces. We can also heavily parallelize our analyses and incorporate these visualizations
into pipelines that enable us to generate the same plots for different datasets.
In today’s tutorial we will learn how to generate several different types of graphical data summaries
in R using the
ggplot2
package.
We will also learn to perform some basic manipulations and
customize these plots where applicable. Our tutorial is divided into seven primary sections, each
of which will cover how to generate and customize a different type of plot:
1. Line Graphs
2. Bar Charts
3. Box and Whisker Plots (with dot plots)
4. Scatter Plots
5. Pie Charts
6. Histograms
7. Heat Maps
#1. Line typically used for time data
#2. Bar charts are good for looking at catagorical data and their numbers
#6. histogram y axis is typically count
#3. Look at the spread of your data
#4. Look at the relationship between 2 variables
#5. Visualize proportions or percentages
#7. infer visually correlation between 2 variables
[1]:
#Run the code block below in order to import our dataset, select a subset of
␣
,
→
the columns, and filter out undesirable location data.
#This will be the data that we will work with throughput this tutorial and in
␣
,
→
our practical.
library
(dplyr)
library
(ggplot2)
1
covid_df
<-
read.table
(
"/var/biojupyterhubdata/BIO259/owid-covid-data-15.06.
,
→
2022.csv"
, sep
=
","
, header
=
TRUE
, quote
=
""
)
covid_df
<-
covid_df
%>%
␣
,
→
select
(
c
(
"continent"
,
"location"
,
"date"
,
"total_cases"
,
"new_cases"
,
"total_deaths"
,
"new_deaths"
,
→
"total_cases_per_million"
,
"new_cases_per_million"
,
␣
,
→
"total_deaths_per_million"
,
"new_deaths_per_million"
,
"icu_patients"
,
␣
,
→
"icu_patients_per_million"
,
"hosp_patients"
,
"hosp_patients_per_million"
,
␣
,
→
"total_vaccinations"
,
"total_tests"
,
"new_tests"
,
"positive_rate"
,
␣
,
→
"total_vaccinations"
,
"people_vaccinated"
,
"people_fully_vaccinated"
,
␣
,
→
"total_boosters"
,
"new_vaccinations"
,
"population"
,
"population_density"
,
␣
,
→
"median_age"
,
"gdp_per_capita"
,
"diabetes_prevalence"
,
"life_expectancy"
))
covid_df
<-
covid_df
%>%
filter
(
!
location
%in%
c
(
'Africa'
,
'Asia'
,
'Europe'
,
␣
,
→
'North America'
,
'South America'
,
'Oceania'
,
'Low income'
,
'Lower middle
␣
,
→
income'
,
'Upper middle income'
,
'High income'
,
'European Union'
,
␣
,
→
'International'
,
'Northern Cyprus'
,
'World'
,
'Grand Total'
))
covid_df
<-
covid_df
%>%
mutate
(date
=
as.Date
(date, format
=
"%Y-%m-%d"
))
␣
,
→
#convert column to date object
covid_df
[is.na
(covid_df)]
= 0
#replace NA values with 0
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
0.1
Part 1: Line Graphs
Line graphs are an effective means of displaying changes in a response variable through time. In
this section, we will use our Covid-19 dataset to plot the number of Sars-CoV-2 cases detected each
day through time in order to compare the pandemic curves for Canada and the USA.
[2]:
# Generate a line graph of daily cases (per million) in Canada using
# geom_line()
# The ggplot process is usually: create a data frame with the part of
#the data you want
# Then add a type of graph, with aesthetic parameters
#(such as values of x, values of y etc)
2
#colo(u)r will colour depending on the location column
# The answer should look like this, but with some 'corrections'
ggplot
(covid_df
%>%
filter
(location
==
"Canada"
))
+
#Create subset line graph
geom_line
(
aes
(x
=
date, y
=
new_cases_per_million, group
=
location,
␣
,
→
colour
=
location))
# Write your code here. Be careful to follow the instructions!
[3]:
# Compare line graph from Canada to one from the USA, normalizing for
␣
,
→
population. Note change in scale.
# The answer should look like this, but with some 'corrections'
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ggplot
(covid_df
%>%
filter
(location
==
"Canada"
|
location
==
"United States"
))
␣
,
→
+
geom_line
(
aes
(x
=
date, y
=
new_cases_per_million,
group
=
location, colour
=
location))
# Write your code here. Be careful to follow the instructions!
[4]:
# Adjust your axes, legends, and colours to improve figure readability.
ggplot
(covid_df
%>%
filter
(location
==
"Canada"
|
location
==
"United States"
))
␣
,
→
+
geom_line
(
aes
(x
=
date, y
=
new_cases_per_million, group
=
location,
␣
,
→
colour
=
location))
+
4
theme
(axis.title
=
element_text
(size
=15
), axis.text
=
␣
,
→
element_text
(size
=15
), axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#each option in
␣
,
→
theme has several underlying options, here we are manipulating several
␣
,
→
aspects of our axes
xlab
(
"Date"
)
+
ylab
(
"New Cases Per Million"
)
+
#replace axis labels
theme
(legend.position
=
"bottom"
, legend.text
=
element_text
(size
=15
), legend.
,
→
title
=
element_blank
())
+
#manipulate legend
scale_color_manual
(values
=
c
(
"#ff0000"
,
"#0000FF"
))
#specify different
␣
,
→
"national" colours for each group
5
0.2
Part 2: Bar Charts
Bar charts utilize rectangular bars with lengths that are proportional to the values they represent
to compare categorical data.
Error bars are commonly added to bar charts to display variance.
In this section, we will compare the average number of hospitalized Covid-19 patients per million
across eight countries in Europe and learn how to include error bars on these plots.
[5]:
# We'll pivot our covid_df in order to calculate summary mean, sd, and sem
␣
,
→
values for subset of European countries.
# Our countries of interest are Belgium, France, Italy, Iceland, Netherlands,
␣
,
→
Portugal, Sweden, and the United Kingdom.
pivot_covid_df
<-
covid_df
%>%
group_by
(location)
%>%
filter
(location
==
␣
,
→
'Belgium'
|
location
==
'France'
|
location
==
'Italy'
|
location
==
␣
,
→
'Iceland'
|
location
==
'Netherlands'
|
location
==
'Portugal'
|
location
␣
,
→
==
'Sweden'
|
location
==
'United Kingdom'
)
%>%
summarise
(mean_hosp_patients
=
mean
(hosp_patients_per_million),
␣
,
→
sd_hosp_patients
=
sd
(hosp_patients_per_million), sem_hosp_patients
=
␣
,
→
sd
(hosp_patients_per_million)
/
sqrt
(
n
()))
pivot_covid_df
A tibble: 8 × 4
location
mean_hosp_patients
sd_hosp_patients
sem_hosp_patients
<chr>
<dbl>
<dbl>
<dbl>
Belgium
161.27678
133.90926
4.558324
France
244.34309
143.84840
4.865745
Iceland
36.31031
53.97307
1.863357
Italy
201.03494
177.41961
6.025486
Netherlands
59.19184
41.76282
1.440954
Portugal
110.12235
130.67054
4.516635
Sweden
87.08845
84.02470
2.855275
United Kingdom
136.17966
122.06000
4.145375
[6]:
# Generate a simple bar chart of mean hospitalized patients in
# each country. You can play around with geom_bar settings to see what they
␣
,
→
each do.
# Again the trick is to load the data to ggplot
# Then plot a bar graph with specific aesthetics
# geom_bar() can perform summaries for us, but we can also plot the pivot table
␣
,
→
directly.
# stat = "identity" we are giving it the y values to plot
#The answer should look like this, but with some 'corrections'
ggplot
(pivot_covid_df)
+
geom_bar
(
aes
(x
=
location, y
=
mean_hosp_patients), stat
=
"identity"
,
␣
,
→
fill
=
"gray"
, colour
=
"black"
, size
=1
, alpha
=1
)
# Write your code here. Be careful to follow the instructions!
6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Warning message:
“Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.”
[7]:
# Add error bars. You can play around with geom_errorbar settings to see what
␣
,
→
they do.
# Adding elements to graphs with ggplot is simple - just pile on ggplot
␣
,
→
graphing commands
#Error bars are the 95% CI: [mean - 1.96*SEM, mean + 1.96*SEM]
#The answer should look like this, but with some 'corrections'
ggplot
(pivot_covid_df)
+
7
geom_bar
(
aes
(x
=
location, y
=
mean_hosp_patients), stat
=
"identity"
,
␣
,
→
fill
=
"gray"
, colour
=
"black"
, size
=1
, alpha
=1
)
+
geom_errorbar
(
aes
(x
=
location,
ymin
=
mean_hosp_patients
-1.96*
sem_hosp_patients,
ymax
=
mean_hosp_patients
+1.96*
sem_hosp_patients),
width
=0
, colour
=
"black"
, alpha
=1
, size
=1
)
# Write your code here. Be careful to follow the instructions!
[8]:
# Make thematic modifications to optimize your plot. The syntax for this is the
␣
,
→
same for different plot types. For example, let's change the axis label
␣
,
→
sizes etc.
# Update the X and Y axis titles
8
# Change their font sizes etc.
# Look at one of the graph above we modified to inspire yourself.
#The answer should look like this, but with some 'corrections'
ggplot
(pivot_covid_df)
+
geom_bar
(
aes
(x
=
location, y
=
mean_hosp_patients), stat
=
"identity"
,
fill
=
"gray"
, colour
=
"black"
, size
=1
, alpha
=1
)
+
#manipulate bars
geom_errorbar
(
aes
(x
=
location,
ymin
=
mean_hosp_patients
-1.96*
sem_hosp_patients,
ymax
=
mean_hosp_patients
+1.96*
sem_hosp_patients),
width
=0
, colour
=
"black"
, alpha
=1
, size
=1
)
+
#manipulate error
␣
,
→
bars
theme
(axis.title
=
element_text
(size
=15
),
axis.text.x
=
element_text
(size
=15
, angle
=45
, hjust
=1
),
axis.text.y
=
element_text
(size
=15
),
axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#manipulate axes
xlab
(
"Country"
)
+
ylab
(
"Mean Hospitalize patients"
)
#update axis labels
# Write your code here. Be careful to follow the instructions!
9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.3
Part 3: Box and Whisker Plots
Like bar charts, box and whisker plots also distribute categorical data on the x-axis and enable
us to compare quantitative variables on the y-axis. However, they also enable us to display more
information about variance, including the median, the interquartile range, and the overall range.
In this section, we will compare the same data that we compared in Part 2 of this tutorial, but
here, we will use a box and whisker plot to do so.
[9]:
# Filter the covid_df to extract European countries of interest.
filter_covid_df
<-
covid_df
%>%
filter
(location
==
'Belgium'
|
location
==
␣
,
→
'France'
|
location
==
'Italy'
|
location
==
'Iceland'
|
location
==
␣
,
→
'Netherlands'
|
location
==
'Portugal'
|
location
==
'Sweden'
|
location
␣
,
→
==
'United Kingdom'
)
10
[10]:
# Generate a basic box plot using geom_boxplot()
#The answer should look like this, but with some 'corrections'
ggplot
(filter_covid_df)
+
geom_boxplot
(
aes
(x
=
location, y
=
hosp_patients_per_million))
#all default
␣
,
→
settings
# Write your code here. Be careful to follow the instructions!
[11]:
# Modify some geom_boxplot settings to enhance visual appeal.
# Change the fill color, the size of the lines etc.
#The answer should look like this, but with some 'corrections'
11
ggplot
(filter_covid_df)
+
geom_boxplot
(
aes
(x
=
location, y
=
hosp_patients_per_million),
fill
=
"blue"
,
colour
=
"pink"
, size
=1
, alpha
=1
)
# Write your code here. Be careful to follow the instructions!
[12]:
# Make thematic modifications to optimize your plot.
# Again, change the x/y axis titles, the label sizes etc.
#The answer should look like this, but with some 'corrections'
ggplot
(filter_covid_df,
aes
(x
=
location, y
=
hosp_patients_per_million))
+
12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
geom_boxplot
(fill
=
"gray"
, colour
=
"black"
, size
=1
, alpha
=1
)
+
#modify boxplot
␣
,
→
features
theme
(axis.title
=
element_text
(
15
),
axis.text.x
=
element_text
(size
= 15
, angle
= 45
, hjust
= 1
),
axis.text.y
=
element_text
(size
= 15
),
axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#modify axis features
xlab
(
"Country"
)
+
ylab
(
"Hospitalized patients per million"
)
#re-assign axis
␣
,
→
labels
# Write your code here. Be careful to follow the instructions!
13
0.4
Part 4: Scatter Plots
Scatter plots are used to illustrate the relationship between two variables, with each dot representing
an individual piece of data. In this section, we will explore the relationship between total Covid-19
deaths per million and the median age of the population in each country. The expectation is that
countries that have a higher median age will have experience a greater burden of Covid-19 deaths.
[13]:
# We'll pivot our covid_df in order to extract total deaths per million and
␣
,
→
median age for each country.
pivot_covid_df
<-
covid_df
%>%
group_by
(location)
%>%
␣
,
→
summarise
(total_deaths_per_million
=
max
(total_deaths_per_million),
␣
,
→
median_age
=
max
(median_age))
pivot_covid_df[pivot_covid_df
== 0
]
<-
NA
#filter out rows with missing data on
␣
,
→
total deaths or median age.
pivot_covid_df
<-
na.omit
(pivot_covid_df)
pivot_covid_df
14
A tibble: 188 × 3
location
total_deaths_per_million
median_age
<chr>
<dbl>
<dbl>
Afghanistan
193.546
18.6
Albania
1217.223
38.0
Algeria
154.091
29.1
Angola
55.992
16.8
Antigua and Barbuda
1418.037
32.1
Argentina
2828.455
31.9
Armenia
2907.220
35.7
Aruba
2528.103
41.2
Australia
357.450
37.9
Austria
2208.873
44.4
Azerbaijan
950.081
32.4
Bahamas
2053.342
34.3
Bahrain
852.259
32.4
Bangladesh
175.168
27.5
Barbados
1637.076
39.8
Belarus
738.970
40.3
Belgium
2736.768
41.8
Belize
1674.425
25.0
Benin
13.091
18.8
Bhutan
26.927
28.6
Bolivia
1855.076
25.4
Bosnia and Herzegovina
4840.263
42.5
Botswana
1130.050
25.8
Brazil
3124.829
33.5
Brunei
509.589
32.4
Bulgaria
5395.369
44.7
Burkina Faso
17.863
17.6
Burundi
3.101
17.5
Cambodia
180.333
25.6
Cameroon
70.893
18.8
Spain
2294.117
45.5
Sri Lanka
768.422
34.1
Sudan
110.222
19.7
Suriname
2289.633
29.6
Sweden
1874.872
41.0
Switzerland
1585.338
43.1
Syria
172.360
21.7
Taiwan
190.568
42.2
Tajikistan
12.821
23.3
Tanzania
13.659
17.7
Thailand
434.634
40.1
Timor
98.968
18.0
Togo
32.200
19.4
Tonga
112.403
22.3
Trinidad and Tobago
2827.472
36.2
Tunisia
2400.768
32.7
Turkey
1164.074
31.6
Uganda
76.501
16.4
Ukraine
2587.238
41.4
United Arab Emirates
230.706
34.0
United Kingdom
2632.966
40.8
15
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[14]:
# Generate a basic scatter plot using geom_point()
#The answer should look like this, but with some 'corrections'
ggplot
(pivot_covid_df)
+
geom_point
(
aes
(x
=
median_age, y
=
total_deaths_per_million),
fill
=
"black"
, colour
=
"black"
, size
=2
, alpha
=1
)
#mostly default
␣
,
→
settings, with colour, size and transparency specified
# Write your code here. Be careful to follow the instructions!
[15]:
# Make thematic modifications to optimize your plot.
#The answer should look like this, but with some 'corrections'
16
ggplot
(pivot_covid_df)
+
geom_point
(
aes
(x
=
median_age, y
=
total_deaths_per_million),
fill
=
"black"
, colour
=
"black"
, size
=2
, alpha
=2
)
+
theme
(axis.title
=
element_text
(
15
),
axis.text.x
=
element_text
(size
= 15
, angle
= 45
, hjust
= 1
),
axis.text.y
=
element_text
(size
= 15
),
axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#modify axis features
xlab
(
"Median Age"
)
+
ylab
(
"total deaths per million"
)
#re-assign axis labels
# Write your code here. Be careful to follow the instructions!
17
[16]:
# Add a trendline using geom_smooth().
# Notice how a trend line is something like a second 'chart' on top of this
␣
,
→
graph, which can have its own aesthetic.
# Although it's possible to have the aesthetic passed from the initial ggplot()
␣
,
→
command, it's easier to learn ggplot by specifying the aesthetic every time.
#The answer should look like this, but with some 'corrections'
ggplot
(pivot_covid_df)
+
geom_point
(
aes
(x
=
median_age, y
=
total_deaths_per_million), fill
=
"black"
,
␣
,
→
colour
=
"black"
, size
=2
, alpha
=2
)
+
theme
(axis.title
=
element_text
(
15
),
axis.text.x
=
element_text
(size
= 15
, angle
= 45
, hjust
= 1
),
axis.text.y
=
element_text
(size
= 15
),
axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#modify axis features
xlab
(
"Median Age"
)
+
ylab
(
"Total Deaths Per Million"
)
+
geom_smooth
(
aes
(x
=
median_age, y
=
total_deaths_per_million),
method
=
lm)
#add simple regression line with confidence interval
# Write your code here. Be careful to follow the instructions!
`geom_smooth()` using formula = 'y ~ x'
18
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[17]:
# Label and highlight specific point.
# To do so, we'll pass different datasets to the graphing commands
#The answer should look like this, but with some 'corrections'
ggplot
()
+
geom_point
(data
=
pivot_covid_df,
aes
(x
=
median_age,
␣
,
→
y
=
total_deaths_per_million), fill
=
"black"
, colour
=
"black"
, size
=2
, alpha
=2
)
+
theme
(axis.title
=
element_text
(
15
),
axis.text.x
=
element_text
(size
= 15
, angle
= 45
, hjust
= 1
),
axis.text.y
=
element_text
(size
= 15
),
axis.ticks.length
=
unit
(
.1
,
"cm"
))
+
#modify axis features
xlab
(
"Median Age"
)
+
ylab
(
"Total Deaths Per Million"
)
+
19
geom_smooth
(data
=
pivot_covid_df,
aes
(x
=
median_age,
␣
,
→
y
=
total_deaths_per_million),method
=
lm)
+
geom_point
(data
=
pivot_covid_df
%>%
filter
(location
==
"Canada"
),
aes
(x
=
median_age, y
=
total_deaths_per_million),
color
=
"red"
, size
=5
)
+
#label point for Canada red
geom_text
(data
=
pivot_covid_df
%>%
filter
(location
==
"Canada"
),
aes
(x
=
median_age, y
=
total_deaths_per_million,
label
=
"Canada"
), nudge_x
= 3
, size
=5
)
#add a title to the
␣
,
→
Canada point
# Write your code here. Be careful to follow the instructions!
`geom_smooth()` using formula = 'y ~ x'
20
0.5
Part 5: Pie Charts
Pie charts allow us to display percentage values as slices of a pie. In this section, we will use a pie
chart to display the percentage of worldwide Covid-19 deaths that occured on each continent.
[18]:
# Let's pivot our covid_df in order to calculate the total deaths that occured
␣
,
→
in each continent.
pivot_covid_df
<-
covid_df
%>%
group_by
(continent)
%>%
summarise
(total_deaths
=
␣
,
→
sum
(new_deaths))
%>%
mutate
(percent
=
total_deaths
/
sum
(total_deaths))
pivot_covid_df
A tibble: 6 × 3
continent
total_deaths
percent
<chr>
<dbl>
<dbl>
Africa
254379
0.040553667
Asia
1432580
0.228385096
Europe
1847300
0.294500682
North America
1448053
0.230851836
Oceania
13253
0.002112823
South America
1277086
0.203595896
[19]:
#A stacked bar plot to illustrate the proportion of deaths on each continent
#x must be "" for a stacked bar plot since we want everything plotted on top of
␣
,
→
each other
#The answer should look like this, but with some 'corrections'
ggplot
()
+
geom_bar
(data
=
pivot_covid_df,
aes
(x
=
""
, y
=
percent, fill
=
continent), stat
=
␣
,
→
"identity"
)
#default for a stacked bar graph
# Write your code here. Be careful to follow the instructions!
21
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[20]:
#Convert to a pie chart.
#The answer should look like this, but with some 'corrections'
ggplot
()
+
geom_bar
(data
=
pivot_covid_df,
aes
(x
=
""
, y
=
percent, fill
=
continent), stat
=
␣
,
→
"identity"
)
+
coord_polar
(theta
=
"y"
, start
= 0
)
#turn into a pie graph
# Write your code here. Be careful to follow the instructions!
22
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[21]:
# Clean up the rest of the chart.
# Add percentages etc
ggplot
(pivot_covid_df,
aes
(x
=
""
, y
=
percent, fill
=
continent))
+
geom_bar
(stat
=
"identity"
)
+
geom_text
(
aes
(x
= 1.6
, label
=
scales
::
percent
(percent, accuracy
= .1
)),
␣
,
→
position
=
position_stack
(vjust
= .5
), size
= 5
)
+
coord_polar
(
"y"
, start
=0
)
+
theme_minimal
()
+
#remove background
theme
(axis.title.x
=
element_blank
(), axis.title.y
=
element_blank
(), panel.
,
→
border
=
element_blank
(),
#remove other excess text
panel.grid
=
element_blank
(), axis.ticks
=
element_blank
(), axis.
,
→
text
=
element_blank
())
+
23
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
theme
(legend.text
=
element_text
(size
=15
), legend.title
=
element_blank
())
␣
,
→
#manipulate legend
0.6
Part 6: Histograms
Histograms diplay frequency distributions of data from one or more variables using adjacent vertical
bars. In this section, we will look at the distribution of normalized daily deaths throughout the
pandemic in Canada and the United States.
[22]:
# Generate a basic histogram for daily deaths per million observed in Canada.
# We use the geom_histogram() function for this.
#The answer should look like this, but with some 'corrections'
24
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
ggplot
(covid_df
%>%
filter
(location
==
"Canada"
))
+
geom_histogram
(
aes
(x
=
new_deaths_per_million), binwidth
=.2
)
# Write your code here. Be careful to follow the instructions!
[23]:
#Plot histograms for daily deaths per million in both Canada and the USA on the
␣
,
→
same plot.
# Write your code here. Be careful to follow the instructions!
ggplot
(covid_df
%>%
filter
(location
==
"Canada"
|
location
==
"United States"
))
␣
,
→
+
geom_histogram
(
aes
(x
=
new_deaths_per_million, color
=
location, fill
=
␣
,
→
location),alpha
= 0.5
, binwidth
=.2
)
25
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
#alpha modifies fill transparent, lower the alpha more transparent the plot
[24]:
# Add a mean line for each distribution.
# We've created a new dataframe with the means of the datasets.
means
<-
covid_df
%>%
filter
(location
==
"Canada"
|
location
==
"United
␣
,
→
States"
)
%>%
group_by
(location)
%>%
summarise
(mean
=
␣
,
→
mean
(new_deaths_per_million))
means
#The answer should look like this, but with some 'corrections'
ggplot
()
+
geom_histogram
(data
=
covid_df
%>%
26
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
filter
(location
==
"Canada"
|
location
==
"United States"
),
aes
(x
=
new_deaths_per_million, colour
=
location, fill
=
location), alpha
=0.5
,
␣
,
→
binwidth
=.2
)
+
#alpha makes fill transparent
geom_vline
(data
=
means,
aes
(xintercept
=
mean, colour
=
location),
␣
,
→
linetype
=
"dashed"
)
#mean line
# Write your code here. Be careful to follow the instructions!
A tibble: 2 × 2
location
mean
<chr>
<dbl>
Canada
1.259491
United States
3.472203
27
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[25]:
#Clean up the rest of the histogram.
ggplot
()
+
geom_histogram
(data
=
covid_df
%>%
filter
(location
==
"Canada"
|
location
==
␣
,
→
"United States"
),
aes
(x
=
new_deaths_per_million, colour
=
location,
␣
,
→
fill
=
location), alpha
=0.5
, binwidth
=.2
)
+
#alpha makes fill transparent
geom_vline
(data
=
means,
aes
(xintercept
=
mean, colour
=
location),
␣
,
→
linetype
=
"dashed"
)
+
xlab
(
"Daily Deaths Per Million"
)
+
ylab
(
"Count"
)
+
#modify axis labels
theme
(legend.position
=
"bottom"
, legend.text
=
element_text
(size
=15
), legend.
,
→
title
=
element_blank
())
+
#manipulate legend
theme
(axis.title
=
element_text
(size
=15
), axis.text.x
=
␣
,
→
element_text
(size
=15
), axis.text.y
=
element_text
(size
=15
), axis.ticks.
,
→
length
=
unit
(
.1
,
"cm"
))
+
scale_color_manual
(values
=
c
(
"#ff0000"
,
"#0000FF"
))
#manually specify
␣
,
→
colours for each group
28
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
0.7
Part 7: Heat Maps
Heat maps enable us to illustrate the relationship between three variables by separating two cate-
gorical variables on the x and y axes, and displaying a third variable on the 2-dimensional matrix
using a colour gradient.
In this section, we will compare the average number of Covid-19 cases
observed each day for each month of the pandemic in Canada.
Our goal is to highlight months
were cases were especially high, and those where cases were especially low.
[26]:
# Modify the covid_df dataframe to have years and months as a separate column
covid_df
<-
covid_df
%>%
mutate
(year
=
format
(date,
"%Y"
))
%>%
␣
,
→
mutate
(month
=
format
(date,
"%m"
))
[27]:
# Now let's build a heatmap dataframe. This data frame should only look at data
␣
,
→
from Canada, and then get the average number of new cases for each month/year
#The answer should look like this, but with some 'corrections'
heatmap_df
<-
covid_df
%>%
filter
(location
==
"Canada"
)
%>%
group_by
(year,
␣
,
→
month)
%>%
summarise
(avg_cases
=
mean
(new_cases))
head
(heatmap_df)
# Write your code here. Be careful to follow the instructions!
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
A grouped_df: 6 × 3
year
month
avg_cases
<chr>
<chr>
<dbl>
2020
01
0.4444444
2020
02
1.0000000
2020
03
344.4516129
2020
04
1550.1000000
2020
05
1124.2903226
2020
06
418.0333333
[28]:
#Generate a basic heat map with mostly default settings.
#The answer should look like this, but with some 'corrections'
ggplot
(heatmap_df)
+
geom_tile
(
aes
(year,month , fill
=
avg_cases), colour
=
"white"
)
+
scale_fill_gradient
(low
=
"white"
, high
=
"dark red"
)
#specify gradient colours
# Write your code here. Be careful to follow the instructions!
29
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
[29]:
#Clean up the heat map to make it more visually appealing.
options
(repr.plot.width
=8
, repr.plot.height
=4
)
#this setting will set a new
␣
,
→
working default for all of the Jupyter Notebook
ggplot
(heatmap_df)
+
geom_tile
(
aes
(year, month, fill
=
avg_cases),colour
=
"white"
)
+
scale_fill_gradient
(low
=
"white"
, high
=
"dark red"
, guide
=
␣
,
→
guide_colorbar
(frame.colour
=
"black"
, frame.linewidth
= 2
, ticks.colour
=
␣
,
→
"black"
, ticks.linewidth
= 2
, label
=
TRUE
, barwidth
=2
, barheight
=8
))
+
␣
,
→
#modify your gradient features
coord_flip
()
+
#turn the plot 90 degrees
scale_x_discrete
(limits
=
rev)
+
#flip the order of the year axis
theme_minimal
()
+
#remove background features
30
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
theme
(panel.border
=
element_rect
(colour
=
"black"
, fill
=
NA
, linewidth
=2
),
␣
,
→
panel.grid
=
element_blank
())
+
#add a panel border, remove the grid
theme
(legend.text
=
element_text
(size
=15
), legend.title
=
␣
,
→
element_text
(size
=15
))
+
#manipulate legend
theme
(axis.title
=
element_text
(size
=15
), axis.text.x
=
␣
,
→
element_text
(size
=15
), axis.text.y
=
element_text
(size
=15
), axis.ticks.
,
→
length
=
unit
(
.5
,
"cm"
))
#manipulate axis
0.8
Tutorial Summary
Your practical this week will apply the tools that we have learned during this tutorial in
ggplot2
to generate appropriate plots for a given scenario. Specifically, you will be asked to identify the
best visualization tool to answer a particular question and then create the corresponding figure.
These activities will be conducted with the guidance of your TAs, but they will be graded.
[ ]:
31
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Documents
Related Questions
What connects both internal and external data in operations and supply chain analytics?
Ai
Danalytics
Teradata
Deep Learning.
arrow_forward
What is business analytics? Briefly describe the domain of the major fields of business analytics databases and data warehousing, descriptive, predictive, and prescriptive analytics.
arrow_forward
What is the advantage of using existing datasets as a data collection method?
A It provides the most accurate and reliable data
B
It allows for customization and control over the data collection process
It saves time and resources by utilizing data already available
All of the above
arrow_forward
Alert dont submit AI generated answer.
arrow_forward
All analysis, calculations, and explanations must be done in a single Excel file (use separate Excel sheets for each question). Upload the completed Excel file using the file extension format Lastname_Firstname_RegressionProblem.
Regression Problem
Sarah Anderson, the business analyst at TV Revolution, is conducting research on the dealership’s various television brands. She has collected data over the past year (2022) on the manufacturer, screen size, and price of various television brands. The data is given in the file below.
You have been hired as an intern to run analyses on the data and report the results back to Sarah; the five questions that Sarah needs you to address are given below.
Does there appear to be a positive or negative relationship between price and screen size? Use a scatter plot to examine the relationship.
Determine and interpret the correlation coefficient between the two variables. In your interpretation, discuss the direction of the relationship (positive,…
arrow_forward
Data mining is the extraction of knowledge and data patterns from various raw data sets by examining patterns from various raw data sets by examining trends and business reports used for classification of data and prediction of the data set.
Give an example of an actual or potential application of big data or data mining in a marketing organization. Describe how the application meets the criteria of being big data or data mining.
arrow_forward
First sub-orparto?
arrow_forward
What is model breakdown?
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage
Related Questions
- What connects both internal and external data in operations and supply chain analytics? Ai Danalytics Teradata Deep Learning.arrow_forwardWhat is business analytics? Briefly describe the domain of the major fields of business analytics databases and data warehousing, descriptive, predictive, and prescriptive analytics.arrow_forwardWhat is the advantage of using existing datasets as a data collection method? A It provides the most accurate and reliable data B It allows for customization and control over the data collection process It saves time and resources by utilizing data already available All of the abovearrow_forward
- Alert dont submit AI generated answer.arrow_forwardAll analysis, calculations, and explanations must be done in a single Excel file (use separate Excel sheets for each question). Upload the completed Excel file using the file extension format Lastname_Firstname_RegressionProblem. Regression Problem Sarah Anderson, the business analyst at TV Revolution, is conducting research on the dealership’s various television brands. She has collected data over the past year (2022) on the manufacturer, screen size, and price of various television brands. The data is given in the file below. You have been hired as an intern to run analyses on the data and report the results back to Sarah; the five questions that Sarah needs you to address are given below. Does there appear to be a positive or negative relationship between price and screen size? Use a scatter plot to examine the relationship. Determine and interpret the correlation coefficient between the two variables. In your interpretation, discuss the direction of the relationship (positive,…arrow_forwardData mining is the extraction of knowledge and data patterns from various raw data sets by examining patterns from various raw data sets by examining trends and business reports used for classification of data and prediction of the data set. Give an example of an actual or potential application of big data or data mining in a marketing organization. Describe how the application meets the criteria of being big data or data mining.arrow_forward
arrow_back_ios
arrow_forward_ios
Recommended textbooks for you
- Big Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin HarcourtAlgebra & Trigonometry with Analytic GeometryAlgebraISBN:9781133382119Author:SwokowskiPublisher:Cengage

Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Algebra & Trigonometry with Analytic Geometry
Algebra
ISBN:9781133382119
Author:Swokowski
Publisher:Cengage