Assignment_1_solution_set
.pdf
keyboard_arrow_up
School
University of Waterloo *
*We aren’t endorsed by this school
Course
605B
Subject
Health Science
Date
May 17, 2024
Type
Pages
6
Uploaded by CorporalStarling4312
Assignment 1 - HLTH 605B - Fall 2023 (100 marks) - Solution Set
Write up your own answers to the following questions. Also, where asked to use R, you must do so. Include
your R code as part of your answer, similar to how the code and subsequent results are presented in the
module notes. Submit your answers to the Assignment 1 dropbox in .pdf, .doc, or .docx format.
1. Problem 1 (20 marks)
Reflecting on the required reading article “Scientists rise up against statistical significance” in
Nature
, 2019:
a.
(10 marks) What are the authors trying to convey when mentioning that scientific decisions should not
just be based on statistical evidence?
Answer
: That we should be using statistical evidence such as p-values as only one component in making
scientific decisions. Specifically, according to the authors, we should not think purely in a dichotomous way
when interpreting a p-value, where we assume that there exists scientific evidence in favor of the alternative
hypothesis if the p-value is below the pre-established cutoff (e.g., .05) for the hypothesis test or, conversely,
that there does not exist such evidence if the p-value is above the pre-establshed cutoff.
b.
(10 marks) Do your best to explain the following quote from the authors: “Neither should we conclude
that two studies conflict because one had a statistically significant result and the other did not.”?
Answer
: There are potentially many reasons that one study could lead to a null hypothesis being rejected
while another, even one that appears to be about the same study, might not. For example, simple differences
in the realized background variation of the responses could lead to conflicting results on statistical significance.
Other reasons could be due to differences in sample sizes, differences in the makeup of the samples themselves
(which may or may not reflect differences in the populations from which the samples were obtained), differences
in equipment or in expertise in obtaining measurements between the two studies, and so on.
2. Problem 2 (20 marks)
Answer the following conceptual questions connected to correlation.
a.
(10 marks) Without writing any equations, explain the difference between the covariance of two numeric
variables,
X
and
Y
, and (Pearson’s) correlation of those same two variables.
Answer
: Both the covariance and Pearson correlation of numeric variables
X
and
Y
describe the association
between the two variables. However, the size of the covariance will be a function of both the association itself
and the units of the two variables. Hence, seeing a covariance between
X
and
Y
of 1342.56 may be considered
large for one problem but not large at all for another problem; all we can say for this covariance quantity,
without further context, is that the association is positive between
X
and
Y
since the covariance is positive.
This overall lack of interpretation is the advantage of the Pearson correlation, specifically when a linear
association between
X
and
Y
makes sense. This correlation measure simply takes the original covariance and
makes it insensitive to the original scale of
X
and
Y
, normalizing it to fall between -1 and 1, with specific
interpretations if the correlation is negative, roughly 0, or positive, with the strongest co-linear relationships
occurring when the correlation is near its extremes of -1 or 1.
b.
(10 marks) In Section 2.3.5 of the Module 2a notes, in the equation for the Spearman rank correlation
coefficient,
r
(
S
)
X,Y
, explain the role of the difference,
d
i
, between the ranks of
X
and
Y
for each person
i
and how those differences can influence the final value of
r
(
S
)
X,Y
.
1
Answer
: After ranking
X
and
Y
separately, Spearman correlation looks to see how close the ranks are for
each given person
i
, on average. This is where the difference
d
i
comes in. If the differences of the ranks are
small on average, then the term with the squared differences will tend to be small (closer to 0), and since the
term with the squared differences is subtracted by 1, then this will lead to a Spearman correlation closer to 1.
However, if the differences are very large, so that when
X
i
is ranked low then
Y
i
is ranked high, and vice
versa, then this is indicative of a strong negative correlation, and will lead the squared term that contains
d
i
to be closer to 2, thereby leading the Spearman correlation to be closer to -1. Finally, if the ranks of
X
and
Y
are unrelated across all
i
, on average, then the squared term that contains
d
i
will tend to be near 1,
leading the Spearman correlation to be near 0.
3. Problem 3 (60 marks)
Let’s go back to the
gapminder
dataset partially explored in the Module 2a notes.
a.
(15 marks)
Using R
, and the R function
ggplot
, produce one scatterplot with
gdpPerCap
on the
x-axis and
lifeExp
on the y-axis for the 1967 data, and a second scatterplot for the 2002 data. Compare
and contrast what you see in these two plots.
NOTE: Unlike what you see in the Module 2a code, you should add your own centered title to each of
these plots. The way to do this is to add the following to the
ggplot
call used to create scatterplots in
the Example in Section 2.2 of the Module 2a notes:
+ ggtitle("Enter your title here") + theme(plot.title = element_text(hjust = 0.5))
Answer
: In FIGURE A1.3a.1 below for the 1967 data, we can see a curved relationship between
gdpPerCap
and
lifeExp
. Specifically, there is a steep, roughly linear, increase in
lifeExp
for small increases in
gdpPerCap
,
when
gdpPerCap
is low (to roughly 8000 or 9000). But as
gdpPerCap
increases, this increase in
lifeExp
slows
down until about 16000-17000
gdpPerCap
, after which it effectively flattens out. There is one exception, a
very large outlier, at about 81000
gdpPerCap
, whereas all other
gdpPerCap
values fall below about 23000.
We can also see
lifeExp
maxes out at just under 75.
In FIGURE A1.3a.2 below for the 2002 data, we can see a more pronounced curved relationship between
gdpPerCap
and
lifeExp
, as compared to the 1967 data. In addition, in 2002, there are many more countries
with
gdpPerCap
> 23000. Also in the 2002 plot,
lifeExp
maxes out at about 82, as compared to closer to 75
in 1967.
Points allocation
: 4 points for each plot, of which 1 mark (in each plot) is allocated for properly adding a
title. Then, remaining 7 marks go to explanation of comparing and contrasting what is seen between the
plots.
library
(ggplot2)
data
(gapminder,
package=
gapminder
)
ggplot
(
subset
(gapminder, year
%in%
1967
),
aes
(gdpPercap, lifeExp))
+
geom_point
()
+
ggtitle
(
"FIGURE A1.3a.1: Scatterplot of 1967 life expectancy vs. GDP per capita"
)
+
theme
(
plot.title =
element_text
(
hjust =
0.5
))
2
40
50
60
70
0
20000
40000
60000
80000
gdpPercap
lifeExp
FIGURE A1.3a.1: Scatterplot of 1967 life expectancy vs. GDP per capita
ggplot
(
subset
(gapminder, year
%in%
2002
),
aes
(gdpPercap, lifeExp))
+
geom_point
()
+
ggtitle
(
"FIGURE A1.3a.2: Scatterplot of 2002 life expectancy vs. GDP per capita"
)
+
theme
(
plot.title =
element_text
(
hjust =
0.5
))
3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help