Assignment_1_solution_set

.pdf

School

University of Waterloo *

*We aren’t endorsed by this school

Course

605B

Subject

Health Science

Date

May 17, 2024

Type

pdf

Pages

6

Uploaded by CorporalStarling4312

Report
Assignment 1 - HLTH 605B - Fall 2023 (100 marks) - Solution Set Write up your own answers to the following questions. Also, where asked to use R, you must do so. Include your R code as part of your answer, similar to how the code and subsequent results are presented in the module notes. Submit your answers to the Assignment 1 dropbox in .pdf, .doc, or .docx format. 1. Problem 1 (20 marks) Reflecting on the required reading article “Scientists rise up against statistical significance” in Nature , 2019: a. (10 marks) What are the authors trying to convey when mentioning that scientific decisions should not just be based on statistical evidence? Answer : That we should be using statistical evidence such as p-values as only one component in making scientific decisions. Specifically, according to the authors, we should not think purely in a dichotomous way when interpreting a p-value, where we assume that there exists scientific evidence in favor of the alternative hypothesis if the p-value is below the pre-established cutoff (e.g., .05) for the hypothesis test or, conversely, that there does not exist such evidence if the p-value is above the pre-establshed cutoff. b. (10 marks) Do your best to explain the following quote from the authors: “Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not.”? Answer : There are potentially many reasons that one study could lead to a null hypothesis being rejected while another, even one that appears to be about the same study, might not. For example, simple differences in the realized background variation of the responses could lead to conflicting results on statistical significance. Other reasons could be due to differences in sample sizes, differences in the makeup of the samples themselves (which may or may not reflect differences in the populations from which the samples were obtained), differences in equipment or in expertise in obtaining measurements between the two studies, and so on. 2. Problem 2 (20 marks) Answer the following conceptual questions connected to correlation. a. (10 marks) Without writing any equations, explain the difference between the covariance of two numeric variables, X and Y , and (Pearson’s) correlation of those same two variables. Answer : Both the covariance and Pearson correlation of numeric variables X and Y describe the association between the two variables. However, the size of the covariance will be a function of both the association itself and the units of the two variables. Hence, seeing a covariance between X and Y of 1342.56 may be considered large for one problem but not large at all for another problem; all we can say for this covariance quantity, without further context, is that the association is positive between X and Y since the covariance is positive. This overall lack of interpretation is the advantage of the Pearson correlation, specifically when a linear association between X and Y makes sense. This correlation measure simply takes the original covariance and makes it insensitive to the original scale of X and Y , normalizing it to fall between -1 and 1, with specific interpretations if the correlation is negative, roughly 0, or positive, with the strongest co-linear relationships occurring when the correlation is near its extremes of -1 or 1. b. (10 marks) In Section 2.3.5 of the Module 2a notes, in the equation for the Spearman rank correlation coefficient, r ( S ) X,Y , explain the role of the difference, d i , between the ranks of X and Y for each person i and how those differences can influence the final value of r ( S ) X,Y . 1
Answer : After ranking X and Y separately, Spearman correlation looks to see how close the ranks are for each given person i , on average. This is where the difference d i comes in. If the differences of the ranks are small on average, then the term with the squared differences will tend to be small (closer to 0), and since the term with the squared differences is subtracted by 1, then this will lead to a Spearman correlation closer to 1. However, if the differences are very large, so that when X i is ranked low then Y i is ranked high, and vice versa, then this is indicative of a strong negative correlation, and will lead the squared term that contains d i to be closer to 2, thereby leading the Spearman correlation to be closer to -1. Finally, if the ranks of X and Y are unrelated across all i , on average, then the squared term that contains d i will tend to be near 1, leading the Spearman correlation to be near 0. 3. Problem 3 (60 marks) Let’s go back to the gapminder dataset partially explored in the Module 2a notes. a. (15 marks) Using R , and the R function ggplot , produce one scatterplot with gdpPerCap on the x-axis and lifeExp on the y-axis for the 1967 data, and a second scatterplot for the 2002 data. Compare and contrast what you see in these two plots. NOTE: Unlike what you see in the Module 2a code, you should add your own centered title to each of these plots. The way to do this is to add the following to the ggplot call used to create scatterplots in the Example in Section 2.2 of the Module 2a notes: + ggtitle("Enter your title here") + theme(plot.title = element_text(hjust = 0.5)) Answer : In FIGURE A1.3a.1 below for the 1967 data, we can see a curved relationship between gdpPerCap and lifeExp . Specifically, there is a steep, roughly linear, increase in lifeExp for small increases in gdpPerCap , when gdpPerCap is low (to roughly 8000 or 9000). But as gdpPerCap increases, this increase in lifeExp slows down until about 16000-17000 gdpPerCap , after which it effectively flattens out. There is one exception, a very large outlier, at about 81000 gdpPerCap , whereas all other gdpPerCap values fall below about 23000. We can also see lifeExp maxes out at just under 75. In FIGURE A1.3a.2 below for the 2002 data, we can see a more pronounced curved relationship between gdpPerCap and lifeExp , as compared to the 1967 data. In addition, in 2002, there are many more countries with gdpPerCap > 23000. Also in the 2002 plot, lifeExp maxes out at about 82, as compared to closer to 75 in 1967. Points allocation : 4 points for each plot, of which 1 mark (in each plot) is allocated for properly adding a title. Then, remaining 7 marks go to explanation of comparing and contrasting what is seen between the plots. library (ggplot2) data (gapminder, package= gapminder ) ggplot ( subset (gapminder, year %in% 1967 ), aes (gdpPercap, lifeExp)) + geom_point () + ggtitle ( "FIGURE A1.3a.1: Scatterplot of 1967 life expectancy vs. GDP per capita" ) + theme ( plot.title = element_text ( hjust = 0.5 )) 2
40 50 60 70 0 20000 40000 60000 80000 gdpPercap lifeExp FIGURE A1.3a.1: Scatterplot of 1967 life expectancy vs. GDP per capita ggplot ( subset (gapminder, year %in% 2002 ), aes (gdpPercap, lifeExp)) + geom_point () + ggtitle ( "FIGURE A1.3a.2: Scatterplot of 2002 life expectancy vs. GDP per capita" ) + theme ( plot.title = element_text ( hjust = 0.5 )) 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help