11_ANOVA_2023

pdf

School

McGill University *

*We aren’t endorsed by this school

Course

206

Subject

Biology

Date

Feb 20, 2024

Type

pdf

Pages

12

Report

Uploaded by ConstableFlagBear38

BIOL206 - Lab 11: ANOVA and population dynamics Fall 2023 Objectives The learning objectives of this lab are to: 1. Be introduced to population dynamics, an important element in ecology and conservation. 2. Compare the means of more than two groups using a statistical test called an ANOVA. Population dynamics Species extinctions are necessarily preceded by declines in abundance of plant and animal populations. Thus, many ecologists are interested in tracking population trends (i.e., whether abundance is increasing or decreasing), which can be done by repeatedly counting populations over time. This is known as “time-series data”. Population growth or decline can be described mathematically using the equation: N ( t + 1) = N ( t ) e r where N(t) is the population size at time t , N(t+1) is the population size at time t + 1 , and r is the growth rate. The units of r are t 1 . Obviously, this is a simplified model of population dynamics, as it does not include processes such as random population fluctuations and density dependence. Nonetheless, our main interest is on the coefficient r , which will determine whether the population is growing or declining over the long term. Estimating population growth from time series data We can rearrange the equation above to see how to estimate the growth rate, r , from time-series data: N ( t + 1) = N ( t ) e r N ( t + 1) /N ( t ) = e r 1
log ( N ( t + 1) /N ( t )) = r log ( N ( t + 1)) log ( N ( t )) = r Therefore, the difference in the log population size from one year to the next is an estimate of the population growth rate. This value is known as the log-difference. For example, suppose that the population size of a population has the following values over five years: Year t Population size N(t) Log of population size log(N(t)) Log-difference log(N(t+1))-log(N(t)) 2000 39 3.664 2001 61 4.111 4.111-3.664 = 0.447 2002 48 3.871 3.871-4.111 = -0.24 2003 31 3.434 3.434-3.871 = -0.437 2004 32 3.466 3.466-3.434 = 0.032 Notice that: If the population increases, the log-difference is positive. If the population stays almost the same, the log-difference is close to 0. If the population decreases, the log-difference is negative. The mean log-difference over the five year time period was -0.05 year 1 . This value is an estimate of the growth rate, r , of the population. In this lab and next week’s lab, we will try to predict population growth rate. The Living Planet Index In this lab, we use data sampled from The Living Planet Index database. From the LPI website in 2022: "The LPI tracks almost 21,000 populations of mammals, birds, fish, reptiles and amphibians around the world. [...] The data is gathered from almost 4,000 sources, using increasingly sophisticated technology such as audio devices to monitor insect sounds; drones and satellite tagging to track populations on the move; and even block-chain technology to track the impact of harvesting on wild populations." 2
Map of terrestrial & freshwater populations in the Living Planet Database. Analysis of Variance Today, we will use a statistical test called an ANOVA to try to predict population growth rate of animal populations in the Living Planet Database. So far, you have learned two types of hypothesis tests: t-tests and linear regressions. Today, you will learn a third type of hypothesis test: analysis of variance (ANOVA). To review: One-sample t-tests are used to determine whether a population mean is equal to a hypoth- esized value. Two-sample t-tests are used to determine whether two populations have the same mean. Linear regressions are used to determine whether two continuous variables are related. ANOVAs will add a new ability to your repertoire. Similar to a two-sample t-test but more flexible, ANOVAs allow you to determine whether two or more populations have the same mean. The statistical hypotheses associated with an ANOVA for k number of groups are: Null hypothesis ( H 0 ): The population mean is the same for all groups. µ 1 = µ 2 = ... = µ k Alternative hypothesis ( H A ): The population mean varies between the groups. 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Note: We do not generally specify the alternative hypothesis in mathematical terms for an ANOVA with more than two groups. However, if we we want to write it out, we need to list all possible group differences. For example, for an ANOVA with three groups: H A : µ 1 ̸ = µ 2 or µ 1 ̸ = µ 3 or µ 2 ̸ = µ 3 . ANOVAs can test whether any number of groups have different population means. Therefore, it can test whether two groups have different population means, just like two-sample t-tests. In fact, two-sample t-tests and ANOVAs with two groups are mathematically related and can be used interchangeably in many situations. The F-statistic ANOVAs use F as a test statistic. Although the calculation and distribution of F is different from t, thehypothesistestingframeworkisthesame: Youcalculatetheteststatisticforyourobservations and compare it to a critical value to know whether to reject the null hypothesis. The formula for F is: F = MSG MSE Where MSG is the mean square between groups, a measure of variation between the groups. MSE is the mean square within groups, a measure of variation within groups. Sources of variation in ANOVA (modi ed from https://www.datanovia.com/en/lessons/anova-in-r/ ) Therefore, F is a ratio of variation between and within groups. The more variation there is between groups (relative to within groups), the larger the F. If F is large enough (i.e., greater than the critical value), we reject the null hypothesis that the groups have the same mean. 4
Procedure This week you will be using ANOVAs to predict growth rates of populations in the Living Planet Index database. We have already calculated the estimated growth rates for the populations for you, by taking the mean log-difference of the time series data. We will refer to the mean log-difference as “population growth rate”, although “estimated mean population growth rate” would be a more accurate (but cumbersome) name. There is one population growth rate value for each population. Start by downloading the Living Planet data, “LPI.csv”, from MyCourses. Open an R Script, set your working directory, and load the csv as a dataframe. Call it “LPI”. Each row is a different population. The final column of the dataframe is Pop.growth. This is the mean log-difference of each population. The other columns give other information about the populations, such as their class and biome. # Look at the LPI data View (LPI) Scientific question An interesting observation is that the average population growth of the populations is close to zero. In fact, a one-sample t-test shows that the mean population growth rate is not significantly different from zero. # One-sample t-test of population growth # H0: mu = 0 t.test (LPI $ Pop.growth, mu= 0 , alternative= "two.sided" ) This lack of change in average population size across all populations is the result of some popu- lations increasing and offsetting declines in other populations. Looking at a histogram of the population growth rates, we can see that the mean of the distribution is approximately zero, but there is symmetrical variation around the mean. # Install the ggplot2 package install.packages ( "ggplot2" ) # Load ggplot2 library (ggplot2) # Histogram of population growth ggplot ( data = LPI) + geom_histogram ( mapping = aes ( x = Pop.growth), bins = 10 ) + labs ( title = "Histogram of population growth rate" , x = "Populuation growth rate (1/year)" , y = "Frequency" ) What explains this variation? In other words, what predicts whether a population declines, in- creases, or stays the same over time? Populations do not grow or decline in isolation, but instead are affected by myriad potential factors that may vary across space and time. A better under- 5
standing of what factors are associated with population decline could help inform conservation priorities and strategies. Today, we will test whether taxonomic class and trophic level predict population growth rate. We will start with the scientific question: Are some taxonomic classes of vertebrates experiencing more population decline than other classes? Biological hypothesis We hypothesize that some classes of vertebrates are experiencing more population decline than other classes because they are less able to adapt to rapid anthropogenic environmental change. Exploratory data analysis Just as with t-tests and linear regression, it’s vital to conduct an EDA before you begin a hypothesis test. The EDA for an ANOVA is similar to the EDA for a two-sample t-test. Calculate summary statistics of each group separately, as well as all the groups together. Visualize the data: Make a histogram for each group and make a boxplot of all the distribu- tions together. Note whether the distributions are approximately normal. # Summary statistics of the full population growth distribution mean (LPI $ Pop.growth) median (LPI $ Pop.growth) sd (LPI $ Pop.growth) min (LPI $ Pop.growth) max (LPI $ Pop.growth) # Summary statistics for each group separately tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = mean) tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = median) tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = sd) tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = var) tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = min) tapply ( X = LPI $ Pop.growth, INDEX = LPI $ Class, FUN = max) # Histograms of each class # Amphibians ggplot ( data = LPI[LPI $ Class == "Amphibia" ,]) + geom_histogram ( mapping = aes ( x = Pop.growth), bins = 10 ) + labs ( title = "Histogram of population growth rate" , x = "Population growth rate (1/year)" , y = "Frequency" ) # YOUR CODE HERE! (Make histograms for each class) # Boxplot ggplot (LPI, aes ( x= Class, y= Pop.growth)) + geom_boxplot () + 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
labs ( title = "Population growth rate by animal class" , x = "Taxonomic class" , y = "Population growth rate (1/year)" ) What are your impressions from this EDA? Do you think the ANOVA will find a significant difference in mean population growth rate between the taxonomic classes? Hypothesis test State the biological hypothesis as statistical hypotheses: Statistical null hypothesis ( H 0 ): The population mean of population growth rate is the same for all taxonomic classes. µ 1 = µ 2 = µ 3 = µ 4 Statistical alternative hypothesis ( H A ): The population mean of population growth rate varies between the taxonomic classes. Note: The wording for these hypotheses is a little awkward because they mention two different populations. The population in “population growth rate” refers to ecological populations (groups of animals of the same species in the same geographic area). The population in “population mean” refers to the statistical populations, which in this case are the populations of all ecological populations within each class. Choose an appropriate statistical test that will allow you to reject H 0 if it is false: You want to determine whether different groups (taxonomic classses) have different population means (mean population growth rate). Therefore, an ANOVA is probably appropriate. Before we can say for certain when an ANOVA is appropriate, we need to check the assumptions. ANOVAs have the same assumptions as two-sample t-tests: 1. The observations are independent of one another: For the purposes of this lab, we will assume this assumption is met. In reality, there might be non-independence caused by phylogeny, as we have seen in previous labs. There might also be non-independence due to spatial structure: populations in the same geographic region might be more similar. 2. The variable is approximately normally distributed within each sample: You saw from your histograms in the EDA that the samples are approximately normally distributed. The sample size of each group is relatively small, so we do not expect the distributions to follow normal distributions closely. 3. The samples have approximately equal variance (homoscedasticity): You saw from your summary statistics in the EDA that the samples have approximately equal variance. Their variance is roughly the same order of magnitude, and although mammal variance is some- what higher than the other classes, it is not enough to cause problems. Therefore, the data meets the assumptions and an ANOVA is appropriate. Choose a significance level ( α ): As usual, α = 0.05. 7
Determine the critical value that the test statistic must exceed to be significant: Let’s explore what the data should look like if the null hypothesis is true and the population means of all classes are the same. What does a typical sample from this null population look like? What range of F-values do the samples typically have? Like we did with the t-distribution, we can simulate the null population and take samples to find out. # Simulate a null population where all taxonomic classes have the same # mean and standard deviation for population growth. Repeatedly sample # from the population. null_samp_F = NULL for (i in 1 : 10000 ){ Pop.growth_null = rnorm ( n= 135 , mean= mean (LPI $ Pop.growth), sd= sd (LPI $ Pop.growth)) Class_null = sample ( x= c ( "Aves" , "Mammalia" , "Amphibia" , "Reptilia" ), size= 135 , replace= TRUE ) null_samp_F[i] = summary ( aov (Pop.growth_null ~ Class_null))[[ 1 ]][ 1 , 4 ] } # (This may take a few seconds to run) # Make a histogram of the null sample Fs (sampling distribution) ggplot ( data = data.frame (null_samp_F)) + geom_histogram ( mapping = aes ( x = null_samp_F), bins = 50 ) + labs ( title = "F of samples from the null population" , x = "F" , y = "Frequency" ) Since we chose a significance level of 0.05, we want to reject the null hypothesis only if the F value is unusual enough (when the null hypothesis is true) that it—or a more extreme value—occurs only 5% of the time. We can use the simulated F-distribution to find the value that the sample Fs will below 95% of the time when the null hypothesis is true: # Put the numbers in increasing order ordered_null_samp_F = null_samp_F[ order (null_samp_F)] # Select the 9,5000th value (of 10,000) # Only 5% of values in the sample are larger than this value ordered_null_samp_F[ 9500 ] This value, which should be roughly 2.7, is the critical F. This means that F is equal or greater than approximately this value only 5% of the time when the null hypothesis is true. As with the t-distribution, it is educational to use a simulation to generate the F-distribution, but using the theoretical F-distribution is easier and more accurate. In R, we can use the qf() function with the following arguments: p = α df1: ANOVAs have two degrees of freedoms. df1 is the degrees of freedom for the F numer- ator (MSG). df1 = k-1, where k is the number of groups. df1 = 3. df2: df2 is the degrees of freedom for the F denominator (MSE). df1 = n-k, where n is the 8
sample size of all the groups put together. df2 = 135-4 = 131. lower.tail = FALSE # Obtain the critical F F_crit = qf ( p = 0.05 , df1 = 3 , df2= 131 , lower.tail= F) F_crit # Add the critical F to the histogram of the F-distribution Fdist = rf ( n= 10000 , df1 = 3 , df2= 131 ) ggplot ( data = data.frame (Fdist)) + geom_histogram ( mapping = aes ( x = Fdist), bins = 50 ) + labs ( title = "F of samples from the null population" , x = "F" , y = "Frequency" ) + geom_vline ( xintercept = F_crit, col= "red" ) From the theoretical F-distribution, the critical value is 2.674 (which should be pretty close to the value from your simulated F-distribution). This is the cut-off value for the ANOVA. If the F we calculate from the sample is greater than this critical value, we will reject the null hypothesis. Note: Notice that there is no one-sided test. ANOVAs are always non-directional. Perform the statistical test: To calculate F, we create an ANOVA table: df SS MSS F Group Residuals We fill in each column from left to right, starting with df. This column is simply the degrees of freedom we calculated earlier, with df1 in the Group row and df2 in the Residuals row. # Degrees of freedom df1 = 3 df2 = 131 df SS MSS F Group 3 Residuals 131 The group sums of squares, SSG, goes in the Group row of the SS column, and the sums of squares of the residuals (SSE) goes in the Residuals row of the SS column. SSG is the sum of squared differences between each observation’s group mean and the overall mean. In other words, it is the sum of squared differences between the predicted values and the mean. 9
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
# SSG: # Predicted values of population growth pred = predict ( aov (Pop.growth ~ Class, data= LPI), newdata= LPI) # SSG SSG = sum ((pred - mean (LPI $ Pop.growth)) ˆ 2 ) SSG SSE is the sum of the squared residuals. For an ANOVA, the residuals are the differences between the observations and their group means. # SSE: # Calculate the residuals by running the model residuals = aov (Pop.growth ~ Class, data= LPI) $ residuals # Sum of squared residuals SSE = sum (residuals ˆ 2 ) SSE df SS MSS F Group 3 0.0091 Residuals 131 1.0092 We calculate the mean squares (MS) by dividing the SS column by the df column. The mean square between groups, MSG, goes in the Group row. This is a measure of how much the group means vary. The mean square within groups, MSE, goes in the Residuals row. This is a measure of how much the observations vary within the groups. # MSG MSG = SSG / df1 MSG # MSE MSE = SSE / df2 MSE df SS MSS F Group 3 0.0091 0.003039 Residuals 131 1.0092 0.007704 Notice that MSG is small compared to MSE, which suggests the group means do not vary much, relative to the variation within the groups. We formalize this comparison between MSG and MSE by calculating F. Recall that F = MSG/MSE. 10
# F F = MSG / MSE F df SS MSS F Group 3 0.0091 0.003039 0.394 Residuals 131 1.0092 0.007704 The calculated F is 0.394. From the previous step of the hypothesis testing framework, we found that the critical value of F is 2.674. # Add the calculated (blue) and critical (red) F # to the histogram of the F-distribution ggplot ( data = data.frame (Fdist)) + geom_histogram ( mapping = aes ( x = Fdist), bins = 50 ) + labs ( title = "F of samples from the null population" , x = "F" , y = "Frequency" ) + geom_vline ( xintercept = F_crit, col= "red" ) + geom_vline ( xintercept = 0.394 , col= "blue" ) The calculated F is smaller than the critical value, therefore we fail to reject the null hypothe- sis that the population mean of population growth rate is the same for all taxonomic classes. We conclude that the classes do not differ significantly in their mean population growth rate. Our biological hypothesis that some taxonomic classes of vertebrates are experiencing more population decline than other classes is not supported by this data. We can verify our calculations using the aov() function in R: # Run the ANOVA predicting population growth rate from class anova1 = aov (Pop.growth ~ Class, data= LPI) summary (anova1) You should see that the aov() function creates the same ANOVA table with the same values as we calculated. The aov() ANOVA table has an additional column, Pr(>F). This is the p-value, which provides another method of deciding whether to reject the null hypothesis. p = 0.757, which is greater than 0.05, therefore we fail to reject the null hypothesis. When the null hypothesis is true, a sample F of 0.394 or greater occurs 75.7% of the time (i.e., very often). 11
Assignment Submit your completed assignment on MyCourses before your next lab session. Write your assignment in Word, then upload it as a PDF. None of your answers should be longer than four sentences. Perform a statistical hypothesis test that address this scientific question: Does the trophic level of a population affects its population growth rate? 1. Below is the time-series data for an example population. Estimate the population growth rate (mean log-difference) for the population. Show your calculations. [0.5 pt] Year Population size 2019 102 2020 91 2021 84 2022 65 2. Make a biological hypothesis. Support your hypothesis with a rationale and at least one reference to a peer-reviewed study. [1 pt] 3. Perform exploratory data analysis. Present your results, including a table of summary statis- tics, appropriate data visualization, and a short paragraph summarizing your impressions. [1 pts] 4. Perform the statistical hypothesis test. a. What are your null ( H 0 ) and alternative ( H A ) hypotheses? Provide both hypotheses in words and provide the null hypothesis in mathematical format. [0.25 pt] b. Is an ANOVA appropriate to test these hypotheses? Justify your answer. [0.25 pt] c. What significance value did you choose? What are the degrees of freedom? What is the critical F? [0.25 pt] d. Provide the completed ANOVA table. Show your calculations underneath the table. (You can use R to calculate the values in the equations, such as the mean and residuals. Do not use aov() except to double-check your answer.) [0.5 pts] e. Do you reject the null hypothesis? Why or why not? What do you infer from the results of this test? [0.25 pt] Total points = 4 pts 12
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help