BIO259-Graphical-Data-Summaries_Tutorial

.pdf

School

Toronto Metropolitan University *

*We aren’t endorsed by this school

Course

259

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

31

Uploaded by DeanValorHornet7

Report
BIO259-Graphical-Data-Summaries_Tutorial October 16, 2023 Our ability to generate visually stimulating graphical representations that evoke appropriate re- sponses from the audiences of our data is arguably the most important aspect of biological data science. These graphical representations not only help us communicate our results to our audience, but also enable us to gain greater insight into our own data through simplification. By leveraging the power of computer programming, we can generate and manipulate figures in order to produce an extremely wide range of plots that simply aren’t accessible through most graphic user interfaces. We can also heavily parallelize our analyses and incorporate these visualizations into pipelines that enable us to generate the same plots for different datasets. In today’s tutorial we will learn how to generate several different types of graphical data summaries in R using the ggplot2 package. We will also learn to perform some basic manipulations and customize these plots where applicable. Our tutorial is divided into seven primary sections, each of which will cover how to generate and customize a different type of plot: 1. Line Graphs 2. Bar Charts 3. Box and Whisker Plots (with dot plots) 4. Scatter Plots 5. Pie Charts 6. Histograms 7. Heat Maps #1. Line typically used for time data #2. Bar charts are good for looking at catagorical data and their numbers #6. histogram y axis is typically count #3. Look at the spread of your data #4. Look at the relationship between 2 variables #5. Visualize proportions or percentages #7. infer visually correlation between 2 variables [1]: #Run the code block below in order to import our dataset, select a subset of , the columns, and filter out undesirable location data. #This will be the data that we will work with throughput this tutorial and in , our practical. library (dplyr) library (ggplot2) 1
covid_df <- read.table ( "/var/biojupyterhubdata/BIO259/owid-covid-data-15.06. , 2022.csv" , sep = "," , header = TRUE , quote = "" ) covid_df <- covid_df %>% , select ( c ( "continent" , "location" , "date" , "total_cases" , "new_cases" , "total_deaths" , "new_deaths" , "total_cases_per_million" , "new_cases_per_million" , , "total_deaths_per_million" , "new_deaths_per_million" , "icu_patients" , , "icu_patients_per_million" , "hosp_patients" , "hosp_patients_per_million" , , "total_vaccinations" , "total_tests" , "new_tests" , "positive_rate" , , "total_vaccinations" , "people_vaccinated" , "people_fully_vaccinated" , , "total_boosters" , "new_vaccinations" , "population" , "population_density" , , "median_age" , "gdp_per_capita" , "diabetes_prevalence" , "life_expectancy" )) covid_df <- covid_df %>% filter ( ! location %in% c ( 'Africa' , 'Asia' , 'Europe' , , 'North America' , 'South America' , 'Oceania' , 'Low income' , 'Lower middle , income' , 'Upper middle income' , 'High income' , 'European Union' , , 'International' , 'Northern Cyprus' , 'World' , 'Grand Total' )) covid_df <- covid_df %>% mutate (date = as.Date (date, format = "%Y-%m-%d" )) , #convert column to date object covid_df [is.na (covid_df)] = 0 #replace NA values with 0 Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union 0.1 Part 1: Line Graphs Line graphs are an effective means of displaying changes in a response variable through time. In this section, we will use our Covid-19 dataset to plot the number of Sars-CoV-2 cases detected each day through time in order to compare the pandemic curves for Canada and the USA. [2]: # Generate a line graph of daily cases (per million) in Canada using # geom_line() # The ggplot process is usually: create a data frame with the part of #the data you want # Then add a type of graph, with aesthetic parameters #(such as values of x, values of y etc) 2
#colo(u)r will colour depending on the location column # The answer should look like this, but with some 'corrections' ggplot (covid_df %>% filter (location == "Canada" )) + #Create subset line graph geom_line ( aes (x = date, y = new_cases_per_million, group = location, , colour = location)) # Write your code here. Be careful to follow the instructions! [3]: # Compare line graph from Canada to one from the USA, normalizing for , population. Note change in scale. # The answer should look like this, but with some 'corrections' 3
ggplot (covid_df %>% filter (location == "Canada" | location == "United States" )) , + geom_line ( aes (x = date, y = new_cases_per_million, group = location, colour = location)) # Write your code here. Be careful to follow the instructions! [4]: # Adjust your axes, legends, and colours to improve figure readability. ggplot (covid_df %>% filter (location == "Canada" | location == "United States" )) , + geom_line ( aes (x = date, y = new_cases_per_million, group = location, , colour = location)) + 4
theme (axis.title = element_text (size =15 ), axis.text = , element_text (size =15 ), axis.ticks.length = unit ( .1 , "cm" )) + #each option in , theme has several underlying options, here we are manipulating several , aspects of our axes xlab ( "Date" ) + ylab ( "New Cases Per Million" ) + #replace axis labels theme (legend.position = "bottom" , legend.text = element_text (size =15 ), legend. , title = element_blank ()) + #manipulate legend scale_color_manual (values = c ( "#ff0000" , "#0000FF" )) #specify different , "national" colours for each group 5
0.2 Part 2: Bar Charts Bar charts utilize rectangular bars with lengths that are proportional to the values they represent to compare categorical data. Error bars are commonly added to bar charts to display variance. In this section, we will compare the average number of hospitalized Covid-19 patients per million across eight countries in Europe and learn how to include error bars on these plots. [5]: # We'll pivot our covid_df in order to calculate summary mean, sd, and sem , values for subset of European countries. # Our countries of interest are Belgium, France, Italy, Iceland, Netherlands, , Portugal, Sweden, and the United Kingdom. pivot_covid_df <- covid_df %>% group_by (location) %>% filter (location == , 'Belgium' | location == 'France' | location == 'Italy' | location == , 'Iceland' | location == 'Netherlands' | location == 'Portugal' | location , == 'Sweden' | location == 'United Kingdom' ) %>% summarise (mean_hosp_patients = mean (hosp_patients_per_million), , sd_hosp_patients = sd (hosp_patients_per_million), sem_hosp_patients = , sd (hosp_patients_per_million) / sqrt ( n ())) pivot_covid_df A tibble: 8 × 4 location mean_hosp_patients sd_hosp_patients sem_hosp_patients <chr> <dbl> <dbl> <dbl> Belgium 161.27678 133.90926 4.558324 France 244.34309 143.84840 4.865745 Iceland 36.31031 53.97307 1.863357 Italy 201.03494 177.41961 6.025486 Netherlands 59.19184 41.76282 1.440954 Portugal 110.12235 130.67054 4.516635 Sweden 87.08845 84.02470 2.855275 United Kingdom 136.17966 122.06000 4.145375 [6]: # Generate a simple bar chart of mean hospitalized patients in # each country. You can play around with geom_bar settings to see what they , each do. # Again the trick is to load the data to ggplot # Then plot a bar graph with specific aesthetics # geom_bar() can perform summaries for us, but we can also plot the pivot table , directly. # stat = "identity" we are giving it the y values to plot #The answer should look like this, but with some 'corrections' ggplot (pivot_covid_df) + geom_bar ( aes (x = location, y = mean_hosp_patients), stat = "identity" , , fill = "gray" , colour = "black" , size =1 , alpha =1 ) # Write your code here. Be careful to follow the instructions! 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help