BIO259-Graphical-Data-Summaries_Tutorial

pdf

School

Toronto Metropolitan University *

*We aren’t endorsed by this school

Course

259

Subject

Statistics

Date

Apr 3, 2024

Type

pdf

Pages

Uploaded by DeanValorHornet7

BIO259-Graphical-Data-Summaries_Tutorial October 16, 2023 Our ability to generate visually stimulating graphical representations that evoke appropriate re- sponses from the audiences of our data is arguably the most important aspect of biological data science. These graphical representations not only help us communicate our results to our audience, but also enable us to gain greater insight into our own data through simplification. By leveraging the power of computer programming, we can generate and manipulate figures in order to produce an extremely wide range of plots that simply aren’t accessible through most graphic user interfaces. We can also heavily parallelize our analyses and incorporate these visualizations into pipelines that enable us to generate the same plots for different datasets. In today’s tutorial we will learn how to generate several different types of graphical data summaries in R using the ggplot2 package. We will also learn to perform some basic manipulations and customize these plots where applicable. Our tutorial is divided into seven primary sections, each of which will cover how to generate and customize a different type of plot: 1. Line Graphs 2. Bar Charts 3. Box and Whisker Plots (with dot plots) 4. Scatter Plots 5. Pie Charts 6. Histograms 7. Heat Maps #1. Line typically used for time data #2. Bar charts are good for looking at catagorical data and their numbers #6. histogram y axis is typically count #3. Look at the spread of your data #4. Look at the relationship between 2 variables #5. Visualize proportions or percentages #7. infer visually correlation between 2 variables [1]: #Run the code block below in order to import our dataset, select a subset of ␣ , → the columns, and filter out undesirable location data. #This will be the data that we will work with throughput this tutorial and in ␣ , → our practical. library (dplyr) library (ggplot2) 1

covid_df <- read.table ( "/var/biojupyterhubdata/BIO259/owid-covid-data-15.06. , → 2022.csv" , sep = "," , header = TRUE , quote = "" ) covid_df <- covid_df %>% ␣ , → select ( c ( "continent" , "location" , "date" , "total_cases" , "new_cases" , "total_deaths" , "new_deaths" , → "total_cases_per_million" , "new_cases_per_million" , ␣ , → "total_deaths_per_million" , "new_deaths_per_million" , "icu_patients" , ␣ , → "icu_patients_per_million" , "hosp_patients" , "hosp_patients_per_million" , ␣ , → "total_vaccinations" , "total_tests" , "new_tests" , "positive_rate" , ␣ , → "total_vaccinations" , "people_vaccinated" , "people_fully_vaccinated" , ␣ , → "total_boosters" , "new_vaccinations" , "population" , "population_density" , ␣ , → "median_age" , "gdp_per_capita" , "diabetes_prevalence" , "life_expectancy" )) covid_df <- covid_df %>% filter ( ! location %in% c ( 'Africa' , 'Asia' , 'Europe' , ␣ , → 'North America' , 'South America' , 'Oceania' , 'Low income' , 'Lower middle ␣ , → income' , 'Upper middle income' , 'High income' , 'European Union' , ␣ , → 'International' , 'Northern Cyprus' , 'World' , 'Grand Total' )) covid_df <- covid_df %>% mutate (date = as.Date (date, format = "%Y-%m-%d" )) ␣ , → #convert column to date object covid_df [is.na (covid_df)] = 0 #replace NA values with 0 Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union 0.1 Part 1: Line Graphs Line graphs are an effective means of displaying changes in a response variable through time. In this section, we will use our Covid-19 dataset to plot the number of Sars-CoV-2 cases detected each day through time in order to compare the pandemic curves for Canada and the USA. [2]: # Generate a line graph of daily cases (per million) in Canada using # geom_line() # The ggplot process is usually: create a data frame with the part of #the data you want # Then add a type of graph, with aesthetic parameters #(such as values of x, values of y etc) 2

#colo(u)r will colour depending on the location column # The answer should look like this, but with some 'corrections' ggplot (covid_df %>% filter (location == "Canada" )) + #Create subset line graph geom_line ( aes (x = date, y = new_cases_per_million, group = location, ␣ , → colour = location)) # Write your code here. Be careful to follow the instructions! [3]: # Compare line graph from Canada to one from the USA, normalizing for ␣ , → population. Note change in scale. # The answer should look like this, but with some 'corrections' 3

Your preview ends here