econ140_problem_set_1

.pdf

School

University of California, Berkeley *

*We aren’t endorsed by this school

Course

140R

Subject

Economics

Date

Apr 3, 2024

Type

pdf

Pages

39

Uploaded by ProfLapwingMaster1246 on coursehero.com

econ140_problem_set_1 September 16, 2022 In this cell, please type your name and SID ECON 140R - Problem Set 1 This material is closely and gratefully adapted from the work of the UC Berkeley EEP/IAS 118 team, including Jeremy Magruder, Sofia Villas-Boas, James Sears, and many other people working on these materials for EEP118. This is your work. We are in your debt! INSTRUCTIONS Please step through this tutorial, copy and paste the code, and run it to produce output. Make sure to write some sentences at the end in response to the questions in the last section. You will earn 100% of the credit on this problem set for completing it. Tips: • We recommend you walk through Coding Bootcamp 1 and Coding Bootcamp 2 before tackling this Problem Set. You do not need to turn those in • Type your name and SID in the field at the top • When done, got to “File: Download As: PDF via LaTeX” or something similar like “File: Save and Export Notebook As: PDF.” You want a PDF you can upload to Gradescope • Navigate to Gradescope and find Problem Set 1 there, and submit your PDF. Follow the prompts on Gradescope to select fields. If you’re not sure how, ask a friend and reach out to your GSI Learning goals 0.1 ggplot2 and visualizations We started off using R’s built-in plot function, which let us produce scatterplots and construct histograms of all sorts of variables. However, it doesn’t look the best and has some ugly naming conventions. ggplot2 will give us more control over our figure and allow us to get as in depth with it as we want. Check out the Wikipedia page on ggplot2 for details. The etymology is “gg” for Grammar of Graphics. There is a “2” involved, but the function itself is ggplot() . GENTLE DISCLAIMER This notebook asks you to step through these functions in a piecemeal fashion. Don’t be surprised if the early steps don’t look like much. Also, there is no need to get fancy with graphics. Your objective should be to see how ggplot() works, so that you can create useful visualizations. 1
ggplot2 is part of the tidyverse package, so we’ll need to load that in before we get started. Let’s also use the sleep75.dta dataset provided by Jeffrey Wooldridge with his textbook. The dataset comes from J.E. Biddle and D.S. Hamermesh (1990), “Sleep and the Allocation of Time,” Journal of Political Economy 98, 922-943. If you’re interested in the contents of the dataset, here is the PDF codebook at CRAN for all 115 datasets in the wooldridge package. The documentation for this dataset starts on page 130. This dataset is also available through the wooldridge R package; you can either use the local copy on datahub, per the code below, or you could switch back to the R package, shown in the commented-out code below. [21]: library (tidyverse) library (haven) # Want to use the wooldridge package instead? Uncomment these 3 rows and commment out the last row below: #install.packages('wooldridge') #data(sleep75, package = 'wooldridge') #sleepdata <- sleep75 sleepdata <- read_dta ( "sleep75.dta" ) 1 ggplot2 Basic Syntax Let’s start by getting familiar with the basic syntax of ggplot2 . Its syntax is a little bit different than some of the functions we’ve used before, but ultimately it makes thing nice and easy as we make more and more professional-looking figures. It also plays nicely with pipes! To start a plot, we start with the function ggplot() This function initializes an empty plot and passes data to other plots that we’ll add on top. We can also use this function to define our dataset or specify what our 𝑥 and 𝑦 variables are. Try starting a new plot by running ggplot() — with no arguments — below. You should get a blank gray canvas: [22]: ggplot () 2
We get a little bit more if we specify our data and our 𝑥 and 𝑦 variables. To specify the data, we add the argument data = "dataname" to the ggplot() function. To specify which variable is on the 𝑥 axis and which is on the 𝑦 , we use the aes(x = "xvar", y = "yvar") argument. aes() is short for “aesthetics” and allows us to automatically pass these variables along as our 𝑥 and 𝑦 variables for the plots we add. Suppose we’re interested in using our sleepdata to see the relationship between age and the hourly wage in our sample. Note that economists usually take the natural log of the hourly wage, or of income or wealth if that were our variable of interest. The reason is that levels of these variables clearly reveal heteroskedasticity, variances that in this case get larger as the wage and other correlated variables increase. Copy and paste this code into the field below and run it: 3
ggplot(data = sleepdata, aes(x = age, y = hrwage)) [23]: ggplot (data = sleepdata, aes (x = age, y = hrwage)) Now we have labels on both of our axes corresponding to the assigned variable, and a grid corre- sponding to possible values of those variables. This makes sense, because we told R with aes() what our 𝑥 variable and 𝑦 variable are, and then R automatically sets up tick marks based on our data. We will add geometries (sets of points, histograms, lines, etc.) by adding what we call “layers” using a + after our ggplot() function. Let’s take a look at a few of the options. 4
1.1 Scatterplots Now let’s add some points! If we want to get a sense of how age and the hourly wage vary in our data, we can do that by just plotting the points. We can add (𝑥, 𝑦) points in what is usually called a “scatterplot” using the function geom_point() . In spreadsheet programs, for example, Excel calls this an “X Y (Scatter)” chart. Since we already declared our two variables, all we need to add + geom_point() to our existing code after the last parenthesis: ggplot(data = sleepdata, aes(x = age, y = hrwage)) + geom_point() [25]: ggplot (data = sleepdata, aes (x = age, y = hrwage)) + geom_point () Warning message: “Removed 174 rows containing missing values (geom_point).” 5
This is a plot of all our points. This rendering could take a very long time if our dataset were enormous! Also, note that we were warned that there were rows that contained at least one missing value, and those got dropped. 1.1.1 Labels Often we’d like to change the labels from the variable names to a more descriptive label, and possibly add a title. We can, by adding the labs() function to our plot. Try pasting this in below and running: ggplot(data = sleepdata, aes(x = age, y = hrwage)) + geom_point() + labs(title = "Relationship between age and the hourly wage", subtitle = "Nonmissing Sample", caption = "Note: prepared using Wooldridge's sleep75 data.", x = "Age (years)", y = "Hourly Wage ($)") [29]: ggplot (data = sleepdata, aes (x = age, y = hrwage)) + geom_point () + labs (title = "Relationship between age and the hourly wage" , subtitle = "Nonmissing Sample" , caption = "Note: prepared using Wooldridge's sleep75 data." , x = "Age (years)" , y = "Hourly Wage ($)" ) Warning message: “Removed 174 rows containing missing values (geom_point).” 6
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help