Rstudio_tutorial

pdf

School

Simon Fraser University *

*We aren’t endorsed by this school

Course

445

Subject

Computer Science

Date

Oct 30, 2023

Type

pdf

Pages

18

Uploaded by Sahajkaran21

Report
BAMA 520 CUSTOMER ANALYTICS R and Rstudio tutorial Prepared for: BAMA 520 Customer Analytics Prepared by: Miremad Soleymanian and Neda Ahmadi September 8, 2022 Proposal number: Session 2
BAMA 520 SOFTWARE INSTALLATION Objective The software tools used with this book are written in R, a language and system for computational statistics. R is open source software, developed by a large, worldwide team of developers. The software consists of a base system and thousands of add-on packages. To assist in writing R code and developing analytics solutions, an Integrated Development Environment (IDE) is essential. The most common IDE for R is RStudio. In other words, RStudio is just an interface shell for the R programming. Therefore, we need install both R and RStudio before using them. 1- Downloading and Installing R Use link below and download the proper installation for your operating system(Windows/IOS/ Linux): https://cran.r-project.org/ Follow the instructions. Installing Rstudio without proper installation of R on your operating system will prevent you from having a functionable environment. 2- Downloading and Installing Rstudio Download the free version of Rstudio Desktop from (https://www.rstudio.com/products/ rstudio/). Follow intallation instructions.
BAMA 520 RSTUDIO ENVIRONMENT 1- Rstudio pannels An accessible introduction to RStudio for the total beginner. Includes basic treatment of the four primary panels and a few easy examples to help you get started using R with RStudio. - Source Panel Panel on the upper left is called the source panel. This is where you write and save your code scripts which is basically a text file. Saved code scripts give you the ability to rerun any part of the code you want and recreate your analysis and visualizations in case any changes occur in your original data. Source Panel Environment Panel Console and Terminal Panel File and plot Panel
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
- Console and Terminal Panel Panel on lower left. Console panel shows you the results of the code that you run in source panel. Also, you can directly run R codes in this panel in an interactive way. However, those codes will not be saved or cannot be used for future needs. Terminal will act as the terminal you have on your pc. You can move to different repositories, change location of directories and …. - Environment Panel Environment panel is located on upper right side. It will show you the outputs and variables you define during coding. Also, you can see the history od actions in your code and console by clicking on the history tab. - Plots and File Panel Files tab will show you the files located on you drive. You can change directories that a code is being run in and see the folders and files that you want to read or is being saved. Plots tab is where you can see the visualizations and plots which we will get familiar with as we move forward. CODING IN A NEW SCRIPT 1- Creating a New script You can create a new script by clicking on File-> New FIle -> R script or using the shortcut version written in front of it. This will create a new Rscript which you can start coding in. By trying to save the file(Ctrl S in windows and Command S in mac) you can define the location and name of the script to be able to use it for future needs. 1- Running Code in Rstudio Before moving to Basic coding skills let’s get familiar with how to run lines or the whole script in Rstudio.
If you look at upper section of Source panel you can find buttons shown in image below. In case you want to run few lines in your script, select those lines and click on Run button. But if you want to run the whole script, click on Source. You can also use Ctrl or Command Enter as a shortcut for Source to run the whole script. Once you use any of these methods you can see the executed lines in console. If you want to work with created variables interactively and understand it more use the console panel to show variable outputs or perform other operations such as mean,max,sum and …. Running your script will show you defined variable in environment panel. You can always clean the environment panel by clicking on the broom icon.
R TUTORIAL - SESSION 2 Before getting started with course materials let’s get to know a few basic commands which can help you throughout the process. Data Types R has a wide variety of data types including scalars, vectors (numerical, character, logical), matrices, data frames, and lists. We will get familiar with a few of these data types throughout the semester. Set up working directory Generally, when you execute an R script the initial directory that the code runs in is the same directory as your script file. This means you have a direct access to any folder or file in that directory. However, if you need to load or save any file outside that, you need provide the complete adress to that location. A way to go around this is assigning your working directory in Rstudio. Under the “Files Plots Packages Help Viewer” window, usually on the bottom right window ofRStudio, click “Files” tab if “Files” tab is not chosen. If “Files” tab is selected, browse to thelocation where you wish to store scripts and data. Last, click “More” button, which is just aboveyour course folder line, and select the option “Set As Working Directory”. You can also write the following code in your console panel. If you want to doble check the final working directory you can use getwd( ) command in console panel. Creating New Variables Use the assignment operator <- to create new variables. Image below shows a sample of how you can define a variable in R script or console.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
Print Variables Write print(variable name) in console or script to see each variable in run results. You can also type a variable name and press Enter but this approach only works for the last line in the script. Vector Vectors are the most basic R data objects and there are six types but you only need to know how to define a character, double, logical and integer one. An example of one element vector of each is shown in image below. Now if you want to save multiple values in one variable you can use c( ) . The elements can be from any of these categories based on the need you have. You can see a sample code and result below. Accessing Vector Elements Elements of a Vector are accessed using indexing. The [ ] brackets are used for indexing. Indexing starts with position 1. Giving a negative value in the index drops that element from
result. TRUE , FALSE or 0 and 1 can also be used for indexing. For instance, code below will assign “Sat” to z and “Mon”,”Tue” and “Fri” to variable u. Dataframe Dataframe is a type of data that you will have to work with the most. A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. (Same as what you create in an excel) Following are the characteristics of a data frame. The column names should be non-empty. The row names should be unique. The data stored in a data frame can be of numeric, factor or character type. Each column should contain same number of data items. Defining a Dataframe on your own You can define columns of dataframe using vectors which was explained above. Command data.frame( ) will create a dataframe. Make sure to define proper names for each column to be able to get good intuition of values and type of variables you need to store in each. Use code below as an example and try to see the output variable in environment console.
Reading a dataframe as an xls or csv file In R, we can read data from files stored outside the R environment. We can also write data into files which will be stored and accessed by the operating system. R can read and write into various file formats like csv, excel, xml etc. In order to load an xls file like jackjill.xls is better to save it as a csv file first and then load it to Rstudio for data analysis tasks. After the file “jackjill.xls” has been downloaded, locate and open it Excel and Enable Editing. If R cannot read a particular file format, often changing to a differentfile type will solve the problem. To illustrate this first trick in your tool kit, we will convert the.xls file to a comma separated value file, with the file extension csv. Use the pull-down menu commandFile -> Save As, which will bringup the “Save As” dialog box. Use the Browse button to navigate to the working directory youset up in step 1, and, use the drop-down menu to select CSV (Commadelimited)(*.csv) as the file type and save the file. By default, the file will have the name jackjill.csv
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
In this section we will learn to read data from a csv file and then write data into a csv file. The file should be present in current working directory so that R can read it. Of course we can also set our own directory and read files from there. Read a csv file in R In order to read a csv file in R script use read.csv(“file path”). As mentioned before, during execution Rstudio has direct access to the directory you have saved your script in. So if the csv file is in the same directory as your code you only need to provide name or local path to the file. Other than that you need to write the complete path to desired file. The output should look something like this in cosole which is similar to the dataframe we defined above. The argument “stringsAsFactors = TRUE” tells R that we want to treat the names of levels notmerely as names, but as factor (categorical) variables. ————————————————————————————————————————- *** For some of tasks from here you need to install libraries. Refer to Chapter3.pdf for more explanation. Type lines below in cosole environment to install following packages. You have to do this once. However, each time you want to use any of this packages you can call them using library( ) command in the beggining of the script as shown in image below. You can also load R functions from local files using source()
——————————————————————————————————————- Getting information about a Dataframe To start you can first make sure a variable is even a data frame using command is.data.frame( ). Also, you can find number of columns or rows in the dataframe using ncol( ) and nrow( ) commands. Image below shows the code and results for jackjill dataset. Str(Dataframe) This command gives you all information above and even more. str() function in R Language is used for compactly displaying the internal structure. So applying it on a dataframe will give you insights about number and types of variables, structures, values and …. Run command str(dataframe name) in your script or console to see the results. View(Dataframe) and Glimpse(Dataframe) View command will open a new tab with a portion of the data. We can get a bit more information on the data using the glimpse command. Images below show output of each command in order.
Output of View(): Output of glimpse(): Add HH.ID in Datacleaning process There is one “data cleaning” chore that should be done for most data sets. The jack.jill data set contains the variable HH.ID, and is an example of a case identifier (or observation identifier). For the jack.jill data set, it is a household identification number. Since these “variables” are merely identifiers, they will not actually be used in any analyses that we will do. However, we need to keep them as they are needed to merge with any outside data, such as personal contact information. We can ensure that we will never accidentally use HH.ID as an analysis variable
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
by setting it as a record-name data type, or “row name”, now. This step should ALWAYS be taken with a new data set that has identification variables — and most will. The following code assigns HH.ID as the rownames, deletes the original variable, and checks that the operation was performed as desired. Note that jack.jill$HH.ID tells R to look for the variable HH.ID in the data set jack.jill. As well, note that as soon as you type the $ sign after a data frame name, a dropdown list of variables appears, from which you can select the variable you want. Accessing a Value/Row/Column in dataframe Data can be accessed by index. We have already seen how square brackets [] can be used to subset data (sometimes also called “slicing”). The generic format is dat[row_numbers,column_numbers] . If we leave out a dimension R will interpret this as a request for all values in that dimension. So, dat[, 2] will return all elements in second column. Another way to return values of certain column is by using $ sign. df$”Column_name” will return that specific column as an output. Example: dat $ Gender Visualization As, intuition you get from data is the most important thing in this course, visualizations can be very helpful and it will definitely com handy in many cases. We will be familiar with more visualization tools as we move forward. However, let’s get to know few of them today. Histogram The function to create a histogram is hist(..., ..., ...), where a number of arguments can be passedwithin the brackets. At least one argument, the variable to be plotted, must be specified here.We want to generate a histogram for the Spending variable, the only continuous variable in thejack.jill data. The first argument specifies the data set and the variable within the data set. You can also define number of bins (breaks = n), Color of each bin (Col=“color_name”) and Color of border(border=“color_name”)
The output for following code should look like this. Usage: Histograms are for continuous numeric variable —the numeric variable is binned to a categorical variable and plotted. However for categorical (“factor”) variables we can show number of times each category has repeated. This is called the frequency distribution. For plotting frequency distribution we can use funtion below. Plot The plot command uses the class of the variable — in this case a factor — to decide what type of plot to produce — in this case a bar plot.
BCA_functions_source_file We will also use some very handy functions not onCRAN, written by Dan Putler and modified by Ryan Tavakol. These functions are in a filenamed BCA_functions_source_file.R , which you must first download from your coursewebsite into your working directory. - A most useful summary of the data: To get information on all of the variables in this dataset we will use one of the functions in the BCA_functions_source_file.R, - variable.summary() This function summarizes all variables in dataframe Data preparation is invariably the most time-consuming task in analytic work. The first step is understanding what you have to work with. This table lists a characteristics of all variables in the data set. Inspecting this table is a critical part of the data understanding phase, and helps with identifying data problems, and hence identifying the data preparation steps necessary to fix them. We will work through it slowly. The variable names appear in the first column. The second column,Class, gives the class of each variable, which incorporates both the measurement scale and the internal computer coding used for the variables. The combination of scales and coding can initially be confusing, but there are only four classes, and getting comfortable with them is essential for data analysis. The possible classes of variables are numeric, integer, factor, and character. • Numeric and integer variables are ratio-scaled numbers (integer variables do not have fractional values, or are “discrete”). • A factor is a categorical variable. The summary table does not distinguish between ordered (ordinal-scaled) and unordered (nominal-scaled) factors, but we will usually treat them as nominal. • A character variable only contains text as labels. This is fine if the variables are merely record identifiers, but if they are variables that need to be used in a quantitative analysis,
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help
- numSummary() This command gives a summary statistics for numeric variables As well as the mean and standard deviation (“sd” in the new output) that we saw previously, theoutput shows the values of the variables at each of its quartiles. The 0% quartile is the minimumvalue ofSpendingin the data set ($13), the 50% quartile is the median value of Spending($585),and the 100% quartile is the maximum value ofSpending($5940) in the data set. The largegap between the 75% and 100% quartiles (relative to the other inter-quartile gaps) indicatesthat the distribution ofSpendingis likely to be highly skewed, with a very small percentage ofhouseholds having extremely high spending levels. The fact that the mean is much higher thanthe median is also a result of such skewing. - numSummary(df, groups = ) Redo the summary explained above but summarize by groups - cor() The cor() function below generates the correlation matrix >cor(select(CCS, YearsGive:DonPerYear))
Contingency Tables The purpose of this tutorial is to show you how to produce contingency tables. Contingency tables are also called “crosstabulations” and “cross-tabs.” They are the simplest type of multi- variate analysis (i.e., methods for studying the relationships among multiple variables) and arecommon in market research. This tutorial will only consider two-way (two variable) contingency tables. Interpreting three-way tables is often difficult, and four-way or higher-order tables are essentially impossible to interpret. R will calculate a chi-square test statistic to help you evaluatethe statistical independence of the two variables under study. You will also learn how to changethe level order in a factor variable, to create more easily interpreted and presentation-friendly contingency tables. In order to create a contigency table we have to use xtabs( ) command. Contingency tables only work with categorical variables, therefore we will use the binned version of spending. In order to do that we can use binVariable function loaded from BCA_functions_source_file.R. You can follow the steps as below: Now that we have the categorical version we can apply xtabs as below. The output should look something like this:
If you have an ordinal variable levels() function provides the ordering of that variable and fct_relevel() can be used to reorder the factor levels. Chi-square test for independance This compares two variables in a contigency table to see if they are related. In simpler words it compares two set of data to see if there exists a relationship among them. A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there were no relationship at all in the population. A low value for chi-square means there is a high correlation between your two sets of data. In theory, if your observed and expected values were equal (“no difference”) then chi-square would be zero — an event that is unlikely to happen in real life. You can use chisq.test() command to calculate and generate test and expected counts. the expected column will return the expected values. In order to get proportion you can use prop.table() . If you want to create the proportions in each row, across columns you have to put margin=2 and if you want to create the proportions in each column, across rows you have to choose margin as 1. You can also get p-value to figure the signifigance of relation between variables.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help

Browse Popular Homework Q&A

Q: How do you rewrite the expression 4 logx — 3 log(x^2 + 1) + 2 log(x—1) as a single logarithm log A?
Q: Review Conceptual Example 2 before attempting this problem. The moon has a diameter of 3.48 x 106 m…
Q: RAB ΑΟ ΒΟ 6Ω 12 Ω Find the resistance at the terminals A-B in the network in Figure. 6Ω 12 Ω 60 6Ω •…
Q: For the Sine wave shown below, Figure 2: a. The peak-to-peak value is equal to: b. The peak value is…
Q: A standard may be used to design a file integrity monitoring system.
Q: ts decreased from $200,000 to $150,000 and variable cost per unit of $ 5 remained unchanged. How…
Q: 1- Here we have a practical grammar for generating four-function expressions as  below. Please…
Q: 2. Consider the curve y = (x²+2) ³¹/ 0≤x≤1 a) Find the length of this curve b) Find the volume of…
Q: Find the number of 3-digit integers that are not divisible by 4, 5, or 6.
Q: Part A: Find all values of x in the interval [0, 2π] that satisfy the equation. (Enter your answers…
Q: a) y = cos 2x on [0,1] b) x= on [1,10]
Q: According to the Mars company, packages of milk chocolate M&Ms contain 20% orange candies. Suppose a…
Q: 7. A manufacturer wants to design an open box (no top) having a square base and volume of 1/2 liter.…
Q: Convert the point (x, y, z) = ( − 2, – 1, 1) to cylindrical coordinates. Give answers as positive…
Q: Solve the initial value problems in Exercises 15–20. dy + 2y = 3, y(0) = 1 15. dt dy 16. t + 2y =…
Q: 2. Determine the arc length of the given curves a) x = 1/² + 1/ 4y luoda r b) y = ln (cos x) 1 ≤ y…
Q: Ques If 450 g of magnesium hydroxide is dissolved in water to make 6.5 L of solution, what is the…
Q: 4. Quarterly demand for smartphones at a retailer is as shown. After obtaining initial estimates for…
Q: Which of the following is an essential component of the definition of learning? Question 1…
Q: 2) Consider the following unbalanced net ionic redox equation. MoO42- + SO32- Mo2+ + SO4²- notorien…
Q: (a) Show that S is path-connected, by constructing for any two points x, y € Sn an explicit path…
Q: hat are the aims and purposes of the software for file integr