Assignment-6-Introduction-to-working-with-R-RStudio
docx
keyboard_arrow_up
School
University of Saskatchewan *
*We aren’t endorsed by this school
Course
311
Subject
Statistics
Date
Apr 3, 2024
Type
docx
Pages
4
Uploaded by MagistrateStar1002
Assignment #6: Introduction to Working with R/RStudio
Submission Instructions
Due:
Friday, April 6, 2018 at 11:59 PM.
Submit the following four
files through Canvas>Assignments>To-Do: (1)
The completed, working R script that produced the analysis in Steps 1 through 9
(2)
The output file – descriptivesOutput.txt (3)
Another output file – histogram.pdf
(4)
The completed answer sheet provided on the last page and also as a separate word file
If you do not follow the instructions, your assignment will be counted late.
o
Late Assignment policy: Same as before.
Evaluation
Your submission will be graded based on the correctness of the completed answer sheet, with other files
as supporting documents.
Before you start
For this assignment, you’ll run simple analyses by modifying the R script you used in the ICA #11 (
Descriptives.r
). You will also need a new data set – OnTimeAirport2017Dec.csv
, which contains actual data regarding on-time flight statistics for 83,915 flights, by airline and airport, for December 2017, collected from Bureau of Transportation Statistics.
1
IMPORTANT! When downloading the .csv file, please make sure that the name doesn’t change, and that it is in the same folder as the Descriptives.r file that you are modifying
.
The metadata for the – OnTimeAirport2017Dec.csv spreadsheet is below:
Variable Name
Variable Description
FlightDate
The date of the flight (mm/dd/yyyy)
UniqueCarrier
The unique carrier code
CarrierlName
The name of the carrier
FlightNum
Flight Number
Origin
The origin airport of the flight
OriginCity
The origin city of the flight
Dest
The destination airport of the flight
DestCity
The destination city of the flight
DepDelay
The delay in departing from the origin gate (in minutes)
TaxiOut
The minutes spent taxiing out to the runway at origin
TaxiIn
The minutes spent taxiing in from the runway at destination
ArrDelay
The delay in arrive to the destination gate (in minutes)
Cancelled
Whether the flight was cancelled (0 = no, 1 = yes)
AirTime
Flight Time (in minutes)
Distance
The total distance of the flight (in miles)
1
https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
Modifying the Descriptives.r script
To complete the assignment, modify the Descriptives.r
script (used in ICA #11) to perform an analysis of departure delays by origin airport, following the instructions below, and complete the answer sheet on the last page
.
1)
Use OnTimeAirport2017Dec.csv as the input file.
HINT: In line 21 of the Descriptives.r script, it says:
INPUT_FILENAME <- "NBA14Salaries.csv"
Change that line to:
INPUT_FILENAME <- "OnTimeAirport2017Dec.csv"
2)
Present the number of flights, grouped by destination airport (using Dest
).
HINT: In line 61, change the line to read:
summary(dataSet$Dest)
This presents the number of observations/rows (flights) by destination airport. You will need the output from this command to answer the first question in the answersheet on the last page.
3)
Present summary statistics for arrival delay (using ArrDelay
).
HINT: In line 66, change the line by replacing Salary with ArrDelay
:
describe(dataSet$ArrDelay)
4)
Present summary statistics for arrival delay (using ArrDelay
), grouped by airline carriers (using UniqueCarrier
).
HINT: Check line 73 in the script:
describeBy(dataSet$Salary,dataSet$Position)
This presents summary statistics for salary by position (for the NBA salary data). Now that we are using a different data set, you should be able to figure out how to change line 73 to present summary statistics for arrival delay (
ArrDelay
), grouped by airline carrier (
UniqueCarrier
).
If you get that, you will now be able to answer questions 2 through 4 on the answer sheet!
5)
Compare, using a t-test, the arrival delays for two airline carrier
s (using UniqueCarrier
)
, American Airlines (AA) and United Airlines (UA).
HINT: Now please change line 87 and line 93 on your own. Hopefully the first few steps will get you started!
Check line 87:
subset <- dataSet[ which(dataSet$Position=='PG' | dataSet$Position=='SF'), ] This create a subset with only two positions: PG and SF (for the NBA salary data). Now that we are using a different data set, you should be able to figure out how to change this line to create a subset with only two airline carriers: AA and UA.
Check line 93:
Page 2
t.test(subset$Salary~subset$Position)
This runs a t-test by using Salary as your dependent variable and Position as your grouping variable (for the NBA salary data). Now with the airport data, you should be able to change this line by using ArrDelay
as the dependent variable, and UniqueCarrier
as the grouping variable.
6)
Create a histogram, properly labeled, of the overall distribution of arrival delays (using ArrDelay
) for all flights.
HINT: You will need to change the hist()
function in both line 106 and line 112. You also need to change line 25 & line 27 for the label and title of the histogram. In addition, in line 24, change the number of breaks (NUM_BREAKS) to 50 so you will see more vertical bars in the histogram.
Once you’ve completed this part, add several new lines to the script that does the following 7), 8), and 9):
NOTE: Make sure you add these lines right before the sink()
function (line 96) so that the results are included in your text file output.
7)
Use describeBy()
to compare the flight distance (
Distance
) across airlines (using UniqueCarrier
).
8)
Use describeBy()
to compare the taxiing out time (
TaxiOut
) across origin airports (
Origin
).
9)
Answer this question using a t-test: Do planes spend more time taxiing out to the runway in Newark (EWR) or Philadelphia (PHL) as the origin airport? (using TaxiOut
as the taxiing out time, and Origin
as the origin airport); Once you’ve completed all the 9 steps, you can set the working directory and run the script. Based on your script output, answer the 11 questions listed on the answersheet on the next page.
Answer Sheet on the Next Page……
1.
Page 3
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Answer Sheet for Assignment: Introduction to Working with R/RStudio
Name __________________________________
Answer the questions below based on your script output
Question
Answer
1
How many total flights (including cancelled flights) have Philadelphia (PHL) as the destination airport during December 2017?
2
What was the average arrival delay (in minutes) across all flights during December 2017?
3
What was the average arrival delay (in minutes) for American Airlines (with UniqueCarrier code of AA) during December 2017?
4
What was the longest arrival delay for United Airlines (with UniqueCarrier code of UA) during December 2017?
5
On average, which airline (using UniqueCarrier) experienced greater arrival delays: American Airline (AA) or United Airlines (UA)?
6
For question #5, was this difference statistically significant? What is the p-value?
(answer both questions in the blank to the right)
7
Which airline(s) had longest average flight distance?
(you can list more than one if it’s a tie)
8
Which airline (s) had shortest average flight distance?
(you can list more than one if it’s a tie)
9
On average, which origin airport (using Origin) experienced greater taxi out times: Newark (EWR) or Philadelphia (PHL)?
1
0
For question #9, was this difference statistically significant? What is the p-value?
(answer both questions in the blank to the right)
1
1
Looking at the histogram. Is the distribution symmetric? Are most flights delayed less than 50 minutes or more than 50 minutes?
Page 4
Related Documents
Related Questions
need evaluation steps
arrow_forward
Create a side-by-side boxplot for vitamin D level vs. NewAge and a side-
by-side boxplot for vitamin D level vs. country.
Create a scatterplot to show the relationship between vitamin D level
and Age.
Compare these two side-by-side boxplots and the scatterplot and explain
your findings.
• Note: Write appropriate captions for the tables, graphs, and outputs.
arrow_forward
The r code for side by side boxplot of vitamind v newage and vitamin d v country.
Scatterplot code for relationship between vitamin d level and age.
arrow_forward
The data file, data2.xls (Excel format), has been uploaded to this module. Click, download, and open this file. It contains:
Table 1. Violent victimization, by type of crime, 2016, and 2017
Appendix table 3. Standard errors for table 1: Violent victimization, by type of crime, 2016, and 2017
From the estimation of the number of Rape/sexual assault (298,410) in 2016 at 95% CI. what is the lower limit?
arrow_forward
//$$/$/$/$::$/$:Helppppppp
arrow_forward
Describe the three primary charts and graphs used to organize and display data.
arrow_forward
can a cause and effect relationship be determined?
arrow_forward
Recently, management at Oak Tree Golf Course received a few complaints about the condition of the greens. Several players complained that the greens are too fast. Rather than react to the comments of just a few, the Golf Association conducted a survey of 100 male and
100 female golfers. The survey results are summarized here.
Excel File: data02-31.xlsx
Male Golfers
Male
Green Condition
Handicap
Under 15
15 or more
25
25
a. Complete the crosstabulation shown below.
Green Condition
Gender Too Fast Fine
Female
35
40
Too Fast
10
65
60
Fine
40
Total
100
100
Female Golfers
200
Green Condition
Handicap
Under 15
15 or more
Too Fast
1
Note: This exercise is an example of Simpson's Paradox.
39
Fine
9
Total
75
125
Which group shows the highest percentage saying that the greens are too fast?
Females, at 40%
b. Refer to the initial crosstabulations. For those players with low handicaps (better players), which group (male or female) shows the highest percentage saying the greens are too fast?
For…
arrow_forward
A survey about social media reported that 82% of B2B marketers (marketers that focus primarily on
attracting businesses) plan to increase their use of social media, as compared to 55% of B2C marketers
(marketers that primarily target consumers). The survey was based on 1,286 B2B marketers and 1,731
B2C marketers. The accompanying table summarizes the results. Complete parts (a) through (d) below.
A. What is the probability that a randomly selected respondent is a B2C marketer?
B. What is the probability that a randomly selected respondent plans to increase use of social media
or is a B2C marketer?
C. Explain the difference in the results in (a) and (b)
arrow_forward
A survey about social media reported that 79% of B2B marketers (marketers that focus primarily on attracting businesses) plan to increase their use of social media, as compared to 54% of B2C marketers
(marketers that primarily target consumers). The survey was based on 1,333 B2B marketers and 1,669 B2C marketers. The accompanying table summarizes the results. Complete parts (a) through (d) below.
A Click the icon to view the contingency table about social media use and marketers.
Contingency table
a. What is the probability that a randomly selected respondent plans to increase use of social media?
(Round to three decimal places as needed.)
Increase Use of
Social Media?
Business Focus
B2B
B2C
Total
b. What is the probability that a randomly selected respondent is a B2C marketer?
Yes
1,049
901
1,950
(Round to three decimal places as needed.)
No
284
768
1,052
Total
1,333
1,669
3,002
c. What is the probability that a randomly selected respondent plans to increase use of social media or is…
arrow_forward
The Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database:
FloorArea: square feet of floor space
Offices: number of offices in the building
Entrances: number of customer entrances
Age: age of the building (years)
AssessedValue: tax assessment value (thousands of dollars)
Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics.
Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables?
Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue?
Construct a scatter plot…
arrow_forward
Briefly describe the methods of collecting primary data
arrow_forward
Thank you for any feedback on this one.
arrow_forward
. Primary Data Source and Secondary Data Source ?
arrow_forward
A lecturer at WIN wanted to know if he can predict student’s quiz results by asking them to complete a simple survey. The result of the survey is found in the file: Assignment 2 sem22020 data set 1.Quiz ResultActual Mark (0-15) for quiz student attainedEQRQuiz score (0-15) expected to get before taking the quizStudy Hrs.Number of hours per week (on average) spent studying for StatisticsAgeAge (in years)BBTSatisfaction rating of Big Bang TheorySexM=1 F=0MBMB=1 for good math background, otherwise 0MCMC= 1 if math centre is used regularly, otherwise 0AuHSAuHS = 1 if student completed high school in Australia, otherwise 0LMLM=1 if student likes math, 0 otherwiseTask 1: Variable List(a) Using the variables listed in the table above, Describe each variable.(b) State for each variable whether it is qualitative or quantitative; if it is qualitative, state whether it is nominal or ordinal, and if it is quantitative, state whether it is discrete or continuous.Task 2: HistogramCreate a histogram…
arrow_forward
Determine the type of variation model that best fits the data in the attached image.
arrow_forward
Recently, management at Oak Tree Golf Course received a few complaints about the condition of the greens. Several players complained that the greens are too fast. Rather than react to the comments of just a few, the Golf Association conducted a survey of 100 male and
100 female golfers. The survey results are summarized here.
Excel File: data02-31.xlsx
Male Golfers
Green Condition
Gender Too Fast
Male
Handicap
Under 15
15 or more
25
25
a. Complete the crosstabulation shown below.
Green Condition
Female
Too Fast
10
Fine
Fine
40
Female Golfers
Total
Green Condition
Handicap
Under 15
15 or more
Too Fast
1
Fine
9
39 51
Total
Which group shows the highest percentage saying that the greens are too fast?
- Select your answer -
b. Refer to the initial crosstabulations. For those players with low handicaps (better players), which group (male or female) shows the highest percentage saying the greens are too fast?
For the low handicappers, the - Select your answer - have a higher percentage who…
arrow_forward
An business reviews data on the daily amount of calls it receives. Are the data discrete or continous?
arrow_forward
Please help me
arrow_forward
Please answer the question in entirety.
arrow_forward
The university administration has assembled the data for a twelve-month period pertaining to the monthly total
costs of providing the service and the corresponding number of students who used the laundering facilities each
month. You were recently taught how to use the Excel graphing tool and a member of the team successfully
generated the scattergram given below from the data set provided.
Total Laundering Costs
300,000
250,000
200,000
150,000
100,000
50,000
0
0
V
+
ANGLA RUSTON UNIVERSITY SCATTER DIAGRAM
500
1,000
1,500
Line of Best Fit
# of Students Laundering
2,000
2,500
3,000
The other team members are now tasked to use the graph to provide the administrators with a detailed response
to the following questions:
a) What is another name for the "line of best fit" in Excel? What is the purpose of this line?
b) Using the line of best fit", determine Angla Ruston University's fixed cost per month and the variable cost
per student. (Use 0 & 2,500 students.)
c) Based on the scatter gram,…
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill
Related Questions
- need evaluation stepsarrow_forwardCreate a side-by-side boxplot for vitamin D level vs. NewAge and a side- by-side boxplot for vitamin D level vs. country. Create a scatterplot to show the relationship between vitamin D level and Age. Compare these two side-by-side boxplots and the scatterplot and explain your findings. • Note: Write appropriate captions for the tables, graphs, and outputs.arrow_forwardThe r code for side by side boxplot of vitamind v newage and vitamin d v country. Scatterplot code for relationship between vitamin d level and age.arrow_forward
- The data file, data2.xls (Excel format), has been uploaded to this module. Click, download, and open this file. It contains: Table 1. Violent victimization, by type of crime, 2016, and 2017 Appendix table 3. Standard errors for table 1: Violent victimization, by type of crime, 2016, and 2017 From the estimation of the number of Rape/sexual assault (298,410) in 2016 at 95% CI. what is the lower limit?arrow_forward//$$/$/$/$::$/$:Helppppppparrow_forwardDescribe the three primary charts and graphs used to organize and display data.arrow_forward
- can a cause and effect relationship be determined?arrow_forwardRecently, management at Oak Tree Golf Course received a few complaints about the condition of the greens. Several players complained that the greens are too fast. Rather than react to the comments of just a few, the Golf Association conducted a survey of 100 male and 100 female golfers. The survey results are summarized here. Excel File: data02-31.xlsx Male Golfers Male Green Condition Handicap Under 15 15 or more 25 25 a. Complete the crosstabulation shown below. Green Condition Gender Too Fast Fine Female 35 40 Too Fast 10 65 60 Fine 40 Total 100 100 Female Golfers 200 Green Condition Handicap Under 15 15 or more Too Fast 1 Note: This exercise is an example of Simpson's Paradox. 39 Fine 9 Total 75 125 Which group shows the highest percentage saying that the greens are too fast? Females, at 40% b. Refer to the initial crosstabulations. For those players with low handicaps (better players), which group (male or female) shows the highest percentage saying the greens are too fast? For…arrow_forwardA survey about social media reported that 82% of B2B marketers (marketers that focus primarily on attracting businesses) plan to increase their use of social media, as compared to 55% of B2C marketers (marketers that primarily target consumers). The survey was based on 1,286 B2B marketers and 1,731 B2C marketers. The accompanying table summarizes the results. Complete parts (a) through (d) below. A. What is the probability that a randomly selected respondent is a B2C marketer? B. What is the probability that a randomly selected respondent plans to increase use of social media or is a B2C marketer? C. Explain the difference in the results in (a) and (b)arrow_forward
- A survey about social media reported that 79% of B2B marketers (marketers that focus primarily on attracting businesses) plan to increase their use of social media, as compared to 54% of B2C marketers (marketers that primarily target consumers). The survey was based on 1,333 B2B marketers and 1,669 B2C marketers. The accompanying table summarizes the results. Complete parts (a) through (d) below. A Click the icon to view the contingency table about social media use and marketers. Contingency table a. What is the probability that a randomly selected respondent plans to increase use of social media? (Round to three decimal places as needed.) Increase Use of Social Media? Business Focus B2B B2C Total b. What is the probability that a randomly selected respondent is a B2C marketer? Yes 1,049 901 1,950 (Round to three decimal places as needed.) No 284 768 1,052 Total 1,333 1,669 3,002 c. What is the probability that a randomly selected respondent plans to increase use of social media or is…arrow_forwardThe Excel file for this assignment contains a database with information about the tax assessment value assigned to medical office buildings in a city. The following is a list of the variables in the database: FloorArea: square feet of floor space Offices: number of offices in the building Entrances: number of customer entrances Age: age of the building (years) AssessedValue: tax assessment value (thousands of dollars) Use the data to construct a model that predicts the tax assessment value assigned to medical office buildings with specific characteristics. Construct a scatter plot in Excel with FloorArea as the independent variable and AssessmentValue as the dependent variable. Insert the bivariate linear regression equation and r^2 in your graph. Do you observe a linear relationship between the 2 variables? Use Excel’s Analysis ToolPak to conduct a regression analysis of FloorArea and AssessmentValue. Is FloorArea a significant predictor of AssessmentValue? Construct a scatter plot…arrow_forwardBriefly describe the methods of collecting primary dataarrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Glencoe Algebra 1, Student Edition, 9780079039897...AlgebraISBN:9780079039897Author:CarterPublisher:McGraw Hill

Glencoe Algebra 1, Student Edition, 9780079039897...
Algebra
ISBN:9780079039897
Author:Carter
Publisher:McGraw Hill