HW2
.pdf
keyboard_arrow_up
School
North Carolina State University *
*We aren’t endorsed by this school
Course
308
Subject
Statistics
Date
Jan 9, 2024
Type
Pages
3
Uploaded by AgentRatPerson905
R: Programming Assignment 2 (57 pts)
In this assignment you will create a .Rmd file and corresponding .html output. Write code to answer the questions
below, and upload both the .Rmd file and the .html file you create to the Moodle assignment link.
•
The file submitted must meet the R File Submission Guidelines available in the Resources and Information
section of the course.
•
If your file doesn’t meet these guidelines, we may take up to 50% off from your score.
•
It is time to put what you’ve learned into practice! You may not work with others on this assignment. You
cannot post to the discussion forum nor post anywhere else to obtain help on this assignment.
•
You may obtain help from your instructor (or another class’s instructor) if you are stuck. Class time is the best
time to get help!
•
Remember that there is a one business day turn around on email.
•
No late work will be accepted. If you have a documented emergency that prevents you from completing a
homework assignment, please contact your instructor and provide proof of the emergency.
The homework involves two parts:
•
One ‘prescriptive’ part where you are asked to perform tasks similar to what we’ve done in class.
•
One ‘open-ended’ part where you will discuss a dataset you’ve found and read it into R.
This second part will be used on the final project for the course!
Prescriptive Part (32 pts)
Dataset
About the data (more info at
http://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset
):
Seven different types of dry beans were used in this research, taking into account the features such
as form, shape, type, and structure by the market situation. A computer vision system was developed
to distinguish seven different registered varieties of dry beans with similar features in order to obtain
uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered
dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system
were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions
and 4 shape forms, were obtained from the grains.
There are 17 different variables in the data:
1. Area (A): The area of a bean zone and the number of pixels within its boundaries.
2. Perimeter (P): Bean circumference is defined as the length of its border.
3. Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean.
4.
Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the
main axis.
5. Aspect ratio (K): Defines the relationship between L and l.
6. Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region.
7. Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed.
8. Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area.
9. Extent (Ex): The ratio of the pixels in the bounding box to the bean area.
10. Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans.
11. Roundness (R): Calculated with the following formula: (4piA)/(Pˆ2)
12. Compactness (CO): Measures the roundness of an object: Ed/L
13. ShapeFactor1 (SF1)
14. ShapeFactor2 (SF2)
15. ShapeFactor3 (SF3)
16. ShapeFactor4 (SF4)
17. Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira)
1
Tasks
For this section, write a brief discussion about what you are going to do (
don’t just copy and paste the question
prompts
) using markdown text followed by code chunks that display the code and results for each question below.
Any answers/information that you are asked to provide should be done using markdown.
1.
(2 pts) Use
tidyverse
type functions to read in the
Dry_Bean_Dataset.xlsx
data set. The data is available
at
https://www4.stat.ncsu.edu/~online/datasets/Dry_Bean_Dataset.xlsx
.
•
(2 pts) After the code chunk write markdown text that includes inline R code. You should output the
sentence
The data has been read into the object
name of your object here
.
This object is a
describe
the object class here
that has
use_in-line_R_code_to_render_this_value
variables and
use_in-line_R_code_to_render_this_value
observations.
2.
Use
tidyverse
functions and
chaining
to do the following modifications to your data object (saving the
result as a new object): The new data object should
•
Not include the
Extent
or
AspectRatio
variables. (2 pts)
•
Renames the
ShapeFactor1
,
ShapeFactor2
,
ShapeFactor3
and
ShapeFactor4
variables to
SF1
,
SF2
,
SF3
, and
SF4.
(2 pt)
•
Removes any observations in which the bean class is not one of
DERMASON
,
SEKER
,
SIRA
,
BOMBAY
or
CALI.
(2 pts)
•
Creates a new variable that is the average of the newly renamed shape factor variables. (2 pts)
•
Creates a new categorical variable that bins the new average shape factor variable into three categories
(you chose the values for the category names) as follows (3 pts):
–
If the value is larger than 0.6 give the new variable a character string indicating it is in the largest
category
–
If the value is larger than 0.4 but less than or equal to 0.6 give the new variable a character string
indicating it is in the middle category
–
Otherwise give the new variable a character string indicating it is in the lowest category
•
Reorders the rows of the data set descending on the
MajorAxisLength
variable. (2 pts)
Finally, display the new data object so it prints out (just call the new object name). Remember, you should
have a bit of markdown text prior to this code chunk that describes what you are going to do (and it shouldn’t
just be a copy and paste of the prompt!). Write out what you want to do and which function your are choosing
(from the
tidyverse
) to accomplish it. (2 pts)
3.
Using your new data object, create a two-way contingency table for the Class variable and the
binned
variable
you created in the above step. (2 pts)
•
When creating the table, only include observations where the
SF1
variable is less than 0.008 (1 pt)
•
In a sentence below the code chunk, describe what the number in the top left cell of the table means. (1
pt)
4.
Find the Mean, Median, and Standard Deviation summary statistics for the Area and Perimeter variables
for
each Class of bean
. (4 pts)
•
In a sentence below the code chunk, describe what all of the statistics found mean for one of the bean
classes (2 pts)
5.
Create a correlation matrix between the
Area
,
Perimeter
,
MajorAxisLength
, and
MinorAxisLength
vari-
ables. Note: This can be easily done using the
cor()
function. (3 pts)
2
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help
Related Questions
Just need help with part 3, thank you!
arrow_forward
Part 2. Refer to the Excel file Cereal data set to complete the following tasks. All results and explanations need to be reported within this Word document after each question. Make sure to use complete sentences when explaining your results. Your results should be formatted and edited.
Data Set: Cereals
The data set shows the name of different brands of cereals, the manufacturers, the total calories, proteins, sugar, fat, potassium, sodium, location of the shelf in the supermarket, etc. The amount of sugar, protein, etc., is measured in grams (g).
Exercise 1:
A. Construct a frequency distribution and a bar graph for the cereal manufactures (mfr). Include the relative frequencies. Edit and format the graph and include appropriate labels for the horizontal and vertical axes. Describe your findings in the context of the problem (Include which manufacturer produces the most cereals and least number of cereals in the cereal market).
N = Nabisco, K = Kellog’s, Q = Quaker Oats…
arrow_forward
Sleep Late, a large hotel chain, has been using activity-based costing to determine the cost of a night’s stay at their hotels. One of the activities, “Inspection,” occurs after a customer has checked out of a hotel room. Sleep Late inspects every 10th room and has been using “number of rooms inspected” as the cost driver for inspection costs. A significant component of inspection costs is the cost of the supplies used in each inspection Mary Adams, the chief inspector, is wondering whether inspection labor-hours might be a better cost driver for inspection costs. Mary gathers information for weekly inspection costs, rooms inspected, and inspection labor-hours as follows:
Q.Plot the data and regression line for rooms inspected and inspection costs. Plot the data and regression line for inspection labor-hours and inspection costs. Which cost driver of inspection costs would you choose? Explain.
arrow_forward
UPDATED – Use this version! From part 2 of this assignment, you know that gestational diabetes affects #+1 percent of the population in our patient’s age group. Many women in this age group have decided to follow a special diet, which is advertised as being especially nutritious for the baby. One clinic has 53 pregnant women who have elected to follow this diet. Out of these women, #+3 percent of them have developed gestational diabetes. Use a 5% significance level to determine if this rate is higher than the #+1 percent level in the general population. Show all work and include screen shots of any Excel template pages you use. Include the following.
A formal statement of the null and alternative hypothesis for your test. Make sure to include correct statistical notation for the formal null and alternative, do not just state this in words.
Screen shots of any Excel Template pages you use.
An interpretation of your p-value including
A statement saying whether you…
arrow_forward
What do you think are the advantages of understanding the development of Process Flow Diagram? Is it necessary for industrial engineers to know the creation of this chart? Why? Answer in 20 sentences.
arrow_forward
A survey found that Massachusetts residents spent an average of $861.75 on the lottery, more than three times the U.S. average. A researcher at a Boston think tank believes that Massachusetts residents spend less than this amount. He surveys 100 Massachusetts residents and asks them about their annual expenditures on the lottery. (You may find it useful to reference the t table.)
Click here for the Excel Data File (The Lottery data below is from the excel file)
a. Specify the competing hypotheses to test the researcher’s claim.multiple choice 1
H0: μ = 861.75; HA: μ ≠ 861.75
H0: μ ≥ 861.75; HA: μ < 861.75
H0: μ ≤ 861.75; HA: μ > 861.75
b-1. Calculate the value of the test statistic. (Negative value should be indicated by a minus sign. Round final answer to 3 decimal places.)
Test statistic=
Lottery
787
605
919
1140
1090
1191
405
795
1050
644
699
518
469
654
708
405
747
791
880
751
795
803
1103
823
765
744…
arrow_forward
Background information: Allison collected additional days of data to monitor the process.
Steps to monitor using the control charts:
Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File.
Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts.
Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart.
Now that we have set up the control charts using enough data from a stable process, the 30 days of data, we will monitor the process. While monitoring the process, what will we use as the upper control limit for the R (range) Chart to compare against our new range values? Enter your response to three decimal places. You do not need to include the units (minutes), ONLY the numeric value.
USE EXCELL DATA TO GET…
arrow_forward
Identify several types of manufacturing companies for which process costing would be an appropriate product-costing system. What characteristics do the products of these companies have that would make process costing a good choice?
How is process costing similar and different in a second or later processing department?
arrow_forward
Answers for letter d only
arrow_forward
STATS 1700
For each of the following studies, make a chart of the four possible correct andincorrect decisions, and explain what each would mean. Each chart should belaid out like Table 6-1, but put into the boxes the possible results, using thenames of the variables involved in the study. (a) A study of whether increasing the amount of recess time improves schoolchildren’s in-class behavior. (b) A study of whether color-blind individuals can distinguish gray shades betterthan the population at large. (c) A study comparing individuals who have ever been in psychotherapy to thegeneral public to see if they are more tolerant of other people’s upsets thanis the general population.MyStatLab Making Sense of S
arrow_forward
Charity Navigator is America's leading independent charity evaluator. The data in the Excel Online file below show the total expenses ($), the percentage of the total budget spent on administrative expenses, the percentage spent on fundraising, and the percentage spent on program expenses for 10 supersized charities. Administrative expenses include overhead, administrative staff and associated costs, and organizational meetings. Fundraising expenses are what a charity spends to raise money, and program expenses are what the charity spends on the programs and services it exists to deliver. The sum of the three percentages does not add to 100% because of rounding. Construct a spreadsheet to answer the following questions.
Charity #
Charity
Total Expenses ($)
Administrative Expenses (%)
Fundraising Expenses (%)
Program Expenses (%)
1
American Red Cross
3355147520
3.8
3.9
92.1
2
World Vision
1200410940
3.8
7.6
88.4
3
Smithsonian Institution
1078888839
23.3
2.6
73.7
4
Food For…
arrow_forward
Charity Navigator is America's leading independent charity evaluator. The data in the Excel Online file below show the total expenses ($), the percentage of the total budget spent on administrative expenses, the percentage spent on fundraising, and the percentage spent on program expenses for 10 supersized charities. Administrative expenses include overhead, administrative staff and associated costs, and organizational meetings. Fundraising expenses are what a charity spends to raise money, and program expenses are what the charity spends on the programs and services it exists to deliver. The sum of the three percentages does not add to 100% because of rounding. Construct a spreadsheet to answer the following questions.
arrow_forward
Charity Navigator is America's leading independent charity evaluator. The data in the Excel Online file below show the total expenses ($), the percentage of the total budget spent on administrative expenses, the percentage spent on fundraising, and the percentage spent on program expenses for 10 supersized charities. Administrative expenses include overhead, administrative staff and associated costs, and organizational meetings. Fundraising expenses are what a charity spends to raise money, and program expenses are what the charity spends on the programs and services it exists to deliver. The sum of the three percentages does not add to 100% because of rounding. Construct a spreadsheet to answer the following questions.
arrow_forward
SEE MORE QUESTIONS
Recommended textbooks for you
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Intermediate Algebra
Algebra
ISBN:9781285195728
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Elementary Geometry for College Students
Geometry
ISBN:9781285195698
Author:Daniel C. Alexander, Geralyn M. Koeberlein
Publisher:Cengage Learning
Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,
Related Questions
- Just need help with part 3, thank you!arrow_forwardPart 2. Refer to the Excel file Cereal data set to complete the following tasks. All results and explanations need to be reported within this Word document after each question. Make sure to use complete sentences when explaining your results. Your results should be formatted and edited. Data Set: Cereals The data set shows the name of different brands of cereals, the manufacturers, the total calories, proteins, sugar, fat, potassium, sodium, location of the shelf in the supermarket, etc. The amount of sugar, protein, etc., is measured in grams (g). Exercise 1: A. Construct a frequency distribution and a bar graph for the cereal manufactures (mfr). Include the relative frequencies. Edit and format the graph and include appropriate labels for the horizontal and vertical axes. Describe your findings in the context of the problem (Include which manufacturer produces the most cereals and least number of cereals in the cereal market). N = Nabisco, K = Kellog’s, Q = Quaker Oats…arrow_forwardSleep Late, a large hotel chain, has been using activity-based costing to determine the cost of a night’s stay at their hotels. One of the activities, “Inspection,” occurs after a customer has checked out of a hotel room. Sleep Late inspects every 10th room and has been using “number of rooms inspected” as the cost driver for inspection costs. A significant component of inspection costs is the cost of the supplies used in each inspection Mary Adams, the chief inspector, is wondering whether inspection labor-hours might be a better cost driver for inspection costs. Mary gathers information for weekly inspection costs, rooms inspected, and inspection labor-hours as follows: Q.Plot the data and regression line for rooms inspected and inspection costs. Plot the data and regression line for inspection labor-hours and inspection costs. Which cost driver of inspection costs would you choose? Explain.arrow_forward
- UPDATED – Use this version! From part 2 of this assignment, you know that gestational diabetes affects #+1 percent of the population in our patient’s age group. Many women in this age group have decided to follow a special diet, which is advertised as being especially nutritious for the baby. One clinic has 53 pregnant women who have elected to follow this diet. Out of these women, #+3 percent of them have developed gestational diabetes. Use a 5% significance level to determine if this rate is higher than the #+1 percent level in the general population. Show all work and include screen shots of any Excel template pages you use. Include the following. A formal statement of the null and alternative hypothesis for your test. Make sure to include correct statistical notation for the formal null and alternative, do not just state this in words. Screen shots of any Excel Template pages you use. An interpretation of your p-value including A statement saying whether you…arrow_forwardWhat do you think are the advantages of understanding the development of Process Flow Diagram? Is it necessary for industrial engineers to know the creation of this chart? Why? Answer in 20 sentences.arrow_forwardA survey found that Massachusetts residents spent an average of $861.75 on the lottery, more than three times the U.S. average. A researcher at a Boston think tank believes that Massachusetts residents spend less than this amount. He surveys 100 Massachusetts residents and asks them about their annual expenditures on the lottery. (You may find it useful to reference the t table.) Click here for the Excel Data File (The Lottery data below is from the excel file) a. Specify the competing hypotheses to test the researcher’s claim.multiple choice 1 H0: μ = 861.75; HA: μ ≠ 861.75 H0: μ ≥ 861.75; HA: μ < 861.75 H0: μ ≤ 861.75; HA: μ > 861.75 b-1. Calculate the value of the test statistic. (Negative value should be indicated by a minus sign. Round final answer to 3 decimal places.) Test statistic= Lottery 787 605 919 1140 1090 1191 405 795 1050 644 699 518 469 654 708 405 747 791 880 751 795 803 1103 823 765 744…arrow_forward
- Background information: Allison collected additional days of data to monitor the process. Steps to monitor using the control charts: Now monitor the process. An additional ten days of data have been collected, see table labeled “1st 10 Days of Monitoring Reservation Processing Time” in the Data File. Develop Xbar and R charts for the 1st 10 days of monitoring. Plot the data for the 1st 10 days on the Xbar and R charts. Is the process in control? If the control chart indicates an out-of-control process, note which days, the pattern, and whether it is the Xbar or R chart. Now that we have set up the control charts using enough data from a stable process, the 30 days of data, we will monitor the process. While monitoring the process, what will we use as the upper control limit for the R (range) Chart to compare against our new range values? Enter your response to three decimal places. You do not need to include the units (minutes), ONLY the numeric value. USE EXCELL DATA TO GET…arrow_forwardIdentify several types of manufacturing companies for which process costing would be an appropriate product-costing system. What characteristics do the products of these companies have that would make process costing a good choice? How is process costing similar and different in a second or later processing department?arrow_forwardAnswers for letter d onlyarrow_forward
- STATS 1700 For each of the following studies, make a chart of the four possible correct andincorrect decisions, and explain what each would mean. Each chart should belaid out like Table 6-1, but put into the boxes the possible results, using thenames of the variables involved in the study. (a) A study of whether increasing the amount of recess time improves schoolchildren’s in-class behavior. (b) A study of whether color-blind individuals can distinguish gray shades betterthan the population at large. (c) A study comparing individuals who have ever been in psychotherapy to thegeneral public to see if they are more tolerant of other people’s upsets thanis the general population.MyStatLab Making Sense of Sarrow_forwardCharity Navigator is America's leading independent charity evaluator. The data in the Excel Online file below show the total expenses ($), the percentage of the total budget spent on administrative expenses, the percentage spent on fundraising, and the percentage spent on program expenses for 10 supersized charities. Administrative expenses include overhead, administrative staff and associated costs, and organizational meetings. Fundraising expenses are what a charity spends to raise money, and program expenses are what the charity spends on the programs and services it exists to deliver. The sum of the three percentages does not add to 100% because of rounding. Construct a spreadsheet to answer the following questions. Charity # Charity Total Expenses ($) Administrative Expenses (%) Fundraising Expenses (%) Program Expenses (%) 1 American Red Cross 3355147520 3.8 3.9 92.1 2 World Vision 1200410940 3.8 7.6 88.4 3 Smithsonian Institution 1078888839 23.3 2.6 73.7 4 Food For…arrow_forwardCharity Navigator is America's leading independent charity evaluator. The data in the Excel Online file below show the total expenses ($), the percentage of the total budget spent on administrative expenses, the percentage spent on fundraising, and the percentage spent on program expenses for 10 supersized charities. Administrative expenses include overhead, administrative staff and associated costs, and organizational meetings. Fundraising expenses are what a charity spends to raise money, and program expenses are what the charity spends on the programs and services it exists to deliver. The sum of the three percentages does not add to 100% because of rounding. Construct a spreadsheet to answer the following questions.arrow_forward
arrow_back_ios
SEE MORE QUESTIONS
arrow_forward_ios
Recommended textbooks for you
- Algebra for College StudentsAlgebraISBN:9781285195780Author:Jerome E. Kaufmann, Karen L. SchwittersPublisher:Cengage LearningIntermediate AlgebraAlgebraISBN:9781285195728Author:Jerome E. Kaufmann, Karen L. SchwittersPublisher:Cengage LearningBig Ideas Math A Bridge To Success Algebra 1: Stu...AlgebraISBN:9781680331141Author:HOUGHTON MIFFLIN HARCOURTPublisher:Houghton Mifflin Harcourt
- Elementary Geometry for College StudentsGeometryISBN:9781285195698Author:Daniel C. Alexander, Geralyn M. KoeberleinPublisher:Cengage LearningMathematics For Machine TechnologyAdvanced MathISBN:9781337798310Author:Peterson, John.Publisher:Cengage Learning,
Algebra for College Students
Algebra
ISBN:9781285195780
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Intermediate Algebra
Algebra
ISBN:9781285195728
Author:Jerome E. Kaufmann, Karen L. Schwitters
Publisher:Cengage Learning
Big Ideas Math A Bridge To Success Algebra 1: Stu...
Algebra
ISBN:9781680331141
Author:HOUGHTON MIFFLIN HARCOURT
Publisher:Houghton Mifflin Harcourt
Elementary Geometry for College Students
Geometry
ISBN:9781285195698
Author:Daniel C. Alexander, Geralyn M. Koeberlein
Publisher:Cengage Learning
Mathematics For Machine Technology
Advanced Math
ISBN:9781337798310
Author:Peterson, John.
Publisher:Cengage Learning,