HW2

.pdf

School

North Carolina State University *

*We aren’t endorsed by this school

Course

308

Subject

Statistics

Date

Jan 9, 2024

Type

pdf

Pages

Uploaded by AgentRatPerson905

R: Programming Assignment 2 (57 pts) In this assignment you will create a .Rmd file and corresponding .html output. Write code to answer the questions below, and upload both the .Rmd file and the .html file you create to the Moodle assignment link. • The file submitted must meet the R File Submission Guidelines available in the Resources and Information section of the course. • If your file doesn’t meet these guidelines, we may take up to 50% off from your score. • It is time to put what you’ve learned into practice! You may not work with others on this assignment. You cannot post to the discussion forum nor post anywhere else to obtain help on this assignment. • You may obtain help from your instructor (or another class’s instructor) if you are stuck. Class time is the best time to get help! • Remember that there is a one business day turn around on email. • No late work will be accepted. If you have a documented emergency that prevents you from completing a homework assignment, please contact your instructor and provide proof of the emergency. The homework involves two parts: • One ‘prescriptive’ part where you are asked to perform tasks similar to what we’ve done in class. • One ‘open-ended’ part where you will discuss a dataset you’ve found and read it into R. This second part will be used on the final project for the course! Prescriptive Part (32 pts) Dataset About the data (more info at http://archive.ics.uci.edu/ml/datasets/Dry+Bean+Dataset ): Seven different types of dry beans were used in this research, taking into account the features such as form, shape, type, and structure by the market situation. A computer vision system was developed to distinguish seven different registered varieties of dry beans with similar features in order to obtain uniform seed classification. For the classification model, images of 13,611 grains of 7 different registered dry beans were taken with a high-resolution camera. Bean images obtained by computer vision system were subjected to segmentation and feature extraction stages, and a total of 16 features; 12 dimensions and 4 shape forms, were obtained from the grains. There are 17 different variables in the data: 1. Area (A): The area of a bean zone and the number of pixels within its boundaries. 2. Perimeter (P): Bean circumference is defined as the length of its border. 3. Major axis length (L): The distance between the ends of the longest line that can be drawn from a bean. 4. Minor axis length (l): The longest line that can be drawn from the bean while standing perpendicular to the main axis. 5. Aspect ratio (K): Defines the relationship between L and l. 6. Eccentricity (Ec): Eccentricity of the ellipse having the same moments as the region. 7. Convex area (C): Number of pixels in the smallest convex polygon that can contain the area of a bean seed. 8. Equivalent diameter (Ed): The diameter of a circle having the same area as a bean seed area. 9. Extent (Ex): The ratio of the pixels in the bounding box to the bean area. 10. Solidity (S): Also known as convexity. The ratio of the pixels in the convex shell to those found in beans. 11. Roundness (R): Calculated with the following formula: (4piA)/(Pˆ2) 12. Compactness (CO): Measures the roundness of an object: Ed/L 13. ShapeFactor1 (SF1) 14. ShapeFactor2 (SF2) 15. ShapeFactor3 (SF3) 16. ShapeFactor4 (SF4) 17. Class (Seker, Barbunya, Bombay, Cali, Dermosan, Horoz and Sira) 1

Tasks For this section, write a brief discussion about what you are going to do ( don’t just copy and paste the question prompts ) using markdown text followed by code chunks that display the code and results for each question below. Any answers/information that you are asked to provide should be done using markdown. 1. (2 pts) Use tidyverse type functions to read in the Dry_Bean_Dataset.xlsx data set. The data is available at https://www4.stat.ncsu.edu/~online/datasets/Dry_Bean_Dataset.xlsx . • (2 pts) After the code chunk write markdown text that includes inline R code. You should output the sentence The data has been read into the object name of your object here . This object is a describe the object class here that has use_in-line_R_code_to_render_this_value variables and use_in-line_R_code_to_render_this_value observations. 2. Use tidyverse functions and chaining to do the following modifications to your data object (saving the result as a new object): The new data object should • Not include the Extent or AspectRatio variables. (2 pts) • Renames the ShapeFactor1 , ShapeFactor2 , ShapeFactor3 and ShapeFactor4 variables to SF1 , SF2 , SF3 , and SF4. (2 pt) • Removes any observations in which the bean class is not one of DERMASON , SEKER , SIRA , BOMBAY or CALI. (2 pts) • Creates a new variable that is the average of the newly renamed shape factor variables. (2 pts) • Creates a new categorical variable that bins the new average shape factor variable into three categories (you chose the values for the category names) as follows (3 pts): – If the value is larger than 0.6 give the new variable a character string indicating it is in the largest category – If the value is larger than 0.4 but less than or equal to 0.6 give the new variable a character string indicating it is in the middle category – Otherwise give the new variable a character string indicating it is in the lowest category • Reorders the rows of the data set descending on the MajorAxisLength variable. (2 pts) Finally, display the new data object so it prints out (just call the new object name). Remember, you should have a bit of markdown text prior to this code chunk that describes what you are going to do (and it shouldn’t just be a copy and paste of the prompt!). Write out what you want to do and which function your are choosing (from the tidyverse ) to accomplish it. (2 pts) 3. Using your new data object, create a two-way contingency table for the Class variable and the binned variable you created in the above step. (2 pts) • When creating the table, only include observations where the SF1 variable is less than 0.008 (1 pt) • In a sentence below the code chunk, describe what the number in the top left cell of the table means. (1 pt) 4. Find the Mean, Median, and Standard Deviation summary statistics for the Area and Perimeter variables for each Class of bean . (4 pts) • In a sentence below the code chunk, describe what all of the statistics found mean for one of the bean classes (2 pts) 5. Create a correlation matrix between the Area , Perimeter , MajorAxisLength , and MinorAxisLength vari- ables. Note: This can be easily done using the cor() function. (3 pts) 2

Your preview ends here

Eager to read complete document? Join bartleby learn and gain access to the full version