Ass2

.docx

School

Boston University *

*We aren’t endorsed by this school

Course

699

Subject

Industrial Engineering

Date

Dec 6, 2023

Type

docx

Pages

16

Uploaded by n1026818121

Report
AD699: Data Mining for Business Analytics Individual Assignment #2 Spring 2023 Due by: Friday, 3Mar @ 11:59 p.m. Simple Linear Regression Q1 Bring dataset into R environment. The dataset has been set in the environment correctly. Q2 Use str() function and indentify data types Numerical: enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc, elpct, readscr, mathscr Categorical: distcod, county, district, grspan Q3 Filter the dataset to 16 most common counties remain
I first create a new data frame called ‘county_counts’ that shows the number of school districts in each county. Next I select only the counties with 10 or more school districts and extract the county names into a vector called ‘common’. Finally, select only the rows in Caschool that correspond to the common. Q4 Partition
Trainning set:
Validation set: Partitioning the data into a training set and a validation set helps us to build better predictive models that generalize well to new data. The advantages are preventing overfitting, Evaluating model performance, and tuning model hyperparameters. Q5 readscr vs mealpct
The percentage of students in the district who qualify for free and reduced price lunches is inversely proportional to average reading score. This does make intuitive sense to me based on the scatter plot shows. It seems there is a strong relationship between readscr and mealpct in the training set. Q6 Correlation between readscr and mealpct
The correaltion coefficient between readscr and mealpct is -0.8925 which indicates a very strong negative correlation between the two variables. The p-value is less than 2.2e-16 which suggests that the correlation is statistically significant at the 5% level. Q7 Simple linear regression On average, a one unit increase in mealpct will decrease readscr by 0.6469. The p-value for mealpct is <2e-16, which is highly significant and suggests that there is a strong linear relationship between mealpct and readscr. The R-squared value of 0.7966 indicates that approximately 79.66% of the variation in readscr can be explained by mealpct.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help