Ass2
.docx
keyboard_arrow_up
School
Boston University *
*We aren’t endorsed by this school
Course
699
Subject
Industrial Engineering
Date
Dec 6, 2023
Type
docx
Pages
16
Uploaded by n1026818121
AD699: Data Mining for Business Analytics
Individual Assignment #2
Spring 2023
Due by: Friday, 3Mar @ 11:59 p.m.
Simple Linear Regression
Q1 Bring dataset into R environment.
The dataset has been set in the environment correctly.
Q2 Use str() function and indentify data types
Numerical:
enrltot, teachers, calwpct, mealpct, computer, testscr, compstu, expnstu, str, avginc,
elpct, readscr, mathscr
Categorical:
distcod, county, district, grspan
Q3 Filter the dataset to 16 most common counties remain
I first create a new data frame called ‘county_counts’ that shows the number of school districts
in each county. Next I select only the counties with 10 or more
school districts and extract the
county names into a vector called ‘common’. Finally, select only the rows in Caschool that
correspond to the common.
Q4 Partition
Trainning set:
Validation set:
Partitioning the data into a training set and a validation set helps us to build better predictive
models that generalize well to new data. The advantages are preventing overfitting, Evaluating
model performance, and tuning model hyperparameters.
Q5 readscr vs mealpct
The percentage of students in the district who qualify for free and reduced price lunches is
inversely proportional to average reading score. This does make intuitive sense to me based on
the scatter plot shows. It seems there is a strong relationship between readscr and mealpct in
the training set.
Q6 Correlation between readscr and mealpct
The correaltion coefficient between readscr and mealpct is -0.8925 which indicates a very strong
negative correlation between the two variables. The p-value is less than 2.2e-16 which suggests
that the correlation is statistically significant at the 5% level.
Q7 Simple linear regression
On average, a one unit increase in mealpct will decrease readscr by 0.6469. The p-value for
mealpct is <2e-16, which is highly significant and suggests that there is a strong linear
relationship between mealpct and readscr. The R-squared value of 0.7966 indicates that
approximately 79.66% of the variation in readscr can be explained by mealpct.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help