Assignment-3_F2023

.Rmd

School

Toronto Metropolitan University *

*We aren’t endorsed by this school

Course

830

Subject

Statistics

Date

Jan 9, 2024

Type

Rmd

Pages

7

Uploaded by yusrafq

Report
--- title: "CIND 123: Data Analytics Basic Methods: Assignment-3" output: html_document --- <center> <h1> Assignment 3 (10%) </h1> </center> <center> <h2> Total 100 Marks </h2> </center> <center> <h3> [Insert your full name] </h3> </center> <center> <h3> [Insert course section & student number] </h3> </center> --- ## Instructions This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>. Use RStudio for this assignment. Complete the assignment by inserting your R code wherever you see the string "#INSERT YOUR ANSWER HERE". When you click the **Knit** button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: Submit **both** the rmd and generated output files. Failing to submit both files will be subject to mark deduction. ## Sample Question and Solution Use `seq()` to create the vector $(2,4,6,\ldots,20)$. ```{r} #INSERT YOUR ANSWER HERE. seq(2,20,by = 2) ``` ## Question 1 [15 Pts] a) [5 Pts] First and second midterm grades of some students are given as c(85,76,78,88,90,95,42,31,66) and c(55,66,48,58,80,75,32,22,39). Set R variables `first` and `second` respectively. Then find the least-squares line relating the second midterm to the first midterm. Does the assumption of a linear relationship appear to be reasonable in this case? Give reasons to your answer as a comment. ```{r} #INSERT YOUR ANSWER HERE. first <- c(85,76,78,88,90,95,42,31,66) second <- c(55,66,48,58,80,75,32,22,39) least_squares <- lm(second ~ first) summary(least_squares) plot(first, second, main = "Midterm Grades", xlab="First Midterm", ylab = "Second Midterm") abline(least_squares, col="blue") #A linear relationship assumption can be examined using the scatterplot and the summary statistics provided by the linear regression model.
#The scatterplot has a clear linear pattern and it does suggest a linear relationship between the variables #For the linear regression model, the R-squared value in the summary need to be checked. The "estimate" values are as follows: intercept = -4.1516, first = 0.7870. The least-squares line equation is: second midterm = 0.7870(first midterm) - 4.1516, which is a fairly good fit for the data. ``` b) [5 Pts] Plot the second midterm as a function of the first midterm using a scatterplot and graph the least-square line in red color on the same plot. ```{r} #INSERT YOUR ANSWER HERE. first <- c(85,76,78,88,90,95,42,31,66) second <- c(55,66,48,58,80,75,32,22,39) least_squares <- lm(second ~ first) plot(first, second, main="Midterm Grades", xlab="First Midterm", ylab="Second Midterm") abline(least_squares, col = "blue") ``` c) [5 Pts] Use the regression line to predict the second midterm grades when the first midterm grades are 81 and 23. ```{r} #INSERT YOUR ANSWER HERE. first <- c(85,76,78,88,90,95,42,31,66) second <- c(55,66,48,58,80,75,32,22,39) least_squares <- lm(second ~ first) first_grade <- c(81, 23) prediction <- predict(least_squares, data.frame(first = first_grade)) prediction ``` ## Question 2 [45 Pts] This question makes use of package "plm". Please load Crime dataset as follows: ```{r load_packages} #install.packages("plm") library(plm) data(Crime) ``` a) [5 Pts] Display the first 8 rows of 'crime' data and display the names of all the variables, the number of variables, then display a descriptive summary of each variable. ```{r} #INSERT YOUR ANSWER HERE. library(plm) data(Crime) head(Crime, 8) names(Crime) length(names(Crime)) summary(Crime) ```
b) [5 Pts] Calculate the mean,variance and standard deviation of probability of arrest (prbarr) by omitting the missing values, if any. ```{r} #INSERT YOUR ANSWER HERE. mean_arrest <- mean(Crime$prbarr, na.rm=TRUE) variance_arrest <- var(Crime$prbarr, na.rm=TRUE) std_arrest <- sd(Crime$prbarr, na.rm=TRUE) cat("Mean:", mean_arrest, "\n") cat("Variance:", variance_arrest, "\n") cat("Standard Deviation:", std_arrest, "\n") ``` c) [5 Pts] Use `lpolpc` (log-police per capita) and `smsa` variables to build a linear regression model to predict probability of arrest (prbarr). And, compare with another linear regression model that uses `polpc` (police per capita) and `smsa`. [5 Pts] How can you draw a conclusion from the results? (Note: Full marks requires comment on the predictors) ```{r} #INSERT YOUR ANSWER HERE. model_one <- lm(prbarr ~ lpolpc + smsa, data = Crime) model_two <- lm(prbarr ~ polpc + smsa, data = Crime) summary(model_one) summary(model_two) #Model One: (prbarr ~ lpolpc + smsa). The coefficient for lpolpc indicates how much the probability of arrest changes for one unit change in the log-police per capita. The coefficient smsa indicates how much the probability of arrest changes for areas within a standard metropolitan statistical area compared to non-metropolitan areas #Model Two: (prbarr ~ polpc + smsa). The coefficient for polpc indicates how much the probability of arrest changes for one unit change in police per capita.The coefficient smsa indicates how much the probabiliy of arrest changes for areas within a standard metropolitan statistical area compared to non-metropolitan areas. #Model Two seems to perform better than Model One based on the various metrics like lower residual standard error (model one=0.1623, model two=0.161), higher multiple R-squared (model one=0.104, model two=0.1189), higher adjusted R-squared (model one=0.1012, model two=0.1161), and higher F-statistic (model one=36.4, model two=42.31) #Comments on Predictors: #In Model Two, the coefficent for polpc is 18.34603, suggesting that a one-unit increase in police per capita is associated with a substantial increase in the probability of arrest. The inclusion of polpc in Model Two appears to provide a better fit to the data. #The coefficient for smsa in both models is negative, suggesting that being in a standard metropolital statistical area is associated with a decrease in the probability of arrest compred to non-metropolitan areas. ``` d) [5 Pts] Based on the output of your model, write the equations using the intercept and factors of `smsa` when `polpc` is set to 0.0015. and compare the result with `predict()` function. Hint: Explore `predict()` function ```{r} #INSERT YOUR ANSWER HERE. polpc_val <- 0.0015 model_two <- lm(prbarr ~ polpc + smsa, data=Crime)
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help