Stats 101A HW 6

.Rmd

School

University of California, Los Angeles *

*We aren’t endorsed by this school

Course

101A

Subject

Statistics

Date

Jan 9, 2024

Type

Rmd

Pages

4

Uploaded by ucladsp

Report
--- title: "Stats 101A HW 5" author: 'Ian Zhang UID: 205702810' date: "2023-05-12" output: pdf_document --- ## Question 1 - Chapter 3 1B The ordinary straight line regression model that plots Fare vs Distance does seem to fit the data well. The scatter plot shows data that seems like it is linear, so the model looks like it is valid. However, if you take a look at the residual plot, we can see that there is a clear upside down U pattern of the residuals, meaning that the linear regression model is actually not valid for this data. To improve this model, we can transform one of the variables. We can use the log() transformation on the distance variable because the range is more than a 1 magnitude increase since distance ranges from 0-2000. We also need to determine if the outlier at around (2000, 500) is a bad leverage point, as it is clear that it does not follow the quadratic trend, so we would need to test if it is a bad leverage point, then we will be able to remove it accordingly ## Question 2 ### Part A ```{r a} ads<-read.csv("AdRevenue.csv") ad.m1 <- lm(AdRevenue ~ Circulation, data = ads) plot(ad.m1) #transformation log.data <- transform(ads, logCirc = log(Circulation), logAd = log(AdRevenue)) ad.log <- lm(logAd ~ logCirc, data = log.data) plot(ad.log) summary(ad.log) ``` This model predicts advertising revenue per page from circulation. Looking at the residual plot from the first linear regression model, there is a clear patter within the points, thus making the linear model invalid. However, after applying the log transformation to both advertising revenue, the residual plots after, we can say that the model is now valid. The new residuals vs fitted gets rid of most of the pattern that was present in the original plot, and the QQ-plot is linear, validating the normal distribution condition. The scale location plot validates the constant variance condition, as the values are spread relatively equally around the line with no clear pattern. ### b ```{r b} #95% prediction interval from 4.583774 to 5.294453 predict(ad.log, data.frame(logCirc = 0.5), interval = "predict") #95% prediction interval from 14.27041 to 16.22938 predict(ad.log, data.frame(logCirc = 20), interval = "predict") ``` ### c Some weaknesses of the log model may be that the residuals are not as randomly scattered as they could be. There could be an argument to say that there is a
slight pattern within the residual plot. However, it is definitely a better fit than the original model and at first glance the plot looks randomly scattered. The same could be said about the scale-location plot, as the line is not extremely horizontal and there could be an argument made that the points aren't similarly spaced out, but again, it is a lot better than the original, which had all the points concentrated in one area. There are also some high leverage points, but since they do not fall outside of Cook's distance, these aren't considered bad leverage points. ## Part B ### a ```{r B} ads.second <- lm(AdRevenue ~ Circulation+I(Circulation^2), data=ads) plot(ads.second) summary(ads.second) ads.third <- lm(AdRevenue ~ Circulation+I(Circulation^2)+I(Circulation^3), data = ads) plot(ads.third) summary(ads.third) ``` ### b ```{r b2} #95% prediction interval from 19.47858 to 186.1802 predict(ads.second, data.frame(Circulation = .5), interval = "predict") #95% prediction interval from 490.5858 to 674.188 predict(ads.second, data.frame(Circulation=20), interval = "predict") #95% prediction interval from 14.92314 to 153.4138 predict(ads.third, data.frame(Circulation = .5), interval = "predict") #95% prediction interval from 418.179 to 580.8878 predict(ads.third, data.frame(Circulation=20), interval = "predict") ``` ### c Quadratic There are some weaknesses in this model. For one, the residual plot does not seem like it is very randomly scattered, as there is a cluster of points near the lower x values that is clearly sloping up. There is also a slight fan shape of the residuals indicating non-constant variance. The QQ plot also does not follow a linear trend, as it trails higher near the top of the data, which indicates non- normality. The scale location plot also supports the fact that the model is weak, as the points are not distributed evenly around the line but instead cluster near the left of the graph and the line is not horizontal at all but increasing. There are also bad leverage points that lie outside of Cook's distance which pulls the model improperly. Third This model also has weaknesses, but they are slightly better than the quadratic model's. First the residual plot looks slightly better, as there is less of an upwards slope and has seemingly more random scatter. However, the points still cluster towards the left of the graph. The plot also has a fan shape that fans out from left to right, meaning that the constant variance condition is violated. The qq plot also does not follow the line, as the tails of the plot stray away from the line which is another weakness of this model. The scale-location plot also shows
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help