IA1_RegressionReview

.docx

School

University of Maryland, University College *

*We aren’t endorsed by this school

Course

733

Subject

Statistics

Date

Apr 3, 2024

Type

docx

Pages

3

Uploaded by sabcarr97

Report
Week 1 Individual Assignment 1: Regression Review The data in the accompanying file Airline Data.csv was assembled by Professor Robert Windle of the Smith School with assistance from Oliver Yao. You may be familiar with this data from earlier classes! The file contains information on 638 air routes in the United States. A route refers to a pair of airports . Note that some cities are served by more than one airport. In such cases, the airports are distinguished by their 3-letter code. The data was collected for the third quarter of 1996 (3Q96). The variables in the data set are: 1. ObsNum: observation number 2. S_CODE: starting airport’s code 3. S_CITY: starting city 4. E_CODE: ending airport’s code 5. E_CITY: ending city 6. COUPON: average number of coupons (a one-coupon flight is a non-stop flight, a two-coupon flight is a one-stop flight, etc.) for that route 7. NEW: number of new carriers entering that route between Q3-96 and Q2-97 8. VACATION: whether a vacation route (1) or not (0); Florida and Las Vegas routes are generally considered vacation routes 9. SW: whether Southwest Airlines serves that route (1) or not (0) 10. HI: Herfindel Index – airlines use this as a measure of market concentration 11. S_INCOME: starting city’s average personal income 12. E_INCOME: ending city’s average personal income 13. S_POP: starting city’s population 14. E_POP: ending city’s population 15. SLOT: whether either endpoint airport is slot controlled or not; this is a measure of airport congestion 16. GATE: whether either endpoint airport has gate constraints or not; this is another measure of airport congestion 17. DISTANCE: distance between two endpoint airports in miles 18. PAX: number of passengers on that route during period of data collection 19. FARE: average fare on that route The Assignment The goal is to predict the FARE as a function of the other variables. Please answer all questions. Supply supporting documentation and show calculations as needed (for example for the RMSE you may want to include a picture of the error measures from the R output). Please submit a single well-formatted PDF or Word file . The instructor should not need to go searching for your answers! You include your script as an Appendix, however it will not be graded, but is often useful when providing feedback. Note that the detailed instructions refer to R – you are however free to use any other software. 1. Data Exploration & Visualization
a) Using the graphical capabilities of R (or the software of your choice) provide a single plot that captures some aspects of the data. Include the plot as a clearly marked Exhibit. b) What do you observe from the plot? How could your observation influence your regression model (or why would it not)? 2. Fitting a linear regression model a) Following the scripts available online (Data Mining with R) adjust them to: a. Randomly partition the data into 70% training and 30% validation, setting the seed to 1. b. Run a multivariable regression with all appropriate variables (note that starting end ending airport indicators are probably not very useful variables). HINT: To remover variables from a data frame in R, one can use the -sign (please refer to the “R Tips and Tricks from Week 1” script). Provide a summary of the model (that includes the values of the regression coefficients) or otherwise include it as a clearly marked and well formatted Exhibit (please refer to the “R Tips and Tricks from Week 1” script on how to export the needed information). b) What is the resulting RMSE on the training data? c) On the validation data? d) From your model, how would you quantify the effects of GATE on the predicted FARE? Please be precise in your interpretation, thinking back to your earlier data analysis class. e) What is the predicted fare of a leg that has COUPON = 1, NEW = 3, VACATION = No, SW = No, HI = 6000, S_income = $25000, E_income = $30000, S_POP = 4,000,000, E_POP=7,150,000, SLOT = Free and GATE = constrained, DISTANCE = 1000, and PAX = 6000? 3. Variable Selection Experiment with variable selection methods (please refer to Step 8 in the Data Mining with R). Minimally implement forward, backwards and stepwise regression models. a) From your experiments – pick a model as your final regression model. Provide a summary of the model or otherwise include it as a clearly marked Exhibit. b) Why did you select this particular model? Please provide quantitative reasoning. c) What is the resulting RMSE on the training data? d) On the validation data?
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help