Assignment 9

.pdf

School

University of Michigan, Dearborn *

*We aren’t endorsed by this school

Course

DS 633

Subject

Industrial Engineering

Date

Feb 20, 2024

Type

pdf

Pages

16

Uploaded by monikagautam93

Report
Assignment 9 1. Problem 9.3 Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.jmp contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in The Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section sec-regtrees is a subset of this dataset). Data preprocessing. Split the data into training (50%), validation (30%), and test (20%) datasets. Run a regression tree with the output variable Price and input variables Age−08−04, KM, Fuel−Type, HP, Automatic, Doors, Quarterly−Tax, Mfg−Guarantee, Guarantee−Period, Airco, Automatic−Airco, CD−Player, Powered−Windows, Sport−Model, and Tow−Bar. Set the minimum split size to 1, and use the split button repeatedly to create a full tree (hint, use the red triangle options to hide the tree and the graph). As you split, keep an eye on RMSE and RSquare for the training, validation and test sets. i. Describe what happens to the RSquare and RMSE for the training, validation and test sets as you continue to split the tree. RSquare RASE N Number of Splits AICc Training 0.639 2171.4819 718 1 13076.7 Validation 0.606 2178.3953 431 Test 0.721 2034.3537 287 RSquare RASE N Number of Splits AICc Training 0.802 1608.639 718 5 12653.9 Validation 0.781 1624.6516 431 Test 0.850 1493.9806 287 RSquare RASE N Number of Splits AICc Training 0.867 1317.707 718 10 12377.8 Validation 0.824 1456.1092 431 Test 0.900 1219.2657 287 RSquare RASE N Number of Splits AICc Training 0.880 1248.6408 718 15 12310.9 Validation 0.829 1437.3576 431 Test 0.905 1186.1872 287
RSquare RASE N Number of Splits AICc Training 0.888 1208.1286 718 20 12274.1 Validation 0.834 1414.5993 431 Test 0.910 1153.4376 287 RSquare RASE N Number of Splits AICc Training 0.894 1173.7013 718 25 12243.3 Validation 0.840 1390.5749 431 Test 0.911 1147.7832 287 RSquare RASE N Number of Splits AICc Training 0.900 1144.0907 718 30 12217.5 Validation 0.841 1382.6499 431 Test 0.919 1097.2567 287 RSquare RASE N Number of Splits AICc Training 0.916 1048.2752 718 40 12114.2 Validation 0.844 1369.8937 431 Test 0.918 1104.0191 287 RSquare RASE N Number of Splits AICc Training 0.920 1021.0167 718 50 12099.3 Validation 0.841 1384.3096 431 Test 0.915 1120.5823 287 RSquare RASE N Number of Splits AICc Training 0.925 988.27733 718 60 12076.1 Validation 0.833 1418.0472 431 Test 0.915 1121.0064 287 RSquare RASE N Number of Splits AICc Training 0.929 959.13913 718 80 12082.7 Validation 0.835 1408.9757 431 Test 0.911 1150.417 287 RSquare RASE N Number of Splits AICc Training 0.933 936.55618 718 100 12101.2 Validation 0.837 1400.0907 431 Test 0.908 1168.0779 287 As we split the regression tree from 1 to 100 times to predict used Toyota Corolla prices, we see that initially, the model gets better at understanding the training data, with both
RSquare (how well it fits the data) and RMSE (how accurate its predictions are) improving. However, after around 30 splits, the improvements start to slow down, especially for the validation and test sets. This suggests that making the model too detailed might not be helping much and could even be making it too focused on the training data. It's like memorizing a lesson without really understanding it. So, while the model gets really good at predicting prices for cars it has seen before, it might struggle with new cars. It's important to find the right balance to make sure the model works well not just on training data but also on new, unseen data. ii. How does the performance of the test set compare to the training and validation sets on these measures? Why does this occur? The performance of the test set in predicting used Toyota Corolla prices follows a similar pattern to the training and validation sets. At first, the model gets better at predicting prices for both the training and test sets as the tree splits. However, after around 30 splits, the improvements start to slow down for the test set, just like they do for the validation set. This could happen because the model becomes too focused on the details of the training data, making it less effective at predicting prices for new cars. It's like studying a specific set of problems too much and struggling when faced with new ones. Monitoring these patterns helps ensure the model works well not just on what it has seen before but also on new, unseen data.
iii. Based on this tree, which are the most important car specifications for predicting the car's price? Column Contributions Term Number of Splits SS Portion Age_08_04 16 7685656170 0.8798 HP 8 459235381 0.0526 KM 20 262445186 0.0300 Powered_Windows 9 82264183 0.0094 Quarterly_Tax 5 60911462.2 0.0070 Doors 9 48787063.1 0.0056 Airco 7 33903425.1 0.0039 Automatic_airco 1 31965718.4 0.0037 Mfg_Guarantee 9 23867564.3 0.0027 Sport_Model 4 14827478.5 0.0017 CD_Player 2 11128462.8 0.0013 Tow_Bar 6 10267467.2 0.0012 Guarantee_Period 2 5813137.94 0.0007 Automatic 1 3052618.98 0.0003 Fuel_Type 1 1732503.71 0.0002 The most important car specifications for predicting the car's price are determined by the features with the highest "SS Portion," which represents the proportion of the sum of squares attributable to each specification. The age of the car is the most crucial factor, making up about 88% of the reason. It's like saying, "If a car is older, it tends to cost less." The power of the car's engine, known as Horsepower (HP), is also important, contributing about 5% to the price. Additionally, the number of kilometers the car has driven (KM) matters, making up around 3% of the reason. iv. Refit this model, and use the Go button to automatically split and prune the tree based on the validation RSquare. Save the prediction formula for this model to the data table. Save the Refit model in jmp file. v. How many splits are in the final tree? 36 splits. RSquare RASE N Number of Splits AICc Training 0.911 1074.441 718 36 12140.6 Validation 0.845 1366.0136 431 Test 0.921 1083.7582 287
vi. Compare RSquare and RMSE for the training, validation and test sets for the reduced model to the full model. Full Model Reduced Model
Full Model Reduced Model RSquare RASE RSquare RASE Training 0.933 936.55618 0.911 1074.441 Validation 0.837 1400.0907 0.845 1366.0136 Test 0.908 1168.0779 0.921 1083.7582 In the training data, the reduced model shows a lower RSquare and a higher RMSE compared to the full model, indicating that the full model explains more variance and has a better fit for the known training data. However, for the Validation and Test sets, the reduced model outperforms the full model, showing higher RSquare and lower RMSE. This suggests that the reduced model generalizes better to new, unseen data, making it more effective for predictions on Validation and Test sets. vii. Which model is better for making predictions? Why? The reduced model is preferred for making predictions. This choice is based on two key factors: the lower prediction error (lower RMSE), implying more accurate predictions, and the model's simplicity (parsimony). The reduced model is simpler, which is advantageous as it avoids overcomplicating the model while still achieving good prediction accuracy. Therefore, for practical prediction purposes, the reduced model stands out as the better choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
  • Access to all documents
  • Unlimited textbook solutions
  • 24/7 expert homework help