Assignment 9
.pdf
keyboard_arrow_up
School
University of Michigan, Dearborn *
*We aren’t endorsed by this school
Course
DS 633
Subject
Industrial Engineering
Date
Feb 20, 2024
Type
Pages
16
Uploaded by monikagautam93
Assignment 9
1.
Problem 9.3
Predicting Prices of Used Cars (Regression Trees). The file ToyotaCorolla.jmp contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in The Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications. (The example in Section sec-regtrees is a subset of this dataset).
Data preprocessing. Split the data into training (50%), validation (30%), and test (20%) datasets.
Run a regression tree with the output variable Price and input variables Age−08−04, KM, Fuel−Type, HP, Automatic, Doors, Quarterly−Tax, Mfg−Guarantee, Guarantee−Period, Airco, Automatic−Airco, CD−Player, Powered−Windows, Sport−Model, and Tow−Bar. Set the minimum split size to 1, and use the split button repeatedly to create a full tree (hint, use the red triangle options to hide the tree and the graph). As you split, keep an eye on RMSE and RSquare for the training, validation and test sets.
i.
Describe what happens to the RSquare and RMSE for the training, validation and test sets as you continue to split the tree.
RSquare
RASE
N
Number of Splits
AICc
Training
0.639
2171.4819
718
1
13076.7
Validation
0.606
2178.3953
431
Test
0.721
2034.3537
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.802
1608.639
718
5
12653.9
Validation
0.781
1624.6516
431
Test
0.850
1493.9806
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.867
1317.707
718
10
12377.8
Validation
0.824
1456.1092
431
Test
0.900
1219.2657
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.880
1248.6408
718
15
12310.9
Validation
0.829
1437.3576
431
Test
0.905
1186.1872
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.888
1208.1286
718
20
12274.1
Validation
0.834
1414.5993
431
Test
0.910
1153.4376
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.894
1173.7013
718
25
12243.3
Validation
0.840
1390.5749
431
Test
0.911
1147.7832
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.900
1144.0907
718
30
12217.5
Validation
0.841
1382.6499
431
Test
0.919
1097.2567
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.916
1048.2752
718
40
12114.2
Validation
0.844
1369.8937
431
Test
0.918
1104.0191
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.920
1021.0167
718
50
12099.3
Validation
0.841
1384.3096
431
Test
0.915
1120.5823
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.925
988.27733
718
60
12076.1
Validation
0.833
1418.0472
431
Test
0.915
1121.0064
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.929
959.13913
718
80
12082.7
Validation
0.835
1408.9757
431
Test
0.911
1150.417
287
RSquare
RASE
N
Number of Splits
AICc
Training
0.933
936.55618
718
100
12101.2
Validation
0.837
1400.0907
431
Test
0.908
1168.0779
287
As we split the regression tree from 1 to 100 times to predict used Toyota Corolla prices, we see that initially, the model gets better at understanding the training data, with both
RSquare (how well it fits the data) and RMSE (how accurate its predictions are) improving. However, after around 30 splits, the improvements start to slow down, especially for the validation and test sets. This suggests that making the model too detailed might not be helping much and could even be making it too focused on the training data. It's like memorizing a lesson without really understanding it. So, while the model gets really good at predicting prices for cars it has seen before, it might struggle with new cars. It's important to find the right balance to make sure the model works well not just on training data but also on new, unseen data.
ii.
How does the performance of the test set compare to the training and validation sets on these measures? Why does this occur?
The performance of the test set in predicting used Toyota Corolla prices follows a similar pattern to the training and validation sets. At first, the model gets better at predicting prices for both the training and test sets as the tree splits. However, after around 30 splits, the improvements start to slow down for the test set, just like they do for the validation set. This could happen because the model becomes too focused on the details of the training data, making it less effective at predicting prices for new cars. It's like studying a specific set of problems too much and struggling when faced with new ones. Monitoring these patterns helps ensure the model works well not just on what it has seen before but also on new, unseen data.
iii.
Based on this tree, which are the most important car specifications for predicting the car's price?
Column Contributions
Term
Number of Splits
SS
Portion
Age_08_04
16
7685656170
0.8798
HP
8
459235381
0.0526
KM
20
262445186
0.0300
Powered_Windows
9
82264183
0.0094
Quarterly_Tax
5
60911462.2
0.0070
Doors
9
48787063.1
0.0056
Airco
7
33903425.1
0.0039
Automatic_airco
1
31965718.4
0.0037
Mfg_Guarantee
9
23867564.3
0.0027
Sport_Model
4
14827478.5
0.0017
CD_Player
2
11128462.8
0.0013
Tow_Bar
6
10267467.2
0.0012
Guarantee_Period
2
5813137.94
0.0007
Automatic
1
3052618.98
0.0003
Fuel_Type
1
1732503.71
0.0002
The most important car specifications for predicting the car's price are determined by the features with the highest "SS Portion," which represents the proportion of the sum of squares attributable to each specification. The age of the car is the most crucial factor, making up about 88% of the reason. It's like saying, "If a car is older, it tends to cost less." The power of the car's engine, known as Horsepower (HP), is also important, contributing about 5% to the price. Additionally, the number of kilometers the car has driven (KM) matters, making up around 3% of the reason.
iv.
Refit this model, and use the Go button to automatically split and prune the tree based on the validation RSquare. Save the prediction formula for this model to the data table.
Save the Refit model in jmp file.
v.
How many splits are in the final tree?
36 splits.
RSquare
RASE
N
Number of Splits
AICc
Training
0.911
1074.441
718
36
12140.6
Validation
0.845
1366.0136
431
Test
0.921
1083.7582
287
vi.
Compare RSquare and RMSE for the training, validation and test sets for the reduced model to the full model.
Full Model
Reduced Model
Full Model
Reduced Model
RSquare
RASE
RSquare
RASE
Training
0.933
936.55618
0.911
1074.441
Validation
0.837
1400.0907
0.845
1366.0136
Test
0.908
1168.0779
0.921
1083.7582
In the training data, the reduced model shows a lower RSquare and a higher RMSE compared to the full model, indicating that the full model explains more variance and has a better fit for the known training data. However, for the Validation and Test sets, the reduced model outperforms the full model, showing higher RSquare and lower RMSE. This suggests that the reduced model generalizes better to new, unseen data, making it more effective for predictions on Validation and Test sets.
vii.
Which model is better for making predictions? Why?
The reduced model is preferred for making predictions. This choice is based on two key factors: the lower prediction error (lower RMSE), implying more accurate predictions, and the model's simplicity (parsimony). The reduced model is simpler, which is advantageous as it avoids overcomplicating the model while still achieving good prediction accuracy. Therefore, for practical prediction purposes, the reduced model stands out as the better choice.
Your preview ends here
Eager to read complete document? Join bartleby learn and gain access to the full version
- Access to all documents
- Unlimited textbook solutions
- 24/7 expert homework help