library(ISLR2)
set.seed(1)
<- sample(nrow(Carseats), nrow(Carseats) / 2)
train.set <- 1:nrow(Carseats) %in% train.set train
Lab 8: Tree-based Methods
Questions
Conceptual Questions
(ISLR2, Q8.3) \star Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \hat{p}_{m1}. The x-axis should display \hat{p}_{m1}, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, \hat{p}_{m1} = 1 - \hat{p}_{m2}. You could make this plot by hand, but it will be much easier to make in
R
.(ISLR2, Q8.4) \star This question relates to the plots in the textbook Figure 8.14, reproduced here as Figure 1:
Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The numbers inside the boxes indicate the mean of Y within each region.
Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
(ISLR2, Q8.5) \star Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of \mathbb{P}(\text{Class is Red}|X): 0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, \text{ and } 0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
Applied Questions
(ISLR2, Q8.8) \star In the lab, a classification tree was applied to the
Carseats
data set after convertingSales
into a qualitative response variable. Now we will seek to predictSales
using regression trees and related approaches, treating the response as a quantitative variable.Split the data set into a training set and a test set.
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the
importance()
function to determine which variables are most important.Use random forests to analyze this data. What test MSE do you obtain? Use the
importance()
function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.Now analyze the data using BART, and report your results.
(ISLR2, Q8.9) This problem involves the
OJ
data set which is part of theISLR2
package.Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
Fit a tree to the training data, with
Purchase
as the response and the other variables as predictors. Use thesummary()
function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
Create a plot of the tree, and interpret the results.
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
Apply the
cv.tree()
function to the training set in order to determine the optimal tree size.Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
Which tree size corresponds to the lowest cross-validated classification error rate?
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
Compare the training error rates between the pruned and unpruned trees. Which is higher?
Compare the test error rates between the pruned and unpruned trees. Which is higher?
(ISLR2, Q8.10) \star We now use boosting to predict
Salary
in theHitters
data set.Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter \lambda. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
Which variables appear to be the most important predictors in the boosted model?
Now apply bagging to the training set. What is the test set MSE for this approach?
Solutions
Conceptual Questions
Applied Questions
library(tree) <- tree(Sales ~ ., data = Carseats, subset = train) fit plot(fit) text(fit, pretty = 0)
<- predict(fit, newdata = Carseats[!train, ]) pred mean((Carseats$Sales[!train] - pred)^2)
[1] 4.922039
Unfortunately, if the option
pretty = 0
is used, the plot doesn’t look too nice. However, we can decipher from the plot htat ShelveLoc seems to be the most important predictor of sales, then price. You can try the optionpretty = NULL
. The test MSE is about 4.17.set.seed(1) <- tree(Sales ~ ., data = Carseats) fit <- cv.tree(fit, FUN = prune.tree) fit.cv plot(fit.cv$size, fit.cv$dev, type = "l")
plot(fit.cv$k, fit.cv$dev, type = "l")
The model with the lowest CV error is the 14 leaf-node tree with a cost-complexity tuning parameter of 34.30.
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
set.seed(1) <- randomForest(Sales ~ ., bag.sales data = Carseats, subset = train, mtry = (ncol(Carseats) - 1), importance = TRUE )<- predict(bag.sales, newdata = Carseats[!train, ]) pred mean((Carseats$Sales[!train] - pred)^2)
[1] 2.634877
importance(bag.sales)
%IncMSE IncNodePurity CompPrice 24.2351022 170.07496 Income 4.3958014 95.51328 Advertising 13.2725833 99.45799 Population -1.0856676 56.91945 Price 56.3728353 502.27782 ShelveLoc 48.1294202 371.79930 Age 18.3513474 162.04892 Education 0.9147364 42.98078 Urban 0.6861240 8.99512 US 5.8486748 15.92802
The test error rate is 2.55, which is lower than the non-bagged regression tree model. Price and ShelveLoc are the most important predictors.
set.seed(1) <- randomForest(Sales ~ ., rf.sales data = Carseats, subset = train, importance = TRUE )<- predict(rf.sales, newdata = Carseats[!train, ]) pred mean((Carseats$Sales[!train] - pred)^2)
[1] 2.956352
importance(rf.sales)
%IncMSE IncNodePurity CompPrice 14.4290662 150.86590 Income 4.8926264 129.04906 Advertising 9.8054622 112.00297 Population -0.7055324 97.14674 Price 40.2730211 399.65115 ShelveLoc 33.8898265 298.27481 Age 12.7259159 173.48643 Education 1.3788577 72.55781 Urban -0.5804948 15.72089 US 6.2361451 29.96115
The test error rate is 3.275433, which is lower than with the nonbagged regression tree model, but higher than the bagged regression tree model. Price and ShelveLoc predictors are most important, but their effect is understated compared to bagging.
set.seed(1) <- rep(Inf, ncol(Carseats) - 1) rfTestMSE for (i in 1:(ncol(Carseats) - 1)) { <- randomForest(Sales ~ ., rf.sales data = Carseats, subset = train, mtry = i, importance = TRUE )<- predict(rf.sales, newdata = Carseats[!train, ]) pred <- mean((Carseats$Sales[!train] - pred)^2) rfTestMSE[i] }plot(1:(ncol(Carseats) - 1), rfTestMSE, type = "l")
As expected, the test MSE shows a decreasing trend as the number of variables included in each random forest increases.
library(ISLR2) set.seed(1) <- sample(1:nrow(OJ), 800) train.set <- 1:nrow(OJ) %in% train.set train
library(tree) <- tree(Purchase ~ ., data = OJ, subset = train) OJ.tree summary(OJ.tree)
Classification tree: tree(formula = Purchase ~ ., data = OJ, subset = train) Variables actually used in tree construction: [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff" [5] "PctDiscMM" Number of terminal nodes: 9 Residual mean deviance: 0.7432 = 587.8 / 791 Misclassification error rate: 0.1588 = 127 / 800
The training error rate is 0.165 with 8 terminal nodes.
# you can type in the tree name "OJ.tree" # here but it is easier to see in plot form plot(OJ.tree) text(OJ.tree)
Consider the 4th leaf-node from the left: if 0.264232 < LoyalCH < 0.508643 and PriceDiff < 0.195 and SpecialCH ??? 0.5 then classify as CH.
See (c).
<- predict(OJ.tree, newdata = OJ[!train, ], type = "class") pred table(pred = pred, true = OJ$Purchase[!train])
true pred CH MM CH 160 38 MM 8 64
The test error rate is 0.226.
<- cv.tree(OJ.tree, FUN = prune.misclass) OJ.cv
plot(OJ.cv$size, OJ.cv$dev, type = "l")
The optimal-sized tree has 5 leaf-nodes.
See (g)
<- prune.misclass(OJ.tree, best = 5) OJ.prune
table(fitted = predict(OJ.prune, type = "class"), true = OJ$Purchase[train])
true fitted CH MM CH 441 86 MM 44 229
table(fitted = predict(OJ.tree, type = "class"), true = OJ$Purchase[train])
true fitted CH MM CH 450 92 MM 35 223
The training error is the same! (But could we have known this previously?) Look at the statistics in
OJ.cv
.<- predict(OJ.prune, newdata = OJ[!train, ], type = "class") pred.prune table(pred = pred.prune, true = OJ$Purchase[!train])
true pred CH MM CH 160 36 MM 8 66
<- predict(OJ.tree, newdata = OJ[!train, ], type = "class") pred.tree table(pred = pred.tree, true = OJ$Purchase[!train])
true pred CH MM CH 160 38 MM 8 64
The test error is also the same here.
library(ISLR2) <- Hitters[!is.na(Hitters$Salary), ] myHitters $Salary <- log(myHitters$Salary) myHitters
<- c(rep(TRUE, 200), rep(FALSE, nrow(myHitters) - 200)) train
library(gbm)
Loaded gbm 2.2.2
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
set.seed(1) <- 10^seq(-5, -0.2, by = 0.05) lambda <- rep(Inf, length(lambda)) trainMSE <- rep(Inf, length(lambda)) testMSE for (i in 1:length(lambda)) { <- gbm(Salary ~ ., fit distribution = "gaussian", data = myHitters[train, ], n.trees = 1000, shrinkage = lambda[i] )<- predict(fit, n.trees = 1000) pred.train <- predict(fit, newdata = myHitters[!train, ], n.trees = 1000) pred.test <- mean((myHitters$Salary[train] - pred.train)^2) trainMSE[i] <- mean((myHitters$Salary[!train] - pred.test)^2) testMSE[i] }plot(lambda, trainMSE, type = "l")
plot(lambda, testMSE, type = "l")
See (c)
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:ISLR2': Boston
<- stepAIC(lm(Salary ~ ., data = myHitters, subset = train), myHitters.lm direction = "both" )
Start: AIC=-187.65 Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors + NewLeague Df Sum of Sq RSS AIC - CHmRun 1 0.0049 64.080 -189.64 - NewLeague 1 0.0129 64.088 -189.61 - CRBI 1 0.0220 64.097 -189.59 - Runs 1 0.0709 64.146 -189.43 - RBI 1 0.0751 64.150 -189.42 - CAtBat 1 0.0958 64.171 -189.35 - HmRun 1 0.1761 64.251 -189.10 - CHits 1 0.2560 64.331 -188.86 - League 1 0.2878 64.362 -188.76 - Errors 1 0.3521 64.427 -188.56 <none> 64.075 -187.65 - Division 1 0.8491 64.924 -187.02 - CRuns 1 0.8673 64.942 -186.97 - Assists 1 1.0963 65.171 -186.26 - CWalks 1 1.9086 65.983 -183.78 - Years 1 2.4861 66.561 -182.04 - AtBat 1 2.5729 66.648 -181.78 - Walks 1 2.8898 66.964 -180.83 - PutOuts 1 3.2769 67.352 -179.68 - Hits 1 4.3240 68.399 -176.59 Step: AIC=-189.64 Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors + NewLeague Df Sum of Sq RSS AIC - NewLeague 1 0.0119 64.092 -191.60 - RBI 1 0.0875 64.167 -191.37 - Runs 1 0.0891 64.169 -191.36 - CAtBat 1 0.1110 64.191 -191.29 - HmRun 1 0.2377 64.317 -190.90 - League 1 0.2838 64.363 -190.75 - CRBI 1 0.3051 64.385 -190.69 - Errors 1 0.3510 64.431 -190.55 - CHits 1 0.5538 64.633 -189.92 <none> 64.080 -189.64 - Division 1 0.8580 64.938 -188.98 - Assists 1 1.0914 65.171 -188.26 + CHmRun 1 0.0049 64.075 -187.65 - CRuns 1 1.6078 65.687 -186.68 - CWalks 1 2.1324 66.212 -185.09 - Years 1 2.4813 66.561 -184.04 - AtBat 1 2.5702 66.650 -183.77 - Walks 1 2.9502 67.030 -182.64 - PutOuts 1 3.2741 67.354 -181.67 - Hits 1 4.4492 68.529 -178.21 Step: AIC=-191.6 Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors Df Sum of Sq RSS AIC - Runs 1 0.0852 64.177 -193.34 - RBI 1 0.0870 64.179 -193.33 - CAtBat 1 0.1133 64.205 -193.25 - HmRun 1 0.2378 64.329 -192.86 - CRBI 1 0.3133 64.405 -192.63 - Errors 1 0.3435 64.435 -192.53 - CHits 1 0.5746 64.666 -191.82 <none> 64.092 -191.60 - League 1 0.6500 64.742 -191.58 - Division 1 0.8610 64.953 -190.93 - Assists 1 1.0989 65.190 -190.20 + NewLeague 1 0.0119 64.080 -189.64 + CHmRun 1 0.0040 64.088 -189.61 - CRuns 1 1.6393 65.731 -188.55 - CWalks 1 2.1279 66.219 -187.07 - Years 1 2.4904 66.582 -185.98 - AtBat 1 2.6070 66.699 -185.63 - Walks 1 2.9386 67.030 -184.63 - PutOuts 1 3.2764 67.368 -183.63 - Hits 1 4.4593 68.551 -180.15 Step: AIC=-193.34 Salary ~ AtBat + Hits + HmRun + RBI + Walks + Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors Df Sum of Sq RSS AIC - RBI 1 0.0618 64.239 -195.14 - CAtBat 1 0.0870 64.264 -195.06 - HmRun 1 0.1634 64.340 -194.83 - Errors 1 0.3273 64.504 -194.32 - CRBI 1 0.4024 64.579 -194.09 - CHits 1 0.4926 64.669 -193.81 <none> 64.177 -193.34 - League 1 0.7004 64.877 -193.16 - Division 1 0.8457 65.022 -192.72 - Assists 1 1.1085 65.285 -191.91 + Runs 1 0.0852 64.092 -191.60 + CHmRun 1 0.0208 64.156 -191.40 + NewLeague 1 0.0080 64.169 -191.36 - CRuns 1 1.6598 65.837 -190.23 - CWalks 1 2.0465 66.223 -189.06 - Years 1 2.5245 66.701 -187.62 - AtBat 1 2.6071 66.784 -187.37 - Walks 1 3.0973 67.274 -185.91 - PutOuts 1 3.3751 67.552 -185.08 - Hits 1 5.2867 69.463 -179.50 Step: AIC=-195.14 Salary ~ AtBat + Hits + HmRun + Walks + Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors Df Sum of Sq RSS AIC - HmRun 1 0.1074 64.346 -196.81 - CAtBat 1 0.1121 64.351 -196.79 - Errors 1 0.3404 64.579 -196.09 - CRBI 1 0.3411 64.580 -196.08 - CHits 1 0.5728 64.811 -195.37 <none> 64.239 -195.14 - League 1 0.6842 64.923 -195.02 - Division 1 0.8171 65.056 -194.62 - Assists 1 1.1172 65.356 -193.69 + RBI 1 0.0618 64.177 -193.34 + Runs 1 0.0600 64.179 -193.33 + CHmRun 1 0.0332 64.205 -193.25 + NewLeague 1 0.0082 64.230 -193.17 - CRuns 1 1.8768 66.115 -191.38 - CWalks 1 2.0449 66.283 -190.88 - Years 1 2.5017 66.740 -189.50 - AtBat 1 2.8465 67.085 -188.47 - Walks 1 3.0732 67.312 -187.80 - PutOuts 1 3.4528 67.691 -186.67 - Hits 1 5.2334 69.472 -181.48 Step: AIC=-196.81 Salary ~ AtBat + Hits + Walks + Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors Df Sum of Sq RSS AIC - CAtBat 1 0.1313 64.477 -198.40 - Errors 1 0.3031 64.649 -197.87 - CRBI 1 0.6032 64.949 -196.94 - League 1 0.6451 64.991 -196.81 <none> 64.346 -196.81 - CHits 1 0.7286 65.075 -196.56 - Division 1 0.8421 65.188 -196.21 - Assists 1 1.0108 65.357 -195.69 + HmRun 1 0.1074 64.239 -195.14 + CHmRun 1 0.0722 64.274 -195.03 + NewLeague 1 0.0108 64.335 -194.84 + Runs 1 0.0074 64.339 -194.83 + RBI 1 0.0058 64.340 -194.83 - CRuns 1 2.0625 66.408 -192.50 - CWalks 1 2.1395 66.485 -192.27 - Years 1 2.4855 66.831 -191.23 - AtBat 1 2.7484 67.094 -190.44 - Walks 1 3.1563 67.502 -189.23 - PutOuts 1 3.4623 67.808 -188.33 - Hits 1 5.3023 69.648 -182.97 Step: AIC=-198.4 Salary ~ AtBat + Hits + Walks + Years + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors Df Sum of Sq RSS AIC - Errors 1 0.3400 64.817 -199.35 <none> 64.477 -198.40 - League 1 0.6931 65.170 -198.26 - CRBI 1 0.7669 65.244 -198.04 - Division 1 0.8264 65.304 -197.85 - CHits 1 1.0305 65.508 -197.23 + CAtBat 1 0.1313 64.346 -196.81 + HmRun 1 0.1266 64.351 -196.79 + CHmRun 1 0.1138 64.363 -196.75 - Assists 1 1.2204 65.698 -196.65 + NewLeague 1 0.0145 64.463 -196.45 + RBI 1 0.0038 64.473 -196.41 + Runs 1 0.0002 64.477 -196.40 - CRuns 1 1.9601 66.437 -194.41 - CWalks 1 2.0480 66.525 -194.15 - AtBat 1 2.7199 67.197 -192.14 - Walks 1 3.0259 67.503 -191.23 - PutOuts 1 3.3495 67.827 -190.27 - Years 1 3.9323 68.410 -188.56 - Hits 1 5.4618 69.939 -184.14 Step: AIC=-199.35 Salary ~ AtBat + Hits + Walks + Years + CHits + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists Df Sum of Sq RSS AIC - League 1 0.6153 65.433 -199.46 <none> 64.817 -199.35 - CRBI 1 0.7184 65.536 -199.14 - Division 1 0.8639 65.681 -198.70 - Assists 1 0.9112 65.728 -198.56 + Errors 1 0.3400 64.477 -198.40 - CHits 1 1.0614 65.879 -198.10 + CAtBat 1 0.1683 64.649 -197.87 + CHmRun 1 0.1027 64.715 -197.67 + HmRun 1 0.0859 64.731 -197.61 + NewLeague 1 0.0062 64.811 -197.37 + RBI 1 0.0000 64.817 -197.35 + Runs 1 0.0000 64.817 -197.35 - CWalks 1 1.9827 66.800 -195.32 - CRuns 1 1.9910 66.808 -195.30 - AtBat 1 3.0871 67.904 -192.04 - Walks 1 3.1044 67.922 -191.99 - PutOuts 1 3.3522 68.169 -191.26 - Years 1 4.1178 68.935 -189.03 - Hits 1 5.9080 70.725 -183.90 Step: AIC=-199.46 Salary ~ AtBat + Hits + Walks + Years + CHits + CRuns + CRBI + CWalks + Division + PutOuts + Assists Df Sum of Sq RSS AIC - CRBI 1 0.6392 66.072 -199.51 <none> 65.433 -199.46 + League 1 0.6153 64.817 -199.35 - Division 1 0.8091 66.242 -199.00 - CHits 1 0.8163 66.249 -198.98 + NewLeague 1 0.3785 65.054 -198.62 - Assists 1 0.9543 66.387 -198.56 + Errors 1 0.2622 65.170 -198.26 + CAtBat 1 0.2134 65.219 -198.11 + CHmRun 1 0.0934 65.339 -197.75 + HmRun 1 0.0579 65.375 -197.64 + Runs 1 0.0075 65.425 -197.48 + RBI 1 0.0006 65.432 -197.46 - CRuns 1 1.7237 67.156 -196.26 - CWalks 1 1.8619 67.294 -195.85 - AtBat 1 3.2103 68.643 -191.88 - Walks 1 3.3974 68.830 -191.34 - PutOuts 1 3.4478 68.880 -191.19 - Years 1 3.8438 69.276 -190.04 - Hits 1 5.7897 71.222 -184.50 Step: AIC=-199.52 Salary ~ AtBat + Hits + Walks + Years + CHits + CRuns + CWalks + Division + PutOuts + Assists Df Sum of Sq RSS AIC - CHits 1 0.5086 66.580 -199.98 + CHmRun 1 0.7324 65.339 -199.74 - Assists 1 0.6552 66.727 -199.54 <none> 66.072 -199.51 + CRBI 1 0.6392 65.433 -199.46 + League 1 0.5361 65.536 -199.14 - Division 1 0.8042 66.876 -199.09 + CAtBat 1 0.3820 65.690 -198.68 + HmRun 1 0.3080 65.764 -198.45 + NewLeague 1 0.2843 65.787 -198.38 + Errors 1 0.2266 65.845 -198.20 + RBI 1 0.1568 65.915 -197.99 + Runs 1 0.0180 66.054 -197.57 - CWalks 1 1.5544 67.626 -196.87 - CRuns 1 1.8017 67.873 -196.13 - AtBat 1 2.9686 69.040 -192.72 - Walks 1 3.3606 69.432 -191.59 - PutOuts 1 3.5426 69.614 -191.07 - Years 1 4.1936 70.265 -189.21 - Hits 1 5.6890 71.761 -185.00 Step: AIC=-199.98 Salary ~ AtBat + Hits + Walks + Years + CRuns + CWalks + Division + PutOuts + Assists Df Sum of Sq RSS AIC + CHmRun 1 0.9592 65.621 -200.88 - Assists 1 0.4404 67.021 -200.66 <none> 66.580 -199.98 + CHits 1 0.5086 66.072 -199.51 + HmRun 1 0.4522 66.128 -199.34 + League 1 0.3603 66.220 -199.07 + CRBI 1 0.3316 66.249 -198.98 + Errors 1 0.2653 66.315 -198.78 - Division 1 1.1221 67.703 -198.64 + RBI 1 0.1526 66.428 -198.44 - CWalks 1 1.1927 67.773 -198.43 + NewLeague 1 0.1262 66.454 -198.36 + CAtBat 1 0.0864 66.494 -198.24 + Runs 1 0.0508 66.530 -198.13 - CRuns 1 2.4724 69.053 -194.69 - AtBat 1 2.6054 69.186 -194.30 - PutOuts 1 3.1307 69.711 -192.79 - Walks 1 3.7084 70.289 -191.14 - Years 1 3.7734 70.354 -190.96 - Hits 1 5.2100 71.790 -186.91 Step: AIC=-200.88 Salary ~ AtBat + Hits + Walks + Years + CRuns + CWalks + Division + PutOuts + Assists + CHmRun Df Sum of Sq RSS AIC <none> 65.621 -200.88 + League 1 0.4883 65.133 -200.38 - Assists 1 0.8610 66.482 -200.28 - CHmRun 1 0.9592 66.580 -199.98 + Errors 1 0.3065 65.315 -199.82 - Division 1 1.0239 66.645 -199.79 + CHits 1 0.2818 65.339 -199.74 + NewLeague 1 0.2366 65.385 -199.61 + CRBI 1 0.1907 65.430 -199.47 + CAtBat 1 0.0667 65.555 -199.09 + HmRun 1 0.0422 65.579 -199.01 + Runs 1 0.0383 65.583 -199.00 + RBI 1 0.0157 65.606 -198.93 - CWalks 1 1.5295 67.151 -198.28 - CRuns 1 1.6423 67.263 -197.94 - AtBat 1 3.0894 68.711 -193.68 - PutOuts 1 3.1887 68.810 -193.39 - Years 1 3.7086 69.330 -191.89 - Walks 1 3.7259 69.347 -191.84 - Hits 1 5.6581 71.279 -186.34
set.seed(1) require(glmnet)
Loading required package: glmnet
Loading required package: Matrix
Loaded glmnet 4.1-8
<- cv.glmnet(model.matrix(Salary ~ ., data = myHitters[train, ]), myHitters.cv $Salary[train], myHittersalpha = 1 )<- glmnet(model.matrix(Salary ~ ., data = myHitters[train, ]), myHitters.lasso $Salary[train], myHittersalpha = 1 )<- predict(myHitters.lasso, myHitters.pred.lasso type = "response", newx = model.matrix(Salary ~ ., data = myHitters[!train, ] ) )<- predict(myHitters.lm, newdata = myHitters[!train, ]) myHitters.pred.lm mean((myHitters.pred.lasso - myHitters$Salary[!train])^2)
[1] 0.4755605
mean((myHitters.pred.lm - myHitters$Salary[!train])^2)
[1] 0.4931775
The MSE’s of the linear model and the lasso model seem to be higher than the MSE of the boosted model except for larger values of \lambda.
summary(fit)
CAtBat and PutOuts appear to be the most important outputs.
library(randomForest) set.seed(1) <- randomForest(Salary ~ ., myHitters.bag data = myHitters, subset = train, mtry = (ncol(myHitters) - 1), importance = TRUE )<- predict(myHitters.bag, newdata = myHitters[!train, ]) pred mean((myHitters$Salary[!train] - pred)^2)
[1] 0.2301184
The test MSE is 0.229, which is lower than the minimum test MSE from the boosted model.