Stratify / segment the predictor space into a number of simple regions.
The set of splitting rules can be summarised in a tree.
Bagging, random forests, boosting
Examples of what we call “ensemble methods”.
Produce multiple trees.
Improve the prediction accuracy of tree-based methods.
Lose some interpretation.
Decision Trees
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
Trees in a nutshell
Decision trees are a simple, easy to interpret, and popular method for both regression and classification tasks. They can be used to make predictions, but also simply as “data exploration” to understand a data set better.
They make predictions by partitioning the predictor space into a number of simple regions, and making a constant prediction within each region. The set of splitting rules can be summarised in a tree. “Stand alone” tree models are rarely particularly accurate, but they form the basis for more accurate (and complex) methods like random forests and boosting.
ggplot(Hitters, aes(x = Years, y = Hits, colour =log(Salary))) +geom_point() +geom_vline(xintercept =4.5, colour ="black", linetype ="dashed") +annotate("segment", x =4.5, xend =24, y =117.5, yend =117.5, colour ="black", linetype ="dashed") +annotate("text", x =2, y =200, label ="R[1]", parse =TRUE, size =10) +annotate("text", x =20, y =50, label ="R[2]", parse =TRUE, size =10) +annotate("text", x =20, y =200, label ="R[3]", parse =TRUE, size =10) +theme(text =element_text(size =20))
Tree regions & predictions
A decision tree is made by:
Dividing the predictor space (i.e. the set of possible values for X_1, X_2, \dots, X_p) into J distinct and non-overlapping regions, R_1, R_2, \dots, R_J,
Making the same prediction for every observation that falls into the region R_j
the mean response for the training data in R_j (regression trees)
the mode response for the training data in R_j (classification trees)
Random Forests for Wildfire Insurance Applications: Mélina Mailhot, Concordia University
Homeowners’ insurance in wildfire-prone areas can be a very risky business that some insurers may not be willing to undertake. We create an actuarial spatial model for the likelihood of wildfire occurrence over a fine grid map of North America. Several models are used, such as generalized linear models and tree-based machine learning algorithms. A detailed analysis and comparison of the models show a best fit using random forests. Sensitivity tests help in assessing the effect of future changes in the covariates of the model. A downscaling exercise is performed, focusing on some high-risk states and provinces. The model provides the foundation for actuaries to price, reserve, and manage the financial risk from severe wildfires.
A Machine Learning Approach to Forecasting Italian Honey Production with Tree-Based Methods: Elia Smaniotto, University of Florence
The Italian apiculture sector, one of the largest honey producers in Europe, has suffered considerable damage in recent years. Adverse weather conditions, occurring more frequently as climate change progresses, can be high-impact and cause the environment to be unfavourable to the bees’ activity [1]. In this paper, we aim to study the effect of climatic and meteorological events on honey production. The database covers several hives, mainly located in northern Italy, and contains temperature, precipitations, geographical and meteorological measurements. We adopt random forest and gradient boosting algorithms, powerful and flexible tree-based methods to predict the honey production variation. Then, a feature importance analysis is performed to discover the main driver of honey production within the covered area. This study, which lies within the existing literature [2,3], seeks to establish the links between weather conditions and honey production, aiming to protect bees’ activity better and assess potential losses for beekeepers.
Improving Business Insurance Loss Models by Leveraging InsurTech Innovation, Emiliano Valdez, University of Connecticut
Recent transformative and disruptive developments in the insurance industry embrace various InsurTech innovations. In particular, with the rapid advances in data science and computational infrastructure, InsurTech is able to incorporate multiple emerging sources of data and reveal implications for value creation on business insurance by enhancing current insurance operations. In this paper, we unprecedentedly combine real-life proprietary insurance claims information and its InsurTech empowered risk factors describing insured businesses to create enhanced tree-based loss models. An empirical study in this paper shows that the supplemental data sources created by InsurTech innovation significantly help improve the underlying insurance company’s internal or inhouse pricing models. The results of our work demonstrate how InsurTech proliferates firm-level value creation and how it can affect insurance product development, pricing, underwriting, claim management, and administration practice.
On the Pricing of Capped Volatility Swaps using Machine Learning Techniques, Eva Verschueren, KU Leuven
A capped volatility swap is a forward contract on an asset’s capped, annualized realized volatility, over a predetermined period of time. The volatility swap allows investors to get a pure exposure to the volatility of the underlying asset, making the product an interesting instrument for both hedging and speculative purposes. In this presentation, we develop data-driven machine learning techniques in the context of pricing capped volatility swaps. To this purpose, we construct unique data sets comprising both the delivery price of contracts at initiation and the daily observed prices of running contracts. In order to predict future realized volatility, we explore distributional information on the underlying asset, specifically by extracting information from the forward implied volatilities and market-implied moments of the asset. The pricing performance of tree-based machine learning models and a Gaussian process regression model is presented in a tailored validation setting.
Integrated Design for Index Insurance, Jinggong Zhang, Nanyang Technological University
Weather index insurance (WII) is a promising tool for agricultural risk mitigation, but its popularity is often hindered by challenges of product design, such as basis risk, weather index selection and product complexity issues. In this paper we develop machine learning methodologies to design the statistically optimal WII to address those critical concerns in the literature and practice. The idea from tree-based models is exploited to simultaneously achieve weather variable selection and payout function determination, leading to effective basis risk reduction. The proposed framework is applied to an empirical study where high-dimensional weather variables are adopted to hedge soybean production losses in Iowa. Our numerical results show that the designed insurance policies are potentially viable with much lower government subsidy, and therefore can enhance social welfare.
Bayesian CART for insurance pricing, Yaojun Zhang, University of Leeds
An insurance portfolio offers protection against a specified type of risk to a collection of policyholders with various risk profiles. Insurance companies use risk factors to group policyholders with similar risk profiles in tariff classes. Premiums are set to be equal for policyholders within the same tariff class which should reflect the inherent riskiness of each class. Tree-based methods, like the classification and regression tree (CART), have gained popularity as they can in some cases give good performance and be easily interpretable. In this talk, we discuss a Bayesian approach applied to CART models. The idea is to have the prior induce a posterior distribution that will guide the stochastic search using MCMC towards more promising trees. We shall introduce different Bayesian CART models for the insurance claims data, which include the frequency-severity model and the (zero-inflated) compound Poisson model. Some simulation and real data examples will be discussed.
Machine Learning in Long-term Mortality Forecasting, Wenjun Zhu, Nanyang Technological University
We propose a new machine learning-based framework for long-term mortality forecasting. Based on ideas of neighbouring prediction, model ensembling, and tree boosting, this framework can significantly improve the prediction accuracy of long-term mortality. In addition, the proposed framework addresses the challenge of a shrinking pattern in long-term forecasting with information from neighbouring ages and cohorts. An extensive empirical analysis is conducted using various countries and regions in the Human Mortality Database. Results show that this framework reduces the mean absolute percentage error (MAPE) of the 20-year forecasting by almost 50% compared to classic stochastic mortality models, and it also outperforms deep learning-based benchmarks. Moreover, including mortality data from multiple populations can further enhance the long-term prediction performance of this framework.
Growing a Tree
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
Fitting a regression tree
Divide the predictor space into high-dimensional rectangles, or boxes.
The goal is to find boxes R_1, R_2, \ldots, R_J that minimise
\mathrm{RSS} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2
where \hat{y}_{R_j} is the mean response for the training observations within the jth box.
It is computationally unfeasible to consider every possible partition.
So, we take stepwise greedy approach: at each step, we pick the split that most reduces the error “right now”.
As in linear regression “forward stepwise selection”, nothing guarantees this yields the “optimal” splits, overall.
Illustraion with a synthetic regression dataset
Growing a regression tree I
Growing a regression tree II
Growing a regression tree III
Growing a regression tree IV
Recursive binary splitting: Overview
Start with the root node, and make new splits “greedily”, one at a time.
When choosing what the new split should be:
consider all of the predictor variables
for each one, there is an optimal split point s (which, if chosen, maximises the reduction in error)
computationally, that s can be determined very quickly
pick the overall best split (i.e., pick the predictor j whose optimal split point s results in the overall largest reduction in error)
Two new “regions” (aka “leaves”) are hence created.
Repeat this splitting process (always turning one leaf into two), until a stopping criterion is reached (e.g., each leaf contains \leq 5 observations).
Finding the first “best split”
Consider a splitting variable j and split point s
R_1(j, s) = \{X | X_j \leq s\} \quad \text{and} \quad R_2(j, s) = \{X|X_j>s\}
Find the splitting variable j and split point s that solve
\min_{j,\ s}\Big[ \min_{c_1} \sum_{x_i \in R_1(j,\ s)}(y_i-c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,\ s)}(y_i-c_2)^2 \Big]
where the inner mins are solved by
\hat{c}_1 = \mathrm{Ave}(y_i | x_i \in R_1(j, s))
\quad \text{and} \quad
\hat{c}_2 = \mathrm{Ave}(y_i | x_i \in R_2(j, s))
After this first split, “we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions.” (James et al., 2021)
2023 exam question
What would be the tree’s predicted value for y at x = 0?
Classification trees
Very similar to a regression tree, except:
Predict that each observation belongs to the most commonly occurring class (mode) of training observations in the region to which it belongs.
RSS cannot be used as a criterion for making the binary splits, instead we use a measure of node purity (small values are “good”). For a given region R_m, compute either:
So, should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:1000,])rpart.plot(tree)
Plot claims by year
Code
claims %>%group_by(lossYear) %>%summarise(n =n()) %>%ggplot(aes(x = lossYear, y = n)) +geom_bar(stat ="identity") +theme_minimal() +labs(x ="Year", y ="Number of claims")
Plot average claim size by year
Code
claims %>%group_by(lossYear) %>%summarise(mean_claim =mean(amountPaidOnBuildingClaim)) %>%ggplot(aes(x = lossYear, y = mean_claim)) +geom_line() +theme_minimal() +labs(x ="Year", y ="Average claim size")
Number of claims by state
Prepare to make state-based maps for USA
claims$state_full <- state.name[match(claims$state, state.abb)]state_claims <- claims %>%group_by(state_full) %>%summarise(num_claims =n(),max_claim_size =max(amountPaidOnBuildingClaim),common = num_claims >=nrow(claims) /100)claims$state_full <-NULL# Merge with the map datastates_map <-map_data("state")state_claims$region <-tolower(state_claims$state_full)states_map <-left_join(states_map, state_claims, by ="region")
Code
ggplot(states_map, aes(long, lat, group = group, fill = num_claims)) +geom_polygon(color ="white") +scale_fill_viridis_c(option ="C") +labs(title ="Number of Claims by State",fill ="Number of Claims") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Max claim size by state
Code
# Plot maximum claim size by stateggplot(states_map, aes(long, lat, group = group, fill = max_claim_size)) +geom_polygon(color ="white") +scale_fill_viridis_c(option ="C") +labs(title ="Maximum Claim Size by State",fill ="Max Claim Size") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Some states have very few claims
Code
# Plot states where floods are commonggplot(states_map, aes(long, lat, group = group, fill = common)) +geom_polygon(color ="white") +scale_fill_viridis_d() +labs(title ="States where flood claims are frequent",fill ="Number of Claims >= 1%") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Geographical distribution of perils
Friedman Exhibit 1 (p. 10).
Hot spots
Friedman Exhibit 13 (p. 46).
Reduce the number of levels
table(claims$state)
AK AL AR AZ CA CO CT DC DE FL GA GU HI
27 2019 410 139 1408 191 819 10 303 11833 1035 6 152
IA ID IL IN KS KY LA MA MD ME MI MN MO
504 43 1605 662 241 1012 20785 885 696 115 402 368 1752
MS MT NC ND NE NH NJ NM NV NY OH OK OR
2753 53 5023 513 189 142 8520 48 81 5899 790 490 237
PA PR RI SC SD TN TX UN UT VA VI VT WA
2363 569 238 2023 136 809 17759 10 11 2006 83 108 522
WI WV WY
313 874 16
length(unique(claims$state))
[1] 55
# States with fewer than 1% claimsrare_flood_states <-names(which(table(claims[["state"]]) <nrow(claims) /100))claims$state <-ifelse(claims$state %in% rare_flood_states, "Other", claims$state)table(claims$state)
AL CA FL GA IL KY LA MO MS NC NJ NY Other
2019 1408 11833 1035 1605 1012 20785 1752 2753 5023 8520 5899 12205
PA SC TX VA
2363 2023 17759 2006
length(unique(claims$state))
[1] 17
New tree
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:5000,])rpart.plot(tree)
More data
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:50000,])rpart.plot(tree)
Pruning a Tree
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
What’s the best size of tree
The smallest tree is just a root node (no splits).
The upper limit is to grow until one observation in each region.
“In order to reduce the size of the tree and hence to prevent overfitting, these stopping criteria that are inherent to the recursive partitioning procedure are complemented with several rules. Three stopping rules that are commonly used can be formulated as follows:
A node t is declared terminal when it contains less than a fixed number of observations.
A node t is declared terminal if at least one of its children nodes t_L and t_R that results from the optimal split s_t contains less than a fixed number of observations.
A node t is declared terminal when its depth is equal to a fixed maximal depth.”
Pruning motivation
“While the stopping rules presented above may give good results in practice, the strategy of stopping early the growing of the tree is in general unsatisfactory… That is why it is preferable to prune the tree instead of stopping the growing of the tree. Pruning a tree consists in fully developing the tree and then prune it upward until the optimal tree is found.”
A decision rule of considering the decrease in RSS at each step/split (versus a threshold) is too short-sighted.
Alternate approach of growing a large tree then pruning back to obtain a subtree is a better strategy.
Cross validation of each possible subtree is however very cumbersome.
An alternative approach is cost complexity pruning (also known as weakest link pruning)
Cost-Complexity Pruning
Define a subtree T \subset T_0 to be any tree than can be obtained by pruning T_0 (a fully-grown tree)
The mth terminal node (aka “leaf”, aka “region”) is denoted R_m
|T|: number of terminal nodes in T
Define the cost complexity criterion
\text{Total cost} = \text{Measure of Fit} + \text{Measure of Complexity}
C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \hat{y}_m)^2
+ \alpha|T| where \hat{y}_m is the mean y_i in the mth leaf and \alpha controls the tradeoff between tree size and goodness of fit.
Cost-Complexity Pruning
For each specific \alpha, we must find the subtree T_\alpha \subset T_0 that minimisesC_\alpha(T).
It turns out that it is not too computationally expensive to find T_{\alpha} for a sequence of increasing \alpha’s, because “branches get pruned from the tree in a nested and predictable fashion” (James et al., 2021).
This is called “cost-complexity pruning” or “weakest link pruning”.
But how do we choose \alpha (and hence the final “optimally pruned” tree)?
cross-validation!
Pruning a Tree: Algorithm Summary
Use recursive binary splitting to grow a large tree on the training data
stop only when each terminal node has fewer than some minimum number of observations
Apply cost complexity pruning to the large tree to obtain a sequence of best subtrees, as a function of \alpha
there is a unique subtree T_\alpha that minimises C_\alpha(T)
Use K-fold cross-validation to choose \alpha
Return the subtree from Step 2 that corresponds to the chosen value of \alpha
CV to prune Hitters tree (only 2 predictors)
For our Hitters example, a tree with 3 or 4 leaves (equivalently, 2 and 3 splits) is probably best.
set.seed(123)tree <-rpart(log(Salary) ~ Years + Hits, data = Hitters)tree$cptable
The unpruned tree that results from top-down greedy splitting on the training data.
CV to pick \alpha (equiv., |T|)
The training, cross-validation, and test MSE are shown as a function of the number of terminal nodes in the pruned tree. Standard error bands are displayed. The minimum cross-validation error occurs at a tree of size three.
CV to prune NFIP tree
Cross-validation to prune the large NFIP tree
# Perform cross-validation to prune the treeset.seed(123)cv_tree <-train( amountPaidOnBuildingClaim ~ ., data=train_set, method="rpart",trControl=trainControl(method="cv", number=5),tuneGrid=data.frame(cp=seq(0, 0.01, 0.001)))# Get the optimal cp valueoptimal_cp <- cv_tree$bestTune$cpplot(cv_tree)
There are 68% of the rows in the original dataset in the bootstrap resample.
Bootstrap Aggregation (Bagging)
A general-purpose procedure to reduce the variance of predictions; but particularly useful (and frequently used) in the context of decision trees.
Bagging procedure for trees:
Bootstrap: re-sample (with replacement) the original data set repeatedly, hence obtain B different bootstrapped training data sets
Train: train a tree on the bth bootstrapped training set (no need to prune), hence obtain the model \hat{f}^{\ast b}(x)
Aggregate: for regression, take the average prediction, at any point x: \hat{f}_\text{bag}(x) = \dfrac{1}{B}\sum_{b=1}^{B} \hat{f}^{\ast b}(x) (same idea for classification, but take a “majority vote”, i.e., for any given x, choose the most common category predicted for that x by the B models)
Bagging: Illustration
Bagging: Illustration
Samples that are in the bag
Let’s say element i,j of the matrix is 1 if the ith observation is in the jth bootstrap sample (we say it is “in the bag”) and 0 otherwise.
Samples that are out of bag
Now consider the inverse, element i,j of the matrix is 1 if the ith observation is not in the jth bootstrap sample, it is “out of the bag”.
This provides a simple way to estimate the test error of the bagged model: “out-of-bag error” (cheaper than cross-validation).
Out-of-bag error estimation
There is a very straightforward way to estimate the test error of a bagged model
On average, each bagged tree makes use of around two-thirds of the observations.
The remaining observations (one-third, on average) are referred to as the out-of-bag (OOB) observations.
Predict the response for the ith observation using each of the trees for which that observation was OOB
\sim B/3 predictions for the ith observation
Take the average (regression) or majority (classification) prediction to obtain the final OOB prediction for the ith observation.
“with B sufficiently large, OOB error is virtually equivalent to leave-one-out cross-validation error” (James et al., 2021).
Bagging: variable selection
Bagging can lead to difficult-to-interpret results, as we don’t get a single tree (but a large number of them, each of which may use different predictors and have different leaves).
Still, a predictor’s overall “importance” can be measured
Bagging regression trees: “we can record the total amount that the RSS (8.1) is decreased due to splits over a given predictor, averaged over all B trees” (James et al., 2021).
Bagging classification trees: “we can add up the total amount that the Gini index (8.6) is decreased by splits over a given predictor, averaged over all B trees” (James et al., 2021).
Random Forests
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
Random Forests
This is a forest (stock photo)
Random Forests in a nutshell
An issue with bagging is that it tends to produce very correlated trees.
Random forests are an alternative that produces “less correlated” trees.
In random forests, we build trees under an important restriction: whenever a new “split” is created, it is only allowed to use one out of m (randomly selected) predictors.
For each new split, a new random set of m predictors (out of the total p) is considered.
An effect this has is that “strong” predictors are used in (far) fewer models, and hence the other predictors have the opportunity to have their effects captured by more trees (the smaller guys also get their chance to shine!).
Random forests tend to reduce the variance of the final predictions (at the cost of introducing some bias).
Typically, we choose m \approx \sqrt{p}.
Bagging can be said to be a special case of random forests, with m=p.
Fitting with package randomForest
rf_model <-randomForest(amountPaidOnBuildingClaim ~ ., data = train_set,ntree=50, importance =TRUE)
Code
# Calculate validation set RMSEval_pred <-predict(rf_model, newdata=val_set)val_rmse_rf <-sqrt(mean((val_pred - val_set$amountPaidOnBuildingClaim)^2))