ggplot(Hitters, aes(x = Years, y = Hits, colour =log(Salary))) +geom_point() +geom_vline(xintercept =4.5, colour ="black", linetype ="dashed") +annotate("segment", x =4.5, xend =24, y =117.5, yend =117.5, colour ="black", linetype ="dashed") +annotate("text", x =2, y =200, label ="R[1]", parse =TRUE, size =10) +annotate("text", x =20, y =50, label ="R[2]", parse =TRUE, size =10) +annotate("text", x =20, y =200, label ="R[3]", parse =TRUE, size =10) +theme(text =element_text(size =20))
Tree regions & predictions
A decision tree is made by:
Dividing the predictor space (i.e. the set of possible values for X_1, X_2, \dots, X_p) into J distinct and non-overlapping regions, R_1, R_2, \dots, R_J,
Making the same prediction for every observation that falls into the region R_j
the mean response for the training data in R_j (regression trees)
the mode response for the training data in R_j (classification trees)
Random Forests for Wildfire Insurance Applications: Mélina Mailhot, Concordia University
Homeowners’ insurance in wildfire-prone areas can be a very risky business that some insurers may not be willing to undertake. We create an actuarial spatial model for the likelihood of wildfire occurrence over a fine grid map of North America. Several models are used, such as generalized linear models and tree-based machine learning algorithms. A detailed analysis and comparison of the models show a best fit using random forests. Sensitivity tests help in assessing the effect of future changes in the covariates of the model. A downscaling exercise is performed, focusing on some high-risk states and provinces. The model provides the foundation for actuaries to price, reserve, and manage the financial risk from severe wildfires.
A Machine Learning Approach to Forecasting Italian Honey Production with Tree-Based Methods: Elia Smaniotto, University of Florence
The Italian apiculture sector, one of the largest honey producers in Europe, has suffered considerable damage in recent years. Adverse weather conditions, occurring more frequently as climate change progresses, can be high-impact and cause the environment to be unfavourable to the bees’ activity [1]. In this paper, we aim to study the effect of climatic and meteorological events on honey production. The database covers several hives, mainly located in northern Italy, and contains temperature, precipitations, geographical and meteorological measurements. We adopt random forest and gradient boosting algorithms, powerful and flexible tree-based methods to predict the honey production variation. Then, a feature importance analysis is performed to discover the main driver of honey production within the covered area. This study, which lies within the existing literature [2,3], seeks to establish the links between weather conditions and honey production, aiming to protect bees’ activity better and assess potential losses for beekeepers.
Improving Business Insurance Loss Models by Leveraging InsurTech Innovation, Emiliano Valdez, University of Connecticut
Recent transformative and disruptive developments in the insurance industry embrace various InsurTech innovations. In particular, with the rapid advances in data science and computational infrastructure, InsurTech is able to incorporate multiple emerging sources of data and reveal implications for value creation on business insurance by enhancing current insurance operations. In this paper, we unprecedentedly combine real-life proprietary insurance claims information and its InsurTech empowered risk factors describing insured businesses to create enhanced tree-based loss models. An empirical study in this paper shows that the supplemental data sources created by InsurTech innovation significantly help improve the underlying insurance company’s internal or inhouse pricing models. The results of our work demonstrate how InsurTech proliferates firm-level value creation and how it can affect insurance product development, pricing, underwriting, claim management, and administration practice.
On the Pricing of Capped Volatility Swaps using Machine Learning Techniques, Eva Verschueren, KU Leuven
A capped volatility swap is a forward contract on an asset’s capped, annualized realized volatility, over a predetermined period of time. The volatility swap allows investors to get a pure exposure to the volatility of the underlying asset, making the product an interesting instrument for both hedging and speculative purposes. In this presentation, we develop data-driven machine learning techniques in the context of pricing capped volatility swaps. To this purpose, we construct unique data sets comprising both the delivery price of contracts at initiation and the daily observed prices of running contracts. In order to predict future realized volatility, we explore distributional information on the underlying asset, specifically by extracting information from the forward implied volatilities and market-implied moments of the asset. The pricing performance of tree-based machine learning models and a Gaussian process regression model is presented in a tailored validation setting.
Integrated Design for Index Insurance, Jinggong Zhang, Nanyang Technological University
Weather index insurance (WII) is a promising tool for agricultural risk mitigation, but its popularity is often hindered by challenges of product design, such as basis risk, weather index selection and product complexity issues. In this paper we develop machine learning methodologies to design the statistically optimal WII to address those critical concerns in the literature and practice. The idea from tree-based models is exploited to simultaneously achieve weather variable selection and payout function determination, leading to effective basis risk reduction. The proposed framework is applied to an empirical study where high-dimensional weather variables are adopted to hedge soybean production losses in Iowa. Our numerical results show that the designed insurance policies are potentially viable with much lower government subsidy, and therefore can enhance social welfare.
Bayesian CART for insurance pricing, Yaojun Zhang, University of Leeds
An insurance portfolio offers protection against a specified type of risk to a collection of policyholders with various risk profiles. Insurance companies use risk factors to group policyholders with similar risk profiles in tariff classes. Premiums are set to be equal for policyholders within the same tariff class which should reflect the inherent riskiness of each class. Tree-based methods, like the classification and regression tree (CART), have gained popularity as they can in some cases give good performance and be easily interpretable. In this talk, we discuss a Bayesian approach applied to CART models. The idea is to have the prior induce a posterior distribution that will guide the stochastic search using MCMC towards more promising trees. We shall introduce different Bayesian CART models for the insurance claims data, which include the frequency-severity model and the (zero-inflated) compound Poisson model. Some simulation and real data examples will be discussed.
Machine Learning in Long-term Mortality Forecasting, Wenjun Zhu, Nanyang Technological University
We propose a new machine learning-based framework for long-term mortality forecasting. Based on ideas of neighbouring prediction, model ensembling, and tree boosting, this framework can significantly improve the prediction accuracy of long-term mortality. In addition, the proposed framework addresses the challenge of a shrinking pattern in long-term forecasting with information from neighbouring ages and cohorts. An extensive empirical analysis is conducted using various countries and regions in the Human Mortality Database. Results show that this framework reduces the mean absolute percentage error (MAPE) of the 20-year forecasting by almost 50% compared to classic stochastic mortality models, and it also outperforms deep learning-based benchmarks. Moreover, including mortality data from multiple populations can further enhance the long-term prediction performance of this framework.
Growing a Tree
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
Fitting a regression tree
Divide the predictor space into high-dimensional rectangles, or boxes
The goal is to find boxes R_1, R_2, \ldots, R_J that minimise
\mathrm{RSS} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2
where \hat{y}_{R_j} is the mean response for the training observations within the jth box
Computationally unfeasible to consider every possible partition
take a top-down, greedy approach…
Synthetic regression dataset
Growing a regression tree I
Growing a regression tree II
Growing a regression tree III
Growing a regression tree IV
Recursive binary splitting
Start with the root node, and make new splits greedily one at a time
Scan through all of the inputs
for each splitting variable, the split point s can be determined very quickly
The overall solution for this branch (i.e. selection of j) follows.
Partition the data into the two resulting regions
Repeat the splitting process on each of the two regions
Continue the process until a stopping criterion is reached
Recursive binary splitting details
Consider a splitting variable j and split point s
R_1(j, s) = \{X | X_j \leq s\} \quad \text{and} \quad R_2(j, s) = \{X|X_j>s\}
Find the splitting variable j and split point s that solve
\min_{j,\ s}\Big[ \min_{c_1} \sum_{x_i \in R_1(j,\ s)}(y_i-c_1)^2 + \min_{c_2} \sum_{x_i \in R_2(j,\ s)}(y_i-c_2)^2 \Big]
where the inner mins are solved by
\hat{c}_1 = \mathrm{Ave}(y_i | x_i \in R_1(j, s))
\quad \text{and} \quad
\hat{c}_2 = \mathrm{Ave}(y_i | x_i \in R_2(j, s))
2023 exam question
What would be the tree’s predicted value for y at x = 0?
Classification trees
Very similar to a regression tree, except:
Predict that each observation belongs to the most commonly occurring class of training observations in the region to which it belongs
RSS cannot be used as a criterion for making the binary splits, instead use a measure of node purity:
So, should you use Gini impurity or entropy? The truth is, most of the time it does not make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:1000,])rpart.plot(tree)
Plot claims by year
Code
claims %>%group_by(lossYear) %>%summarise(n =n()) %>%ggplot(aes(x = lossYear, y = n)) +geom_bar(stat ="identity") +theme_minimal() +labs(x ="Year", y ="Number of claims")
Plot average claim size by year
Code
claims %>%group_by(lossYear) %>%summarise(mean_claim =mean(amountPaidOnBuildingClaim)) %>%ggplot(aes(x = lossYear, y = mean_claim)) +geom_line() +theme_minimal() +labs(x ="Year", y ="Average claim size")
Number of claims by state
Prepare to make state-based maps for USA
claims$state_full <- state.name[match(claims$state, state.abb)]state_claims <- claims %>%group_by(state_full) %>%summarise(num_claims =n(),max_claim_size =max(amountPaidOnBuildingClaim),common = num_claims >=nrow(claims) /100)claims$state_full <-NULL# Merge with the map datastates_map <-map_data("state")state_claims$region <-tolower(state_claims$state_full)states_map <-left_join(states_map, state_claims, by ="region")
Code
ggplot(states_map, aes(long, lat, group = group, fill = num_claims)) +geom_polygon(color ="white") +scale_fill_viridis_c(option ="C") +labs(title ="Number of Claims by State",fill ="Number of Claims") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Max claim size by state
Code
# Plot maximum claim size by stateggplot(states_map, aes(long, lat, group = group, fill = max_claim_size)) +geom_polygon(color ="white") +scale_fill_viridis_c(option ="C") +labs(title ="Maximum Claim Size by State",fill ="Max Claim Size") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Some states have very few claims
Code
# Plot states where floods are commonggplot(states_map, aes(long, lat, group = group, fill = common)) +geom_polygon(color ="white") +scale_fill_viridis_d() +labs(title ="States where flood claims are frequent",fill ="Number of Claims >= 1%") +theme_minimal() +theme(axis.title =element_blank(), axis.text =element_blank(), axis.ticks =element_blank())
Geographical distribution of perils
Friedman Exhibit 1 (p. 10).
Hot spots
Friedman Exhibit 13 (p. 46).
Reduce the number of levels
table(claims$state)
AK AL AR AZ CA CO CT DC DE FL GA GU HI
27 2019 410 139 1408 191 819 10 303 11833 1035 6 152
IA ID IL IN KS KY LA MA MD ME MI MN MO
504 43 1605 662 241 1012 20785 885 696 115 402 368 1752
MS MT NC ND NE NH NJ NM NV NY OH OK OR
2753 53 5023 513 189 142 8520 48 81 5899 790 490 237
PA PR RI SC SD TN TX UN UT VA VI VT WA
2363 569 238 2023 136 809 17759 10 11 2006 83 108 522
WI WV WY
313 874 16
length(unique(claims$state))
[1] 55
# States with fewer than 1% claimsrare_flood_states <-names(which(table(claims[["state"]]) <nrow(claims) /100))claims$state <-ifelse(claims$state %in% rare_flood_states, "Other", claims$state)table(claims$state)
AL CA FL GA IL KY LA MO MS NC NJ NY Other
2019 1408 11833 1035 1605 1012 20785 1752 2753 5023 8520 5899 12205
PA SC TX VA
2363 2023 17759 2006
length(unique(claims$state))
[1] 17
New tree
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:5000,])rpart.plot(tree)
More data
tree <-rpart(amountPaidOnBuildingClaim ~ ., data=claims[1:50000,])rpart.plot(tree)
Pruning a Tree
Lecture Outline
Decision Trees
Growing a Tree
National Flood Insurance Program Demo
Pruning a Tree
Bootstrap Aggregation
Random Forests
Boosting
What’s the best size of tree
The smallest tree is just a root node (no splits).
The upper limit is to grow until one observation in each region.
“In order to reduce the size of the tree and hence to prevent overfitting, these stopping criteria that are inherent to the recursive partitioning procedure are complemented with several rules. Three stopping rules that are commonly used can be formulated as follows:
A node t is declared terminal when it contains less than a fixed number of observations.
A node t is declared terminal if at least one of its children nodes t_L and t_R that results from the optimal split s_t contains less than a fixed number of observations.
A node t is declared terminal when its depth is equal to a fixed maximal depth.”
Pruning motivation
“While the stopping rules presented above may give good results in practice, the strategy of stopping early the growing of the tree is in general unsatisfactory… That is why it is preferable to prune the tree instead of stopping the growing of the tree. Pruning a tree consists in fully developing the tree and then prune it upward until the optimal tree is found.”
A decision rule of considering the decrease in RSS at each step/split (versus a threshold) is too short-sighted.
Alternate approach of growing a large tree then pruning back to obtain a subtree is a better strategy.
Cross validation of each possible subtree is however very cumbersome.
An alternative approach is cost complexity pruning (also known as weakest link pruning)
Cost-Complexity Pruning
Define a subtree T \subset T_0 to be any tree than can be obtained by pruning T_0 (a fully-grown tree)
Terminal node m represents region R_m
|T|: number of terminal nodes in T
Define the cost complexity criterion
\text{Total cost} = \text{Measure of Fit} + \text{Measure of Complexity}
C_\alpha(T) = \sum_{m=1}^{|T|} \sum_{i \in R_m} (y_i - \hat{y}_m)^2
+ \alpha|T| where \hat{y}_m is the mean y_i in the mth leaf and \alpha controls the tradeoff between tree size and goodness of fit.
Cost-Complexity Pruning
For each \alpha, we want to find the subtree T_\alpha \subseteq T_0 that minimises C_\alpha(T)
How to find T_\alpha?
“weakest link pruning”
For a particular \alpha, find the subtree T_\alpha such that the cost complexity criterion is minimised
How to choose \alpha?
cross-validation
Tree Algorithm Summary
Use recursive binary splitting to grow a large tree on the training data
stop only when each terminal node has fewer than some minimum number of observations
Apply cost complexity pruning to the large tree to obtain a sequence of best subtrees, as a function of \alpha
there is a unique smallest subtree T_\alpha that minimises C_\alpha(T)
Use K-fold cross-validation to choose \alpha
Return the subtree from Step 2 that corresponds to the chosen value of \alpha
Unpruned Hitters tree
The unpruned tree that results from top-down greedy splitting on the training data.
CV to pick \alpha (equiv., |T|)
The training, cross-validation, and test MSE are shown as a function of the number of terminal nodes in the pruned tree. Standard error bands are displayed. The minimum cross-validation error occurs at a tree of size three.
CV to prune NFIP tree
Cross-validation to prune the large NFIP tree
# Perform cross-validation to prune the treeset.seed(123)cv_tree <-train( amountPaidOnBuildingClaim ~ ., data=train_set, method="rpart",trControl=trainControl(method="cv", number=5),tuneGrid=data.frame(cp=seq(0, 0.01, 0.001)))# Get the optimal cp valueoptimal_cp <- cv_tree$bestTune$cpplot(cv_tree)