ACTL3142 & ACTL5110 Statistical Machine Learning for Risk Applications
Introduction to GLMs
Lecture Outline
Introduction to GLMs
The components of a GLM
Fit a GLM
Generalised linear models
The linear, logistic and Poisson regression model have common properties and can be summarised in a unified framework
Framework consists of a systematic and distribution part:
Systematic component: describes the mean structure
Stochastic component: describes the individual variation of the response around the mean
This class of models is called Generalised Linear Models (GLM)
The class of GLMs has played a key role in the development of statistical modelling and of associated software
The class of GLMs has numerous application in Actuarial Science
Generalised linear models
Linear Regression
Logistic Regression
Poisson Regression
Generalised Linear Models
Type of Data
Continuous
Binary (Categorical)
Count
Flexible
Use
Prediction of continuous variables
Classification
Prediction of the number of events
Flexible
Distribution of Y
Normal
Bernoulli (Binomial for multiple trials)
Poisson
Exponential Family
\mathbb{E}[Y|X]
X\beta
\frac{e^{X\beta}}{1+e^{X\beta}}
e^{X\beta}
g^{-1}(X\beta)
Link Function Name
Identity
Logit
Log
Depends on the choice of distribution
Link Function Expression
\eta(\mu) = \mu
\eta(\mu) = \log \left(\frac{\mu}{1-\mu}\right)
\eta(\mu) = \log(\mu)
Depends on the choice of distribution
Insurance Applications
Application are numerous
Mortality Modelling
Rate making (Modelling Claims Frequency and severity)
Loss reserving
Models used are often multiplicative, hence linear on the log-scale.
Claim numbers are generally Poisson, or Poisson with over-dispersion. These distributions are not symmetric and their variance is proportional to mean.
Claim amounts are skewed to the right densities, shaped like for example Gamma.
When to use a GLM?
Use GLMs when
variance not constant and/or
when errors not normal.
Cases when we might use GLMs include: when response variable is
count data expressed as proportions (e.g. logistic regression)
count data that are not proportions (e.g. log-linear models of counts)
binary response variable (e.g. dead or alive)
data on time to death where the variance increases faster than linearly with the mean (e.g. time data with gamma errors).
Many basic statistical methods (regression, t-test) assume constant variance—but often untenable. Hence value of GLMs.
Error structure
Many kinds of data have non-normal errors
errors that are strongly skewed
errors that are kurtotic
errors that are strictly bounded (as in proportions)
errors that cannot lead to negative fitted values (as in counts)
Error structure
GLM allows specification of a variety of different error distributions:
Poisson errors, useful with count data
binomial errors, useful with data on proportions
gamma errors, useful with data showing a constant coefficient of variation
exponential errors, useful with data on time to death (survival analysis)
The components of a GLM
Lecture Outline
Introduction to GLMs
The components of a GLM
Fit a GLM
The components of a GLM
A Generalised Linear Model (GLM) has three components:
A systematic component allowing for inclusion of covariates or explanatory variables (captures location).
A stochastic component specifying the error distributions (captures spread)
A parametric link function linking the stochastic and systematic components by associating a function of the mean to the covariates.
The Systematic Component
The systematic component is a linear predictor \eta, that is, a function (with linear coefficients) of the covariates, sometimes called explanatory variables.
Consider the following linear (in its coefficients) model:
\eta_{i} = \mathrm{x}_i \beta = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip},
where \mathrm{x}_i is the i’th row of X and there are p predictor variables (or covariates) affecting the response.
The mean \mu_i (location) of the response will depend on \eta_i in that
\begin{aligned}
\eta_i &= g(\mu_i) \\
\mu_i &= g^{-1}\bigl(\mathrm{x}_i\beta\bigr) = g^{-1}(\eta_i),
\end{aligned}
where g is called the link function (unsurprisingly!).
The Stochastic Component
We are now interested in building spread around our linear model for the mean
We will use a the exponential dispersion family.
We say Y comes from an exponential dispersion family if its density has the form
f_{Y}(y) = \exp \left[ \frac{y\theta -b\left( \theta \right) }{%
\psi }+c\left( y;\psi \right) \right].% \new{\theta\in \Theta,\psi\in \Pi }.
Here \theta and \psi are location and scale parameters, respectively. Note in the book they use different representation but they are equivalent.
\theta known as canonical or natural parameter of the distribution.
b\left( \theta \right) and c\left( y;\psi \right) are known functions and specify the distribution
Examples of Exponential Dispersion Families
Normal N\left( \mu ,\sigma ^{2}\right) with \theta = \mu and \psi = \sigma^2.
Gamma(\alpha, \beta) with \theta = -\beta/\alpha = -\frac{1}{\mu} and \psi = 1/\alpha .
Inverse Gaussian(\alpha, \beta) with \theta = -\frac{1}{2} \beta^2/\alpha^2 = -\frac{1}{2\mu^2} and \psi = \beta /\alpha^2.
Poisson(\mu) with \theta = \log \mu and \psi = 1.
Binomial(m, p) with \theta = \log \left[p/\left( 1-p\right) \right] = \log\left[ \mu/(m-\mu)\right] and \psi = 1.
Negative Binomial(r, p) with \theta = \log (1-p) = \log (\mu p/r) and \psi = 1.
Example - Gamma
The gamma(\alpha,\beta) distributions belong to the exponential dispersion families. Its density is
\begin{aligned}
f(y) &= \frac{\beta^\alpha y^{\alpha-1}\mathrm{e}^{-\beta y}}{\Gamma(\alpha)}\\
&= \exp\left( -\log\Gamma(\alpha)+\alpha\log\beta+(\alpha-1)\log y-\beta y\right)\\
&= \exp\left(\frac{ -\frac{\beta}{\alpha}y-\left(-\log(\frac\beta\alpha)\right) }{ \frac 1\alpha}+\frac{\log \frac1{1/\alpha}}{\frac1{\alpha}}-\log\Gamma\left(\frac1{1/\alpha}\right)+\left(\frac1{1/\alpha}-1\right)\log y\right)\\
&= \exp\left( \frac{y\theta-b(\theta)}{\psi}+c(y;\psi)\right)
\end{aligned}
where \theta = -\frac{\beta}{\alpha}, \psi = \frac1\alpha, b(\theta) = -\log(-\theta),
The moment generating function can be expressed as
M_{Y}(t)
= \mathbb{E}( \mathrm{e}^{Yt})
= \exp \left[\frac{b\left( \theta +t\psi \right) -b\left( \theta \right) }{\psi }\right] .
The cumulant generating function immediately follows:
\kappa_Y(t) = \log M_Y(t) = \frac{b(\theta +t \psi) -b(\theta)}{\psi}.
which can be used to determine
Notice \mu = b^\prime(\theta) and so mean \mu depends on location parameter \theta or \theta depends on \mu. So we sometimes write \theta = \theta(\mu).
Examples
Normal N(\mu, \sigma^2): \theta = \mu and hence \theta(\mu) = \mu.
Poisson(\mu) with \theta = \log \mu and hence \theta(\mu) = \log \mu.
Binomial(m, p) with \theta = \log[p/(1-p)] = \log[\mu/(m-\mu)] and hence \theta(\mu) = \log[\mu/(m-\mu)].
Negative Binomial(r, p) with \theta = \log(1-p) = \log (\mu p/r) and hence \theta(\mu) = \log(1-p) = \log( \mu p/r).
Example
For the gamma(\alpha,\beta) random variable Y, find \theta(\mu).
Answer: The density can be written as
\begin{aligned}
f(y)&= \exp\left(\frac{y\theta-b(\theta)}{\psi}+c(y;\psi)\right)
\end{aligned}
where \theta = -\frac{\beta}{\alpha}, \psi = \frac1\alpha, b(\theta) = -\log(-\theta),
c(y;\psi) = \frac{\log\frac1\psi}{\psi}+(\frac1\psi-1)\log y-\log\Gamma(\frac1\psi). Then
The variance is sometimes expressed as \text{Var}(Y) = \psi V(\mu) where clearly
V(\mu) = b^{\prime \prime}\bigl( \theta(\mu) \bigr)
and is called the variance function.
Examples
Normal N(\mu, \sigma^{2}) with V(\mu) = 1
Gamma(\alpha, \beta) with V(\mu) = \mu^2
Inverse Gaussian(\alpha, \beta) with V(\mu) = \mu^3
Poisson(\mu) with V(\mu) = \mu
Binomial(m, p) with V(\mu) = \mu (1-\mu/m)
Negative Binomial(r, p) with V(\mu) = \mu (1+\mu/r)
Mean-variance relationship
With linear regression, central assumption is constant variance (top left-hand graph)
Count data: response is integer, lots of zeros—variance may increase linearly with mean (top right)
Proportion data: count of number of failures of events or successes, variances will be inverted U-shape (bottom left)
If response follows gamma distribution (e.g. time-to-death data), variance increases faster than linearly with mean (bottom right).
Example
Show that the variance function of Gamma(\alpha, \beta) GLM is V(\mu) = \mu^2.
Note that \mu = \mathbb{E}[Y] = b^\prime(\theta) = -\frac1{\theta}, and therefore \theta = -\frac1{\mu}. So V(\mu) = b^{\prime \prime}(\theta) = \frac1{\theta^2} = \mu^2.
The link between both components
The mean \mu_i is connected to the linear predictor \mathrm{x}_i\beta through \mu_i = g^{-1}\bigl(\mathrm{x}_i\beta\bigr) = g^{-1}(\eta_i) \quad \text{or} \quad \eta_i=g(\mu_i) where g is called the link function.
If g(\cdot) \equiv \theta(\cdot), that is, if \theta_i = \eta_i, we say we have a canonical link, or natural link function.
A GLM models an n-vector of independent response variables, \mathbf{Y} using
Random component: For each observation y_{i} we use an exponential dispersion model f(y_i;\theta_i) = \exp \left[ \frac{y_{i}\theta_{i}-b\left( \theta _{i}\right) }{\psi }+c\left( y_{i};\psi \right) \right] where \theta _{i} is the canonical parameter, \psi is a dispersion parameter and function b(\cdot) and c(\cdot, \cdot) are known.
Sytematic component:\eta_{i} = \mathrm{x}_i \beta = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}, the linear predictors with \beta = (\beta_0, \beta_1,\ldots,\beta_p) regression parameters.
Parametric link function: the link function g(\mu_i) = \eta_i=\mathrm{x}_i\beta combines the linear predictor with the mean \mu_i=\mathbb{E}[Y_i]. The link is called canonical if \theta _{i}=\eta_i
Fit a GLM
Lecture Outline
Introduction to GLMs
The components of a GLM
Fit a GLM
Procedure
Constructing a GLM consists of the following steps:
Choose a response distribution f(y_i) and hence choose b(\theta).
Choose a link g(\mu).
Choose explanatory variables x in terms of which g(\mu) is to be modeled.
Collect observations y_1,\cdots,y_n on the response y and corresponding values x_1,\cdots,x_n on the explanatory variables x.
Fit the model by estimating \beta and, if unknown, \psi.
Given the estimate of \beta, generate predictions (or fitted values) of y for different settings of X and examine how well the model fits. Also the estimated value of \beta will be used to see whether or not given explanatory variables are important in determining \mu.
Case Study: Motor claims illustration
Consider the data used in McCullagh and Nelder (1989), page 299, related to motor claims. So the responses are claims.
There are three factors used:
policyholder age (PA), with 8 levels, 17-20, 21-24, 25-29, etc.
car group (CG), with 4 levels, A, B, C, and D.
vehicle age (VA), with 4 levels, 0-3, 4-7, 8-9, 10+
There are a total of 123 different cells of data.
Case Study: Motor claims illustration
library(tidyverse)PCarIns <-read_csv("PrivateCarIns1975-Data.csv")PCarIns <- PCarIns %>%filter(Numb.Claims>0) # remove the 3 categories with no claimsstr(PCarIns)
# convert to categoricalPCarIns <- PCarIns %>%mutate(Cpol.Age =factor(Cpol.Age),Car.Group =factor(Car.Group),Cveh.Age =factor(Cveh.Age))
Case Study: Motor claims illustration
Show code for Figure
par(mfrow=c(1,2)) # Set up the plotting area to have 1 row and 2 columnshist(PCarIns$Avg.Claims, main="Histogram of Avg. Claims")plot(x = PCarIns$Cpol.Age,y = PCarIns$Avg.Claims, xlab ="Policy Age", ylab ="Claim Amount", main="Scatter Plot of Claims by Policy Group")
Show code for Figure
par(mfrow=c(1,1)) # Reset to default plotting layout
Case Study: Motor claims illustration
Show code for Figure
par(mfrow=c(1,2)) # Set up the plotting area to have 1 row and 2 columnshist(PCarIns$Avg.Claims, main="Histogram of Avg. Claims")plot(x = PCarIns$Car.Group, y = PCarIns$Avg.Claims, xlab ="Car group", ylab ="Claim Amount", main="Scatter Plot of Claims by Car Group")
Show code for Figure
par(mfrow=c(1,1)) # Reset to default plotting layout
Case Study: Motor claims illustration
Show code for Figure
par(mfrow=c(1,2)) # Set up the plotting area to have 1 row and 2 columnshist(PCarIns$Avg.Claims, main="Histogram of Avg. Claims")plot(x = PCarIns$Cveh.Age,y = PCarIns$Avg.Claims, xlab ="Policyholder Age", ylab ="Claim Amount", main="Scatter Plot of Claims by Policyholder Age")
Show code for Figure
par(mfrow=c(1,1)) # Reset to default plotting layout
How could we model the relationship between claim Amounts and the covariates?
Maximum Likelihood Estimation
The parameters in a GLM are estimated using maximum likelihood.
For each observation y_{i} the contribution to the likelihood is
f(y_i; \theta_i) = \exp \left[ \frac{y_{i}\theta_{i}-b\left( \theta _{i}\right) }{\psi }+c\left( y_{i};\psi\right) \right].
Given vector \mathbf{y}, an observation of \mathbf{Y}, MLE of \mathbf{\beta} is possible. Since the y_i are mutually independent, the likelihood of \mathbf{\beta} is L(\mathbf{\beta}) = \prod_{i=1}^n f(y_i; \theta_i).
Maximum Likelihood Estimation
So for n independent observations y_{1}, y_{2}, \dots, y_{n}, we have
L(\mathbf{y}; \mathbf{\mu})
= \prod_{i=1}^{n} \exp \left[ \frac{y_{i}\theta _{i}-b(\theta_i) }{\psi} + c\left( y_{i};\psi \right) \right].
Take log to obtain the log-likelihood as
\ell(\mathbf{y}; \mathbf{\mu})
= \sum_{i=1}^n \left[ \frac{y_i \theta_i - b(\theta_i)}{\psi} + c(y_i; \psi) \right].
Example
Consider a GLM model with the canonical link and gamma distribution. The density of response variable is
\begin{aligned}
f(y)&= \exp\left( \frac{y\theta-b(\theta)}{\psi}+c(y;\psi)\right)
\end{aligned}
with b({\theta}) = -\log(-\theta).
Moreover, with canonical link, we have \theta_i = \theta(\mu_i) = g(\mu_i) = \mathrm{x}_i\beta.
The log-likelihood is
\ell(\mathbf{y}; \mathbf{\mu}) = \sum_{i=1}^n \left(\frac{y_i \theta_i - b(\theta_i)}{\psi} + c(y_i; \psi)\right)
For n data points we can estimate up to n parameters
Null model: the systematic component is a constant term only. \hat{\mu}_i=\bar{y},\quad \text{for all} \quad i = 1, 2, \dots, n
Only one parameter \rightarrow too simple
Full or saturated model: Each observation has its own parameter. \hat{\mu}_i = y_i,\quad \text{for all} \quad i = 1, 2, \dots, n
All variations can be explained by the covariates \rightarrow no explanation of data possible
Deviance and Scaled Deviance
The log-likelihood in the full model gives
\ell(\mathbf{y}; \mathbf{y})
= \sum_{i=1}^n \left[ \frac{y_i \widetilde{\theta}_i - b\bigl(\widetilde{\theta}_i\bigr)}{\psi} + c(y_i; \psi) \right]
where \widetilde{\theta}_i are the canonical parameter values corresponding to \mu_i = y_i for all i = 1, 2, \dots, n.
Deviance and Scaled Deviance
Let \widehat{\mu} denote the M.L.E. of chosen model.
One way of assessing the fit of a given model is to compare it to the model with the “closest” possible fit: the full model
The likelihood ratio criterion compares a model with its associated full model.
\begin{aligned}
-2\log \left[ \frac{L(\mathbf{y}; \widehat{\mu})}{L(\mathbf{y}; \mathbf{y})} \right] &= 2 [ \ell(\mathbf{y}; \mathbf{y}) - \ell(\mathbf{y}; \widehat{\mu}) ] = 2\sum_{i=1}^n \left[\frac{y_{i}( \widetilde{\theta}_i - \widehat{\theta}_i)}{\psi}-\frac{b(\widetilde{\theta}_i) -b(\widehat{\theta}_i)}{\psi}\right] \\
&= \frac{D(y, \widehat{\mu})}{\psi}
\end{aligned}
D(y, \widehat{\mu}) is called the deviance and D(y, \widehat{\mu})/\psi the scaled deviance.
Deviance plays much the same role for GLMs that RSS plays for ordinary linear models. (For ordinary linear models, deviance is RSS.)
The scaled deviance is actually a measure of the fit of the model. It has approximately (asymptotically true) a chi-squared distribution with degrees of freedom equal to the number of observations minus the number of estimated parameters.
Thus, we can use the scaled deviance usually for comparing models that are nested (one model is a subset of the other) by looking at the difference in the deviance and comparing it with the chi-squared table.
Reminder: a significant value (at the 5% level) for a \chi^{2} distribution with \nu degrees of freedom is approximately 2\nu.
Model selection
Nested models: Wald test, score test, likelihood ratio test (drop-in deviance test)
Non-Nested models: Use AIC = -2\ell(\mathbf{y}; \widehat{\mu}) + 2d (the smaller the better)
Model selection (Nested models)
Model 1: \eta=\beta_0+\beta_1x_1+\ldots\beta_qx_q,
Model 2: \eta=\beta_0+\beta_1x_1+\ldots\beta_qx_q+\beta_{q+1}x_{q+1}+\ldots\beta_px_p
Is Model 2 an improvement over Model 1?
H_0: \beta_{q+1} = \cdots = \beta_p = 0H_a: \text{at least one } \beta_j \text{ is non-zero}
Model selection (Nested models)
Consider two models,
Model 1: q parameters, with scaled deviance D_1;
Model 2: p parameters (p>q), with scaled deviance D_2.
Model 2 is a significant improvement over Model 1 (a more parsimonious model), if D_1 - D_2 > the critical value obtained from a \chi^2(p-q) distribution.
Since \mathbb{P} \left[ \chi^2(\nu)>2\nu \right] \approx 5\%, the following rule of thumb can be used as an approximation: \text{model 2 is preferred if } D_{1}-D_{2} > 2(p-q).
Case Study: Motor claims illustration - Null Model
#Null modelpcarins.glm.NULL <-glm(Avg.Claims~1, weights=Numb.Claims, family=Gamma, data = PCarIns)pcarins.glm.NULL
Call: glm(formula = Avg.Claims ~ 1, family = Gamma, data = PCarIns,
weights = Numb.Claims)
Coefficients:
(Intercept)
0.004141
Degrees of Freedom: 122 Total (i.e. Null); 122 Residual
Null Deviance: 649.9
Residual Deviance: 649.9 AIC: 99520
Case Study: Motor claims illustration - Deviance analysis
#analysis of the deviance tableprint(anova(pcarins.glm, test="Chi"))
The scaled deviance statistics are provided below:
Model
Deviance
First Diff.
d.f.
Mean Deviance
1
649.9
PA
567.7
82.2
7
11.7
PA+CG
339.4
228.3
3
76.1
PA+CG+VA
124.8
214.7
3
71.6
+PA⋅CG
90.7
34.0
21
1.62
+PA⋅VA
71.0
19.7
21
0.94
+CG⋅VA
65.6
5.4
9
0.60
Complete
0.0
65.6
58
1.13
Residuals in GLMs
Residuals are a primary tool for assessing how well a model fits the data.
They can also help to detect the form of the variance function and to diagnose problem observations.
We consider three different kinds of residuals:
deviance residuals: r_i^D = \text{sign}(y_i - \widehat{\mu}_i) \sqrt{d_i} where d_{i} is contribution of ith observation to the scaled deviance (drawing on idea that deviance is akin to RSS).
response residuals: they are simply y_i - \widehat{\mu}_i.
Residuals in GLMs (continued)
If the model is correct and the sample size n is large, then the (scaled) deviance is approximately \chi^2_{n-(p+1)}.
The expected value of the deviance is thus n-(p+1), and one expects each case to contribute approximately (n-(p+1))/n\approx 1 to the deviance. If |d_i| is much greater than 1, then case i is contributing too much to the deviance (contributing to lack of fit), indicating a departure from the model assumptions for that case.
Typically deviance residuals are examined by plotting them against fitted values or explanatory variables.