ACTL3142 & ACTL5110 Statistical Machine Learning for Risk Applications
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Reading
James et al. (2021): Chapters 4.1, 4.2, 4.3, 4.6, 4.7.1, 4.7.2, 4.7.6, 4.7.7
Lecture Outline
An overview of classification
Logistic regression
Poisson regression
Generalised linear models
Regression
Classification
Success/failure of a treatment, explained by dosage of medicine administered, patient’s age, sex, BMI, severity of condition, etc.
Vote for/against political party, explained by age, gender, education level, region, ethnicity, geographical location, etc.
Customer churns/stays depending on usage pattern, complaints, social demographics, discounts offered, etc.
Default
from ISLR2
)default
(Y) is a binary variable (yes/no as 1/0)income
(X_1) and credit card balance
(X_2) are continuous predictorsstudent
(X_3) is a categorical predictorDefault
data:Coding in the binary case is simple Y \in \{0,1\} \Leftrightarrow Y\in\{{\color{#2171B5}\bullet},{\scriptsize{\color{#238B45}\blacksquare}}\}
Our objective is to find a good predictive model f that can:
Lecture Outline
An overview of classification
Logistic regression
Poisson regression
Generalised linear models
The Odds of an event A measure the probability of A relative to its complement, i.e. \text{Odds}(A)=\frac{\mathbb{P}(A)}{\mathbb{P}(A^{\text{c}})} = \frac{\mathbb{P}(A)}{1-\mathbb{P}(A)}.
There is a “bijection” between probability and odds: if you know the probability you can find the odds, but also, if you know the odds you can recover the probability: \frac{1}{\text{Odds}(A)} = \frac{1}{\mathbb{P}(A)}-1. \mathbb{P}(A)=\frac{\text{Odds}(A)}{1+\text{Odds}(A)}.
Odds take values between 0 and \infty.
\underbrace{\ln\left(\frac{\mathbb{P}(Y=1|\mathrm{X})}{1-\mathbb{P}(Y=1|\mathrm{X})}\right)}_{\text{log-odds}} = \underbrace{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}_{\text{linear model}}
probability | odds | logodds |
---|---|---|
0.001 | 0.001 | -6.907 |
0.250 | 0.333 | -1.099 |
0.333 | 0.500 | -0.693 |
0.500 | 1.000 | 0.000 |
0.667 | 2.000 | 0.693 |
0.750 | 3.000 | 1.099 |
0.999 | 999.000 | 6.907 |
\ln\left(\frac{\mathbb{P}(Y=1|\mathrm{X})}{1-\mathbb{P}(Y=1|\mathrm{X})}\right) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p
Or, equivalently, \text{Odds}(Y=1|\mathrm{X}) = \mathrm{e}^{ \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}
Using (training) data and maximum-likelihood estimation, we produce estimates \hat{\beta}_0, \hat{\beta}_1, \ldots \hat{\beta}_p.
We can then estimate the probability \mathbb{P}(Y=1|\mathrm{X}) as
\widehat{\mathbb{P}}(Y=1|\mathrm{X}) = \frac{\mathrm{e}^{\hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p}}{1+\mathrm{e}^{\hat{\beta}_0 + \hat{\beta}_1 X_1 + \cdots + \hat{\beta}_p X_p}}.
Y = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \varepsilon, hence if predictor X_{j} increases by 1 we would predict Y to increase by {\beta}_j, on average.
Y = \begin{cases} 1 & \text{if } {\color{#2171B5}\bullet} \\ 0 & \text{if } {\scriptsize{\color{#238B45}\blacksquare}} \end{cases} \qquad \ln\left(\frac{\mathbb{P}(Y=1|X)}{1-\mathbb{P}(Y=1|X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2
The parameter estimates are \hat{\beta}_0= 13.671, \hat{\beta}_1= -4.136, \hat{\beta}_2= 2.803
\hat{\beta}_1= -4.136 implies that the bigger X_1 the lower the chance it is a blue point
\hat{\beta}_2= 2.803 implies that the bigger X_2 the higher the chance it is a blue point
\ln\left(\frac{\mathbb{P}(Y=1|X)}{1-\mathbb{P}(Y=1|X)}\right) = 13.671 - 4.136 X_1 + 2.803 X_2
X1 | X2 | log-odds | P(Y=1|X) | prediction |
---|---|---|---|---|
7.0 | 8.0 | 7.14 | 0.9992 | blue |
8.0 | 7.5 | 1.61 | 0.8328 | blue |
8.0 | 7.0 | 0.20 | 0.5508 | blue |
8.5 | 7.5 | -0.46 | 0.3864 | green |
9.0 | 7.0 | -3.93 | 0.0192 | green |
Note: the prediction is 1 (“blue”) if \mathbb{P}[Y=1|\mathrm{X}]>0.5.
glmStudent <- glm(default ~ student, family = binomial(), data = ISLR2::Default)
summary(glmStudent)
Call:
glm(formula = default ~ student, family = binomial(), data = ISLR2::Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.50413 0.07071 -49.55 < 2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 2908.7 on 9998 degrees of freedom
AIC: 2912.7
Number of Fisher Scoring iterations: 6
glmAll <- glm(default ~ balance + income + student, family = binomial(), data = ISLR2::Default)
summary(glmAll)
Call:
glm(formula = default ~ balance + income + student, family = binomial(),
data = ISLR2::Default)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
Results of logistic regression:
default
against student
Predictor | Coefficient | Std error | Z-statistic | P-value |
---|---|---|---|---|
(Intercept) |
-3.5041 | 0.0707 | -49.55 | <0.0001 |
student = Yes |
0.4049 | 0.1150 | 3.52 | 0.0004 |
default
against balance
, income
, and student
Predictor | Coefficient | Std error | Z-statistic | P-value |
---|---|---|---|---|
(Intercept) |
-10.8690 | 0.4923 | -22.080 | < 0.0001 |
balance |
0.0057 | 2.319e-04 | 24.738 | < 0.0001 |
income |
0.0030 | 8.203e-06 | 0.370 | 0.71152 |
student = Yes |
-0.6468 | 0.2362 | -2.738 | 0.00619 |
Confusion matrix
Y=0 | Y=1 | Total | |
---|---|---|---|
\hat{Y}=0 | 10 | 2 | 12 |
\hat{Y}=1 | 4 | 14 | 18 |
Total | 14 | 16 | 30 |
\text{True-Positive Rate} = \frac{14}{16}=0.875
\text{False-Positive Rate} = \frac{4}{14}=0.286
Confusion matrix
Y=0 | Y=1 | Total | |
---|---|---|---|
\hat{Y}=0 | 6 | 0 | 6 |
\hat{Y}=1 | 8 | 16 | 24 |
Total | 14 | 16 | 30 |
\text{True-Positive Rate} = \frac{16}{16}=1
\text{False-Positive Rate} = \frac{8}{14}=0.429
Lecture Outline
An overview of classification
Logistic regression
Poisson regression
Generalised linear models
In many applications we need to model count data, Y\in\{0,1,2,3,\ldots\}:
In mortality studies and/or health insurance, the aim is to explain the number of deaths and/or disabilities in terms of predictor variables such as age, gender, occupation and lifestyle.
In general insurance, the count of interest may be the number of claims made on vehicle insurance policies. This could be a function of the driver’s characteristics, the geographical location of driver, the age of the car, type of car, previous claims experience, and so on.
Bikeshare
datasetbikers
as function of the other variables?'data.frame': 8645 obs. of 15 variables:
$ season : num 1 1 1 1 1 1 1 1 1 1 ...
$ mnth : Factor w/ 12 levels "Jan","Feb","March",..: 1 1 1 1 1 1 1 1 1 1 ...
$ day : num 1 1 1 1 1 1 1 1 1 1 ...
$ hr : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
$ holiday : num 0 0 0 0 0 0 0 0 0 0 ...
$ weekday : num 6 6 6 6 6 6 6 6 6 6 ...
$ workingday: num 0 0 0 0 0 0 0 0 0 0 ...
$ weathersit: Factor w/ 4 levels "clear","cloudy/misty",..: 1 1 1 1 1 2 1 1 1 1 ...
$ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
$ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
$ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
$ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
$ casual : num 3 8 5 3 0 0 2 1 1 8 ...
$ registered: num 13 32 27 10 1 1 0 2 7 6 ...
$ bikers : num 16 40 32 13 1 1 2 3 8 14 ...
Bikeshare
dataset - DiscussionY = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon
\log(Y) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p + \epsilon
\mathbb{P}(Y=k) = \frac{e^{\lambda}\lambda^k}{k!} \quad \text{for } k=0,1,2,\ldots \quad \text{with } \mathbb{E}[Y]= \text{Var}(Y)=\lambda.
\log(\lambda(X_1,\ldots,X_p)) = \beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p.
Hence, \mathbb{E}[Y] = \lambda(X_1,X_2,\ldots, X_p) = \mathrm{e}^{\beta_0 + \beta_1 X_1 + \cdots + \beta_p X_p}.
This is convenient because \lambda cannot be negative.
Interpretation: an increase in X_j by one unit is associated with a change in \mathbb{E}[Y] by a multiplicative factor e^{\beta_j}.
L(\beta_0,\beta_1,\ldots,\beta_p)=\prod_{i=1}^n\frac{e^{\lambda(\mathrm{x}_i)}\lambda(\mathrm{x}_i)^{y_i}}{y_i!} \quad \text{with} \quad \lambda(\mathrm{x}_i) = \mathrm{e}^{\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}}
Bikeshare
datasetglmBikeshare <- glm(bikers ~ workingday + temp + weathersit + mnth + hr, family = poisson(),
data = ISLR2::Bikeshare)
summary(glmBikeshare)
Call:
glm(formula = bikers ~ workingday + temp + weathersit + mnth +
hr, family = poisson(), data = ISLR2::Bikeshare)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.693688 0.009720 277.124 < 2e-16 ***
workingday 0.014665 0.001955 7.502 6.27e-14 ***
temp 0.785292 0.011475 68.434 < 2e-16 ***
weathersitcloudy/misty -0.075231 0.002179 -34.528 < 2e-16 ***
weathersitlight rain/snow -0.575800 0.004058 -141.905 < 2e-16 ***
weathersitheavy rain/snow -0.926287 0.166782 -5.554 2.79e-08 ***
mnthFeb 0.226046 0.006951 32.521 < 2e-16 ***
mnthMarch 0.376437 0.006691 56.263 < 2e-16 ***
mnthApril 0.691693 0.006987 98.996 < 2e-16 ***
mnthMay 0.910641 0.007436 122.469 < 2e-16 ***
mnthJune 0.893405 0.008242 108.402 < 2e-16 ***
mnthJuly 0.773787 0.008806 87.874 < 2e-16 ***
mnthAug 0.821341 0.008332 98.573 < 2e-16 ***
mnthSept 0.903663 0.007621 118.578 < 2e-16 ***
mnthOct 0.937743 0.006744 139.054 < 2e-16 ***
mnthNov 0.820433 0.006494 126.334 < 2e-16 ***
mnthDec 0.686850 0.006317 108.724 < 2e-16 ***
hr1 -0.471593 0.012999 -36.278 < 2e-16 ***
hr2 -0.808761 0.014646 -55.220 < 2e-16 ***
hr3 -1.443918 0.018843 -76.631 < 2e-16 ***
hr4 -2.076098 0.024796 -83.728 < 2e-16 ***
hr5 -1.060271 0.016075 -65.957 < 2e-16 ***
hr6 0.324498 0.010610 30.585 < 2e-16 ***
hr7 1.329567 0.009056 146.822 < 2e-16 ***
hr8 1.831313 0.008653 211.630 < 2e-16 ***
hr9 1.336155 0.009016 148.191 < 2e-16 ***
hr10 1.091238 0.009261 117.831 < 2e-16 ***
hr11 1.248507 0.009093 137.304 < 2e-16 ***
hr12 1.434028 0.008936 160.486 < 2e-16 ***
hr13 1.427951 0.008951 159.529 < 2e-16 ***
hr14 1.379296 0.008999 153.266 < 2e-16 ***
hr15 1.408149 0.008977 156.862 < 2e-16 ***
hr16 1.628688 0.008805 184.979 < 2e-16 ***
hr17 2.049021 0.008565 239.221 < 2e-16 ***
hr18 1.966668 0.008586 229.065 < 2e-16 ***
hr19 1.668409 0.008743 190.830 < 2e-16 ***
hr20 1.370588 0.008973 152.737 < 2e-16 ***
hr21 1.118568 0.009215 121.383 < 2e-16 ***
hr22 0.871879 0.009536 91.429 < 2e-16 ***
hr23 0.481387 0.010207 47.164 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 1052921 on 8644 degrees of freedom
Residual deviance: 228041 on 8605 degrees of freedom
AIC: 281159
Number of Fisher Scoring iterations: 5
Bikeshare
datasetplot(x = 1:12, y = c(0, glmBikeshare$coefficients[7:17]), type = 'o',
xlab = "month", ylab = "coefficient", xaxt = "n")
axis(1, at=1:12, labels=substr(month.name, 1, 1))
plot(x = 1:24, y = c(glmBikeshare$coefficients[18:40], 0), type = 'o',
xlab = "hour", ylab = "coefficient")
Lecture Outline
An overview of classification
Logistic regression
Poisson regression
Generalised linear models
Linear Regression | Logistic Regression | Poisson Regression | Generalised Linear Models | |
---|---|---|---|---|
Type of Data | Continuous | Binary (Categorical) | Count | Flexible |
Use | Prediction of continuous variables | Classification | Prediction of an integer number | Flexible |
Distribution of Y|\mathrm{x} | Normal | Bernoulli (Binomial for multiple trials) | Poisson | Exponential Family |
\mathbb{E}[Y|\mathrm{X}] | \mathrm{X}\boldsymbol{\beta} | \frac{e^{\mathrm{X}\boldsymbol{\beta}}}{1+e^{\mathrm{X}\boldsymbol{\beta}}} | e^{\mathrm{X}\boldsymbol{\beta}} | g^{-1}(\mathrm{X}\boldsymbol{\beta}) |
Link Function Name | Identity | Logit | Log | Depends on the choice of distribution |
Link Function Expression | \eta(\mu) = \mu | \eta(\mu) = \log \left(\frac{\mu}{1-\mu}\right) | \eta(\mu) = \log(\mu) | Depends on the choice of distribution |
Comments about logistic regression