Lab 5: Generalised Linear Models

Questions

Conceptual Questions

\star Solution An insurance company tested for claim sizes under two factors, i.e. **CAR**, the insurance group into which the car was placed, and **AGE**, the age of the policyholder (i.e. two-way contingency table). It was assumed that the claim size y_i follows a gamma distribution, under a log-link function. Analysis of a set of data for which n=8 provided the following output, in the SAS software:

Observation	Claim size	CAR type	Age group	Pred	Xbeta	Resdev
1	27	1	1	25.53	3.24	0.30
2	16	1	2	24.78	3.21	-1.90
3	36	1	1		3.41	1.03
4	45	1	2	38.09	3.64	1.11
5	38	2	1	40.85	3.71	-0.46
6	27	2	2	36.97	3.61	-1.73
7	14	2	1		2.45	0.69
8	6	2	2	14.59	2.68	-2.55

Calculate the fitted claim sizes missing in the table.

Solution What are the three essential components of any generalised linear model?
\star Solution Consider the Inverse Gaussian distribution, as parametrised on Wikipedia (other parametrisations exist), i.e., with density:

f(y) = \sqrt{\frac{\lambda}{2\pi y^3}} \exp\left( -\frac{\lambda (y - \mu)^2}{2\mu^2 y} \right), \quad y > 0.
1. Show that this is a member of the exponential dispersion family. Hence, identify \psi and b(\theta). Hint: you can rely on your knowledge that \theta=-1/2\mu^2 (as seen in the lecture).
2. Find the mean and variance of the Inverse Gaussian (as function of \mu and \lambda). You should rely on the properties of the exponential dispersion family, rather than doing tedious calculations ;-).

\star Solution (Question #9, ACTL3003/5106 Final Exam 2005)

A random variable Y is said to have an exponential dispersion model if its density can be expressed in the form f_Y(y; \theta, \psi) = \exp\left(\frac{y\theta-b(\theta)}{\psi}+c(y;\psi)\right) where \theta and \psi are parameters and b(\cdot) and c(\cdot;\cdot) are both functions

Automobile insurance claims experience data of a French insurance company for a two-year period, beginning January 2001 and ending December 2002, is being modeled using a Generalised Linear Model (GLM) framework.

Assume that the number of claims for risk class i, Y_i, has a Poisson distribution with probability density function of the form f_Y(y;\mu) = \frac{\mathrm{e}^{-\mu}\mu^y}{y!}, \text{ for } y = 0,1,2, \dots where its mean \mu>0 is related to the variables SEX, VEH_AGE, AGE, and LOYALTY as \ln{\left(\mu_i\right)}=\ln{(\texttt{EXP}_i)}+\beta_0+\beta_1\texttt{SEX}_i+\beta_2\texttt{VEH}\_\texttt{AGE}_i(1)+\beta_3\texttt{VEH}\_\texttt{AGE}_i(2)+\beta_4\texttt{AGE}_i+\beta_5\texttt{LOYALTY}_i where detailed description of variables and their definitions are found below:

Variable	Description
`SEX`	1 = male; 2 = female.
`VEH_AGE`	1 = less than 1 year; 2 = between 1-2 years; 3 = 2 years and more.
`AGE`	1 = 20 years and below; 2 = above 20 years.
`LOYALTY`	1 = has been client for past 36 months; 0 = otherwise.
`Y`	total number of claims
`EXP`	total number of policies exposed to claims

Note that the variables VEH_AGE(1) and VEH_AGE(2) in the regression equation are the respective indicator variables for VEH_AGE of types 1 and 2.

The SAS software output for running PROC GENMOD on the data is provided below.

Show that the Poisson distribution can be written in exponential dispersion form. Identify the dispersion and canonical parameters, \psi and \theta respectively, in terms of \mu, to the extent possible.
Derive the expression for the deviance of the Poisson GLM model (this should be a function of the y_i and \widehat{\mu}_i only).
Based on the scaled deviances provided in the SAS output, analyse the adequacy of the model.
Optional: explain the meaning of overdispersion in the context of a Poisson GLM model.

Solution There are m drivers in each of three age groups, and data on the number of claims made during the last year are available. Assume that the numbers of claims are independent Poisson random variables. If Y_{ij} is the number of claims for the jth driver in group i (i = 1,2,3; j = 1,2,\dots,m), let \mathbb{E}(Y_{ij}) = \mu_{ij} and suppose \ln{(\mu_{ij})} = \alpha_i. That is to say, we assume that within an age group all drivers have the same mean claim (but between groups, the mean can vary).

Explain carefully why this model fits the framework of GLMs. In particular, identify the link function and the linear predictor \eta.
Determine the log-likelihood of the data, and hence the maximum likelihood estimators of \alpha_i for i = 1,2,3.

For a particular data set, several models are fitted, with deviances shown below:

Model	Link Function	Scaled Deviance
Model 1	\ln(\mu_{ij}) = \alpha	72.53
Model 2	\ln{(\mu_{ij})} = \begin{cases} \alpha &\text{ if } i=1,2 \\ \beta &\text{ if } i=3 \end{cases}	61.64
Model 3	\ln{(\mu_{ij})}= \alpha_i	60.40

Interpret these three models. In particular, are these three models “nested”? Also, determine whether Model 2 is a significant improvement over model 1, and whether Model 3 is a significant improvement over Model 2.

Extra Exercises

Solution As you know from being studious in your probability courses, the mass function of the \text{Binomial}(m,p) distribution is given by f(y;p)=\frac{m!}{(m-y)!y!}p^y(1-p)^{m-y}. Here, we consider that “m” (the number of trials) is known (i.e., it is not a parameter we need to estimate).
1. Show that the Binomial distribution is a member of the exponential dispersion family, i.e., with mass function that can be written as f(y;\theta,\psi)=\exp{\left[\frac{y\theta-b(\theta)}{\psi}+c(y;\psi)\right]}. Specifically, provide expressions for \theta, \psi and b(\theta).
2. Find the expression for the deviance of a binomial model. This should be a function of y_i and \widehat{\mu}_i.
Solution An insurance company has a set of n risks (i=1,2,\dots,n) for which it has recorded the number of claims per month, Y_{ij}, for m months (j=1,2,\dots,m). It is assumed that the number of claims for each risk, for each month, are independent Poisson random variables with \mathbb{E}(Y_{ij}) = \mu_{ij}. These random variables are modelled using a Generalised Linear Model, with \ln{\mu_{ij}} = \beta_i,\ \text{for} \ i = 1,2,\dots,n. That is, we assume the mean number of claims is constant across different months (but the mean is possibly different for the different risks).
1. Derive the maximum likelihood estimator of \beta_i.
2. Show that the deviance for this model is
  
  2\sum_{i=1}^{n}{\sum_{j=1}^{m}{\left(y_{ij}\ln{\frac{y_{ij}}{\bar{y_i}}{}}-(y_{ij}-\bar{y_i})\right)}} where \bar{y_i}=\frac{1}{m}\sum_{j=1}^{n}{y_{ij}} (the mean monthly number of claims for risk “i”.)
3. A company has data for each month over a 2 year period (m=24). For one risk, the average number of claims per month was 17.45. In the most recent month for this risk, there were 9 claims. Calculate the contribution that this observations makes to the deviance.
Solution Consider the Inverse Gaussian distribution, as studied in a previous problem. Show that the deviance expression for the Inverse Gaussian has the following form: D=\sum_{i=1}^{n}{\frac{1}{\hat{\mu}_i^2}\frac{(y_i-\hat{\mu}_i)^2}{\hat{\mu}_i}}.

Applied questions

\star Solution In this question, the vehicle insurance data set dataCar from R package insuranceData is used. This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67856 policies of which 4624 had at least one claim.

The data frame dataCar contains claim occurrence clm, which takes value 1 if there is a claim and 0 otherwise. The variable veh_value represents the vehicle value which takes value from \$0-\$350,000. We will not be concerned about other variables at the moment.

In this question, we will build a logistic regression model to apply to the vehicle insurance data set. Previous study has shown that the relationship between the likelihood of occurrence of a claim and vehicle value are possibly quadratic or cubic.
1. Suppose the relationship between vehicle value and the probability of a claim is cubic, formulate the model and test significance of the coefficients.
2. Which variable(s) is/are significant at the 1% level?
3. Use AIC to determine which model is the best model: Linear, quadratic or cubic.
\star Solution Third party insurance is a compulsory insurance for vehicle owners in Australia. It insures vehicle owners against injury caused to other drivers, passengers or pedestrians, as a result of an accident.

In this question, the third party claims data set Third_party_claims.csv is used (you can find it on the homepage of the course public website). This data set records the number of third party claims in a twelve-month period between 1984-1986 in each of 176 geographical areas (local government areas) in New South Wales, Australia.
1. Consider a model for the number of claims (claims) in an area as a function of the number of accidents (accidents). Produce an histogram of claims as well as a scatter plot of claims against accidents. Do you think a simple linear regression model is appropriate here?
2. Fit a simple linear regression with claims as response and accidents as only predictor. Use the plot command to produce residuals diagnostic plots of the fitted model. What do these plots tell you?
3. Now, let’s try Poisson regression. To use the log-link function (which ensures the predicted means are always positive), we will keep claims as the response variable, but will use log(accident) as the predictor. Overall, what do you think of this new model?
4. Now, let’s try a Negative Binomial(r,p) fit. To do this, you will need the command glm.nb() from library MASS (rather than glm()). Note this function also estimates the parameter “r”, which for some reason is called “theta” in R (a tad confusing as it’s not the canonical parameter). Is the Negative Binomial model better than Poisson? Why do you think that is?
5. Optional: Now, fit the regression model by specifying family=quasipoisson within glm(). Comment on the estimates of the parameters and their standard errors.

Solutions

Conceptual Questions

Question We know that the linear predictor, for the ith observation, is \ln \mu_{i}=\eta _{i}=\sum\limits_{j}x_{ij}\beta _{j}=\mathrm{x}_{i}\boldsymbol{\beta} \, \text{ (in vector form).} Thus, E\left( y_{i}\right) =\mu_{i}=\mathrm{e}^{\mathrm{x}_{i}\boldsymbol{\beta}}. and therefore, the predicted values are E\left( y_{3}\right) =\mathrm{e}^{3.41}=30.27 and E\left( y_{7}\right) =\mathrm{e}^{2.45}=11.59.
Question The three components of a generalised linear model are:
1. Systematic Component: Every observation “i” has a linear predictor \eta_i = \beta_0 + \sum_j x_{ij}\beta_j where x_{ij} denotes the jth explanatory variable, and
2. Stochastic Component: The observations Y_i|\mathrm{X}_i are independent and each follows an Exponential Dispersion distribution.
3. Link Function: The expected value \mathbb{E}[Y_i|\mathrm{X}_i] = \mu_i is linked to the linear predictor \eta_i by the link function g(\mu_i)=\eta_i.
Question
1. Note we are not interested in the function c(y,\psi). Hence, everything that does not contain a “\theta”, we don’t care about and can put inside a “constant”. We use the symbol \propto to mean “is equal to, up to a constant”. Hence, \begin{align*} f(y) &\propto \exp\left( -\frac{\lambda (y - \mu)^2}{2\mu^2 y} \right) \\ & = \exp\left( -\frac{\lambda (y^2 - 2y\mu + \mu^2)}{2\mu^2 y} \right) \\ & \propto \exp\left( \theta \lambda (y - 2 \mu) \right) \\ & = \exp\left( \frac{ \theta y - 2 \theta \mu}{1/\lambda} \right), \\ \end{align*} from which we deduce \psi=1/\lambda and \begin{align*} b(\theta)&= 2 \theta \mu = 2 \theta \left(-\frac{1}{2\theta} \right)^{\frac{1}{2}} = - \left(2^2 \theta^2\right)^{\frac{1}{2}} \left(-\frac{1}{2\theta} \right)^{\frac{1}{2}} \\ & = - \left(-2\theta\right)^{\frac{1}{2}}. \end{align*}
2. We have \begin{align*} \mathbb{E}[Y] & = b'(\theta) = \left(-2\theta\right)^{-\frac{1}{2}} = \left(1/\mu^2\right)^{-\frac{1}{2}} = \mu.\\ \mathbb{V}\text{ar}[Y] & = \psi b''(\theta) = \psi \left(-2\theta\right)^{-\frac{3}{2}} = \frac{\mu^3}{\lambda}. \end{align*}

Question

For a Poisson(\mu) distribution, its density can be expressed as \begin{aligned} f_Y(y)&=\frac{\mathrm{e}^{-\mu}\mu^y}{y!} \\ &=\frac{1}{y!}\exp{(-\mu+y\ln{\mu})} \\ &=\exp{(y\ln{\mu}-\mu-\ln{y!})} \end{aligned} Thus, we see it belongs to the exponential dispersion family with \theta=\ln{\mu},\;\psi=1,\;b(\theta)=\mu,\;c(y;\psi)=-\ln{y!}
Recall that the scaled deviance for any member of the Exponential Dispersion family has the form \begin{aligned} \frac{D}{\psi}&=2\left[\ell\left(\widetilde{\mathbf{\theta}};\mathbf{y}\right)-\ell \left(\widehat{\mathbf{\theta}};\mathbf{y}\right)\right] \\ &=\frac{2}{\psi}\sum_{i=1}^{n}\left[\left(y_{i}\widetilde{\theta}_{i}-b\left(\widetilde{\theta}_{i}\right)\right) -\left(y_{i}\widehat{\theta }_{i}-b\left(\widehat{\theta }_{i}\right)\right)\right] \end{aligned} where for the Poisson, we have, from part 1 above that \theta=\ln{\mu},\;\psi=1,\;b(\theta)=e^{\theta}=\mu,\;c(y;\psi)=-\ln{y!} Note that \pmb{\widetilde{\theta}}=\pmb{y} under the full model. Thus the deviance can be expressed as \begin{aligned} D&=2\sum_{i=1}^{n}{[(y_i(\ln{y_i})-y_i)-(y_i(\ln{\widehat{\mu}_i}))-\widehat{\mu}_i]} \\ &=2\sum_{i=1}^{n}{[y_i(\ln{y_i}-\ln{\widehat{\mu}_i})-(y_i-\widehat{\mu}_i)]} \end{aligned} This gives us the desired form of the deviance of the Poisson.

First, summarizing the deviance statistics below:

Model	Deviance	Differences	df	ROT: Is D1-D2>2(p-q)?	Significant?
intercept	199.9268	-		-	-
sex	194.5739	5.3529	1	Yes	Yes
veh_age	194.5571	0.0168	2	No	No
driver_age	188.9746	5.5825	1	Yes	Yes
loyalty	188.3470	0.6276	1	No	No

[ROT means Rule-of-Thumb, so there is no need to look up chi-square table.] The table should be read from “top to bottom”. If the decrease in deviance when adding a new predictor is larger than 2*\#\text{parameters added}, we judge the inclusion of the predictor to be worth it. Thus, a GLM model with SEX and DRIVER_AGE looks suitable/adequate (those are the significant predictor variables for claims). Note that here a good practice would be to fit the model again with only these two predictors, and confirm they are still significant (as we have seen before, removing variables from a model can change the effect of other variables.)

Overdispersion is a phenomenon that sometimes occurs in data that are modeled using the Poisson distribution. This is because the mean and the variance for a Poisson are both equal to the Poisson parameter, and if the observed variance from the data is much larger than the observed mean, then there is potential problem of over-dispersion. If the estimate of the dispersion after fitting the data, as measured by the either the deviance or Pearson’s chi-square, divided by the degrees of freedom, is not near 1, then the data may be overdispersed if the dispersion estimate is greater than 1, or underdispersed if the dispersion estimate is less than 1. A convenient and simple way to model this situation is to allow the variance function of the Poisson distribution to have a multiplicative overdispersion factor. That is, we let the variance function be so that there is scale parameter \phi, as in example in class, we have V(\mu)=\phi\mu.

Question
1. The linear predictor can be written as \eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2, where X_1 and X_2 are binary predictors (taking values of 0/1) indicating if a driver belongs to age groups 1 or 2, repectively (age group i=3 is then seen as the baseline group). We know that Y\sim\text{Poisson}(\mu) is a member of the exponential dispersion family form. In the present case, we have g(\mu) = \ln(\mu) = \eta \iff \mu = \mathrm{e}^{\eta}. More specifically, for observation Y_{ij} (i.e., group i, driver j), we have \ln(\mathbb{E}(Y_{ij})) = \ln(\mu_{ij}) = \begin{cases} \beta_0 + \beta_1 & \text{ if } i = 1 \\ \beta_0 + \beta_2 & \text{ if } i = 2 \\ \beta_0 & \text{ if } i = 3. \\ \end{cases} But note that this corresponds exactly to the setting of the question, with \alpha_1=\beta_0+\beta_1, \alpha_2=\beta_0+\beta_2, \alpha_3=\beta_0. So, indeed, we have a Generalised Linear Model with log link function (and Poisson distribution).
2. The likelihood is given by \prod\nolimits_{i=1}^{3}\prod\nolimits_{j=1}^{m}\frac{\mu_{ij}^{y_{ij}}\mathrm{e}^{-\mu _{ij}}}{y_{ij}!} so that the log-likelihood is \sum_{i=1}^{3}\sum_{j=1}^{m}\left( y_{ij}\ln \mu_{ij}-\mu _{ij}-\ln{y_{ij}!}\right) . In terms of \alpha_i, we re-write this as \ell \left( \alpha _{1},\alpha_{2},\alpha _{3}\right)=-\sum_{i=1}^{3}m\mathrm{e}^{\alpha _{i}}+\sum_{i=1}^{3}y_{i+}\alpha_{i}+\text{constant} where y_{i+} refers to the sum of the observations in the ith group. Differentiating, we get \frac{\partial \ell \left( \alpha_{i}\right) }{\partial \alpha _{i}}=-m\mathrm{e}^{\alpha_{i}}+y_{i+}=0 so that the maximum likelihood estimator of \alpha_i is \widehat{\alpha }_{i}=\ln \left( y_{i+}/m\right) .
3. Regarding interpretation of the models: Model 1 says that there is no difference in the average number of claims for the three age groups. Model 2 says that there is no difference in the average number of claims between age groups 1 and 2, but that the third age group may be different from these first two. Model 3 gives the possibility of different average number of claims for each age group. These models are indeed nested: Model 1 is the simplest (\eta = \beta_0) and is contained in Model 2 (\eta = \beta_0 + \beta_1(X_1+X_2)) which is contained in Model 3 (\eta = \beta_0 + \beta_1 X_1 + \beta_2 X_2). We may use our “Rule of Thumb” for significant improvement: if the decrease in deviance is larger than twice the additional number of parameters, the addition of parameters is worth it. Here we summarise in table form:
  
  Model Deviance Difference d.f. D_{1}-D_{2}>2\left(p-q\right)? Significant improvement?
  
  Model 1 72.53 - -
  
  Model 2 61.64 10.89 1 Yes Yes
  
  Model 3 60.40 1.24 1 No No
  
  So, Model 2 is a significant improvement from Model 1, but Model 3 is not a significant improvement from Model 2. So, Model 2 would be the final choice.
Question
1. Note the only parameter here is p (hence, \theta will be a function of p). Since we are not interested in the function c(y,\psi), everything that does not contain a “p”, we don’t care about and can put inside a “constant”. We use the symbol \propto to mean “is equal to, up to a constant”. We have \begin{align*} f(y) & \propto \exp \left(y \ln(p) + (m-y)\ln(1-p) \right) \\ & = \exp \left(y \ln\left(\frac{p}{1-p}\right) + m\ln(1-p) \right) \end{align*} We hence deduce that \theta = \ln\left(\frac{p}{1-p}\right), and \psi=1. As for b(\theta), note that \begin{align*} \theta & = -\ln\left(\frac{1}{p}-1\right) \implies \mathrm{e}^{-\theta} = \frac{1}{p}-1 \implies p = \frac{1}{1+\mathrm{e}^{-\theta}}. \end{align*} Hence, b(\theta) = -m \ln(1-p) = -m\ln{\frac{\mathrm{e}^{-\theta}}{1+\mathrm{e}^{-\theta}}}= m\ln{(1+\mathrm{e}^\theta)}. Since \mathbb{E}[Y]=\mu=mp, we can also express \theta as \theta=\ln{\frac{\mu}{m-\mu}}. Note that, although not asked, we also have c(y; \psi) = \ln\left(\frac{m!}{(m-y)!y!}\right).
2. Now, to find the deviance of the binomial (deviance is also the scaled deviance since \psi = 1), we have
\begin{aligned} \frac{D}{\psi} &= -2 \ln \left( \frac{ \prod\limits_{i} \frac{m!}{(m-y_i)! y_i!} \left( \frac{\widehat{\mu}_i}{m} \right)^{y_i} (1-\frac{\widehat{\mu}_i}{m})^{(m-y_{i})} }{ \prod\limits_{i}\frac{m!}{(m-y_i)! y_i!} \left( \frac{y_i}{m} \right)^{y_i} (1-\frac{y_i}{m})^{(m-y_i)} }\right) \\ &= -2 \ln \left( \prod\limits_{i} \left( \frac{\widehat{\mu}_i}{y_i} \right)^{y_i} \left( \frac{m-\widehat{\mu}_i}{m-y_i} \right)^{m-y_i} \right) \\ &= -2 \sum\limits_{i} \left[ y_i \ln \left( \frac{\widehat{\mu}_i}{y_i} \right) + (m-y_i) \ln \left( \frac{m-\widehat{\mu}_i}{m-y_i} \right) \right] \\ &= 2 \sum_{i=1}^{m} \left[ y_i \ln \left( \frac{y_i}{\widehat{\mu}_i} \right) + (m-y_i) \ln \left( \frac{m-y_i}{m-\widehat{\mu}_i} \right) \right] . \end{aligned}
Question
1. The likelihood is \prod\nolimits_{i,j}\dfrac{\mu_{ij}^{y_{ij}}\mathrm{e}^{-\mu_{ij}}}{y_{ij}!} and the log-likelihood is therefore
  
  \begin{aligned} \ell \left( \boldsymbol{\beta}\right) &=\sum_{i=1}^{n}\sum_{j=1}^{m}\left( y_{ij}\ln \mu _{ij}-\mu _{ij}-\ln y_{ij}!\right) \\ &=\sum_{i=1}^{n}\sum_{j=1}^{m}\left( y_{ij}\beta _{i}-\mathrm{e}^{\beta _{i}}-\ln y_{ij}!\right) \\ &=\sum_{i=1}^{n}\beta_i\sum_{j=1}^{m}y_{ij}-\sum_{i=1}^{n}\mathrm{e}^{\beta_i}m-\sum_{i=1}^{n}\sum_{j=1}^{m}\ln(y_{ij})!. \end{aligned}
  
  Applying first order condition:
  
  \frac{\partial \ell \left( \boldsymbol{\beta}\right) }{\beta _{i}}=\sum_{j=1}^{m}y_{ij}-m\mathrm{e}^{\beta _{i}}=0 so that
  
  \mathrm{e}^{\beta_{i}}=\frac{1}{m}\sum_{j=1}^{m}y_{ij} \triangleq \overline{y}_{i} and the MLE is
  
  \widehat{\beta }_{i}=\ln \overline{y}_{i}.
2. The deviance is
  
  \begin{aligned} 2\left[ \ell \left( y;y \right) -\ell \left( y;\mu \right) \right] &=2\left[ \begin{array}{c} \sum_{i=1}^{n}\sum_{j=1}^{m}\left( y_{ij}\ln y_{ij}-y_{ij}-\ln y_{ij}!\right) \\ -\sum_{i=1}^{n}\sum_{j=1}^{m}\left( y_{ij}\ln \overline{y}_{i}-\overline{y} _{i}-\ln y_{ij}!\right) \end{array} \right] \\ &=2\sum_{i=1}^{n}\sum_{j=1}^{m}\left( y_{ij}\ln \frac{y_{ij}}{\overline{y} _{i}}-\left( y_{ij}-\overline{y}_{i}\right) \right) . \end{aligned}
3. The contribution to the deviance in this case is
  
  \begin{aligned} D_{ij} &=2\left(y_{ij}\ln \frac{y_{ij}}{\overline{y}_{i}}-\left( y_{ij}- \overline{y}_{i}\right) \right) \\ &=2\cdot\left( 9\ln \frac{9}{17.45}-\left( 9-17.45\right)\right) =4.98. \end{aligned}
Question Recall that the scaled deviance for any member of the Exponential Dispersion family has the form \begin{aligned} \frac{D}{\psi} &=2\left[ \ell \left( \widetilde{\mathbf{\theta }};\mathbf{y }\right) -\ell \left( \widehat{\mathbf{\theta }};\mathbf{y}\right) \right] \\ &=\frac{2}{\psi }\sum_{i=1}^{n}\left[ \left( y_{i}\widetilde{\theta } _{i}-b\left( \widetilde{\theta }_{i}\right) \right) -\left( y_{i}\widehat{ \theta }_{i}-b\left( \widehat{\theta }_{i}\right) \right) \right] \end{aligned} where for the Inverse Gaussian, from a previous problem, we know that \theta=-\frac{1}{2\mu^{2}},\text{ \ and \ }b\left( \theta \right) =-\sqrt{-2\theta }=-1/\mu. Thus, the deviance can be expressed as

\begin{aligned} D &=2\sum_{i=1}^{n}\left[ \left( y_{i}\left( -\frac{1}{2y_{i}^{2}}\right) + \frac{1}{y_{i}}\right) -\left( y_{i}\left( -\frac{1}{2\widehat{\mu }_{i}^{2}} \right) +\frac{1}{\widehat{\mu }_{i}}\right) \right] \\ &=2\sum_{i=1}^{n}\left[ \frac{1}{2y_{i}}\left( 1+\frac{y_{i}^{2}}{\widehat{ \mu }_{i}^{2}}-\frac{2y_{i}}{\widehat{\mu }_{i}}\right) \right] =\sum_{i=1}^{n}\left[ \frac{1}{y_{i}}\left( 1-\frac{y_{i}}{\widehat{\mu }_{i} }\right) ^{2}\right] \\ &=\sum_{i=1}^{n}\left[ \frac{1}{y_{i}}\left( \frac{\widehat{\mu }_{i}-y_{i} }{\widehat{\mu }_{i}}\right) ^{2}\right] =\sum_{i=1}^{n}\frac{1}{\widehat{ \mu }_{i}^2}\frac{1}{y_{i}}\left( \widehat{\mu }_{i}-y_{i}\right) ^{2}. \end{aligned}

This gives the desired result.

Model	Deviance	Difference	d.f.	D_{1}-D_{2}>2\left(p-q\right)?	Significant improvement?
Model 1	72.53	-	-
Model 2	61.64	10.89	1	Yes	Yes
Model 3	60.40	1.24	1	No	No

Applied questions

Question

Firstly, call install.packages("insuranceData") if you have not already done so. Then:
```
library(insuranceData)
data("dataCar")

# To get some impression on the structure of the data,
# i.e., what variables are there
names(dataCar)
```
```
 [1] "veh_value" "exposure"  "clm"       "numclaims" "claimcst0" "veh_body" 
 [7] "veh_age"   "gender"    "area"      "agecat"    "X_OBSTAT_"
```
Note that the response (Y variable) takes only 0 and 1 values. This means (1) a binomial distribution with n=1 and p is a reasonable model (2) the mean of the response (\mathbb{E}[Y|X]) is precisely p. Therefore, a logistic regression implies that we should choose the link function g such that

\ln\Big(\frac{p}{1-p}\Big)=g(p)=\eta=\alpha+\beta_1X+\beta_2X^2+\beta_3X^3, where X is the vehicle value and we are considering the cubic formula.

We can now fit a glm using the following command.
```
car.glm1 <- glm(clm ~ veh_value + I(veh_value^2) + I(veh_value^3),
  family = binomial(link = "logit"), data = dataCar
)
```

summary(car.glm1)


Call:
glm(formula = clm ~ veh_value + I(veh_value^2) + I(veh_value^3), 
    family = binomial(link = "logit"), data = dataCar)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.9247606  0.0476282 -61.408  < 2e-16 ***
veh_value       0.2605947  0.0420331   6.200 5.66e-10 ***
I(veh_value^2) -0.0382409  0.0084167  -4.543 5.53e-06 ***
I(veh_value^3)  0.0008803  0.0002752   3.199  0.00138 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33767  on 67855  degrees of freedom
Residual deviance: 33711  on 67852  degrees of freedom
AIC: 33719

Number of Fisher Scoring iterations: 6

The fit shows that all the coefficients are significant as the p-values are smaller than 0.01.

```
# Linear
car.glmLinear <- glm(clm ~ veh_value,
  family = binomial(link = "logit"),
  data = dataCar
)
# Quadratic
car.glmQuadratic <- glm(clm ~ veh_value + I(veh_value^2),
  family = binomial(link = "logit"), data = dataCar
)
# Cubic
car.glmCubic <- glm(clm ~ veh_value + I(veh_value^2) + I(veh_value^3),
  family = binomial(link = "logit"), data = dataCar
)
## AIC
print(car.glmLinear$aic)
```
```
[1] 33749.12
```
```
print(car.glmQuadratic$aic)
```
```
[1] 33718.92
```
```
print(car.glmCubic$aic)
```
```
[1] 33718.72
```
The AIC of the quadratic model is much less than that of the linear model, suggesting that the linear model is inadequate. Whether the cubic model should be preferred to the quadratic model is debatable. The AIC has further decreased (but very slightly), hence based on AIC alone we could prefer the cubic model. However, note that a common “rule of thumb” for AIC interpretation is that a difference of less than 2 units is not large (i.e., two models that are different by less than 2 AIC units fit the data similarly). Hence, if we favour parsimony, here because the improvement in AIC is only 0.2, we would prefer the quadratic model. See Burnham & Anderson (2002), Chapter 2.6, for more information on interpreting AIC differences.

Question

```
claimsdata <- read.csv("Third_party_claims.csv")
attach(claimsdata)
hist(claims, breaks="FD", col="lightgreen")
```
```
plot(accidents, claims, xlab = "Accidents", ylab = "Claims")
```
While the link between claims and accidents appears fairly linear, the distribution of claims is far from a Normal (it is very skewed to the right). There is also indication of heteroskedasticity: for higher values of the predictor accidents, the value of claims seems more variable. There is also the issue that claims cannot be negative, but a linear regression model can produce negative predictions (since the errors are assumed Normal). Hence, here linear regression might produce a decent “fit” (in terms of R^2), but its main assumptions are not respected.

third.lm <- lm(claims ~ accidents)
summary(third.lm)


Call:
lm(formula = claims ~ accidents)

Residuals:
    Min      1Q  Median      3Q     Max 
-968.38  -53.14   29.23   57.45 2576.60 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -78.71249   26.41044   -2.98  0.00329 ** 
accidents     0.57705    0.01297   44.51  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 288.8 on 174 degrees of freedom
Multiple R-squared:  0.9193,    Adjusted R-squared:  0.9188 
F-statistic:  1981 on 1 and 174 DF,  p-value: < 2.2e-16

plot(third.lm)

The QQ plot shows that the residuals clearly do not follow a standard normal distribution. The residuals vs fitted plot shows that the variance increases as the fitted value increases. Diagnostic checks indicate clear violation of the homoskedasticity assumption. The last plot indicates that point 13 is an (extrement) outlier, while point 34 has high leverage.

```
third.poi <- glm(claims ~ log(accidents),
  family = poisson(link="log"))
summary(third.poi)
```
```
Call:
glm(formula = claims ~ log(accidents), family = poisson(link = "log"))

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -1.770809   0.025578  -69.23   <2e-16 ***
log(accidents)  1.139113   0.003197  356.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 194660.2  on 175  degrees of freedom
Residual deviance:   8566.5  on 174  degrees of freedom
AIC: 9796.2

Number of Fisher Scoring iterations: 4
```
```
plot(third.poi)
```
Given the shape of claims as seen on its histogram (very skewed right), we don’t expect Poisson to be a good fit and indeed the diagnostic plots reveal that the residuals are very poorly behaved:
- First plot reveals heteroskedasticity.
- Second plot shows that the residuals are far from Normal, and many have absolutely extreme values (we expect that few standardised deviance residuals should exceeed 3, and here many have values above 20, even above 30!)
- Third plot again reveals heteroskedasticity and extreme values of the residuals.
- Fourth plot shows that many points are “very influencial”, having a Cook distance above 0.5 and even above 1.
To put it simply: Poisson is totally “off” here.

library("MASS")
third.NB <- glm.nb(claims ~ log(accidents), link="log")
summary(third.NB)


Call:
glm.nb(formula = claims ~ log(accidents), link = "log", init.theta = 6.389601075)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.02155    0.15238  -13.27   <2e-16 ***
log(accidents)  1.17612    0.02376   49.51   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(6.3896) family taken to be 1)

    Null deviance: 2655.71  on 175  degrees of freedom
Residual deviance:  189.97  on 174  degrees of freedom
AIC: 2023.7

Number of Fisher Scoring iterations: 1

              Theta:  6.390 
          Std. Err.:  0.733 

 2 x log-likelihood:  -2017.731

# diagnostic plots
plot(third.NB)

The AIC has decreased a lot, and while there are still a few outliers, the residuals are overall much better behaved. This is totally logical, has the Negative Binomial allows for “overdispersion”, i.e., the variance can be subtantially larger than the mean (which is clearly the case in this data).

Under the Poisson model, the parameter \psi is “fixed” to 1. However, we can also estimate it using the so-called Pearson chi-square statistic,

sum(resid(third.poi, type = "pearson")^2) / third.poi$df.residual

[1] 55.36043

In the present case, the estimate of \psi takes a value of 55.36. Because this is subtantially larger than 1, we have evidence of overdispersion (and hence confirmation that the Poisson model is not appropriate). We hence try the quasipoisson family, which allows for overdispersion.

third.qpoi <- glm(claims ~ log(accidents),
  family = quasipoisson,
)
summary(third.qpoi)


Call:
glm(formula = claims ~ log(accidents), family = quasipoisson)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.77081    0.19032  -9.305   <2e-16 ***
log(accidents)  1.13911    0.02379  47.889   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 55.36067)

    Null deviance: 194660.2  on 175  degrees of freedom
Residual deviance:   8566.5  on 174  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

Note the quasi-poisson estimates of \beta are identical to those of the Poisson model, but with standard errors larger by a factor of \hat{\psi}^{1/2}=7.4405.

References

Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach (2nd ed.). Springer.