Linear Regression

Disclaimer

Some of the figures in this presentation are taken from “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

Overview

Simple Linear Regression
Multiple Linear Regression
Categorical predictors
R Demo
Diagnostics in MLR: ANOVA, R^2, F-test
Linear model selection
Potential issues with Linear Regression

Reading

James et al. (2021), Chapter 3, Chapter 6.1

Linear Regression

A classical and easily applicable approach for supervised learning.
Useful tool for predicting a quantitative response.
Model outputs are easy to interpret.
Many more advanced techniques can be seen as extensions of linear regression.

Simple Linear Regression

Overview

Suppose we have pairs of data (y_1, x_1), (y_2, x_2), ..., (y_n, x_n) and we want to predict values of y_i based on x_i.

We could do a linear prediction: y_i = {\color{magenta}{m}}x_i + {\color{blue}{b}}.
We could do a quadratic prediction: y_i = ax_i^2 + bx_i + c.
We could do any general functional prediction: y_i = f(x_i).

These are all examples of “models” we can specify. Let’s focus on the linear prediction. Some questions:

How do we choose \color{magenta}{m} and \color{blue}{b}? Aren’t there infinite possibilities?
How do we know whether our fitted line is a ‘good’ choice? What do we even mean by ‘good’?

Overview

In simple linear regression, we predict that a quantitive response “Y” is a linear function of a single predictor “X”, Y \approx \beta_0 + \beta_1 X.
“Simple” here refers to the fact there is a single predictor.
Assume we have n observations, and call \bm{y} = (y_1, ..., y_n)^\top the vector of responses and \bm{x} = (x_1,...,x_n)^{\top} the vector of correpsonding predictors. Our model states that the ‘true’ relationship between X and Y is linear (plus a “noise” or “error”). Using vector notation, we write: \bm{y} = \beta_0 + \beta_1 \bm{x} + \bm{\epsilon},

where \bm{\epsilon} = (\epsilon_1, ..., \epsilon_n)^{\top} is a vector of error terms (respecting certain assumptions).

`Advertising` Example

\texttt{sales} \approx \beta_0 + \beta_1 \times \texttt{TV}

Assumptions on the errors

Weak assumptions \mathbb{E}( \epsilon_i|\bm{x}) = 0,\quad \mathbb{V}( \epsilon_i|\bm{x}) = \sigma^2 \text{and}\quad Cov(\epsilon_i,\epsilon_j|\bm{x})=0 for i=1,2,3,...,n; for all i \neq j.

In other words, errors have zero mean, common variance and are conditionally uncorrelated. Those are the minimum assumptions for “Least Squares” estimation.
Strong assumptions \epsilon_i|\bm{x} \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,\sigma^2) for i=1,2,3,...,n. In other words, errors are i.i.d. Normal random variables with zero mean and constant variance. Those are the minimum assumptions for “Maximum Likelihood” estimation.

Model estimation

Recall we assume the ‘true’ relationship between X and Y (seen here as random variables) is described by Y =\beta_0 + \beta_1 X + \epsilon, where \epsilon satisfies either the weak or strong assumptions.
If we have an observed sample from (X,Y), say (x_1, y_1), ..., (x_n, y_n), we then want to obtain estimates \hat{\beta}_0 and \hat{\beta}_1 of the parameters \beta_0, \beta_1.
Once we have these estimates, we can predict any “response” simply by: \hat{y}_i=\hat{\beta}_0 + \hat{\beta}_1 x_i.
This makes sense because \begin{aligned} \mathbb{E}[Y|X=x_i] & = \mathbb{E}[\beta_0 + \beta_1 X + \epsilon_i | X=x_i] = {\beta}_0 + {\beta}_1 x_i \end{aligned} (where we used the fact that \mathbb{E}[\epsilon_i|X=x_i] = 0).

Least Squares Estimates (LSE)

The most common approach to estimating \hat{\beta}_0 and \hat{\beta}_1.
Minimise the residual sum of squares (RSS) \mathrm{RSS} %= \sum_{i = 1}^{n} e_i^2 = \sum_{i = 1}^{n} (y_i - \hat{y}_i)^2 = \sum_{i = 1}^{n} (y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2
The least square coefficient estimates are \begin{aligned} \hat{\beta}_1 &= \dfrac{\sum_{i = 1}^{n}(x_i - \bar{x}_i)(y_i - \bar{y}_i)}{\sum_{i=1}^{n}(x_i - \bar{x}_i)^2}=\frac{S_{xy}}{S_{xx}} \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \end{aligned} where \bar{y} \equiv \frac{1}{n}\sum_{i=1}^{n}y_i and \bar{x} \equiv \frac{1}{n}\sum_{i=1}^{n}x_i. See slide on S_{xy}, S_{xx} and sample (co-)variances. Proof: See Lab questions.

LS Demo

Least Squares Estimates (LSE) - Properties

Under the weak assumptions we have unbiased estimators: \mathbb{E}\left[\hat{\beta }_{0}|\bm{x}\right] =\beta _{0} \quad \text{and} \quad \mathbb{E}\left[\hat{\beta }_{1}|\bm{x}\right] =\beta _{1}. Proof: See Lab questions.
What does this mean? If the linear model is correct, the LSE are on average correct to estimate \beta_0 and \beta_1 (they don’t have a “systematic error”).
Note an (unbiased) estimator of \sigma ^{2} is given by: \begin{aligned} s^{2} &= \frac{\sum_{i=1}^{n}\left( y_{i}-\left(\hat{\beta }_{0}+\hat{\beta }_{1}x_{i}\right) \right)^{2}}{n-2}. \end{aligned}
The next question is: how confident (or “certain”) are we in these estimates?

Least Squares Estimates (LSE) - Uncertainty

Under the weak assumptions we have that the (co-)variances of the parameters are: \begin{aligned} \text{Var}\left(\hat{\beta }_{0}|\bm{x}\right) =&\sigma^2\left(\frac{1}{n}+\frac{\overline{x}^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}\right)=\sigma^2\left(\frac{1}{n}+\frac{\overline{x}^2}{S_{xx}}\right)\\{ }=& SE(\hat{\beta_0})^2\\ \text{Var}\left( \hat{\beta }_{1}|\bm{x}\right) =& \frac{\sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}=\frac{\sigma^2}{S_{xx}}=SE(\hat{\beta_1})^2\\ %= \frac{n\sigma^2}{nS_{xx}-S_{x}^2}\\ \text{Cov}\left( \hat{\beta }_{0},\hat{\beta }_{1}|\bm{x}\right) =& -\frac{ \overline{x} \sigma^2}{\sum_{i=1}^{n}(x_i-\overline{x})^2}=-\frac{ \overline{x} \sigma^2}{S_{xx}} %%s_\epsilon^2 = \text{Var}\left(\hat{\epsilon}\right) =& \frac{nS_{yy}-S_y^2-\hat{\beta}_1^2(nS_{xx}-S_x^2)}{n(n-2)} \end{aligned}

Proof: See Lab questions. Hopefully you can see that all three quantities go to 0 as n gets larger.

Maximum Likelihood Estimates (MLE)

In a simple regression model, there are three parameters to estimate: \beta_0, \beta_1, and \sigma^{2}.
Under the strong assumptions (i.i.d. Normal errors), the joint density of Y_{1},Y_{2},\ldots,Y_{n} (conditional on X_i=x_i, the values taken by the predictors) is the product of their marginals densities (because all the Y_i|x_i are independent, by assumption).
So, the log-likelihood is: \begin{aligned} %L\left({y};\beta_0 ,\beta_1 ,\sigma \right) =&\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi }\sigma }\exp \left( -\frac{\left(y_{i}-\left( \beta_0 +\beta_1 x_{i}\right) \right) ^{2}}{2\sigma ^{2}}\right)\\ %=&\frac{1}{\left( 2\pi \right) ^{n/2}\sigma ^{n}}\exp \left( -\frac{1}{2\sigma ^{2}}\sum_{i=1}^{n}\left( y_{i}-\left( \beta_0 +\beta_1 x_{i}\right) \right) ^{2}\right)\\ \ell\left({y};\beta_0 ,\beta_1 ,\sigma \right) =&-n\ln\left( \sqrt{2\pi} \sigma\right) -\frac{1}{2\sigma^{2}}\sum_{i=1}^{n}\left( y_{i}-\left( \beta_0 +\beta_1 x_{i}\right) \right) ^{2}. \end{aligned}

Proof: Since Y_i|x_i = \beta_0 + \beta_1 x_i + \epsilon_i, where \epsilon_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,\sigma^2), then Y_i|x_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}\left( \beta_0 + \beta_1 x_i, \sigma^2 \right). The result follows.

Maximum Likelihood Estimates (MLE)

Partial derivatives set to zero give the following MLEs: \begin{aligned} \hat{\beta}_1=&\frac{\sum_{i=1}^{n}\left( x_{i}-\overline{x}\right) \left( y_{i}-\overline{y}\right) }{\sum_{i=1}^{n}\left( x_{i}-\overline{x}\right) ^{2}}=\frac{S_{xy}}{S_{xx}},\\ \hat{\beta}_0=&\overline{y}-\hat{\beta}_1\overline{x}, \end{aligned} and \hat{\sigma }_{\text{MLE}}^{2}=\frac{1}{n}\sum_{i=1}^{n}\left( y_{i}-\left( \hat{\beta}_0+\hat{\beta}_1x_{i}\right) \right) ^{2}.

Note that the parameters \beta_0 and \beta_1 have the same estimators as that produced from Least Squares.
However, the MLE \hat{\sigma}^{2} is a slightly biased estimator of \sigma^{2}.
In practice, we use the unbiased version s^2 (see slide).

Interpretating the parameters

How do we interpret a linear regression model such as \hat{\beta}_0 = 1 and \hat{\beta}_1 = -0.5?

The intercept parameter \hat{\beta}_0 is interpreted as the value we would predict for y_i if x_i = 0.
- E.g., predict y_i = 1 if x_i=0
The slope parameter \hat{\beta}_1 is the expected (or mean) change in the response y_i for a 1 unit increase in x_i.
- E.g., we would expect y_i to decrease on average of -0.5 for every 1 unit increase in x_i.

Example 1

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,1) with n = 30.

Estimates of Beta_0 and Beta_1:
 1.309629 -0.5713465

Standard error of the estimates:
 0.346858 0.05956626

Example 2

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,1) with n = 5000.

Estimates of Beta_0 and Beta_1:
 1.028116 -0.5057372

Standard error of the estimates:
 0.02812541 0.00487122

Example 3

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0, \sigma^2=100) with n = 30.

Estimates of Beta_0 and Beta_1:
 -2.19991 -0.4528679

Standard error of the estimates:
 3.272989 0.5620736

Example 4

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0, \sigma^2=100) with n = 5000.

Estimates of Beta_0 and Beta_1:
 1.281162 -0.5573716

Standard error of the estimates:
 0.2812541 0.0487122

Example 5

The below data was generated by Y = 1 - 40\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,100) with n = 30.

Estimates of Beta_0 and Beta_1:
 4.096286 -40.71346

Standard error of the estimates:
 3.46858 0.5956626

Assessing the models

How do we know which models to “trust”?
- Estimates for examples 1, 2 and 4 seem good (parameter estimates quite close to true values, and low standard errors).
- We are less confident in example 3 (higher standard errors).
Discussion question: what do you think of the model in example 5? Are you worried about the standard errors?
Consider the next example, it has low standard errors but it doesn’t look ‘right’… why is that?

Example 6

The below data was generated by Y = 1 + 0.2 \times X^2 + \epsilon where X \sim U[0,10] and \epsilon \sim N(0, \sigma^2=0.01) with n = 500.

Estimates of Beta_0 and Beta_1:
 -2.32809 2.000979

Variances of the estimates:
 0.01808525 0.0005420144

Assessing the Accuracy I

How to assess the accuracy of the coefficient estimates? In particular, consider the following questions:
- What are the confidence intervals for {\beta}_0 and {\beta}_1?
- How to test the null hypothesis that there is no relationship between X and Y?
- How to test if the influence of the predictor variable (X) on the response variable (Y) is larger/smaller than some value?

Note

For inference (e.g., confidence intervals, hypothesis tests), we need the strong assumptions!

Assessing the Accuracy of the Coefficient Estimates - Confidence Intervals

Using the strong assumptions, 100\left( 1-\alpha \right) \% confidence intervals (CI) for \beta_1 and \beta_0 are given by:

for \beta_1:

\hat{\beta}_1\pm t_{1-\alpha /2,n-2}\cdot \underbrace{\dfrac{s}{\sqrt{S_{xx}}}}_{\hat{SE}(\hat{\beta}_1)}

for \beta_0:

\hat{\beta}_0\pm t_{1-\alpha /2,n-2}\cdot \underbrace{s\sqrt{\frac{1}{n}+\frac{\overline{x}^2}{S_{xx}}}}_{\hat{SE}(\hat{\beta}_0)}

See rationale slide.

Assessing the Accuracy of the Coefficient Estimates - Inference on the slope

We can test whether the predictor has any influence on the response (or if the influence is larger/smaller than some value).
For testing the hypothesis H_{0}:\beta_1 =\widetilde{\beta}_1 \quad\text{vs}\quad H_{1}:\beta_1 \neq \widetilde{\beta}_1, (where \widetilde{\beta}_1 is any constant) we use the test statistic: t(\hat{\beta}_1) = \dfrac{\hat{\beta}_1-\widetilde{\beta}_1}{\hat{SE}(\hat{\beta_1})}=\dfrac{\hat{\beta}_1-\widetilde{\beta}_1}{\left( s\left/ \sqrt{S_{xx}}\right. \right)} which has a t_{n-2} distribution under H_0 (see rationale slide).
The construction of the hypothesis test is the same for \beta_0.

Assessing the Accuracy of the Coefficient Estimates - Inference on the slope

The decision rules under various alternative hypotheses are summarized below.

Decision Making Procedures for Testing H_{0}:\beta_1=\widetilde{\beta}_1
Alternative H_{1}	Reject H_{0} in favor of H_{1} if
\beta_1 \neq \widetilde{\beta}_1	\left\vert t\left( \hat{\beta}_1\right) \right\vert >t_{1-\alpha /2,n-2}
\beta_1 >\widetilde{\beta}_1	t\left( \hat{\beta}_1\right)>t_{1-\alpha ,n-2}
\beta_1 <\widetilde{\beta}_1	t\left( \hat{\beta}_1\right)<-t_{1-\alpha ,n-2}

Note we are usually interested in testing H_0: \beta_1 = 0 vs. H_1: \beta_1 \neq 0, as this informs us whether our \beta_1 is significantly different from 0. I.e., the predictor has some influence on the response.
Similar construction to test the significance of \beta_0.

Example 1 - Hypothesis testing

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,1) with n = 30.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8580 -0.7026 -0.1236  0.5634  1.8463 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.30963    0.34686   3.776 0.000764 ***
X           -0.57135    0.05957  -9.592  2.4e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9738 on 28 degrees of freedom
Multiple R-squared:  0.7667,    Adjusted R-squared:  0.7583 
F-statistic:    92 on 1 and 28 DF,  p-value: 2.396e-10

Example 2 - Hypothesis testing

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,1) with n = 5000.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1179 -0.6551 -0.0087  0.6655  3.4684 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.028116   0.028125   36.55   <2e-16 ***
X           -0.505737   0.004871 -103.82   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9945 on 4998 degrees of freedom
Multiple R-squared:  0.6832,    Adjusted R-squared:  0.6831 
F-statistic: 1.078e+04 on 1 and 4998 DF,  p-value: < 2.2e-16

Example 3 - Hypothesis testing

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,100) with n = 30.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.306  -5.751  -2.109   5.522  27.049 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -2.1999     3.2730  -0.672    0.507
X            -0.4529     0.5621  -0.806    0.427

Residual standard error: 9.189 on 28 degrees of freedom
Multiple R-squared:  0.02266,   Adjusted R-squared:  -0.01225 
F-statistic: 0.6492 on 1 and 28 DF,  p-value: 0.4272

Example 4 - Hypothesis testing

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,100) with n = 5000.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-31.179  -6.551  -0.087   6.655  34.684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.28116    0.28125   4.555 5.36e-06 ***
X           -0.55737    0.04871 -11.442  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.945 on 4998 degrees of freedom
Multiple R-squared:  0.02553,   Adjusted R-squared:  0.02533 
F-statistic: 130.9 on 1 and 4998 DF,  p-value: < 2.2e-16

Example 5 - Hypothesis testing

The below data was generated by Y = 1 - 40\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,100) with n = 30.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.580  -7.026  -1.236   5.634  18.463 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.0963     3.4686   1.181    0.248    
X           -40.7135     0.5957 -68.350   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.738 on 28 degrees of freedom
Multiple R-squared:  0.994, Adjusted R-squared:  0.9938 
F-statistic:  4672 on 1 and 28 DF,  p-value: < 2.2e-16

Example 6 - Hypothesis testing

The below data was generated by Y = 1 + 0.2 \times X^2 + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,0.01) with n = 500.


Call:
lm(formula = Y ~ X)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.8282 -1.3467 -0.4217  1.1207  3.4041 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.32809    0.13448  -17.31   <2e-16 ***
X            2.00098    0.02328   85.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.506 on 498 degrees of freedom
Multiple R-squared:  0.9368,    Adjusted R-squared:  0.9367 
F-statistic:  7387 on 1 and 498 DF,  p-value: < 2.2e-16

Summary of hypothesis tests

Below is the summary of the hypothesis tests for whether the \beta_j are statistically different from 0 for the six examples, at a 5% level.

	1	2	3	4	5	6
\beta_0	Y	Y	N	Y	N	Y
\beta_1	Y	Y	N	Y	Y	Y

Does that mean the models that are significant at 5% for both \beta_0 and \beta_1 are equivalently ‘good’ models?

No!
Model 5 looks like it would have the most “predictive power”, even though \beta_0 is not significant.
Model 6 has significant estimates, but is misspecified (clearly the underlying relationship is not linear).

Assessing the accuracy of the model

We have the following so far:

Data plotting with model predictions overlayed.
Estimates of a linear model coefficients \hat{\beta}_0 and \hat{\beta}_1.
Standard errors and hypothesis tests on the coefficients.

We now discuss how to assess whether a model is ‘good’ / ‘accurate’?

Assessing the accuracy of the model

Partitioning the variability is used to assess how well the linear model explains the trend in data:

\underset{\text{total deviation}}{\underbrace{y_{i}-\overline{y}}}=\underset{\text{unexplained deviation}}{\underbrace{\left( y_{i}-\hat{y}_{i}\right) }}+\underset{\text{explained deviation}}{\underbrace{\left( \hat{y}_{i}-\overline{y}\right). }}

We can show (see Lab questions) that :

\underset{\text{TSS}}{\underbrace{\overset{n}{\underset{i=1}{\sum }}\left( y_{i}-\overline{y}\right) ^{2}}}=\underset{\text{RSS}}{\underbrace{\overset{n}{\underset{i=1}{\sum }}\left( y_{i}-\hat{y}_{i}\right) ^{2}}}+\underset{\text{MSS}}{\underbrace{\overset{n}{\underset{i=1}{\sum }}\left( \hat{y}_{i}-\overline{y}\right) ^{2}}},

TSS: total sum of squares;
RSS: residual sum of squares;
MSS: model sum of squares (also known as “explained sum of squares”, ESS, but here we stick to MSS).

Assessing the accuracy of the model

How to interpret these sums of squares:

TSS is the total variability in the y_i values (note how it’s proportional to the sample variance of Y);
MSS is the variability in the y_i that is “explained” or “captured” by our model (i.e., by our knowledge that X is predicting Y).
RSS is the variability that remains “unexplained”, even after capturing the effect of X on Y.

This partitioning of the variability is used in “ANOVA tables”:

Source	Sum of squares	DoF	Mean squares	F-stat
Regression	\text{MSS}= \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2	\text{DFM} = 1	\frac{\text{MSS}}{\text{DFM}}	\frac{\text{MSS}/\text{DFM}}{\text{RSS}/\text{DFE}}
Error	\text{RSS}= \sum_{i=1}^{n} (y_i - \hat{y_i})^2	\text{DFE} = n-2	\frac{\text{RSS}}{\text{DFE}}
Total	\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2	\text{DFT} = n-1	\frac{\text{TSS}}{\text{DFT}}

Assessing the accuracy of the model: R^2

We now define an important metric to quantify how “good” out model is: R^2.
It has several equivalent definitions, but a simple one is as follows:

R^2 = \frac{\text{MSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}.

R^2 is interpreted as the “proportion of variability in Y that can be explained using X” (James et al., 2021).

Alernative definition of R^2

It is not too hard to show (see Lab) that: \begin{aligned} \text{MSS} = \hat{\beta}_1 S_{xy}. \end{aligned} Hence, recalling that \hat{\beta}_1=S_{xy}/S_{xx}, we obtain: \begin{aligned} R^2= \frac{\text{MSS}}{\text{TSS}} = \left(\frac{S_{xy}^2}{S_{xx}} \right) \cdot \frac{1}{S_{yy}} = \left(\frac{S_{xy}}{\sqrt{S_{xx}\cdot S_{yy}}} \right)^2. \end{aligned}
This shows that, in simple linear regression, R^2 is simply the square of the good old sample correlation between X and Y.
Hence, we know it takes values between 0 and 1.

Summary of R^2 from the six examples

Below is a table of the R^2 for all of the six examples:

	1	2	3	4	5	6
R^2	0.77	0.68	0.02	0.03	0.99	0.94

The R^2 for {1, 2}, and {3, 4} are similar (as expected, since we only changed n).
Example 5 has the highested R^2 despite having an insignificant \beta_0.
Example 6 has a surprisingly high R^2 (despite the data being non-linear). But Example 6 does not satisfy the assumptions of our model, so the results are dubious. (More on this later)
Careful: a single number can’t give you the whole story!

Multiple Linear Regression

Overview

We extend the simple linear regression model to include “p” predictors: Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon
The capital letters Y and X_1,\ldots X_p mean we conceptualise those quantities as “random variables”.
The response Y is still a linear function of individual predictors.
\beta_j is the average effect on Y (increase if positive, decrease if negative) of a one unit increase in X_j (holding all other X_{k}, k \neq j variables fixed).
Instead of fitting a line, we are now fitting a (hyper-)plane.

Vector notation

For an observed sample of size n, (y_{1}, x_{11}, x_{12}, ..., x_{1p}), ..., (y_{n}, x_{n1},...,x_{np}), we denote our n responses \bm{y} = (y_1, \ldots, y_n)^{\top}.
For the predictors, we denote x_{ij} the jth predictor of observation i (this is a scalar).
We denote \bm{x}_j = (x_{1j}, x_{2j}, \ldots, x_{nj})^{\top} the n observed values of predictor j.
Hence, our model for all observations can be written \bm{y} = \beta_0 \bm{1} + \beta_1 \bm{x}_1 + \beta_2 \bm{x}_2 + \ldots + \beta_p \bm{x}_p + \bm{\epsilon}. where \bm{1} is a column vector of 1’s (of size n) and \bm{\epsilon} is defined as in simple linear regression.

`Advertising` Example

\texttt{sales} \approx \beta_0 + \beta_1 \times \texttt{TV}+ \beta_2 \times \texttt{radio}

Matrix notation

With matrix notation, we can write the model in an even more compact manner.
Let \bm{\beta} = (\beta_0, \beta_1, \ldots, \beta_p)^{\top} be a column vector (of size p+1).
Let \bm{X} be the design matrix, of size n \times (p+1), defined as

\bm{X}=\left[ \begin{array}{ccccc} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \end{array} \right]

Then, our linear regression model can be written as \bm{y} = \bm{X}\bm{\beta} + \bm{\epsilon}.
Make sure you agree all the dimensions make sense.
Do you see that simple linear regression can be recovered from this notation?

Assumptions on the error terms

The asssumptions made about the errors are as in simple linear regression:

Weak Assumptions:

The error terms \epsilon_{i} satisfy the following: \begin{aligned} \begin{array}{rll} \mathbb{E}[ \epsilon _{i}|\bm{X}] =&0, &\text{ for }i=1,2,\ldots,n; \\ \text{Var}( \epsilon _{i}|\bm{X}) =&\sigma ^{2}, &\text{ for }i=1,2,\ldots,n;\\ \text{Cov}( \epsilon _{i},\epsilon _{j}|\bm{X})=&0, &\text{ for all }i\neq j. \end{array} \end{aligned}

In words, the errors have zero means, common variance, and are uncorrelated. In matrix form, we have: \begin{aligned} \mathbb{E}\left[ \bm{\epsilon} | \bm{X}\right] = \bm{0}; \qquad \text{Cov}\left( \bm{\epsilon} | \bm{X} \right) =\sigma^{2} I_n, \end{aligned} where I_n is the n\times n identity matrix.

Strong Assumptions: \epsilon_{i}|\bm{X} \stackrel{\text{i.i.d}}{\sim} \mathcal{N}(0, \sigma^2).

Least Squares Estimates (LSE)

Same least squares approach as in Simple Linear Regression: we pick the estimates \hat{\beta}_0, \ldots, \hat{\beta}_p that minimise the residual sum of squares (RSS): \begin{aligned} \text{RSS} &= \sum_{i=1}^{n}\left(y_{i}-\hat{y}_{i}\right)^{2}=\sum_{i=1}^{n}\left( y_{i}-\hat{\beta} _{0}-\hat{\beta} _{1}x_{i1}- \ldots -\hat{\beta} _{p}x_{ip}\right)^{2} =\sum_{i=1}^{n} \hat{\epsilon}_{i}^{2} \\ &= \left( \bm{y} - \bm{X} \hat{\bm{\beta}} \right)^{\top}\left(\bm{y}-\bm{X}\hat{\bm{\beta}}\right). \end{aligned}
If \left(\bm{X}^{\top}\bm{X}\right)^{-1} exists, it can be shown that the solution is given by: \hat{\bm{\beta}}=\left(\bm{X}^{\top} \bm{X} \right)^{-1} \bm{X}^{\top} \bm{y}.
The corresponding vector of fitted (or predicted) values is \bm{\hat{y}}=\bm{X}\bm{\hat{\beta}}.

Least Squares Estimates (LSE) - Properties

Under the weak assumptions we have unbiased estimators:

The least squares estimators are unbiased: \mathbb{E}[ \hat{\beta}_j] = \beta_j, \; \forall j.
The variance-covariance matrix of the least squares estimators is: \text{Var}(\hat{\bm{\beta}}) = \sigma^{2}\times \left(\bm{X}^{\top}\bm{X} \right)^{-1}
An unbiased estimator of \sigma^{2} is: s^{2}=\dfrac{\text{RSS}}{n-p-1}, where p+1 is the total number of “\beta” parameters estimated.
Under the strong assumptions, each \hat{\beta}_j is normally distributed. See details in slide.

Categorical predictors

Suppose a predictor is categorical/qualitative (e.g., 2 different categories). How would you model/code this in a regression?
What if there are more than 2 levels?
Consider for example the problem of predicting salary for a potential job applicant:
- A quantitative variable could be years of relevant work experience.
- A two-category variable could be whether the applicant is currently an employee of this company? (T/F)
- A multiple-category variable could be their highest level of education (HS diploma, Bachelors, Masters, PhD)
How do we incorporate this qualitative data into our models?

Integer encoding

One (usually bad) solution: assign a number to each categorical value.

E.g., (HS, B, M, P) = (1,2,3,4).

Problem? The numbers you use specify a relationship between the categories. For example, we are saying a Bachelors degree is above a HS diploma (in particular, it is worth 2x more). So, \beta_{edu} (B) = 2 \times \beta_{edu} (HS).

In some contexts, it might make some sense, but in general it seems unnecessarily restrictive.
And if the categories are completely unrelated (no natural ordering makes sense, e.g., colors), this approach won’t work at all.

One-hot encoding

Another solution is to use a technique called one-hot encoding, which is appropriate if categories have no ordinal relationship between them.
For a variable with C categories, create a set of C binary (0/1) variables. Each takes a value of 1 if an observation is in that category, and 0 otherwise.
E.g., if if we have 4 observations for a variable “colour” (red, green, green, blue) the dummy encoded matrix could be:

\begin{pmatrix} R \\ G \\ G \\ B \\ \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix}, where the first column represents red, second green and third blue.

Technically, we cannot use one-hot encoding in linear regression, and instead use a technique called dummy encoding.

Dummy encoding

The only difference is that we pick a “base case”: a category for which all values on a row of the “predictor matrix” are 0. So, we only need C-1 dummy variables to encode a predictor with C categories. Using the same example as before and setting ‘Red’ as the base case, we have:

\begin{pmatrix} R \\ G \\ G \\ B \\ \end{pmatrix} = \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 1 & 0 \\ 0 & 1 \\ \end{pmatrix}, where the first column is an indicator for “green”, second for “blue”. If both values are 0 for a given observation (i.e., a given row), this observation is ‘Red’.

We need this for (\bm{X}^{\top}\bm{X})^{-1} to be defined. Indeed, in one-hot encoding, the matrix representing a categorical predictor does not have all independent columns (it is not “full rank”). Hence, ~\bm{X}^{\top}\bm{X}~ won’t be full rank either, which is equivalent to saying it’s not invertible. I hope all this evoques fond memories from your linear algebra classes.

R Demo

The matrix approach

TV	radio	sales
230.1	37.8	22.1
44.5	39.3	10.4
17.2	45.9	9.3
151.5	41.3	18.5
180.8	10.8	12.9
8.7	48.9	7.2
57.5	32.8	11.8
120.2	19.6	13.2
8.6	2.1	4.8
199.8	2.6	10.6
66.1	5.8	8.6

\bm{y} = \bm{X} \bm{\beta} + \bm{\epsilon} \bm{X} = \left[ \begin{array}{ccccc} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \end{array} \right] \bm{\beta} = \left[ \begin{array}{c} \beta_{0} \\ % \beta_{1} \\ \vdots \\ \beta_{p} \end{array} \right] \bm{y} = \left[ \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{array} \right]

library(tidyverse) 
site <- url("https://www.statlearning.com/s/Advertising.csv")
df_adv <- read_csv(site, show_col_types = FALSE)
X <- model.matrix(~ TV + radio, data = df_adv);
y <- df_adv[, "sales"]

head(X)

  (Intercept)    TV radio
1           1 230.1  37.8
2           1  44.5  39.3
3           1  17.2  45.9
4           1 151.5  41.3
5           1 180.8  10.8
6           1   8.7  48.9

head(y)

# A tibble: 6 × 1
  sales
  <dbl>
1  22.1
2  10.4
3   9.3
4  18.5
5  12.9
6   7.2

R’s `lm` and `predict`

\hat{\bm{\beta}} = (\bm{X}^\top\bm{X})^{-1} \bm{X}^\top \bm{y}

model <- lm(sales ~ TV + radio, data = df_adv)
coef(model)

(Intercept)          TV       radio 
 2.92109991  0.04575482  0.18799423

X <- model.matrix(~ TV + radio, data = df_adv)
y <- df_adv$sales
beta <- solve(t(X) %*% X) %*% t(X) %*% y
beta

                  [,1]
(Intercept) 2.92109991
TV          0.04575482
radio       0.18799423

\hat{\bm{y}} = \bm{X} \hat{\bm{\beta}}.

budgets <- data.frame(TV = c(100, 200, 300), radio = c(20, 30, 40))
predict(model, newdata = budgets)

       1        2        3 
11.25647 17.71189 24.16731

X_new <- model.matrix(~ TV + radio, data = budgets)
X_new %*% beta

Dummy encoding

Design matrices are normally an ‘Excel’-style table of covariates/predictors plus a column of ones.

If categorical variables are present, they are added as dummy variables:

fake <- tibble(
  speed = c(100, 80, 60, 60, 120, 40),
  risk = c("Low", "Medium", "High",
           "Medium", "Low", "Low")
)
fake

# A tibble: 6 × 2
  speed risk  
  <dbl> <chr> 
1   100 Low   
2    80 Medium
3    60 High  
4    60 Medium
5   120 Low   
6    40 Low

model.matrix(~ speed + risk, data = fake)

  (Intercept) speed riskLow riskMedium
1           1   100       1          0
2           1    80       0          1
3           1    60       0          0
4           1    60       0          1
5           1   120       1          0
6           1    40       1          0
attr(,"assign")
[1] 0 1 2 2
attr(,"contrasts")
attr(,"contrasts")$risk
[1] "contr.treatment"

Dummy encoding & collinearity

What happens if we don’t drop the last level of the dummy variables ?

X_dummy = model.matrix(~ risk, data = fake)
as.data.frame(X_dummy)

  (Intercept) riskLow riskMedium
1           1       1          0
2           1       0          1
3           1       0          0
4           1       0          1
5           1       1          0
6           1       1          0

solve(t(X_dummy) %*% X_dummy)

            (Intercept)   riskLow riskMedium
(Intercept)           1 -1.000000       -1.0
riskLow              -1  1.333333        1.0
riskMedium           -1  1.000000        1.5

X_oh <- cbind(X_dummy, riskHigh = (fake$risk == "High"))
as.data.frame(X_oh)

  (Intercept) riskLow riskMedium riskHigh
1           1       1          0        0
2           1       0          1        0
3           1       0          0        1
4           1       0          1        0
5           1       1          0        0
6           1       1          0        0

solve(t(X_oh) %*% X_oh)

Error in solve.default(t(X_oh) %*% X_oh): system is computationally singular: reciprocal condition number = 6.93889e-18

Diagnostics in MLR: ANOVA, R^2, F-test

ANOVA

Partitioning the variability is done in ANOVA tables:

Source	Sum of squares	DoF	Mean squares	F-stat
Regression	\text{MSS}= \sum_{i=1}^{n} (\hat{y_i} - \bar{y})^2	\text{DFM} = p	\frac{\text{MSS}}{\text{DFM}}	\frac{\text{MSS}/\text{DFM}}{\text{RSS}/\text{DFE}}
Error	\text{RSS}= \sum_{i=1}^{n} (y_i - \hat{y_i})^2	\text{DFE} = n-p-1	\frac{\text{RSS}}{\text{DFE}}
Total	\text{TSS} = \sum_{i=1}^{n} (y_i - \bar{y})^2	\text{DFT} = n-1	\frac{\text{TSS}}{\text{DFT}}

TSS, MSS and RSS have the same interpretation as in simple linear regression: total variability, variability explained by the model, variability unexplained by the model (respectively).

Model Fit

As in simple linear regression, we define the R^2 as R^2 = \frac{\text{MSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}}.
The significance of individual predictors can be assessed by looking at the estimates \hat{\beta}_0, \hat{\beta}_1, \cdots, \hat{\beta}_p and their corresponding t-tests.

F-test

In multiple linear regression, the F-test is used to assess if the model is doing “anything” meaningful (i.e., if it is better than just predicting a response y_i by the global mean \bar{y}).
Formally, we test:

H_0: \beta_1 = \cdots = \beta_p = 0 \quad \text{vs} \quad H_a: \text{at least one } \beta_j \text{ is non-zero}

{\color{blue}{F}}\text{-statistic} = \frac{\text{MSS}/p}{\text{RSS}/(n-p-1)} \sim F_{p, n-p-1} (under H_0).
p-value: 1-F_{\text{DFM},\text{DFE}}({\color{blue}{F}}).

Note: the F-test gives the same conclusion as the t-test on \beta_1 \neq 0 for simple linear regression.
Question: given we have individual p-values for each variable, why do we need the “overall” F-statistic?
- A model with insignificant individual p-values may jointly still be able to explain some significant proportion of the variance.
- Conversely, a model with significant predictor(s) may fail to explain a significant proportion of the variance, globally.

`Advertising` Example (continued)

Linear regression fit using TV and Radio:

What do you observe?

Example 7 - Data plot

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where X_1, X_2 \sim U[0,10] and \epsilon \sim N(0,1) with n = 30.

Example 7 - Model summary

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where X_1, X_2 \sim U[0,10] and \epsilon \sim N(0,1) with n = 30.


Call:
lm(formula = Y ~ X1 + X2)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.6923 -0.4883 -0.1590  0.5366  1.9996 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.22651    0.45843   2.675   0.0125 *  
X1          -0.71826    0.05562 -12.913 4.56e-13 ***
X2           1.01285    0.05589  18.121  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8625 on 27 degrees of freedom
Multiple R-squared:  0.9555,    Adjusted R-squared:  0.9522 
F-statistic: 290.1 on 2 and 27 DF,  p-value: < 2.2e-16

Example 8 - Data plot

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where X_1, X_2 \sim U[0,10] and \epsilon \sim N(0, \sigma^2=100) with n = 30.

Example 8 - Model summary

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where X_1, X_2 \sim U[0,10] and \epsilon \sim N(0,100) with n = 30.


Call:
lm(formula = Y ~ X1 + X2)

Residuals:
    Min      1Q  Median      3Q     Max 
-16.923  -4.883  -1.591   5.366  19.996 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2651     4.5843   0.712   0.4824  
X1           -0.8826     0.5562  -1.587   0.1242  
X2            1.1285     0.5589   2.019   0.0535 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.625 on 27 degrees of freedom
Multiple R-squared:  0.2231,    Adjusted R-squared:  0.1656 
F-statistic: 3.877 on 2 and 27 DF,  p-value: 0.03309

Other Considerations in a Regression Model

Interaction terms (X_i \times X_j): remove the additive assumption.
Quadratic terms (X_i^2): incorporate non-linear relationships.

Linear model selection

The `credit` dataset

Qualitative covariates: own, student, status, region

Linear Model selection

As before, assume you have a sample from your response Y, and several “potential” predictors X_1, X_2, X_3, \ldots, X_p
We say “potential”, because some of these predictors might in fact be useless in predicting Y… but you don’t know which ones, a priori.
You want to select the predictors that create the “best” model (based on some metric)… there isn’t a magical way to do that.
Various approaches exist. We will touch on:
- Direct methods (later in the course)
- Indirect methods
- Subset selection
- Shrinkage (also called Regularization) (later in the course)
- Dimension Reduction (later in the course)

How to determine the “best” model

In general, what we want is a low mean squared error on test data (or, a low test error rate in the case of classification problems).
But recall that: \text{training MSE} \neq \text{test MSE}.
There are ways to directly estimate the test MSE, e.g., cross-validation, validation set (to be covered later).
For now, we cover indirect methods: “making an adjustment to the training error to account for the bias due to overfitting” (James et al., 2021).
In essence, we develop metrics that compare different models (and aim to prevent “overfitting”).

A note about R^2

R^2 is a function of the training MSE.
It can give misleading results because, when you add a predictor to your model, R^2 (on the training set) cannot decrease.

RSS and R^2 for each possible model containing a subset of the ten predictors in the `Credit` data set.

Indirect methods

Mallow’s C_p, with d predictors: \frac{1}{n}(\text{RSS} + 2d \hat{\sigma}^2) This is an unbiased estimate of the test MSE if \hat{\sigma}^2 is an unbiased estimate of \sigma^2.
The Akaike information criterion (AIC) is a very general criterion to compare statistical models. It is, in general, defined as \text{AIC} = 2 d - 2 \ln(\hat{L}) where d is the number of estimated parameters in a statistical model, and \hat{L} is the maximised likelihood function of that model. A low \text{AIC} is preferred. In the context of Linear Regression (under the strong assumptions), minimising \text{AIC} is equivalent to minimising C_p.

The Bayesian information criterion (BIC) is defined as \text{BIC} = d \ln(n) - 2 \ln(\hat{L}) where d and \hat{L} are defined as for the AIC. In the case of Linear Regression (strong assumptions) with d predictors, this is equivalent to \frac{1}{n}(\text{RSS}+\ln(n) \, d\hat{\sigma}^2) Note that \ln(n) > 2 for n > 7, so BIC gives a heavier penalty on adding predictors.
Adjusted R^2 with d predictors 1-\frac{\text{RSS}/(n-d-1)}{\text{TSS}/(n-1)} Again, the idea is to give a “penalty” for adding predictors to the model. This criteria is widely used, though its theoretical backing is not as strong as other measures.

How to determine the “best” model - `Credit` dataset

Subset selection: “best subset” selection

Now that we have metrics to evaluate our multiple models, life is easy… right?
We just pick the best one? If feasible, yes.

Best subset selection algorithm:

Call \mathcal{M}_0 the null model, i.e., model with 0 predictors.
For {\color{blue}{k}}=1,\ldots, p,
1. Fit all {p\choose {\color{blue}{k}}} models that contain exactly {\color{blue}{k}} predictors.
2. Call \mathcal{M}_{\color{blue}{k}} the best model with exactly {\color{blue}{k}} predictors, based on R^2.
Select the overall best model among \mathcal{M}_0, \mathcal{M}_1, \dots, \mathcal{M}_p as the one optimising your favourite criteria (AIC, BIC, adjusted-R^2,…).

Question for a Champion: in Step 2.b., why are we relying on R^2, if we argued before that it is flawed?

Best subset selection - not so fast

The above sounds great but sweeps a potential problem under the rug: it may be infeasable to fit ALL models (if the number of potential predictors is even moderately large).
Indeed, the above algorithm is fitting a total of: \sum_{k = 0}^p {p \choose k} = 2^p \text{ models}
That’s a lot of models!
To get around this issue, we can try instead:
- Forward stepwise selection
- Backward stepwise selection
- Hybrid stepwise selection

Stepwise Example: Forward stepwise selection

Algorithm:

Start with the null model, \mathcal{M}_0.
For {\color{blue}{k}}=0, \ldots, p-1,
1. Consider all models which add one predictor to \mathcal{M}_{\color{blue}{k}}. Note there are p-{\color{blue}{k}} such models.
2. Choose the best model among these p-{\color{blue}{k}} models (based on R^2). Call it \mathcal{M}_{{\color{blue}{k}}+1}.
Select the overall best model among \mathcal{M}_0, \mathcal{M}_1, \dots, \mathcal{M}_p as the one optimising your favourite criteria (AIC, BIC, adjusted-R^2,…).

(note that \mathcal{M}_p denotes the “full model”, i.e., with all p predictors included).

Stepwise subset selection - discussion

Considers a much smaller set of models, hence far less computationally expensive. Indeed, the previous algorithm only fits \sum_{k=0}^{p-1} (p-k) = 1 + \frac{p(p+1)}{2} \text{ models}.
Backward Stepwise Selection is conceptually similar: we start with the full model and remove one predictor at each step.
Both approaches assume the “best model” with k predictors is always a subset of the best model with k+1 predictors.
- In other words, these methods only look one step ahead.
There is no guarantee that Forward or Backward Selection will yield the absolute best model (but they generally lead to decent ones).
Hybrid approaches exist, where we can add and/or remove variables at each step.

Example: Best subset and forward selection on Credit data

# Variables	Best subset	Forward stepwise
1	rating	rating
2	rating, income	rating, income
3	rating, income, student	rating, income, student
4	cards, income, student, limit	rating, income, student, limit

Potential issues with Linear Regression

Potential issues

When fitting Linear Regression models, we are making strong assumptions about the data. If those assumptions are (substantially) incorrect, then the results from our models (inference, predictions) will be dubious.
Specifically, we are assuming that:
- The relationships between the predictors and response are linear and additive;
- Errors have mean 0 and don’t dependent on predictors.
- Errors are homoskedastic (have constant variance) and their variance doesn’t dependent on predictors.
- Errors are uncorrelated;
- In the case of the “strong assumptions” (used for CI/statistical tests), errors are normally distributed.

Recall Example 6 - The problems

Recall the below data was generated by Y = 1 + 0.2 \times X^2 + \epsilon where X \sim U[0,10] and \epsilon \sim N(0, \sigma^2=0.01) with n = 500.

Mean of the residuals: -1.431303e-16

Very strong pattern (“dependence”) between residuals and predictor: indicates our model is not appropriate.

Potential issues

Non-linearity of the response-predictor relationships
Correlation of error terms
Non-constant variance of error terms
Outliers
High-leverage points
Collinearity
Confounding effect (correlation does not imply causality!)

1. Non-linearities

Example: residuals vs fitted for MPG vs Horsepower:

LHS is a linear model. RHS is a quadratic model.

In this example, the quadratic model removes much of the pattern (more on this later).

2. Correlations in the error terms

The assumption in the regression model is that the error terms are uncorrelated with each other.
If they are correlated the standard errors of parameter estimates can be incorrect.

3. Non-constant variance of error terms

The following are two regression outputs for Y (LHS) and \ln Y (RHS)

In this example, the log transformation removed much of the heteroscedasticity.

4. Outliers

An outlier is a point for which the response “y_i” is far from the value predicted by the model (otherwise said, it has a large residual y_i - \hat{y}_i).

5. High-leverage points

Observations with high leverage have an “unusual” value of their predictor(s).
In the below data, point 20 is an outlier but not a leverage point, while 41 is a leverage point (and arguably an outlier too).
Note that the red line was fitted with point 41, and the blue without point 41.

High-leverage points

High leverage points are of concern because they have the potential to significantly affect the estimated regression line.
We identify high leverage points via the hat matrix: \bm{H} = \bm{X}(\bm{X}^{\top}\bm{X})^{-1}\bm{X}^{\top}
Note that \hat{y_{i}}=\sum^{n}_{j=1}h_{ij}y_j=h_{ii}y_i+\sum^{n}_{j\ne i}h_{ij}y_{j} so each prediction is a linear function of all observations, and h_{ii} = [\bm{H}]_{ii} is the weight of observation i on its own prediction.
If h_{ii} > 2(p+1)/n, observation “i” is considered as having a high leverage.

High-leverage points (Example 1)

The below data was generated by Y = 1 - 0.5\times X + \epsilon where X \sim U[0,10] and \epsilon \sim N(0,1) with n = 30. We have added one high leverage point (made a red ‘+’ on the scatterplot).

This point (y=-6.5,x=20) has a leverage value of 0.47 >> 4/31 (but you can’t really call it an outlier).

6. Collinearity

Two or more predictors are closely related to each other (linearly dependent).
High correlation between two predictors indicates collinearity.
This is an issue because if columns of \bm{X} are linearly dependent, the matrix (\bm{X}^{\top}\bm{X}) is singular (non-invertible).
Makes regression fitting tricky, because it increase the set of plausible coefficient values (\beta_j).
Causes the SEs of the \beta_j coefficients to grow.

Collinearity makes optimisation harder

Contour plots of the values of RSS as a function of the predictors’ coefficients. Credit dataset used.
Left: balance regressed onto age and limit. Predictors have low collinearity.
Right: balance regressed onto rating and limit. Predictors have high collinearity.
Black: coefficient estimate minimising RSS.

Multicollinearity

More generally, a given predictor can be a linear function of several other predictors, and this is still a problem.
We use the “variance inflation factor” (VIF): \text{VIF}(\hat{\beta}_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}
R^2_{X_j|X_{-j}} is the R^2 obtained from X_j being regressed onto all other predictors.
Hence, it measures the strength of the linear relationship between the response variable (X_j) and all other predictors.
The minimum VIF is 1 (high values are “bad”).
Rule of thumb: >5 or >10 is considered high.

Multicollinearity example - Plot

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where \epsilon \sim N(0,1), X_1 \sim U[0,10], X_2 = 2X_1. Note n = 30.

Multicollinearity example - Summary and VIF

The below data was generated by Y = 1 - 0.7\times X_1 + X_2 + \epsilon where \epsilon \sim N(0,1) and X_1 \sim U[0,10], X_2 = 2X_1 + \varepsilon, where \varepsilon \sim N(0,10^{-8}) is a small quantity (so the fitting “works”). Note n=30.


Call:
lm(formula = Y ~ X1 + X2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.32126 -0.46578  0.02207  0.54006  1.89817 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  5.192e-01  3.600e-01   1.442   0.1607  
X1           5.958e+04  3.268e+04   1.823   0.0793 .
X2          -2.979e+04  1.634e+04  -1.823   0.0793 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8538 on 27 degrees of freedom
Multiple R-squared:  0.9614,    Adjusted R-squared:  0.9585 
F-statistic: 335.9 on 2 and 27 DF,  p-value: < 2.2e-16

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

VIF for X1: 360619740351

VIF for X2: 360619740351

High SE on the coefficient estimates, making them unreliable.

7. Confounding effects

Lastly, we should mention that correlation does not imply causality!¹
Said otherwise, even if a predictor X is strongly significant in modelling a response Y via linear regression, it does not necessarily mean that X “causes” Y.
C is a confounder (confounding variable) of the relation between X and Y if: C influences X and C influences Y.
This induces a “relationship” between X and Y, but it does not mean that X causes Y (directly).

Confounding effects

Example: the “age of a child” (X) would be correlated with their “aptitude for mathematics” (Y).
But age alone does not “causes” a child to be better in mathematics, it is simply that:
- “time studying” \Rightarrow “age”
- “time studying” \Rightarrow “aptitude for mathematics”.
Here, the confounder is “time studying”.
If “time studying” cannot be measured, than “age” is a reasonable proxy… But it is not optimal, e.g., imagine a child you found in the jungle, or a child that never studies… they won’t be good at math even if they are older!

Confounding effects

How to correctly use/don’t use confounding variables?

If a confounding variable is observable, add it to your model.
If a confounding variable is unobservable, be careful with interpretation:
- in the presence of a confounder, action taken on the predictor alone will have no effect. E.g., “let’s not send our kid to school, they will grow old and automatically become good at maths”.

What’s next

Generalisations of the Linear Model

Many methods we will see in the rest of the course expand the scope of linear models:

Classification problems: Logistic Regression
Non-normality: Generalised Linear Model
Non-linearity: Splines and Generalized Additive Models
Regularised fitting: Ridge regression and Lasso

Appendices

Appendix: Sum of squares

Recall from ACTL2131/ACTL5101, we have the following sum of squares:

\begin{aligned} S_{xx} & =\sum_{i=1}^{n}(x_i-\overline{x})^2 &\implies s_x^2=\frac{S_{xx}}{n-1} \\ S_{yy} & =\sum_{i=1}^{n}(y_i-\overline{y})^2 &\implies s_y^2=\frac{S_{yy}}{n-1}\\ S_{xy} & =\sum_{i=1}^{n}(x_i-\overline{x})(y_i-\overline{y}) &\implies s_{xy}=\frac{S_{xy}}{n-1}, \end{aligned} Here s_x^2, s_y^2 (and s_{xy}) denote sample (co-)variance.

Appendix: CI for \beta_1 and \beta_0

Rationale for \beta_1: Recall that \hat{\beta}_1 is unbiased and \text{Var}(\hat{\beta}_1)=\sigma^2/S_{xx}. However \sigma^2 is usually unknown, and estimated by s^2 so, under the strong assumptions, we have: \frac{\hat{\beta}_1-\beta_1 }{ s/ \sqrt{S_{xx}}} = \left.{\underset{\mathcal{N}(0,1)}{\underbrace{\frac{\hat{\beta}_1-\beta_1}{\sigma/ \sqrt{S_{xx}}}}}}\right/{\underset{\sqrt{\chi^2_{n-2}/(n-2)}}{\underbrace{\sqrt{\frac{\frac{(n-2)\cdot s^2}{\sigma^2}}{n-2}}}}} \sim t_{n-2} as \epsilon_i \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,\sigma^2) then \frac{(n-2)\cdot s^2}{\sigma^2} = \frac{\sum_{i=1}^{n}(y_i-\hat{\beta}_0-\hat{\beta}_1\cdot x_i)^2}{\sigma^2} \sim\chi^2_{{n}-2}.

Note: Why do we lose two degrees of freedom? Because we estimated two parameters!

Similar rationale for \beta_0.

Appendix: Statistical Properties of the Least Squares Estimates

Under the strong assumptions of normality each component \hat{\beta}_{k} is normally distributed with mean and variance \mathbb{E}[ \hat{\beta}_{k}] = \beta_k, \quad \text{Var}( \hat{\beta}_{k}) = \sigma^2 \cdot c_{kk}, and covariance between \hat{\beta}_{k} and \hat{\beta}_{l}: \text{Cov}( \hat{\beta}_k, \hat{\beta}_l ) = \sigma^{2}\cdot c_{kl}, where c_{kk} is the \left( k+1\right)^{\text{th}} diagonal entry of the matrix {\mathbf{C}=}\left( \mathbf{X}^{\top}\mathbf{X}\right) ^{-1}.
The standard error of \hat{\beta}_{k} is estimated using \text{se}( \hat{\beta}_k ) = s\sqrt{c_{kk}}.

Simple linear regression: Assessing the Accuracy of the Predictions - Mean Response

Suppose x=x_{0} is a specified value of the out of sample regressor variable and we want to predict the corresponding Y value associated with it. The mean of Y is: \begin{aligned} \mathbb{E}[ Y \mid x_0 ] &= \mathbb{E}[ \beta_0 +\beta_1 x \mid x=x_0 ] \\ &= \beta_0 + \beta_1 x_0. \end{aligned}

Our (unbiased) estimator for this mean (also the fitted value of y_0) is: \hat{y}_{0}=\hat{\beta}_0+\hat{\beta}_1x_{0}. The variance of this estimator is: \text{Var}( \hat{y}_0 ) = \left( \frac{1}{n} + \frac{(\overline{x}-x_0)^2}{S_{xx}}\right) \sigma ^{2} = \text{SE}(\hat{y}_0)^2

Proof: See Lab questions.

Simple linear regression: Assessing the Accuracy of the Predictions - Mean Response

Using the strong assumptions, the 100\left( 1-\alpha\right) \% confidence interval for \beta_0 +\beta_1 x_{0} (mean of Y) is: \underbrace{\bigl(\hat{\beta}_0 + \hat{\beta}_1 x_0 \bigr)}_{\hat{y}_0} \pm t_{1-\alpha/2,n-2}\times {\underbrace{s\sqrt{ \dfrac{1}{n}+\dfrac{\left(\overline{x}-x_{0}\right) ^{2}}{S_{xx}}}}_{\hat{\text{SE}}(\hat{y}_0)}} ,

as we have \hat{y}_{0} \sim \mathcal{N}( \beta_0 +\beta_1 x_{0}, \text{SE}(\hat{y}_0)^2 %={\left( \dfrac{1}{n}+\frac{\left( \overline{x}-x_{0}\right) ^{2}}{S_{xx}}\right)\sigma ^{2}} )

and \frac{\hat{y}_{0} - (\beta_0 +\beta_1 x_0)}{\hat{\text{SE}}(\hat{y}_0)} %=\frac{\hat{y}_{0} - \left( \beta_0 +\beta_1 x_{0}\right)}{s\sqrt{\dfrac{1}{n}+\dfrac{\left( \overline{x}-x_{0}\right)^{2}}{S_{xx}} }} \sim t(n-2).

Simple linear regression: Assessing the Accuracy of the Predictions - Individual response

A prediction interval is a confidence interval for the actual value of a Y_i (not for its mean \beta_0 + \beta_1 x_i). We base our prediction of Y_i (given {X}=x_i) on: {\hat{y}}_i={\hat{\beta}}_0+{\hat{\beta}}_1x_i. The error in our prediction is: \begin{aligned} {Y}_i-{\hat{y}}_i=\beta_0+\beta_1x_i+{\epsilon}_i-{\hat{y}}_i=\mathbb{E}[{Y}|{X}=x_i]-{\hat{y}}_i+{\epsilon}_i. \end{aligned} with \mathbb{E}\left[{Y}_i-{\hat{y}}_i|{{X}}={x},{X}=x_i\right]=0 , \text{ and } \text{Var}({Y}_i-{\hat{y}}_i|{{X}}={x},{X}=x_i)=\sigma^2\bigl(1+{1\over n}+{(\overline{x} - x_i)^2\over S_{xx}}\bigr) .

Proof: See Lab questions.

Simple linear regression: Assessing the Accuracy of the Predictions - Individual response

A 100(1-\alpha)% prediction interval for {Y}_i, the value of {Y} at {X}=x_i, is given by:

\begin{aligned} %&& {{\hat{y}}_i}\pm t_{1-\alpha/2,n-2}s\sqrt{1+\displaystyle{1\over n}+{(\overline{x} - x_i)^2\over S_{xx}}}\rule{140pt}{0pt}\\ %\equiv && {\underbrace{\hat{\beta}_0+\hat{\beta}_1x_i}_{\hat{y}_i}} \pm t_{1-\alpha/2,n-2}\cdot s\cdot\sqrt{1+\displaystyle{1\over n}+{(\overline{x} - x_i)^2\over S_{xx}}}, \end{aligned} as ({Y}_i-{\hat{y}}_i|{\underline{X}}=\underline{x},{X}=x_i) \sim \mathcal{N}\Bigl(0, \sigma^2\bigl(1 + \frac1n + \frac{(\overline{x} - x_i)^2}{S_{xx}}\bigr)\Bigr) , \text{ and } \frac{Y_i- \hat{y}_i}{s\sqrt{1 + \frac1n + \frac{(\overline{x} - x_i)^2}{S_{xx}}}} \sim t_{n-2}.

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R. Springer.

Footnotes

Check this website on spurious correlations.↩︎

Disclaimer

Overview

Linear Regression

Simple Linear Regression

Overview

Overview

Advertising Example

Assumptions on the errors

Model estimation

Least Squares Estimates (LSE)

Least Squares Estimates (LSE) - Properties

Least Squares Estimates (LSE) - Uncertainty

Maximum Likelihood Estimates (MLE)

Maximum Likelihood Estimates (MLE)

Interpretating the parameters

Example 1

Example 2

Example 3

Example 4

Example 5

Assessing the models

Example 6

Assessing the Accuracy I

Assessing the Accuracy of the Coefficient Estimates - Confidence Intervals

Assessing the Accuracy of the Coefficient Estimates - Inference on the slope

Assessing the Accuracy of the Coefficient Estimates - Inference on the slope

Example 1 - Hypothesis testing

Example 2 - Hypothesis testing

Example 3 - Hypothesis testing

Example 4 - Hypothesis testing

Example 5 - Hypothesis testing

Example 6 - Hypothesis testing

Summary of hypothesis tests

Assessing the accuracy of the model

Assessing the accuracy of the model

Assessing the accuracy of the model

Assessing the accuracy of the model: R^2

Alernative definition of R^2

Summary of R^2 from the six examples

Multiple Linear Regression

Overview

Vector notation

Advertising Example

Matrix notation

Assumptions on the error terms

Least Squares Estimates (LSE)

Least Squares Estimates (LSE) - Properties

Categorical predictors

Categorical predictors

Integer encoding

One-hot encoding

Dummy encoding

R Demo

The matrix approach

R’s lm and predict

Dummy encoding

Dummy encoding & collinearity

Diagnostics in MLR: ANOVA, R^2, F-test

ANOVA

Model Fit

F-test

Advertising Example (continued)

Example 7 - Data plot

Example 7 - Model summary

Example 8 - Data plot

Example 8 - Model summary

Other Considerations in a Regression Model

Linear model selection

The credit dataset

Linear Model selection

How to determine the “best” model

A note about R^2

Indirect methods

How to determine the “best” model - Credit dataset

Subset selection: “best subset” selection

Best subset selection - not so fast

Stepwise Example: Forward stepwise selection

Stepwise subset selection - discussion

Example: Best subset and forward selection on Credit data

Potential issues with Linear Regression

`Advertising` Example

`Advertising` Example

R’s `lm` and `predict`

`Advertising` Example (continued)

The `credit` dataset

How to determine the “best” model - `Credit` dataset