set.seed(1)
<- rnorm(100)
x <- 2 * x + rnorm(100) y
Lab 2: Simple Linear Regression
Questions
Conceptual Questions
Solution Prove that, in simple linear regression, the Least Squares coefficient estimates (LSE) for and are:
Solution Forensic scientists use various methods for determining the likely time of death from post-mortem examination of human bodies. A recently suggested objective method uses the concentration of a compound (3-methoxytyramine or 3-MT) in a particular part of the brain. In a study of the relationship between post-mortem interval and the concentration of 3-MT, samples of the approximate part of the brain were taken from coroners cases for which the time of death had been determined form eye-witness accounts. The intervals (; in hours) and concentrations (; in parts per million) for 18 individuals who were found to have died from organic heart disease are given in the following table. For the last two individuals (numbered 17 and 18 in the table) there was no eye-witness testimony directly available, and the time of death was established on the basis of other evidence including knowledge of the individuals’ activities.
Observation number Interval () Concentration () 1 5.5 3.26 2 6.0 2.67 3 6.5 2.82 4 7.0 2.80 5 8.0 3.29 6 12.0 2.28 7 12.0 2.34 8 14.0 2.18 9 15.0 1.97 10 15.5 2.56 11 17.5 2.09 12 17.5 2.69 13 20.0 2.56 14 21.0 3.17 15 25.5 2.18 16 26.0 1.94 17 48.0 1.57 18 60.0 0.61 , , , ,
In this investigation you are required to explore the relationship between concentration (regarded the responds/dependent variable) and interval (regard as the explanatory/independent variable).
Construct a scatterplot of the data. Comment on any interesting features of the data and discuss briefly whether linear regression is appropriate to model the relationship between concentration of 3-MT and the interval from death.
Calculate the correlation coefficient of the data, and use it to test the null hypothesis that the population correlation coefficient is equal to zero. For this task, you may use the fact that, under the strong assumptions, if is true, then
Calculate the equation of the least-squares fitted regression line and use it to estimate the concentration of 3-MT:
after 1 day and
after 2 days.
Comment briefly on the reliability of these estimates.
A shortcut formula for the unbiased estimate of (variance of the errors in linear regression) is given in the “Orange Formulae Book” (p.24) as Use this formula to compute for the data in this question.
Calculate a 99% confidence interval for the slope of the regression line. Using this confidence interval, test the hypothesis that the slope of the regression line is equal to zero. Comment on your answer in relation to the answer given in part (2) above.
Solution A university wishes to analyse the performance of its students on a particular degree course. It records the scores obtained by a sample of 12 students at the entry to the course, and the scores obtained in their final examinations by the same students. The results are as follows:
Student A B C D E F G H I J K L Entrance exam score (%) 86 53 71 60 62 79 66 84 90 55 58 72 Final paper score (%) 75 60 74 68 70 75 78 90 85 60 62 70 , , , , .
Calculate the fitted linear regression equation of on .
Under the strong assumptions, it can be shown that where is the unbiased estimate of in simple linear regression. Knowing this, calculate an estimate of the variance and obtain a 90% confidence interval for , for the above data.
By considering the slope parameter, formally test whether the data is positively correlated.
Calculate the proportion of variance explained by the model. Hence, comment on the fit of the model.
Solution Complete the following ANOVA table for a simple linear regression with observations:
Source D.F. Sum of Squares Mean Squares F-Ratio Regression ____ ____ ____ ____ Error ____ ____ 8.2 Total ____ 639.5 Solution Consider a fitted simple linear regression model (via least squares) with estimated parameters and . Let be a new (but known) observation (i.e., not in the training set used to estimate the parameters). Your prediction of the response for would be .
- Show that the variance of this prediction is Note you may use as given the expressions for the (co)-variances of given in the lecture.
- Notice how does not contain an “error term” (because the only reasonable prediction of the error term is ). But this means that the variance of computed above should be viewed as the variance of the “predicted mean” of the new observation . What is the variance of the predicted individual response ? Hint: .
Solution Suppose you are interested in relating the accounting variable EPS (earnings per share) to the market variable STKPRICE (stock price). Then, a regression equation was fitted using STKPRICE as the response variable with EPS as the predictor variable. Following is the computer output from your fitted regression. You are also given that: , , , and (Note that: and )
Regression Analysis The regression equation is STKPRICE = 25.044 + 7.445 EPS Predictor Coef SE Coef T p Constant 25.044 3.326 7.53 0.000 EPS 7.445 1.144 6.51 0.000 Analysis of Variance SOURCE DF SS MS F p Regression 1 10475 10475 42.35 0.000 Error 46 11377 247 Total 47 21851
Compute and .
Calculate the correlation coefficient of EPS and STKPRICE.
Estimate the STKPRICE given an EPS of $2. Provide 95% confidence intervals on both the “predicted mean” and the “predicted individual response” for STKPRICE (these are defined in an previous question ). You may assume that the predicted mean and predicted response are approximately Normal (this is not exactly true, even under the strong assumptions, but a reasonable assumption if the sample size is decently large). Comment on the difference between those two confidence intervals.
Provide a 95% confidence interval for the slope coefficient .
Describe how you would check if the errors have constant variance.
Perform a test of the significance of EPS in predicting STKPRICE at a level of significance of 5%.
Test the hypothesis against at a level of significance of 5%.
Solution (Modified Institute Exam Question) As part of an investigation into health service funding a working party was concerned with the issue of whether mortality could be used to predict sickness rates. Data on standardised mortality rates and standardised sickness rates were collected for a sample of 10 regions and are shown in the table below:
Region Mortality rate (per 100,000) Sickness rate (per 100,000) 1 125.2 206.8 2 119.3 213.8 3 125.3 197.2 4 111.7 200.6 5 117.3 189.1 6 100.7 183.6 7 108.8 181.2 8 102.0 168.2 9 104.7 165.2 10 121.1 228.5 Data summaries: , , , , and .
Calculate the correlation coefficient between the mortality rates and the sickness rates. Conduct a statistical test on whether the underlying population correlation coefficient is zero against the alternative that it is positive. Hint: use the test statistic provided in this previous question.
Noting the issue under investigation, draw an appropriate scatterplot for these data and comment on the relationship between the two rates.
Determine the fitted linear regression of sickness rate on mortality rate. Then, test whether the underlaying slope coefficient can be considered to be larger than 2.0.
For a region with mortality rate 115.0, estimate the expected sickness rate.
More Proofs
Applied Questions
Solution (ISLR2, Q3.8) This question involves the use of simple linear regression on the Auto data set.
Use the
lm()
function to perform a simple linear regression withmpg
as the response andhorsepower
as the predictor. Use thesummary()
function to print the results. Comment on the output.
For example:Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted
mpg
associated with ahorsepower
of 98? What are the associated 95% confidence and prediction intervals?
Plot the response and the predictor. Use the
abline()
function to display the least squares regression line.Use the
plot()
function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.
Solution (ISLR2, Q3.11) In this problem we will investigate the -statistic for the null hypothesis in simple linear regression without an intercept. To begin, we generate a predictor and a response as follows.
Perform a simple linear regression of onto , without an intercept. Report the coefficient estimate , the standard error of this coefficient estimate, and the -statistic and -value associated with the null hypothesis . Comment on these results. (You can perform regression without an intercept using the command
lm(y ~ x+0)
.)Now perform a simple linear regression of onto without an intercept, and report the coefficient estimate, its standard error, and the corresponding -statistic and -values associated with the null hypothesis . Comment on these results.
What is the relationship between the results obtained in (a) and (b)?
For the regression of onto without an intercept, the -statistic for takes the form , where is given by (3.38), and where (These formulas are slightly different from those given in Sections 3.1.1 and 3.1.2, since here we are performing regression without an intercept.) Show algebraically, and confirm numerically in R, that the t-statistic can be written as
Using the results from (d), argue that the -statistic for the regression of onto is the same as the -statistic for the regression of onto .
In R, show that when regression is performed with an intercept, the -statistic for is the same for the regression of onto as it is for the regression of onto .
Solutions
Conceptual Questions
Question We determine and by minimizing the error. Hence, we use least squares estimates (LSE) for and : The minimum is obtained by setting the first order condition (FOC) to zero:
The LSE and are given by setting the FOC equal to zero:
So we have
Next step: Rearranging so that and become functions of , , , and
*.
And ’s in line (6) was subbed into in line (7). At this point, is done. So we’ll continue with .
From the previous steps we have: thus:
*.
Using the notations, we have an easier way to write :
*.
-
Scatterplot of concentration against interval. Interesting features are that, in general, the concentration of 3-MT in the brain seems to decrease as the post mortem interval increases. Another interesting feature is that we observe two observations with a much higher post mortem interval than the other observations.
The data seems to be appropriate for linear regression. The linear relationship seems to hold,especially for values of interval between 5 and 26 (we have enough observations for that). Care should be taken into account when evaluating for lower than 5 and larger than 26 (only two observations) because we do not know whether the linear relationship between and still holds then.We test: The corresponding test statistic is given by:
We reject the null hypothesis for large and small values of the test statistic.
We have and the correlation coefficient is given by: Thus, the value of our test statistic is given by: From Formulae and Tables page 163 we observe , * using symmetry property of the student- distribution. We observe that the value of our test statistic (-5.89) is smaller than -4.015, thus our -value should be smaller than . Thus, we can reject the null hypothesis even at a significance level of 0.1%, hence we can conclude that there is a linear dependency between interval and concentration. Note that the alternative hypothesis is here a linear dependency and not negative linear dependency, so you do accept the alternative by rejecting the null hypothesis. Although, when you would use as alternative hypothesis negative dependency, you would accept this alternative, due to the construction of the test we have to use the phrase “a linear dependency” and not “a negative linear dependency”.The linear regression model is given by: The estimate of the slope is given by: The estimate of the intercept is given by: Thus, the estimate of given a value of is given by:
One day equals 24 hours, i.e., , thus
Two day equals 48 hours, i.e., , thus
The data set contains accurate data up to 26 hours, as for observations 17 and 18 (at 48 hour and 60 hours respectively) there was no eye-witness testimony direct available. Predicting 3-MT concentration after 26 hours may not be advisable, even though is within the range of the -values (5.5 hours to 60 hours).
We calculate
The pivotal quantity is given by: The standard error is From Formulae and Tables page 163 we have . Using the test statistic, the 99% confidence interval of the slope is given by: Thus the 99% confidence interval of is given by: . Note that in not within the 99% confidence interval, therefore we would reject the null hypothesis that equals zero and accept the alternative that at a 1% level of significance. This confirms the result in (2) where the correlation coefficient was shown to not equal zero at the 1% significance level.
-
The linear regression model is given by: where i.i.d. distributed for .
The fitted linear regression equation is given by:The estimated coefficients of the linear regression model are given by (see Formulae and Tables p.24): Thus, the fitted linear regression equation is given by:
We have From the previous part, we already have , so
Then, we know the pivotal quantity: Note: we have degrees of freedom because we have to estimate two parameters form the data ( and ). Calling the -quantile of a with degrees of freedom, we have
Thus, we have that the 90% confidence interval is given by: Finally, for our data the 90% confidence interval of is given by .
- We test the following: with a level of significance .
- The test statistic is:
- The rejection region of the test is given by:
- The value of the test statistic is given by:
- The value of the test statistic is in the rejection region, hence we reject the null hypothesis of a zero correlation.
- We test the following: with a level of significance .
The proportion of the variability explained by the model is given by: Hence, a fairly large proportion of the variability of is explained by .
Question The completed ANOVA table is given below:
Source D.F. Sum of Squares Mean Squares F-Ratio Regression Error Total -
We have that
We have
** note that and are not correlated, because is the error of a new observation (which is independent of the training observations, hence independent of ).
Question problem:
We have =RSS/, hence and .
We know that =MSS/TSS=, and where is the empirical correlation. Hence, . We take the positive square root because of the positive sign of the coefficient of .
Given , we have: Note that is the sample variance of , and we have . The estimated standard deviation of the predicted mean is:
Because the 97.5%-quantile of the standard Normal is , we obtain the approximate confidance interval: For the confidence interval on the “individual predicted response” , the only difference is that the estimated variance of is larger by “”. Hence, And the confidence interval is This condidence interval is much broader, which should make sense: there is far less uncertainty about the overall trend (predicted mean of ) than about a specific data point (the actual, individual, ).
A 95% confidence interval for is:
A scatter plot of the residuals (standardised) against either their fitted values () or their predictors () provides a visual tool to assess the constancy of the variation in the errors.
Because the 95% confidence interval for computed earlier does not contain , we already know that, at a level of significance of 5%, we do reject . We can also double-check explicitly (it is the same price) by computing the test statistic. For the significance of the variable EPS, we test against . The test statistic is: This is larger than and therefore we reject the null. There is evidence to support the fact that the EPS variable is a significant predictor of stock price.
To test against , the test statistic is given by: Thus, since this test statistic is smaller than we do not reject the null hypothesis.
-
We have the estimated correlation coefficient:
- We have the hypothesis:
- The test statistic is:
- The critical region is given by:
- The value of the test is:
- We have . Thus the -value is 0.005 and we reject the null hypothesis of a zero correlation for any level of significance that is 0.005 (i.e., 0.5%) or more. This means we reject for typical significance levels used (1%, 5%).
Given the issue of whether mortality can be used to predict sickness, we require a plot of sickness against mortality:
Scatterplot of sickness and mortality. There seems to be an increase linear relationship such that mortality could be used to predict sickness.
We have the estimates:
- Hypothesis: (we set up the hypothesis this way, because then if we reject , then we have some evidence that ).
- Test statistic:
- Critical region:
- Value of statistic:
- It is obvious we do not reject, because the test statistic is negative. Note you have from Formulae and Tables p.163 that , hence a test statistic larger than that would have resulted in a rejection of in favour of . Here though, we don’t have any evidence to support that .
For a region with we have the estimated value:
More Proofs
Question For , using the equation in line (10) in Q1:
For ,
Question In the regression model there are three parameters to estimate: , , and .
Joint density of — under the (strong) normality assumptions — is the product of their marginals (independent by assumption) so that the likelihood is:
Taking partial derivatives of the log-likelihood with respect to : Equate the above to 0 and solve for should give
Similarly, taking partial derivatives of the log-likelihood with respect to :
The last line was derived using the fact that
Equate the above equation to 0 and solve for , we’ll get:
Question We have that * using , continue:
** uses (which is self-explanatory) and (We’ll prove this at the end of this question), we have the following results
*Proof of :*
Using the estimates of and , we have:
*** uses the proof from Q1.
Question The MSS is using .
Question We first consider .
Note that we have:
*uses:
Therefore
This uses because , where the ’s are constant and is given, hence .
We next consider .
Using that:
Finally, we consider .
we have:
Applied Questions
-
Please install the package and load the data by the following command first.
install.packages("ISLR2")
library(ISLR2)
<- lm(mpg ~ horsepower, data = Auto) fit summary(fit)
Call: lm(formula = mpg ~ horsepower, data = Auto) Residuals: Min 1Q Median 3Q Max -13.5710 -3.2592 -0.3435 2.7630 16.9240 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 39.935861 0.717499 55.66 <2e-16 *** horsepower -0.157845 0.006446 -24.49 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 4.906 on 390 degrees of freedom Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049 F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
Yes
Very significant (-value of )
Negative
predict(fit, newdata = data.frame(horsepower = c(98)), interval = "confidence")
fit lwr upr 1 24.46708 23.97308 24.96108
predict(fit, newdata = data.frame(horsepower = c(98)), interval = "prediction")
fit lwr upr 1 24.46708 14.8094 34.12476
plot(Auto$horsepower, Auto$mpg) abline(a = fit$coefficients[1], b = fit$coefficients[2])
par(mfrow = c(2, 2)) plot(fit)
There appears to be some trend in the residuals, indicating a linear fit is not appropriate.
-
set.seed(1) <- rnorm(100) x <- 2 * x + rnorm(100) y
summary(lm(y ~ x + 0))
Call: lm(formula = y ~ x + 0) Residuals: Min 1Q Median 3Q Max -1.9154 -0.6472 -0.1771 0.5056 2.3109 Coefficients: Estimate Std. Error t value Pr(>|t|) x 1.9939 0.1065 18.73 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9586 on 99 degrees of freedom Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
Result is fairly close to what’s expected (2).
summary(lm(x ~ y + 0))
Call: lm(formula = x ~ y + 0) Residuals: Min 1Q Median 3Q Max -0.8699 -0.2368 0.1030 0.2858 0.8938 Coefficients: Estimate Std. Error t value Pr(>|t|) y 0.39111 0.02089 18.73 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4246 on 99 degrees of freedom Multiple R-squared: 0.7798, Adjusted R-squared: 0.7776 F-statistic: 350.7 on 1 and 99 DF, p-value: < 2.2e-16
Result is a bit far from what is expected (0.5), and it doesn’t land in its 95% confidence interval.
The estimate in (a) is about 5 times the estimate in (b). The -statistics, however, are identical.
See:
In R, this is written as
sqrt(100 - 1) * sum(x * y)) / sqrt(sum(x^2) * sum(y^2) - sum(x * y)^2) (
[1] 18.72593
This returns the same value as the t-statistic.
Due to the symmetry of and , we find we have the same formula as above. Hence the t-statistic is the same.
<- lm(y ~ x) fit <- lm(x ~ y) fit2 summary(fit)
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -1.8768 -0.6138 -0.1395 0.5394 2.3462 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.03769 0.09699 -0.389 0.698 x 1.99894 0.10773 18.556 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9628 on 98 degrees of freedom Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762 F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16
summary(fit2)
Call: lm(formula = x ~ y) Residuals: Min 1Q Median 3Q Max -0.90848 -0.28101 0.06274 0.24570 0.85736 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.03880 0.04266 0.91 0.365 y 0.38942 0.02099 18.56 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4249 on 98 degrees of freedom Multiple R-squared: 0.7784, Adjusted R-squared: 0.7762 F-statistic: 344.3 on 1 and 98 DF, p-value: < 2.2e-16