Overview
This session will cover linear regression. As in the ANOVA session, I will start by explaining the basic principles of linear regression, and then show you the function you can use to run a linear regression in R.
Linear regression allows us to test for an effect of a continuous explanatory variable on a measured response variable.
If you are not familiar with basic statistical concepts (hypotheses, null hypotheses, P values, degrees of freedom etc.), I recommend taking a look at these videos.
In this session, we will work with a new dataset that is built into R, the ‘swiss’ dataset, which details socio-economic and fertility indicators for Swiss cantons in 1888.
data(swiss)
head(swiss)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
Specifically, we will look at whether education level had an effect on fertility. Let’s look at the relationship between the % of people that received education beyond primary level and average fertility.
library(ggplot2)
ggplot(data = swiss,mapping = aes(x=Education,y=Fertility)) + geom_point() +
theme_classic()
There appears to be a negative relationship between education and fertility, but we will use linear regression to test whether the relationship is statistically significant.
Running Linear Regression Manually
Basic Principles of Linear Regression
As with analysis of variance, running a linear regression involves calculating an F ratio, which we then compare to an F distribution to estimate a P value (i.e., the probability of obtaining a relationship as strong as that observed by chance given the null hypothesis that there is actually no relationship between the variables). Similar to an ANOVA, the F ratio for a linear regression is the ratio of the mean of squares regression (i.e., the variation in our response variable that is explained by our explanatory variable) and the mean of squares error (i.e. the variation in our response variable not captured by our fitted regression): \[F = \frac{MS_{regression}}{MS_{error}}\]
Again, exactly as in ANOVA, the mean of squares is the sum of squares divided by the degrees of freedom: \[MS_{regression} = \frac{SS_{regression}}{DF_{regression}}\] \[MS_{error} = \frac{SS_{error}}{DF_{error}}\]
The degrees of freedom, again exactly as in ANOVA, are given as: \[DF_{regression} = K - 1\] \[DF_{error} = N - K\] \[DF_{total} = N - 1\]
As in ANOVA, \(N\) is the total sample size. For linear regression, \(K\) is the number of parameters in the analysis, i.e. 2 (the slope and the intercept).
The sums of squares in linear regression are directly analogous to those in ANOVA.
The total sum of squares is the sum, across all data points, \(i\), of the squared differences between recorded values, \(y_i\), and the overall mean of all values, \(\overline{y}\): \[SS_{total} = \sum_{i=1}^{N}{(y_i - \overline{y})^2}\]
The regression sum of squares is the sum, across all data points, \(i\), of the squared differences between the values predicted by the regression equation, \(\hat{y_i}\), and the overall mean of all value in the dataset, \(\overline{y}\): \[SS_{regression} = \sum_{i=1}^{N}{(\hat{y_i} - \overline{y})^2}\]
The error sum of squares is the sum, across all data points, \(i\), of the squared differences between the observed data values, \(y_i\), and the values predicted by the regression equation, \(\hat{y_i}\): \[SS_{error} = \sum_{i=1}^{N}{(y_i - \hat{y_i})^2}\]
Just as in ANOVA, these differences between observed and predicted values are the model residuals.
Manual Calculation of Linear Regression
We will not actually run a manual linear regression here. There is in additional step in doing so compared to calculating an ANOVA, in that you first have to find the best-fitting regression line, from which to calculate the sums of squares. The normal method for finding the best-fitting regression line is to minimize the error sum of squares. This method is part of a class of models referred to as Ordinary Least Squares Regression. We will come across Ordinary Least Squares regression again in later sessions.
Running Linear Regression in R
Basic Operation
In R, we can run a linear regression very simply using the lm function. Here we will test whether the % of people with post-primary education in Swiss cantons in the late 19th Century had a significant effect on their fertility. First, we must ask R to fit a linear regression of fertility as a function of (~) education:
m1 <- lm(formula = Fertility~Education,data = swiss)
summary(m1)
##
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
## Education -0.8624 0.1448 -5.954 3.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
The summary of the model gives us the slope and intercept of the best-fitting regression line (i.e., by minimizing the error sums of squares). The slope and intercept are shown under “Coefficients”, in the “Estimate” column (see below for more detail on the model output). The model summary also gives us a P value (i.e., the probability of observing this relationship given the null hypothesis that there is no relationship between the variables). The P value here is much smaller than the typically accepted threshold of 0.05, and so we can conclude that education had a significant effect on fertility in 19th-Century Switzerland.
We can also obtain our sums of squares by running the anova function. Note that, confusingly, this is not the same as the aov function, which we used to run an actual ANOVA in the previous session. The anova function calculates the sums of squares, F ratio and a P value from any basic statistical model:
anova(m1)
## Analysis of Variance Table
##
## Response: Fertility
## Df Sum Sq Mean Sq F value Pr(>F)
## Education 1 3162.7 3162.7 35.446 3.659e-07 ***
## Residuals 45 4015.2 89.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpreting Model Coefficients
Let’s inspect the model coefficient table in a bit more detail. As I said earlier, this table gives us the slope and intercept of the best-fitting regression line. The intercept is the value of the response variable when our explanatory variable has a value of zero. The slope tells us how much (and in what direction) the response variable changes for each unit increase in our explanatory variable.
An important difference compared with the coefficient table for a simple ANOVA is that this table also gives us the uncertainty (specifically the standard error) associated with the coefficients. This is the measure of our confidence in the estimates of the coefficient values. If we were to repeat our experiment a sufficiently large number of times, collecting new datasets each time, and running a linear regression on these datasets, we would expect to obtain normal distributions of intercept and slope parameters with standard deviation equal to the estimated standard errors of the coefficients. In other words, these uncertainty estimates don’t tell us how much variation in our response variable there is in the underlying population we sampled, but rather how accurately we are able to estimate the mean values of the coefficients. Up to a point, if we increase our sample size, we expect to see a decrease in the standard error of our coefficient estimates. If you want a nice introduction to the distinction between the standard deviation describing variability in the underlying population, and the standard error of mean estimates, see this video.
Checking Model Assumptions
As with all parametric statistical tests, linear regressions are subject to four key assumptions:
- equality of variance
- a normal distribution of residuals
- a linear relationship between variables, and
- independence of individual data points
Since the data represent different cantons in Switzerland, it is probably OK to assume for now that the data points are independent. We will talk about potential non-independence in spatial datasets in a later session, but for now we will leave this assumption to one side.
The figure we plotted earlier of the relationship between education and fertility suggests that the assumption of a linear relationship is reasonable.
To test for equality of variace in a linear regression model, we can look at the spread of residuals across different predicted values in our regression, and we would hope to see a similar spread across these predicted values. The most straightforward way to do this is to plot the relationship between predicted (or fitted) values from the model, and the model residuals:
plot(fitted(m1),residuals(m1))
# Add a horizontal red dashed line where the residuals = 0
abline(h=0,lwd=2,lty=2,col="red")
This plot clearly shows an increase in the spread of residuals at higher predicted/fitted values, and so our model definitely violates the assumption of equality of variance.
Before dealing with this issue, we will also check for a normal distribution of residuals. As when we ran ANOVA tests in the previous sessions, we can check for a normal distribution of residuals either using a Q-Q plot, or a simple histogram:
qqnorm(residuals(m1))
qqline(residuals(m1))
hist(residuals(m1))
Both these plots suggest a significant deviation from a normal distribution of residuals.
As before, if the assumptions of our test are violated, a simple thing to check is whether the input variables conform to a normal distribution. Although it is not necessary that the input variables are normally distributed, it is more likely that the assumptions of equality of variance and a normal distribution of residuals will be violated if not:
hist(swiss$Education)
hist(swiss$Fertility)
The fertility values (response variable) seem reasonably close to a normal distribution, but the education values are very right-skewed. The obvious next step, therefore, is to try a model where we log-transform the values for education:
swiss$LogEducation <- log10(swiss$Education)
m2 <- lm(formula = Fertility~LogEducation,data = swiss)
summary(m2)
##
## Call:
## lm(formula = Fertility ~ LogEducation, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.697 -6.326 -2.859 8.697 22.094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 87.472 4.479 19.53 < 2e-16 ***
## LogEducation -19.008 4.602 -4.13 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.75 on 45 degrees of freedom
## Multiple R-squared: 0.2749, Adjusted R-squared: 0.2588
## F-statistic: 17.06 on 1 and 45 DF, p-value: 0.0001551
anova(m2)
## Analysis of Variance Table
##
## Response: Fertility
## Df Sum Sq Mean Sq F value Pr(>F)
## LogEducation 1 1973.1 1973.14 17.059 0.0001551 ***
## Residuals 45 5204.8 115.66
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We still have a highly significant negative relationship between education and fertility. Let’s check the assumptions of equality of variance and normal distribution of residuals again:
plot(fitted(m2),residuals(m2))
abline(h=0,lwd=2,lty=2,col="red")
qqnorm(residuals(m2))
qqline(residuals(m2))
hist(residuals(m2))
The spread of residuals certainly seems more even along the range of fitted values. The residuals seem to conform more closely to a normal distributions, although still with a slight deviation. If we were conducting this analysis as part of a scientific study that we wanted to publish, we would probably dig into this issue more carefully.
The eagle-eyed among you may also have noticed that the residuals seem to be lower at either end of the range of fitted values. If we re-plot the original relationship between education and fertility, but this time with the log-transformed education values, we can see why:
ggplot(data = swiss,mapping = aes(x=LogEducation,y=Fertility)) + geom_point() + geom_smooth(method="lm") + theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
In fact, it looks as though the relationship between education and fertility is non-linear, violating our assumption of linearity. We won’t deal with the non-linearity issue any further in this session, but we will return to this in later sessions.
Assuming for now that we are happy with this latest model, we will now move onto how we make predictions from our model.
Making Predictions from a Linear Regression Model
There are many situations in which it is necessary to make predictions from a model, including to make graphs of the results. Although for very simple relationships, such as simple linear regression, packages such as ggplot will add the fitted regression line, for more complex models we need to make the predictions manually to capture all the variables accounted for in the model.
The predict function in R can make predictions for most model types (we will encounter a few exceptions in later tutorials). The simplest method is to make predictions for the observations in the original dataset (with their associated explanatory variables).
preds <- predict(m2)
summary(preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 54.70 66.96 70.31 70.14 72.68 87.47
At the moment, this just gives us the median prediction from our model, but we also need to know the uncertainty in our predictions. For this, we can use the interval parameter. Now, we obtain a data frame with the median prediction (‘fit’) and the lower and upper 95% confidence intervals (‘lwr’ and ‘upr’). If we repeated our experiment many times from scratch, we would expect 95% of the predictions to fall within these confidence intervals.
preds <- predict(m2,interval='confidence')
head(preds)
## fit lwr upr
## Courtelary 66.95886 63.43846 70.47925
## Delemont 69.33366 66.14956 72.51777
## Franches-Mnt 74.18582 70.46152 77.91012
## Moutier 71.40825 68.18896 74.62755
## Neuveville 65.11682 61.11818 69.11546
## Porrentruy 71.40825 68.18896 74.62755
Let’s add these model predictions to the original data frame.
swiss <- cbind(swiss,preds)
We can now add these predictions to a graph to show both our model data and model fit.
ggplot(data = swiss,mapping = aes(x = Education)) +
geom_point(mapping = aes(y = Fertility)) +
geom_line(mapping = aes(y = fit),colour="#2E2585",size = 0.8) +
geom_ribbon(mapping = aes(ymin = lwr,ymax = upr),fill = "#2E2585",alpha = 0.2) +
scale_x_continuous(name = "Post-primary Education (%)",transform = 'log10') +
scale_y_continuous(name = "Fertility (standardized)") +
theme_classic()
Finally, we might also be interested in making predictions for values of the explanatory variables not encountered in the original dataset. In this case, we can pass a new data frame of explanatory variable values to the predict function. Let’s say here that we are interested in making predictions of fertility for levels of post-primary education up to 100%. So, let’s predict fertility at levels of post-primary education of 1%, 25%, 50%, 75% and 100%.
nd <- data.frame(Education = c(1,25,50,75,100))
nd$LogEducation <- log10(nd$Education)
preds <- predict(object=m2,newdata=nd,interval='confidence')
preds <- cbind(nd,preds)
preds
## Education LogEducation fit lwr upr
## 1 1 0.000000 87.47167 78.44994 96.49340
## 2 25 1.397940 60.89997 55.39575 66.40419
## 3 50 1.698970 55.17807 47.22614 63.13000
## 4 75 1.875061 51.83097 42.35900 61.30294
## 5 100 2.000000 49.45617 38.88542 60.02692
Therefore, our model predicts that if everyone in a 19th Century Swiss canton had received post-primary education, that the standardized fertility level would have been 49.5 (95% confidence interval: 38.9 - 60.0).
So far, we have calculated the confidence interval, which gives an estimate of the range within which the average fertility of a canton with, say, 100% post-primary education is expected to fall. In other words, this confidence interval ignores variation among individual cantons, and represents our ability to estimate the average response.
We might also be interested in our ability to predict any given individual response, now accounting for individual variation. In this case, we need to calculate prediction intervals.
nd <- data.frame(Education = c(1,25,50,75,100))
nd$LogEducation <- log10(nd$Education)
preds <- predict(object=m2,newdata=nd,interval='confidence')
preds <- cbind(nd,preds)
preds
## Education LogEducation fit lwr upr
## 1 1 0.000000 87.47167 78.44994 96.49340
## 2 25 1.397940 60.89997 55.39575 66.40419
## 3 50 1.698970 55.17807 47.22614 63.13000
## 4 75 1.875061 51.83097 42.35900 61.30294
## 5 100 2.000000 49.45617 38.88542 60.02692
You can see that, inevitably, the 95% prediction interval is higher than the confidence interval, in this case considerably so. In other words, we have a higher confidence in our ability to predict the average fertility of a canton with a given level of education than to predict the fertility of a specific individual canton.
Next Time
In the next session, we will look at analysis of covariance (ANCOVA), which allows us to fit models with a mixture of continuous and categorical (grouping) explanatory variables.