Overview
This session will cover Analysis of covariance (ANCOVA).
ANCOVAs are used to test for an effect of a mixture of both continuous and categorical variables on a measured response variable.
If you are not familiar with basic statistical concepts (hypotheses, null hypotheses, P values, degrees of freedom etc.), I recommend taking a look at these videos.
We will work again with the cars dataset that we have used previously:
data(mtcars)
# Ensure that the number of cylinders is treated as a categorical (grouping) variable,
# by converting it to a factor
$cyl <- factor(mtcars$cyl) mtcars
Running ANCOVAs
Simple Additive Models
In this session, we will investigate the effects of car power and fuel efficiency. Let’s begin by plotting this relationship:
library(ggplot2)
ggplot(data = mtcars,mapping = aes(x=hp,y=mpg))+geom_point()+
scale_x_continuous(name = "Power (HP)")+
scale_y_continuous(name = "Fuel Efficiency (MPG)")+
geom_smooth(method = "lm")+theme_classic()
We will also additionally consider the effect of the number of cylinders on this relationship, so let’s plot the relationship again, but separating car models based on their number of cylinders:
ggplot(data = mtcars,mapping = aes(x=hp,y=mpg,col=cyl,shape=cyl))+geom_point()+
scale_x_continuous(name = "Power (HP)")+
scale_y_continuous(name = "Fuel Efficiency (MPG)")+
geom_smooth(method = "lm")+theme_classic()
To begin with, we will fit a model to test whether both power and number of cylinders affect fuel economy. To do this, we use the lm function. This is the same function that we used when we fit a linear regression, but here by adding also a categoraical (grouping) variable into our model we have fit an analysis of covariance:
<- lm(mtcars$mpg~mtcars$hp+mtcars$cyl)
m1
summary(m1)
##
## Call:
## lm(formula = mtcars$mpg ~ mtcars$hp + mtcars$cyl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.818 -1.959 0.080 1.627 6.812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.65012 1.58779 18.044 < 2e-16 ***
## mtcars$hp -0.02404 0.01541 -1.560 0.12995
## mtcars$cyl6 -5.96766 1.63928 -3.640 0.00109 **
## mtcars$cyl8 -8.52085 2.32607 -3.663 0.00103 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.146 on 28 degrees of freedom
## Multiple R-squared: 0.7539, Adjusted R-squared: 0.7275
## F-statistic: 28.59 on 3 and 28 DF, p-value: 1.14e-08
anova(m1)
## Analysis of Variance Table
##
## Response: mtcars$mpg
## Df Sum Sq Mean Sq F value Pr(>F)
## mtcars$hp 1 678.37 678.37 68.5305 5.225e-09 ***
## mtcars$cyl 2 170.51 85.25 8.6124 0.001216 **
## Residuals 28 277.17 9.90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we look at the ANOVA table, both power and number of cylinders have F ratios with a significant P value, suggesting that both might be important in explaining fuel economy (but more on statistical significance in models with multiple explanatory variables in the next session).
Checking Model Assumptions
For now though, we will check our model assumptions. As in previous sessions, we will assume that our data points are independent of one another, and that there is a linear relationship between our explanatory variables and our response variable. So, for now, we will check the assumptions of equality of variance and normal distribution of residuals:
plot(fitted(m1),residuals(m1)); abline(h=0,lty=2,lwd=2,col="red")
qqnorm(residuals(m1)); qqline(residuals(m1))
hist(residuals(m1))
It looks as though there is a higher spread of residuals for higher fitted values in the model. The residuals appear to conform fairly well to a normal distribution, although it seems there might be some slight skew.
As in the session on linear regression, the first thing we might try in order to diagnose these issues is to investigate the distributions of values in the two continuous variables in our model (the response variable and the continuous explanatory variable):
hist(mtcars$mpg)
hist(mtcars$hp)
Both of these variables are right skewed. Therefore, let’s try log transformations of both:
$LogMPG <- log(mtcars$mpg)
mtcars$LogHP <- log(mtcars$hp)
mtcars
hist(mtcars$LogMPG)
hist(mtcars$LogHP)
This seems to have improved the distributions of both variables. So, let’s try refitting our model, but this time using the log-transformed values:
<- lm(mtcars$LogMPG~mtcars$LogHP+mtcars$cyl)
m2
summary(m2)
##
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP + mtcars$cyl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35708 -0.09661 0.01399 0.10347 0.23491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.43083 0.53052 8.352 4.37e-09 ***
## mtcars$LogHP -0.26469 0.12057 -2.195 0.0366 *
## mtcars$cyl6 -0.18197 0.08774 -2.074 0.0474 *
## mtcars$cyl8 -0.32300 0.12788 -2.526 0.0175 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1502 on 28 degrees of freedom
## Multiple R-squared: 0.7701, Adjusted R-squared: 0.7455
## F-statistic: 31.27 on 3 and 28 DF, p-value: 4.422e-09
anova(m2)
## Analysis of Variance Table
##
## Response: mtcars$LogMPG
## Df Sum Sq Mean Sq F value Pr(>F)
## mtcars$LogHP 1 1.9673 1.96733 87.1742 4.282e-10 ***
## mtcars$cyl 2 0.1495 0.07475 3.3123 0.05115 .
## Residuals 28 0.6319 0.02257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fitted(m2),residuals(m2))
qqnorm(residuals(m2)); qqline(residuals(m2))
This model conforms to the assumptions of parametric statistical tests better than before, although there are still some signs of problems. We will return to this issue again shortly.
Fitting Interactions Between Variables
Let’s re-plot our original relationship between car power and fuel economy, separating cars with different numbers of cylinders:
ggplot(data = mtcars,mapping = aes(x=hp,y=mpg,col=cyl,shape=cyl))+geom_point()+
scale_x_continuous(name = "Power (HP)")+
scale_y_continuous(name = "Fuel Efficiency (MPG)")+
geom_smooth(method = "lm")+theme_classic()
It looks as though the effect of power on fuel economy differs depending on how many cylinders the car has (i.e., the slopes of the lines are different).
Accounting for variability in the effect of one explanatory variable depending on the value of another explanatory variable is achieved by fitting an interaction term in the model. This is done in one of two ways in R. Both approaches here yield exactly the same model:
# First, we can put an asterisk instead of a plus symbol between our variables:
<- lm(mtcars$LogMPG~mtcars$LogHP*mtcars$cyl)
m3 # Second, we can manually specify that we want to fit effects of power, number of
# cylinders, and their interaction, as follows:
<- lm(mtcars$LogMPG~mtcars$LogHP+mtcars$cyl+mtcars$LogHP:mtcars$cyl)
m4
summary(m3)
##
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP * mtcars$cyl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.35724 -0.08548 -0.00269 0.08830 0.26543
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.8204 0.8156 5.910 3.11e-06 ***
## mtcars$LogHP -0.3535 0.1857 -1.903 0.0681 .
## mtcars$cyl6 -1.5756 1.9063 -0.827 0.4160
## mtcars$cyl8 -0.8895 1.2865 -0.691 0.4954
## mtcars$LogHP:mtcars$cyl6 0.2984 0.4045 0.738 0.4673
## mtcars$LogHP:mtcars$cyl8 0.1221 0.2635 0.463 0.6469
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1541 on 26 degrees of freedom
## Multiple R-squared: 0.7753, Adjusted R-squared: 0.7321
## F-statistic: 17.94 on 5 and 26 DF, p-value: 1.06e-07
anova(m3)
## Analysis of Variance Table
##
## Response: mtcars$LogMPG
## Df Sum Sq Mean Sq F value Pr(>F)
## mtcars$LogHP 1 1.96733 1.96733 82.8173 1.453e-09 ***
## mtcars$cyl 2 0.14950 0.07475 3.1467 0.05973 .
## mtcars$LogHP:mtcars$cyl 2 0.01427 0.00713 0.3003 0.74314
## Residuals 26 0.61763 0.02376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The second of these two approaches is useful if you have more than two terms in the model. If you put asterisks between all variables in your model, as in the first approach, you will get a model with every possible combination of interactions (including higher-order interactions) between your explanatory variables. This quickly leads to very complex models with increasing numbers of explanatory variables.
Looking at the model summary and P values, it doesn’t look as though the interaction between power and number of cylinders if very helpful in explaining the fuel efficiency of cars (we will return to this in the next session).
Checking Model Assumptions
Let’s check our model assumptions again:
plot(fitted(m3),residuals(m3))
qqnorm(residuals(m3)); qqline(residuals(m3))
It looks as though this interactive model is conforming better to the assumptions of equal variance and normal distribution of residuals. Missing important variables in your model is another common cause for violation of model assumptions. The only apparent issue remaining is that there are two data points with large negative residuals. The next step in improving this model would probably be to investigate these data points.
Interpreting Model Coefficients
The principle is the same for the coefficient tables produced by analysis of covariance models as those produced by simple ANOVAs or linear regressions. For additive models, interpreting the model coefficients is relatively straightforward. We just have a mix of slope coefficients for continuous variables (as in linear regression), and difference coefficients for grouping variables (as in analysis of variance). The intercept now refers to the value of the response variable when continuous variables have a value of zero and for reference groupings (in this case, cars with 4 cylinders).
Things are a little more complicated for analyses of covariance where we have an interaction between variables. In the case of the interaction between a continuous and grouping variable (as in the model above), we now have individual slopes for the continuous variable for each group within the grouping variable.
Next Time
In the next session, we will look at how you can assess how well your model describes your study system.