Analysis of Covariance (ANCOVA)

Tim Newbold

2025-10-10

Overview

This session will cover Analysis of covariance (ANCOVA).

ANCOVAs are used to test for an effect of a mixture of both continuous and categorical variables on a measured response variable.

If you are not familiar with basic statistical concepts (hypotheses, null hypotheses, P values, degrees of freedom etc.), I recommend taking a look at these videos.

We will work again with the dataset that we have used previously, describing the total abundance of beetles in locations across different types of land use - natural forest (primary vegetation) and areas of livestock grazing (pasture) - in New Zealand.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
url("https://www.dropbox.com/scl/fi/4olcttgagt0sd92hg64vk/Ewers2007BeetleData.rds?rlkey=14c8qby2023lt9924bi3th0ee&dl=1") %>%
  readRDS() -> beetles

As before, because beetle abundance can only take positive values, and because the distribution of values is highly skewed, we will work with log-transformed total abundance.

beetles %>% mutate(LogAbundance = log10(Total_abundance)) -> beetles

Further Reading

The following books are good references for basic statistical tests, including ANOVA, linear regression and ANCOVA:

Crawley, M.J. (2014). Statistics: An Introduction Using R. 2nd Edition. John Wiley & Sons, Hoboken, USA

Barnard, C., Gilbert, F. & McGregor, P. (2007). Asking Questions in Biology: A Guide to Hypothesis Testing, Experimental Design and Presentation in Practical Work and Research Projects. 3rd Edition. Benjamin Cummings, San Francisco, USA.

Running ANCOVAs

Simple Additive Models

In this session, we will investigate the effects of land use and natural habitat coverage in the landscape on beetle total abundance. Let’s begin by plotting these relationships.

library(ggplot2)

ggplot(data = beetles,mapping = aes(
  x = NaturalHabitat5k,y = Total_abundance,col=LandUse,fill=LandUse)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  scale_x_continuous(name = "Surrounding Natural Habitat Proportion") + 
  scale_y_continuous(name = "Total beetle abundance",transform = 'log10') + 
  theme_classic()

To begin with, we will fit a model to test whether both land use and natural habitat coverage affect beetle abundance. To do this, we use the lm function. This is the same function that we used when we fit a linear regression, but here by adding also a categoraical (grouping) variable into our model we have fit an analysis of covariance:

m1 <- lm(formula = LogAbundance~LandUse+NaturalHabitat5k,data = beetles)

summary(m1)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse + NaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90913 -0.15702  0.01417  0.19122  0.63517 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.25634    0.05158  43.743  < 2e-16 ***
## LandUsePasture   -0.29512    0.03583  -8.236 1.35e-14 ***
## NaturalHabitat5k  0.83483    0.17965   4.647 5.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2552 on 230 degrees of freedom
## Multiple R-squared:  0.3671, Adjusted R-squared:  0.3616 
## F-statistic:  66.7 on 2 and 230 DF,  p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## LandUse            1  7.2786  7.2786 111.796 < 2.2e-16 ***
## NaturalHabitat5k   1  1.4060  1.4060  21.595 5.671e-06 ***
## Residuals        230 14.9743  0.0651                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If we look at the ANOVA table, both land use and natural habitat coverage have F ratios with a significant P value, suggesting that both might be important in explaining beetle abundance (but more on statistical significance in models with multiple explanatory variables in the next session).

Checking Model Assumptions

For now though, we will check our model assumptions. As in previous sessions, we will assume that our data points are independent of one another, and that there is a linear relationship between our explanatory variables and our response variable. So, for now, we will check the assumptions of equality of variance and normal distribution of residuals.

ggplot(mapping = aes(x = fitted(m1),y = residuals(m1))) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(yintercept = 0,linewidth = 1.2,linetype='dashed',col="#bf2c23") + 
  theme_classic()

ggplot(mapping = aes(sample = residuals(m1))) + 
  geom_qq() + geom_qq_line() + 
  theme_classic()

It looks as though there is a slightly lower spread of residuals for higher fitted values in the model. The residuals appear to conform fairly well to a normal distribution, although it seems there might be some slight skew.

As in the session on linear regression, the first thing we might try in order to diagnose these issues is to investigate the distributions of values of our continuous variables. Since we have already log-transformed total abundance, we will focus now on the natural habitat values.

ggplot(data = beetles,mapping = aes(x = NaturalHabitat5k)) + 
  geom_histogram() + theme_classic()

The distribution of values for landscape coverage of natural habitat is strongly right-skewed. Therefore, we will try log-transforming this variable also.

beetles %>% mutate(LogNaturalHabitat5k = log10(NaturalHabitat5k)) -> beetles

ggplot(data = beetles,mapping = aes(x = LogNaturalHabitat5k)) + 
  geom_histogram() + theme_classic()

This does seem to have improved the distribution at least a little bit. So, let’s try refitting our model, but this time using the log-transformed values for natural habitat coverage.

m2 <- lm(formula = LogAbundance~LandUse+LogNaturalHabitat5k, data = beetles)

summary(m2)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse + LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92708 -0.15711  0.01682  0.19630  0.64222 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.74171    0.06938  39.516  < 2e-16 ***
## LandUsePasture      -0.30376    0.03594  -8.452 3.29e-15 ***
## LogNaturalHabitat5k  0.42864    0.10362   4.137 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 230 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:   0.35 
## F-statistic: 63.46 on 2 and 230 DF,  p-value: < 2.2e-16
anova(m2)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## LandUse               1  7.2786  7.2786 109.805 < 2.2e-16 ***
## LogNaturalHabitat5k   1  1.1344  1.1344  17.113  4.94e-05 ***
## Residuals           230 15.2459  0.0663                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we will check the model assumptions again.

ggplot(mapping = aes(x = fitted(m2),y = residuals(m2))) + 
  geom_point() + 
  geom_smooth() + 
  geom_hline(yintercept = 0,linewidth = 1.2,linetype='dashed',col="#bf2c23") + 
  theme_classic()

ggplot(mapping = aes(sample = residuals(m2))) + 
  geom_qq() + geom_qq_line() + 
  theme_classic()

There is maybe a slight improvement in conformation to model assumptions, although not by much.

Fitting Interactions Between Variables

Let’s re-plot the original relationship of beetle abundance as a function of land use and surrounding natural habitat in the landscape. This time, we will plot both abundance and natural habitat on log-transformed axes.

ggplot(data = beetles,mapping = aes(
  x = NaturalHabitat5k,y = Total_abundance,col=LandUse,fill=LandUse)) + 
  geom_point() + 
  geom_smooth(method = 'lm') + 
  scale_x_continuous(name = "Surrounding Natural Habitat Proportion",transform = 'log10') + 
  scale_y_continuous(name = "Total beetle abundance",transform = 'log10') + 
  theme_classic()

It looks as though the effect of natural habitat on beetle abundance differs among the two land-use types (i.e., the slopes of the lines are different).

Accounting for variability in the effect of one explanatory variable depending on the value of another explanatory variable is achieved by fitting an interaction term in the model. This is done in one of two ways in R. Both approaches here yield exactly the same model:

# First, we can put an asterisk instead of a plus symbol between our variables:
m3 <- lm(LogAbundance~LandUse*LogNaturalHabitat5k, data = beetles)

# Second, we can manually specify that we want to fit effects of land use,
# natural habitat, and their interaction, as follows:
m4 <- lm(LogAbundance~LandUse+LogNaturalHabitat5k+LandUse:NaturalHabitat5k,
         data = beetles)

summary(m3)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse * LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92189 -0.15328  0.02291  0.18623  0.62522 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.78061    0.08081  34.411  < 2e-16 ***
## LandUsePasture                     -0.45800    0.16802  -2.726  0.00691 ** 
## LogNaturalHabitat5k                 0.49025    0.12264   3.998 8.63e-05 ***
## LandUsePasture:LogNaturalHabitat5k -0.21559    0.22941  -0.940  0.34832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 229 degrees of freedom
## Multiple R-squared:  0.3581, Adjusted R-squared:  0.3497 
## F-statistic: 42.58 on 3 and 229 DF,  p-value: < 2.2e-16
anova(m3)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                              Df  Sum Sq Mean Sq  F value    Pr(>F)    
## LandUse                       1  7.2786  7.2786 109.7491 < 2.2e-16 ***
## LogNaturalHabitat5k           1  1.1344  1.1344  17.1046 4.968e-05 ***
## LandUse:LogNaturalHabitat5k   1  0.0586  0.0586   0.8832    0.3483    
## Residuals                   229 15.1874  0.0663                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The second of these two approaches is useful if you have more than two terms in the model. If you put asterisks between all variables in your model, as in the first approach, you will get a model with every possible combination of interactions (including higher-order interactions) between your explanatory variables. This quickly leads to very complex models with increasing numbers of explanatory variables.

Looking at the model summary and P values, it doesn’t look as though the interaction between land use and surrounding natural habitat has a significant effect on beetle abundance (but we will return to this in the next session).

Interpreting Model Coefficients

The principle is the same for the coefficient tables produced by analysis of covariance models as those produced by simple ANOVAs or linear regressions. For additive models, interpreting the model coefficients is relatively straightforward. We just have a mix of slope coefficients for continuous variables (as in linear regression), and difference coefficients for grouping variables (as in analysis of variance). The intercept now refers to the value of the response variable when continuous variables have a value of zero and for reference groupings (in this case, samples taken within locations where the land use is pasture).

Interpreting the cofficient table in an additive analysis of covariance
Interpreting the cofficient table in an additive analysis of covariance

Things are a little more complicated for analyses of covariance where we have an interaction between variables. In the case of the interaction between a continuous and grouping variable (as in the model above), we now have individual slopes for the continuous variable for each group within the grouping variable.

Interpreting the cofficient table in an interactive analysis of covariance
Interpreting the cofficient table in an interactive analysis of covariance

Making Predictions from an ANOVA Model

As we did with linear regression, it is useful to be able to make predictions from an analysis of covariance.

As before, we can simply make predictions for the values of the explanatory variables in our original dataset, which can be great for making graphs.

preds <- predict(object = m2,interval = 'confidence')

bind_cols(beetles,preds) -> beetles2

head(beetles2)
##         Source_ID Study_number Study_name Site_number  Site_name Block
## 1 CC1_2007__Ewers            1     Beetle           1 NewSite001     A
## 2 CC1_2007__Ewers            1     Beetle           2 NewSite002     A
## 3 CC1_2007__Ewers            1     Beetle           3 NewSite003     A
## 4 CC1_2007__Ewers            1     Beetle           4 NewSite004     A
## 5 CC1_2007__Ewers            1     Beetle           5 NewSite005     A
## 6 CC1_2007__Ewers            1     Beetle           6 NewSite006     A
##   Use_intensity Sample_start_earliest Sample_end_latest Diversity_metric_type
## 1   Minimal use            2000-10-30        2001-02-10             Abundance
## 2   Minimal use            2000-10-30        2001-02-10             Abundance
## 3   Minimal use            2000-10-30        2001-02-10             Abundance
## 4   Minimal use            2000-10-30        2001-02-10             Abundance
## 5   Minimal use            2000-10-30        2001-02-10             Abundance
## 6   Minimal use            2000-10-30        2001-02-10             Abundance
##   Predominant_land_use                 SSB                  SSBS
## 1   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 1
## 2   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 2
## 3   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 3
## 4   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 4
## 5   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 5
## 6   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 6
##   Study_common_taxon     Country                SS                 SSS
## 1         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 1
## 2         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 2
## 3         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 3
## 4         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 4
## 5         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 5
## 6         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 6
##   Total_abundance Species_richness Simpson_diversity ChaoR Richness_rarefied
## 1        256.2766               64         21.937953    NA                NA
## 2        150.6596               49         22.563549    NA                NA
## 3        114.9362               33         15.468927    NA                NA
## 4        254.7234               35          6.499758    NA                NA
## 5        142.8936               33         11.500000    NA                NA
## 6        147.7805               36         13.750499    NA                NA
##              LandUse NaturalHabitat ID NaturalHabitat5k RoadDistance
## 1 Primary vegetation      0.4099132  1         0.321439     2111.284
## 2 Primary vegetation      0.4099132  2         0.321439     2109.313
## 3 Primary vegetation      0.4099132  3         0.321439     2107.431
## 4 Primary vegetation      0.4099132  4         0.321439     2103.511
## 5 Primary vegetation      0.4099132  5         0.321439     2094.761
## 6 Primary vegetation      0.4099132  6         0.321439     2077.476
##   NaturalHabitatDistance Longitude Latitude LogAbundance LogNaturalHabitat5k
## 1               483.3234  16601951 -4972760     2.408709          -0.4929015
## 2               481.5192  16601949 -4972759     2.177997          -0.4929015
## 3               479.7806  16601947 -4972759     2.060457          -0.4929015
## 4               476.1814  16601942 -4972758     2.406069          -0.4929015
## 5               468.1528  16601933 -4972756     2.155013          -0.4929015
## 6               452.5469  16601915 -4972751     2.169617          -0.4929015
##        fit      lwr      upr
## 1 2.530432 2.476821 2.584044
## 2 2.530432 2.476821 2.584044
## 3 2.530432 2.476821 2.584044
## 4 2.530432 2.476821 2.584044
## 5 2.530432 2.476821 2.584044
## 6 2.530432 2.476821 2.584044

Now, we can add these predictions to a results graph. Note that we have to back-transform the predictions because our model fit log10 beetle abundance.

beetles2 %>% mutate(PredMedian = 10^fit,
                    PredLower = 10^upr,
                    PredUpper = 10^lwr) -> beetles2

ggplot(data = beetles2,mapping = aes(x = NaturalHabitat5k,
                                     fill = LandUse,
                                     colour = LandUse)) + 
  geom_point(mapping = aes(y = Total_abundance)) + 
  geom_line(mapping = aes(y = PredMedian)) + 
  geom_ribbon(mapping = aes(ymin = PredLower,ymax = PredUpper),
              alpha = 0.2, colour = NA) + 
  scale_x_continuous(name = "Natural Habitat Proportion",transform = 'log10') + 
  scale_y_continuous(name = "Total beetle abundance",transform = 'log10') + 
  theme_classic()

We can also make predictions for a new dataset, in exactly the same way as we did in the session on linear regression.

Next Time

In the next session, we will look at how you can assess how well your model describes your study system.