Overview
This session will cover Analysis of covariance (ANCOVA).
ANCOVAs are used to test for an effect of a mixture of both continuous and categorical variables on a measured response variable.
If you are not familiar with basic statistical concepts (hypotheses, null hypotheses, P values, degrees of freedom etc.), I recommend taking a look at these videos.
We will work again with the dataset that we have used previously, describing the total abundance of beetles in locations across different types of land use - natural forest (primary vegetation) and areas of livestock grazing (pasture) - in New Zealand.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
url("https://www.dropbox.com/scl/fi/4olcttgagt0sd92hg64vk/Ewers2007BeetleData.rds?rlkey=14c8qby2023lt9924bi3th0ee&dl=1") %>%
readRDS() -> beetles
As before, because beetle abundance can only take positive values, and because the distribution of values is highly skewed, we will work with log-transformed total abundance.
beetles %>% mutate(LogAbundance = log10(Total_abundance)) -> beetles
Further Reading
The following books are good references for basic statistical tests, including ANOVA, linear regression and ANCOVA:
Crawley, M.J. (2014). Statistics: An Introduction Using R. 2nd Edition. John Wiley & Sons, Hoboken, USA
Barnard, C., Gilbert, F. & McGregor, P. (2007). Asking Questions in Biology: A Guide to Hypothesis Testing, Experimental Design and Presentation in Practical Work and Research Projects. 3rd Edition. Benjamin Cummings, San Francisco, USA.
Running ANCOVAs
Simple Additive Models
In this session, we will investigate the effects of land use and natural habitat coverage in the landscape on beetle total abundance. Let’s begin by plotting these relationships.
library(ggplot2)
ggplot(data = beetles,mapping = aes(
x = NaturalHabitat5k,y = Total_abundance,col=LandUse,fill=LandUse)) +
geom_point() +
geom_smooth(method = 'lm') +
scale_x_continuous(name = "Surrounding Natural Habitat Proportion") +
scale_y_continuous(name = "Total beetle abundance",transform = 'log10') +
theme_classic()
To begin with, we will fit a model to test whether both land use and natural habitat coverage affect beetle abundance. To do this, we use the lm function. This is the same function that we used when we fit a linear regression, but here by adding also a categoraical (grouping) variable into our model we have fit an analysis of covariance:
m1 <- lm(formula = LogAbundance~LandUse+NaturalHabitat5k,data = beetles)
summary(m1)
##
## Call:
## lm(formula = LogAbundance ~ LandUse + NaturalHabitat5k, data = beetles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.90913 -0.15702 0.01417 0.19122 0.63517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.25634 0.05158 43.743 < 2e-16 ***
## LandUsePasture -0.29512 0.03583 -8.236 1.35e-14 ***
## NaturalHabitat5k 0.83483 0.17965 4.647 5.67e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2552 on 230 degrees of freedom
## Multiple R-squared: 0.3671, Adjusted R-squared: 0.3616
## F-statistic: 66.7 on 2 and 230 DF, p-value: < 2.2e-16
anova(m1)
## Analysis of Variance Table
##
## Response: LogAbundance
## Df Sum Sq Mean Sq F value Pr(>F)
## LandUse 1 7.2786 7.2786 111.796 < 2.2e-16 ***
## NaturalHabitat5k 1 1.4060 1.4060 21.595 5.671e-06 ***
## Residuals 230 14.9743 0.0651
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we look at the ANOVA table, both land use and natural habitat coverage have F ratios with a significant P value, suggesting that both might be important in explaining beetle abundance (but more on statistical significance in models with multiple explanatory variables in the next session).
Checking Model Assumptions
For now though, we will check our model assumptions. As in previous sessions, we will assume that our data points are independent of one another, and that there is a linear relationship between our explanatory variables and our response variable. So, for now, we will check the assumptions of equality of variance and normal distribution of residuals.
ggplot(mapping = aes(x = fitted(m1),y = residuals(m1))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0,linewidth = 1.2,linetype='dashed',col="#bf2c23") +
theme_classic()
ggplot(mapping = aes(sample = residuals(m1))) +
geom_qq() + geom_qq_line() +
theme_classic()
It looks as though there is a slightly lower spread of residuals for higher fitted values in the model. The residuals appear to conform fairly well to a normal distribution, although it seems there might be some slight skew.
As in the session on linear regression, the first thing we might try in order to diagnose these issues is to investigate the distributions of values of our continuous variables. Since we have already log-transformed total abundance, we will focus now on the natural habitat values.
ggplot(data = beetles,mapping = aes(x = NaturalHabitat5k)) +
geom_histogram() + theme_classic()
The distribution of values for landscape coverage of natural habitat is strongly right-skewed. Therefore, we will try log-transforming this variable also.
beetles %>% mutate(LogNaturalHabitat5k = log10(NaturalHabitat5k)) -> beetles
ggplot(data = beetles,mapping = aes(x = LogNaturalHabitat5k)) +
geom_histogram() + theme_classic()
This does seem to have improved the distribution at least a little bit. So, let’s try refitting our model, but this time using the log-transformed values for natural habitat coverage.
m2 <- lm(formula = LogAbundance~LandUse+LogNaturalHabitat5k, data = beetles)
summary(m2)
##
## Call:
## lm(formula = LogAbundance ~ LandUse + LogNaturalHabitat5k, data = beetles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92708 -0.15711 0.01682 0.19630 0.64222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.74171 0.06938 39.516 < 2e-16 ***
## LandUsePasture -0.30376 0.03594 -8.452 3.29e-15 ***
## LogNaturalHabitat5k 0.42864 0.10362 4.137 4.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2575 on 230 degrees of freedom
## Multiple R-squared: 0.3556, Adjusted R-squared: 0.35
## F-statistic: 63.46 on 2 and 230 DF, p-value: < 2.2e-16
anova(m2)
## Analysis of Variance Table
##
## Response: LogAbundance
## Df Sum Sq Mean Sq F value Pr(>F)
## LandUse 1 7.2786 7.2786 109.805 < 2.2e-16 ***
## LogNaturalHabitat5k 1 1.1344 1.1344 17.113 4.94e-05 ***
## Residuals 230 15.2459 0.0663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now we will check the model assumptions again.
ggplot(mapping = aes(x = fitted(m2),y = residuals(m2))) +
geom_point() +
geom_smooth() +
geom_hline(yintercept = 0,linewidth = 1.2,linetype='dashed',col="#bf2c23") +
theme_classic()
ggplot(mapping = aes(sample = residuals(m2))) +
geom_qq() + geom_qq_line() +
theme_classic()
There is maybe a slight improvement in conformation to model assumptions, although not by much.
Fitting Interactions Between Variables
Let’s re-plot the original relationship of beetle abundance as a function of land use and surrounding natural habitat in the landscape. This time, we will plot both abundance and natural habitat on log-transformed axes.
ggplot(data = beetles,mapping = aes(
x = NaturalHabitat5k,y = Total_abundance,col=LandUse,fill=LandUse)) +
geom_point() +
geom_smooth(method = 'lm') +
scale_x_continuous(name = "Surrounding Natural Habitat Proportion",transform = 'log10') +
scale_y_continuous(name = "Total beetle abundance",transform = 'log10') +
theme_classic()
It looks as though the effect of natural habitat on beetle abundance differs among the two land-use types (i.e., the slopes of the lines are different).
Accounting for variability in the effect of one explanatory variable depending on the value of another explanatory variable is achieved by fitting an interaction term in the model. This is done in one of two ways in R. Both approaches here yield exactly the same model:
# First, we can put an asterisk instead of a plus symbol between our variables:
m3 <- lm(LogAbundance~LandUse*LogNaturalHabitat5k, data = beetles)
# Second, we can manually specify that we want to fit effects of land use,
# natural habitat, and their interaction, as follows:
m4 <- lm(LogAbundance~LandUse+LogNaturalHabitat5k+LandUse:NaturalHabitat5k,
data = beetles)
summary(m3)
##
## Call:
## lm(formula = LogAbundance ~ LandUse * LogNaturalHabitat5k, data = beetles)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.92189 -0.15328 0.02291 0.18623 0.62522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.78061 0.08081 34.411 < 2e-16 ***
## LandUsePasture -0.45800 0.16802 -2.726 0.00691 **
## LogNaturalHabitat5k 0.49025 0.12264 3.998 8.63e-05 ***
## LandUsePasture:LogNaturalHabitat5k -0.21559 0.22941 -0.940 0.34832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2575 on 229 degrees of freedom
## Multiple R-squared: 0.3581, Adjusted R-squared: 0.3497
## F-statistic: 42.58 on 3 and 229 DF, p-value: < 2.2e-16
anova(m3)
## Analysis of Variance Table
##
## Response: LogAbundance
## Df Sum Sq Mean Sq F value Pr(>F)
## LandUse 1 7.2786 7.2786 109.7491 < 2.2e-16 ***
## LogNaturalHabitat5k 1 1.1344 1.1344 17.1046 4.968e-05 ***
## LandUse:LogNaturalHabitat5k 1 0.0586 0.0586 0.8832 0.3483
## Residuals 229 15.1874 0.0663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The second of these two approaches is useful if you have more than two terms in the model. If you put asterisks between all variables in your model, as in the first approach, you will get a model with every possible combination of interactions (including higher-order interactions) between your explanatory variables. This quickly leads to very complex models with increasing numbers of explanatory variables.
Looking at the model summary and P values, it doesn’t look as though the interaction between land use and surrounding natural habitat has a significant effect on beetle abundance (but we will return to this in the next session).
Interpreting Model Coefficients
The principle is the same for the coefficient tables produced by analysis of covariance models as those produced by simple ANOVAs or linear regressions. For additive models, interpreting the model coefficients is relatively straightforward. We just have a mix of slope coefficients for continuous variables (as in linear regression), and difference coefficients for grouping variables (as in analysis of variance). The intercept now refers to the value of the response variable when continuous variables have a value of zero and for reference groupings (in this case, samples taken within locations where the land use is pasture).
Things are a little more complicated for analyses of covariance where we have an interaction between variables. In the case of the interaction between a continuous and grouping variable (as in the model above), we now have individual slopes for the continuous variable for each group within the grouping variable.
Making Predictions from an ANOVA Model
As we did with linear regression, it is useful to be able to make predictions from an analysis of covariance.
As before, we can simply make predictions for the values of the explanatory variables in our original dataset, which can be great for making graphs.
preds <- predict(object = m2,interval = 'confidence')
bind_cols(beetles,preds) -> beetles2
head(beetles2)
## Source_ID Study_number Study_name Site_number Site_name Block
## 1 CC1_2007__Ewers 1 Beetle 1 NewSite001 A
## 2 CC1_2007__Ewers 1 Beetle 2 NewSite002 A
## 3 CC1_2007__Ewers 1 Beetle 3 NewSite003 A
## 4 CC1_2007__Ewers 1 Beetle 4 NewSite004 A
## 5 CC1_2007__Ewers 1 Beetle 5 NewSite005 A
## 6 CC1_2007__Ewers 1 Beetle 6 NewSite006 A
## Use_intensity Sample_start_earliest Sample_end_latest Diversity_metric_type
## 1 Minimal use 2000-10-30 2001-02-10 Abundance
## 2 Minimal use 2000-10-30 2001-02-10 Abundance
## 3 Minimal use 2000-10-30 2001-02-10 Abundance
## 4 Minimal use 2000-10-30 2001-02-10 Abundance
## 5 Minimal use 2000-10-30 2001-02-10 Abundance
## 6 Minimal use 2000-10-30 2001-02-10 Abundance
## Predominant_land_use SSB SSBS
## 1 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 1
## 2 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 2
## 3 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 3
## 4 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 4
## 5 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 5
## 6 Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 6
## Study_common_taxon Country SS SSS
## 1 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 1
## 2 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 2
## 3 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 3
## 4 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 4
## 5 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 5
## 6 Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 6
## Total_abundance Species_richness Simpson_diversity ChaoR Richness_rarefied
## 1 256.2766 64 21.937953 NA NA
## 2 150.6596 49 22.563549 NA NA
## 3 114.9362 33 15.468927 NA NA
## 4 254.7234 35 6.499758 NA NA
## 5 142.8936 33 11.500000 NA NA
## 6 147.7805 36 13.750499 NA NA
## LandUse NaturalHabitat ID NaturalHabitat5k RoadDistance
## 1 Primary vegetation 0.4099132 1 0.321439 2111.284
## 2 Primary vegetation 0.4099132 2 0.321439 2109.313
## 3 Primary vegetation 0.4099132 3 0.321439 2107.431
## 4 Primary vegetation 0.4099132 4 0.321439 2103.511
## 5 Primary vegetation 0.4099132 5 0.321439 2094.761
## 6 Primary vegetation 0.4099132 6 0.321439 2077.476
## NaturalHabitatDistance Longitude Latitude LogAbundance LogNaturalHabitat5k
## 1 483.3234 16601951 -4972760 2.408709 -0.4929015
## 2 481.5192 16601949 -4972759 2.177997 -0.4929015
## 3 479.7806 16601947 -4972759 2.060457 -0.4929015
## 4 476.1814 16601942 -4972758 2.406069 -0.4929015
## 5 468.1528 16601933 -4972756 2.155013 -0.4929015
## 6 452.5469 16601915 -4972751 2.169617 -0.4929015
## fit lwr upr
## 1 2.530432 2.476821 2.584044
## 2 2.530432 2.476821 2.584044
## 3 2.530432 2.476821 2.584044
## 4 2.530432 2.476821 2.584044
## 5 2.530432 2.476821 2.584044
## 6 2.530432 2.476821 2.584044
Now, we can add these predictions to a results graph. Note that we have to back-transform the predictions because our model fit log10 beetle abundance.
beetles2 %>% mutate(PredMedian = 10^fit,
PredLower = 10^upr,
PredUpper = 10^lwr) -> beetles2
ggplot(data = beetles2,mapping = aes(x = NaturalHabitat5k,
fill = LandUse,
colour = LandUse)) +
geom_point(mapping = aes(y = Total_abundance)) +
geom_line(mapping = aes(y = PredMedian)) +
geom_ribbon(mapping = aes(ymin = PredLower,ymax = PredUpper),
alpha = 0.2, colour = NA) +
scale_x_continuous(name = "Natural Habitat Proportion",transform = 'log10') +
scale_y_continuous(name = "Total beetle abundance",transform = 'log10') +
theme_classic()
We can also make predictions for a new dataset, in exactly the same way as we did in the session on linear regression.
Next Time
In the next session, we will look at how you can assess how well your model describes your study system.