Overview
Many ecological datasets, especially those collected in the field, have some sort of hierarchical structure. Often, there is some sort of spatial structuring in the locations sampled. Furthermore, an increasing number of ecological studies are conducted by collating data from many separate previous studies. In these cases, the assumption of independence of data points is not met. The most commonly used solution to this non-independence issue is to use mixed-effects models, so called because they consist of two classes of independent variables:
- Fixed effects, which are variables that we control in our experimental design.
- Random effects, which are variables that describe groups that are representative of the wider population.
Let’s say that we sample the species richness of lake-shore versus lake-centre habitats in a series of lakes. Clearly we can’t assume that the samples from different lakes are independent of one another, because the measured species richness may depend not only on which habitat we sampled, but also on properties of the lakes. In this case, we controlled the sampling of lake-shore versus lake-centre habitats in our experimental design, and so this would be a fixed effect. In contrast, we would probably assume that the lakes were a random sample of the wider population of lakes in the study system, and so we would treat lake idenity as a random variable.
For this session, you will need the lme4 R package, as well as my StatisticalModels package, and also the ggplot2 package, which you will have already installed if you completed the previous sessions:
install.packages("lme4")
::install_github("timnewbold/StatisticalModels") remotes
library(ggplot2)
library(lme4)
library(StatisticalModels)
Example Dataset
In this session, we will work with a dataset describing the growth of five individual orange trees:
data(Orange)
head(Orange)
## Grouped Data: circumference ~ age | Tree
## Tree age circumference
## 1 1 118 30
## 2 1 484 58
## 3 1 664 87
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
If we plot the relationship between tree age and trunk circumference, there is (not surprisingly) a strong positive relationship:
ggplot(Orange,aes(x=age,y=circumference))+geom_point()+theme_classic()
So, let’s try fitting a simple linear model of circumference as a function of age:
<- lm(circumference~age,data=Orange)
m1
m1
##
## Call:
## lm(formula = circumference ~ age, data = Orange)
##
## Coefficients:
## (Intercept) age
## 17.3997 0.1068
If we check the assumptions of the model, we can see that there is a strong increase in variance at higher fitted values of the model.
qqnorm(residuals(m1)); qqline(residuals(m1))
plot(fitted(m1),residuals(m1))
Furthermore, because the data are collected from five individual trees, it is very unlikely that we can safely assume that the data points are independent.
If we re-plot the relationship between age and trunk circumference for individual trees, we can see that there is strong variation in the growth of individual trees:
ggplot(Orange,aes(x=age,y=circumference,col=Tree))+geom_point()+theme_classic()
Fitting Mixed-effects Models in R
One solution to the problems here is to try a mixed-effects model to account for non-independence among individual trees. Here, age is the variable controlled by our experimental design, and so will be our fixed effect. The sampled individual trees can be considered to be a random sample of the wider population of trees, and so tree identity will be our random effect. Random effects are designated as (1|RandomEffect) in the model code:
<- lmer(circumference~age+(1|Tree),data=Orange)
m2
summary(m2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: circumference ~ age + (1 | Tree)
## Data: Orange
##
## REML criterion at convergence: 303.2
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.8781 -0.6743 0.2320 0.5053 1.5416
##
## Random effects:
## Groups Name Variance Std.Dev.
## Tree (Intercept) 389.6 19.74
## Residual 232.9 15.26
## Number of obs: 35, groups: Tree, 5
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 17.399650 10.423696 1.669
## age 0.106770 0.005321 20.066
##
## Correlation of Fixed Effects:
## (Intr)
## age -0.471
Effect Sizes
The model summary gives a coefficient table for the fixed effects very similar to that produced by standard linear or generalized linear models. Effect sizes can be calculated in exactly the same way as for linear models (see ANCOVA session). The slope of the relationship fitted here shows us that trunk circumference grows by approximately 0.1 mm each day, on average. We also get a table describing the variation in our response variable that is explained by variation among random-effect groups (in this case tree individuals, which unsurprisingly accounts for a large proportion of the variation in tree growth), as well as the residual, unexplained variation.
By inspecting the random-effect values, we can see the variation in trunk circumference of each individual tree (these random intercepts are centered around 0):
ranef(m2)
## $Tree
## (Intercept)
## 3 -20.137548
## 1 -15.004448
## 5 -4.343393
## 2 17.900043
## 4 21.585346
##
## with conditional variances for "Tree"
These random intercepts show, consistent with the graph we plotted earlier, that trees 2 and 4 grew particularly rapidly, while trees 1 and 3 grew slowly.
Model Selection
Model selection can be performed in exactly the same way as for linear or generalized linear models. Here we will use simple backward stepwise selection, but you could also use multi-model inference.
In this case, we will compare our model with age as a single explanatory variable with the simpler, intercept-only, null model:
<- lmer(circumference~1+(1|Tree),data=Orange)
m3
anova(m2,m3)
## Data: Orange
## Models:
## m3: circumference ~ 1 + (1 | Tree)
## m2: circumference ~ age + (1 | Tree)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## m3 3 387.92 392.59 -190.96 381.92
## m2 4 308.68 314.90 -150.34 300.68 81.246 1 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As with generalized linear models, the comparison of two mixed-effects models is made on the basis of the difference in deviance, with P values calculated by assessing this difference against a chi-squared distribution. Not surprisingly, we find that age has a highly significant effect on the trunk circumference of orange trees.
Model Explanatory Power
Estimating model explanatory power is more complicated than for linear or generalized linear models. Mixed-effects models do not use sums of squares, and so we cannot use classic R2 values, and the calculation of deviance explained is complicated by the presence of the random effects. Instead, a method to calculate pseudo-R2 values has been proposed (Nakagawa and Schielzeth 2013). This method calculates two R2 values. The conditional R2 describes the variation in the response variable explained by both the fixed and random effects. The marginal R2 describes the variation explained by the fixed effects alone. I have implemented this method as a function R2GLMER in my StatisticalModels package:
R2GLMER(m2)
## $conditional
## [1] 0.931107
##
## $marginal
## [1] 0.8158525
For our model of orange-tree growth, both marginal and conditional R2 values are very high, indicating that most of the variation in trunk circumference can be explained by tree age, but that much of the variation unexplained by age is captured by variation among individual trees.
Checking Model Assumptions
Finally, let’s check how well our mixed-effects model conforms to the assumptions of parametric statistical tests:
qqnorm(residuals(m2)); qqline(residuals(m2))
plot(fitted(m2),residuals(m2))
By accounting for variation in growth among individual trees, our mixed-effects model seems to be behaving much better in terms of equality of variance compared to the simple linear model we fitted earlier. This in addition to dealing with the issue of non-independence of data points caused by sampling repeated measures on the same individual trees.
Generalized Linear Mixed-effects Models
So far in this session, we have been using linear mixed-effects models. If you have data drawn from a Poisson or binomial distribution, you can use generalized linear mixed-effects models. Just as with generalized linear models, you simply specify family=“poisson”.
Further Information
If you are interested to do some further exploration of applying mixed-effects models (including generalized linear mixed-effects models), you might want to have a look at this tutorial on working with the PREDICTS database.