Mixed-effects Models

Tim Newbold

2021-10-19

Overview

Many ecological datasets, especially those collected in the field, have some sort of hierarchical structure. Often, there is some sort of spatial structuring in the locations sampled. Furthermore, an increasing number of ecological studies are conducted by collating data from many separate previous studies. In these cases, the assumption of independence of data points is not met. The most commonly used solution to this non-independence issue is to use mixed-effects models, so called because they consist of two classes of independent variables:

Fixed effects, which are variables that we control in our experimental design.
Random effects, which are variables that describe groups that are representative of the wider population.

Let’s say that we sample the species richness of lake-shore versus lake-centre habitats in a series of lakes. Clearly we can’t assume that the samples from different lakes are independent of one another, because the measured species richness may depend not only on which habitat we sampled, but also on properties of the lakes. In this case, we controlled the sampling of lake-shore versus lake-centre habitats in our experimental design, and so this would be a fixed effect. In contrast, we would probably assume that the lakes were a random sample of the wider population of lakes in the study system, and so we would treat lake idenity as a random variable.

For this session, you will need the lme4 R package, as well as my StatisticalModels package, and also the ggplot2 package, which you will have already installed if you completed the previous sessions:

install.packages("lme4")
remotes::install_github("timnewbold/StatisticalModels")

library(ggplot2)
library(lme4)
library(StatisticalModels)

Example Dataset

In this session, we will work with a dataset describing the growth of five individual orange trees:

data(Orange)

head(Orange)

## Grouped Data: circumference ~ age | Tree
##   Tree  age circumference
## 1    1  118            30
## 2    1  484            58
## 3    1  664            87
## 4    1 1004           115
## 5    1 1231           120
## 6    1 1372           142

If we plot the relationship between tree age and trunk circumference, there is (not surprisingly) a strong positive relationship:

ggplot(Orange,aes(x=age,y=circumference))+geom_point()+theme_classic()

So, let’s try fitting a simple linear model of circumference as a function of age:

m1 <- lm(circumference~age,data=Orange)

m1

## 
## Call:
## lm(formula = circumference ~ age, data = Orange)
## 
## Coefficients:
## (Intercept)          age  
##     17.3997       0.1068

If we check the assumptions of the model, we can see that there is a strong increase in variance at higher fitted values of the model.

qqnorm(residuals(m1)); qqline(residuals(m1))

plot(fitted(m1),residuals(m1))

Furthermore, because the data are collected from five individual trees, it is very unlikely that we can safely assume that the data points are independent.

If we re-plot the relationship between age and trunk circumference for individual trees, we can see that there is strong variation in the growth of individual trees:

ggplot(Orange,aes(x=age,y=circumference,col=Tree))+geom_point()+theme_classic()

Fitting Mixed-effects Models in R

One solution to the problems here is to try a mixed-effects model to account for non-independence among individual trees. Here, age is the variable controlled by our experimental design, and so will be our fixed effect. The sampled individual trees can be considered to be a random sample of the wider population of trees, and so tree identity will be our random effect. Random effects are designated as (1|RandomEffect) in the model code:

m2 <- lmer(circumference~age+(1|Tree),data=Orange)

summary(m2)

## Linear mixed model fit by REML ['lmerMod']
## Formula: circumference ~ age + (1 | Tree)
##    Data: Orange
## 
## REML criterion at convergence: 303.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8781 -0.6743  0.2320  0.5053  1.5416 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Tree     (Intercept) 389.6    19.74   
##  Residual             232.9    15.26   
## Number of obs: 35, groups:  Tree, 5
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept) 17.399650  10.423696   1.669
## age          0.106770   0.005321  20.066
## 
## Correlation of Fixed Effects:
##     (Intr)
## age -0.471

Effect Sizes

The model summary gives a coefficient table for the fixed effects very similar to that produced by standard linear or generalized linear models. Effect sizes can be calculated in exactly the same way as for linear models (see ANCOVA session). The slope of the relationship fitted here shows us that trunk circumference grows by approximately 0.1 mm each day, on average. We also get a table describing the variation in our response variable that is explained by variation among random-effect groups (in this case tree individuals, which unsurprisingly accounts for a large proportion of the variation in tree growth), as well as the residual, unexplained variation.

By inspecting the random-effect values, we can see the variation in trunk circumference of each individual tree (these random intercepts are centered around 0):

ranef(m2)

## $Tree
##   (Intercept)
## 3  -20.137548
## 1  -15.004448
## 5   -4.343393
## 2   17.900043
## 4   21.585346
## 
## with conditional variances for "Tree"

These random intercepts show, consistent with the graph we plotted earlier, that trees 2 and 4 grew particularly rapidly, while trees 1 and 3 grew slowly.

Model Selection

Model selection can be performed in exactly the same way as for linear or generalized linear models. Here we will use simple backward stepwise selection, but you could also use multi-model inference.

In this case, we will compare our model with age as a single explanatory variable with the simpler, intercept-only, null model:

m3 <- lmer(circumference~1+(1|Tree),data=Orange)

anova(m2,m3)

## Data: Orange
## Models:
## m3: circumference ~ 1 + (1 | Tree)
## m2: circumference ~ age + (1 | Tree)
##    npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)    
## m3    3 387.92 392.59 -190.96   381.92                         
## m2    4 308.68 314.90 -150.34   300.68 81.246  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As with generalized linear models, the comparison of two mixed-effects models is made on the basis of the difference in deviance, with P values calculated by assessing this difference against a chi-squared distribution. Not surprisingly, we find that age has a highly significant effect on the trunk circumference of orange trees.

Model Explanatory Power

Estimating model explanatory power is more complicated than for linear or generalized linear models. Mixed-effects models do not use sums of squares, and so we cannot use classic R² values, and the calculation of deviance explained is complicated by the presence of the random effects. Instead, a method to calculate pseudo-R² values has been proposed (Nakagawa and Schielzeth 2013). This method calculates two R² values. The conditional R² describes the variation in the response variable explained by both the fixed and random effects. The marginal R² describes the variation explained by the fixed effects alone. I have implemented this method as a function R2GLMER in my StatisticalModels package:

R2GLMER(m2)

## $conditional
## [1] 0.931107
## 
## $marginal
## [1] 0.8158525

For our model of orange-tree growth, both marginal and conditional R² values are very high, indicating that most of the variation in trunk circumference can be explained by tree age, but that much of the variation unexplained by age is captured by variation among individual trees.

Checking Model Assumptions

Finally, let’s check how well our mixed-effects model conforms to the assumptions of parametric statistical tests:

qqnorm(residuals(m2)); qqline(residuals(m2))

plot(fitted(m2),residuals(m2))

By accounting for variation in growth among individual trees, our mixed-effects model seems to be behaving much better in terms of equality of variance compared to the simple linear model we fitted earlier. This in addition to dealing with the issue of non-independence of data points caused by sampling repeated measures on the same individual trees.

Generalized Linear Mixed-effects Models

So far in this session, we have been using linear mixed-effects models. If you have data drawn from a Poisson or binomial distribution, you can use generalized linear mixed-effects models. Just as with generalized linear models, you simply specify family=“poisson”.

Further Information

If you are interested to do some further exploration of applying mixed-effects models (including generalized linear mixed-effects models), you might want to have a look at this tutorial on working with the PREDICTS database.

References

Nakagawa, Shinichi, and Holger Schielzeth. 2013. “A General and Simple Method for Obtaining R² from Generalized Linear Mixed-Effects Models.” Methods in Ecology & Evolution 4: 133–42. https://doi.org/10.1111/j.2041-210x.2012.00261.x.