Multi-model Selection

Tim Newbold

2021-10-19

Overview

In this session, I will introduce multi-model selection as an alternative to backward stepwise model selection. Classical frequentist statistics is based on P values, which give the probability of obtaining an observed result given the null hypothesis of no relationship between explanatory and response variables. If the P value is below a certain threshold, \(\alpha\) (often 0.05), then an explanatory variable is deemed to have a significant effect on the response variable. Explanatory variables with a P value above this threshold are deemed not to have a significant effect, and are therefore discarded. The problem with this approach is that the threshold P value used to determine whether an effect is significant is rather arbitrary, and the quest to find significant results can lead to some rather dubious behaviours (collectively often referred to as ‘P hacking’).

An alternative approach, so-called ‘multi-model selection’ was proposed by (Burnham and Anderson 2002). The idea of this approach is to assess the relative support for each model in a candidate set of models. Models are compared using Information Criteria, the most widely used of which is the Akaike Information Criterion (AIC).

Akaike Information Criterion (AIC)

Information criteria measure how well a model fits the data (measured using the likelihood of the model), with some penalty applied for model complexity. The AIC is calculated as: \[AIC=-2\times LL_{model}+2\times DF\]

Where \(LL_{model}\) is the log-likelihood of the model, and \(DF\) is the model degrees of freedom. Lower AIC values indicate a better-fitting model.

As in the last session, you will need my MResModelling R package to run the code presented here:

remotes::install_github("timnewbold/MResEcologicalModelling",subdir="MResModelling")
library(MResModelling)

Example Dataset

We will work again with the dataset describing the abundance of the parasitoid wasp species Pimpla punicipes in sites of different land use in Hawai’i.

# Load species-level data (loads as object 'hh')
data(HawaiiHymenoptera)
# For simplicity, we will select only those columns that we will use later
# Site_number: the site at which species were sampled
# LandUse: the land use at the sampled site
# Taxon_name_entered: the species sampled
# Measurement: the recorded abundance of the species
hh <- hh[,c('Site_number','LandUse','Taxon_name_entered','Measurement')]
# Make the site number a grouping variable (factor)
hh$Site_number <- factor(hh$Site_number)

# Subset to the dataset for just Pimpla punicipes
hh.sp <- droplevels(hh[(hh$Taxon_name_entered=="Pimpla punicipes"),])

# Load site-level data (loads as object 'hhs2'), from which we will obtain
# forest-cover estimates
data(HawaiiHymenopteraSitesSubset)

# The estimates of forest cover were not included in the species-level dataset,
# so we will match these from the site-level dataset
hh.sp$ForestCover <- hhs2$ForestCover[match(hh.sp$Site_number,hhs2$Site_number)]

# Remove sites that have missing (NA) forest-cover estimates
hh.sp <- droplevels(hh.sp[!is.na(hh.sp$ForestCover),])

str(hh.sp)
## 'data.frame':    242 obs. of  5 variables:
##  $ Site_number       : Factor w/ 242 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ LandUse           : Factor w/ 2 levels "Primary","Pasture": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Taxon_name_entered: Factor w/ 1 level "Pimpla punicipes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Measurement       : num  0 3 1 1 1 0 0 0 0 0 ...
##  $ ForestCover       : int  41 41 41 41 41 0 0 0 0 0 ...

In the session on GLMs, we discovered that land use had a significant effect on the abundance of P. punicipes, but that the effect of forest cover was marginally non-significant. Here, we will investigate the same relationships, but using a multi-model rather than frequentist approach.

Candidate Models

The first stage in a multi-model approach is to select a suitable set of candidate models. Ideally, this candidate set would consist of a series of models each reflecting some explanation (or hypothesis) for the way the system works. In reality, we often lack the information required to make an such an informed list of candidate models. One commonly used approach is to run models representing every possible combination of explanatory variables. Here, we will compare four different models of the abundance of P. punicipes:

  1. A null (intercept-only) model
  2. A model with only land use as an explanatory variable
  3. A model with only forest cover
  4. A model with both land use and forest cover

We know from the session on GLMs that the data describing the abundance of P. punicipes are over-dispersed, so we will start straight away by fitting negative binomial GLMs:

mNull <- glm.nb(Measurement~1,data=hh.sp)
mLU <- glm.nb(Measurement~LandUse,data=hh.sp)
mFC <- glm.nb(Measurement~ForestCover,data=hh.sp)
mCombined <- glm.nb(Measurement~LandUse+ForestCover,data=hh.sp)

Comparing Candidate Models

We can now compare the AIC values of our candidate models:

aics <- AIC(mNull,mLU,mFC,mCombined)

aics
##           df      AIC
## mNull      2 436.8199
## mLU        3 401.8646
## mFC        3 420.6416
## mCombined  4 400.1322

These AIC values suggest that the combined model with both land use and forest cover is the best-fitting of the four models.

Another useful way to compare models is to calculate, for any model \(i\) in candidate model set \(I\), the difference, \(\Delta AIC_i\), in the model’s AIC value compared with the AIC of the best-fitting model. As a rule of thumb, models with an AIC difference less than 2 are considered to have ‘substantial’ support, models with an AIC difference between 4 and 7 to have ‘considerably less’ support, and models with an AIC difference greater than 10 to have ‘essentially no’ support (Burnham and Anderson 2004).

aics$AIC.Diff <- aics$AIC - min(aics$AIC)

aics
##           df      AIC  AIC.Diff
## mNull      2 436.8199 36.687739
## mLU        3 401.8646  1.732422
## mFC        3 420.6416 20.509448
## mCombined  4 400.1322  0.000000

Here we find that the model with only land use as an explanatory variable has substantial support, consistent with the findings of our backward stepwise selection in the session on GLMs.

AIC Weights

In multi-model selection, the AIC difference for each model \(i\) in candidate set \(I\) are converted into an AIC weight, \(W_i\) using the following formula: \[W_i=\frac{e^{-\frac{1}{2}\times\Delta AIC_i}}{\sum_{i=1}^I{e^{-\frac{1}{2}\times \Delta AIC_i}}}\]

These weights give the probability that the model is the most appropriate one from among those considered.

Let’s calculate the weights for the set of candidate models of P. punicipes abundance that we fitted earlier:

aics$AIC.Wt <- (exp(-0.5*aics$AIC.Diff))/(sum(exp(-0.5*aics$AIC.Diff)))

# This makes sure that numbers are not given in standard form, 
# and so are easier to interpret
options(scipen=5)

aics
##           df      AIC  AIC.Diff            AIC.Wt
## mNull      2 436.8199 36.687739 0.000000007601403
## mLU        3 401.8646  1.732422 0.296036036881013
## mFC        3 420.6416 20.509448 0.000024772234319
## mCombined  4 400.1322  0.000000 0.703939183283265

This shows that our model with both land use and forest cover has a 70% chance, while the model with just land use has a 30% chance of being the most appropriate model. The null model, and the model with just forest cover both have a negligible probability of being the most appropriate model.

Assessing Variable Importance

The other thing we can do with AIC weights is to assess the probability that a given explanatory variable appears in the most appropriate model. This is done by summing the AIC weights of all of the models that contain that variable:

LU.weight <- sum(aics[c('mLU','mCombined'),]$AIC.Wt)
FC.weight <- sum(aics[c('mFC','mCombined'),]$AIC.Wt)

LU.weight
## [1] 0.9999752
FC.weight
## [1] 0.703964

Explanatory variables that have a weight greater than 0.3 are generally considered to be of interest in explaining the measured response variable.

In the case of our models of P. punicipes abundance, land use almost certainly appears in the most appropriate model, while forest cover has a 70% chance of appearing in the most appropriate model. Therefore, we would conclude that forest cover is quite likely to have an important effect on the abundance of this species. This contrasts with the conclusion of our frequentist model selection in the previous session, where we discounted forest cover as an explanatory variable on the basis of a (marginally) non-significant P value. Hopefully you can see the advantages of this more nuanced approach for avoiding the somewhat arbitrary inclusion of exclusion of explanatory variables that occurs when using P values.

Next Time

In the next session, we will look at mixed-effects models for dealing with datasets that have a hierarchical structure, and thus where the assumption that data points are independent of one another is violated.

References

Burnham, K.P., and D.R. Anderson. 2002. Model Selection and Multimodel Inference: An Information Theoretic Approach. 2nd ed. Springer-Verlag, New York.
———. 2004. “Multimodel Inference: Understanding AIC and BIC in Model Selection.” Sociological Methods & Research 33: 261–304. https://doi.org/10.1177/0049124104268644.
Gould, Rachelle K., Liba Pejchar, Sara G. Bothwell, Berry Brosi, Stacie Wolny, Chase Mendenhall, and Gretchen Daily. 2013. “Forest Restoration and Parasitoid Wasp Communities in Montane Hawai’i.” PLoS ONE 8: e59356. https://doi.org/10.1371/journal.pone.0059356.
Hudson, Lawrence N., Tim Newbold, Sara Contu, Samantha L. L. Hill, Igor Lysenko, Adriana De Palma, Helen R. P. Phillips, et al. 2017. “The Database of the PREDICTS (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems) Project.” Ecology & Evolution 7: 145–88. https://doi.org/10.1002/ece3.2579.