Assessing Statistical Models

Tim Newbold

2025-10-10

Overview

So far, we have considered whether our models conform to the assumptions of parametric statistical tests. However, we haven’t yet thought about how well they describe our study system. In this session, we will look at the ways in which you can assess the ability of a model to explain the measured response variable, and how we might go about selecting the “best” model to describe the study system.

There are three main ways that we might assess the ability of our model to explain our response variable.

  1. We can ask which variables have a significant effect on our response variable. This allows us to select the most parsimonious model for explaining our study system.
  2. We might be interested in how strong an effect an explanatory variable has on the response variable (i.e., its effect size).
  3. We can assess how much of the variability in our response variable is explained by the set of explanatory variables in our model. This is referred to as model explanatory power.

We will return again to the beetles dataset for this session, again log-transforming abundance and surrounding natural habitat values, to deal with the fact that the distributions of values are very right-skewed, and also that abundance values are bound at zero.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
url("https://www.dropbox.com/scl/fi/4olcttgagt0sd92hg64vk/Ewers2007BeetleData.rds?rlkey=14c8qby2023lt9924bi3th0ee&dl=1") %>%
  readRDS() %>% 
  mutate(LogAbundance = log10(Total_abundance),
         LogNaturalHabitat5k = log10(NaturalHabitat5k)) -> beetles

Further Reading

The following books are good references for basic statistical tests, including ANOVA, linear regression and ANCOVA. Crawley (2014) details the approach to model selection that I introduce here.

Crawley, M.J. (2014). Statistics: An Introduction Using R. 2nd Edition. John Wiley & Sons, Hoboken, USA

Barnard, C., Gilbert, F. & McGregor, P. (2007). Asking Questions in Biology: A Guide to Hypothesis Testing, Experimental Design and Presentation in Practical Work and Research Projects. 3rd Edition. Benjamin Cummings, San Francisco, USA.

Model Selection

In classical statistics, we want to discard any candidate explanatory variables that turn out not to have a significant effect on our response variable. This is referred to as a frequentist approach. In a later session, we will investigate alternative approaches to model selection that don’t involve a binary inclusion or exclusion of explanatory variables. A frequentist approach allows us to select the most parsimonious model for explaining our study system (as long as we considered all potentially important variables in the first place!).

There are a number of different approaches that are used to select the most parsimonious statistical model. Here, we will follow the most commonly used of these approaches, which is to start with the most complex model, and see which explanatory variables can be dropped from the model because they don’t have a significant effect. This approach tends to be referred to as backward stepwise model selection.

Following on from the session on analysis of covariance, we will start with the most complex interactive model of beetle abundance as a function of land use and coverage of natural habitat in the surrounding landscape.

# Build the most complex model with an interaction between land use and
# coverage of natural habitats in the surrounding landscape
mFull <- lm(formula = LogAbundance~LandUse*LogNaturalHabitat5k, data = beetles)

summary(mFull)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse * LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92189 -0.15328  0.02291  0.18623  0.62522 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         2.78061    0.08081  34.411  < 2e-16 ***
## LandUsePasture                     -0.45800    0.16802  -2.726  0.00691 ** 
## LogNaturalHabitat5k                 0.49025    0.12264   3.998 8.63e-05 ***
## LandUsePasture:LogNaturalHabitat5k -0.21559    0.22941  -0.940  0.34832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 229 degrees of freedom
## Multiple R-squared:  0.3581, Adjusted R-squared:  0.3497 
## F-statistic: 42.58 on 3 and 229 DF,  p-value: < 2.2e-16
anova(mFull)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                              Df  Sum Sq Mean Sq  F value    Pr(>F)    
## LandUse                       1  7.2786  7.2786 109.7491 < 2.2e-16 ***
## LogNaturalHabitat5k           1  1.1344  1.1344  17.1046 4.968e-05 ***
## LandUse:LogNaturalHabitat5k   1  0.0586  0.0586   0.8832    0.3483    
## Residuals                   229 15.1874  0.0663                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis of variance table here already gives us P values, and so some indication of which variables have a significant effect on our response variable. The problem is that these F ratios and P values are influenced by the order in which the variables are entered into the model. We can demonstrate this very easily by creating a model with the same structure, but where the two variables are entered in the opposite order:

mFull2 <- lm(formula = LogAbundance~LogNaturalHabitat5k*LandUse,data = beetles)

anova(mFull)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                              Df  Sum Sq Mean Sq  F value    Pr(>F)    
## LandUse                       1  7.2786  7.2786 109.7491 < 2.2e-16 ***
## LogNaturalHabitat5k           1  1.1344  1.1344  17.1046 4.968e-05 ***
## LandUse:LogNaturalHabitat5k   1  0.0586  0.0586   0.8832    0.3483    
## Residuals                   229 15.1874  0.0663                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mFull2)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                              Df  Sum Sq Mean Sq F value    Pr(>F)    
## LogNaturalHabitat5k           1  3.6773  3.6773 55.4472 1.937e-12 ***
## LandUse                       1  4.7357  4.7357 71.4065 3.397e-15 ***
## LogNaturalHabitat5k:LandUse   1  0.0586  0.0586  0.8832    0.3483    
## Residuals                   229 15.1874  0.0663                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the F ratios and P values for the effects of the first two variables entered into the model (the effects of land use and surrounding natural habitat alone) are slightly different in the two models. The values for the interaction term are identical, because this was the last variable entered into the model.

The best way to assess the significance of terms in a statistical model is to try dropping terms, and seeing whether this leads to a significant drop in the ability of the model to explain the response variable.

The first step in a backward stepwise model selection exercise is to try dropping the interaction term from the model, in this case comparing the full complex model with an additive model that doesn’t have the interaction term. As well as allowing us to compute the sums of squares, F ratios and P values from a single model, the anova function in R also allows us to compare two models:

mR1 <- lm(formula = LogAbundance~LandUse+LogNaturalHabitat5k,data = beetles)

anova(mFull,mR1)
## Analysis of Variance Table
## 
## Model 1: LogAbundance ~ LandUse * LogNaturalHabitat5k
## Model 2: LogAbundance ~ LandUse + LogNaturalHabitat5k
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    229 15.187                           
## 2    230 15.246 -1 -0.058574 0.8832 0.3483

The analysis of variance to compare two models is based on the reduction in the model sum of squares in the simpler model compared to the more complex model. A more complex model must explain the data better than the simpler model by virtue of the extra parameters in the model (as long as the structure of the simpler model is nested within the structure of the more complex model). What we need to determine is whether the reduction in fit of the simpler model is significant.

Just as with a classic analysis of variance test, we can use this sum of squares to calculate an F ratio and P value. This P value tells us the probability that the better fit of the complex model could have occurred by chance given the null hypothesis that the additional term(s) in the model are not actually important. When conducting backward stepwise model selection, it is best practice to compare models that differ by just one term.

For the comparison we are making here, the fact that the P value is greater than 0.05 tells us that the interaction term does not significantly improve the fit of the model to the data.

Once we have tested the significance of the interaction terms in our model, we then remove these and test the main effect (i.e., non-interaction) terms. Here we have two main effects, so we would simultaneously try dropping each of these, and assessing the effect on the fit of the model to the data (by comparison with the additive model that contains both terms):

mR2 <- lm(formula = LogAbundance~LogNaturalHabitat5k,data = beetles)
mR3 <- lm(formula = LogAbundance~LandUse,data = beetles)

anova(mR1,mR2)
## Analysis of Variance Table
## 
## Model 1: LogAbundance ~ LandUse + LogNaturalHabitat5k
## Model 2: LogAbundance ~ LogNaturalHabitat5k
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    230 15.246                                  
## 2    231 19.982 -1   -4.7357 71.443 3.293e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mR1,mR3)
## Analysis of Variance Table
## 
## Model 1: LogAbundance ~ LandUse + LogNaturalHabitat5k
## Model 2: LogAbundance ~ LandUse
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1    230 15.246                                 
## 2    231 16.380 -1   -1.1344 17.113 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The removal of both land use and surrounding natural habitat leads to a significant reduction in model explanatory power. If any of the variables had been non-significant, we would then remove the variable with the highest P value, and proceed to a second round of model selection.

Therefore, we have identified the most parsimonious, minimum adequate model. We now need to compile the final model with all significant terms (both main effects and interaction terms). Here, this is a model that contains just the main effects of land use and surrounding natural habitat, since the interaction was non-significant.

mFinal <- lm(formula = LogAbundance~LandUse+LogNaturalHabitat5k,data = beetles)

summary(mFinal)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse + LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92708 -0.15711  0.01682  0.19630  0.64222 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.74171    0.06938  39.516  < 2e-16 ***
## LandUsePasture      -0.30376    0.03594  -8.452 3.29e-15 ***
## LogNaturalHabitat5k  0.42864    0.10362   4.137 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 230 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:   0.35 
## F-statistic: 63.46 on 2 and 230 DF,  p-value: < 2.2e-16

Effect Sizes

P values tell us how likely it is that we would have observed a given effect by chance given the null hypothesis that there is no relationship between variables. However, this doesn’t on its own tell us how strong an effect is in real terms. A weak effect of an explanatory variable on a response variable could be identified as a significant effect given a large enough sample size. To understand how strong an effect an explanatory variable has on a response variable, we need to consider effect sizes. That is, how much a response variable changes for a given change in an explanatory variable.

Here, we will consider the difference in the beetle abundance we would expect to sample in a location in primary vegetation with 10% coverage of natural habitat in the surrounding landscape compared to a location in primary vegetation with 50% surrounding coverage of natural habitat.

To make these predictions, we can extract the necessary slope and intercept parameters from the coefficient table. Because we are making predictions for locations in primary vegetation, which is the reference level for our land use variable, we just need the overall model intercept and the main slope for the effect of natural habitat.

summary(mFinal)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse + LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92708 -0.15711  0.01682  0.19630  0.64222 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.74171    0.06938  39.516  < 2e-16 ***
## LandUsePasture      -0.30376    0.03594  -8.452 3.29e-15 ***
## LogNaturalHabitat5k  0.42864    0.10362   4.137 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 230 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:   0.35 
## F-statistic: 63.46 on 2 and 230 DF,  p-value: < 2.2e-16
intercept <- summary(mFinal)$coefficients['(Intercept)','Estimate']
slope <- summary(mFinal)$coefficients['LogNaturalHabitat5k','Estimate']

intercept
## [1] 2.741708
slope
## [1] 0.4286374

Now we can calculate the beetle abundance that we would expect to sample in primary vegetation if surrounding natural habitat is 10% compared to if it is 50%.

# First, let's set the log-transformed natural habitat values corresponding with 
# 10% and 50% coverage
log.nh.10 <- log10(0.1)
log.nh.50 <- log10(0.5)

# Now we will use the regression equation to calculate expected (log) 
# total beetle abundance
exp.log.abund.10 <- intercept + log.nh.10 * slope
exp.log.abund.50 <- intercept + log.nh.50 * slope

# We must remember that our response variable was also log-transformed. Therefore,
# we need to back-transform by raising 10 to the power of these predicted values
exp.abund.10 <- 10^exp.log.abund.10
exp.abund.50 <- 10^exp.log.abund.50

# Finally, let's the calculate the expected abundance where coverage of natural
# habitat is 10% as a percentage of expected abundance where coverage of natural
# habitat is 50%
exp.abund.10.percent <- (exp.abund.10/exp.abund.50)*100

round(exp.abund.10.percent,2)
## [1] 50.16

This calculation shows us that beetle abundance is reduced by nearly a half, on average, when landscapes have 10% coverage of natural habitat compared to 50%. This is a very large effect size.

Model Explanatory Power

The final piece of information we are likely to be interested in for assessing how well our model describes the study system is explanatory power. Model explanatory power tells us how much of the variation in our response variable is explained by the combination of explanatory variables in our model. The most commonly used measure of explanatory power for basic statistical approaches is the coefficient of determination, or R2. We can calculate R2 manually, based on the sums of squares, using the following equation: \[R^2=1-\frac{SS_{residual}}{SS_{total}}\].

See the session on linear regression for a refresher on the basis for the different sums of squares values.

From this equation, we can easily calculate the R2 value of our final model from earlier:

anova(mFinal)
## Analysis of Variance Table
## 
## Response: LogAbundance
##                      Df  Sum Sq Mean Sq F value    Pr(>F)    
## LandUse               1  7.2786  7.2786 109.805 < 2.2e-16 ***
## LogNaturalHabitat5k   1  1.1344  1.1344  17.113  4.94e-05 ***
## Residuals           230 15.2459  0.0663                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ss.resid <- anova(mFinal)['Residuals','Sum Sq']
# For the model sum of squares, we need to combine the sum of squares associated
# with both terms in our model
ss.model <- anova(mFinal)['LandUse','Sum Sq'] + 
  anova(mFinal)['LogNaturalHabitat5k','Sum Sq']
ss.total <- ss.resid + ss.model

r2 <- 1 - (ss.resid/ss.total)
r2
## [1] 0.3555946

However, we don’t need to worry about manually calculating R2 values for every model we run, because it is reported in the model summary.

summary(mFinal)
## 
## Call:
## lm(formula = LogAbundance ~ LandUse + LogNaturalHabitat5k, data = beetles)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.92708 -0.15711  0.01682  0.19630  0.64222 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.74171    0.06938  39.516  < 2e-16 ***
## LandUsePasture      -0.30376    0.03594  -8.452 3.29e-15 ***
## LogNaturalHabitat5k  0.42864    0.10362   4.137 4.94e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2575 on 230 degrees of freedom
## Multiple R-squared:  0.3556, Adjusted R-squared:   0.35 
## F-statistic: 63.46 on 2 and 230 DF,  p-value: < 2.2e-16

The R2 of our final model tells us that just over 35% of the variability in sampled beetle abundance is explained by land use and the amount of natural habitat in the surrounding landscape. This is pretty good for an ecological study, when you consider the complexity of natural systems, and the many variables that may determine species abundance within ecosystems.

We may also be interested in the adjusted R2 value, which is penalized according to the complexity (number of parameters) of the model, which here is 0.35.

Next Time

That’s it for this session. In the next session, we will look at how to deal with types of data for which we wouldn’t expect the assumptions of standard statistical tests to be met.