Assessing Statistical Models

Tim Newbold

2021-10-19

Overview

So far, we have considered whether our models conform to the assumptions of parametric statistical tests. However, we haven’t yet thought about how well they describe our study system. In this session, we will look at the ways in which you can assess the ability of a model to explain the measured response variable, and how we might go about selecting the “best” model to describe the study system.

There are three main ways that we might assess the ability of our model to explain our response variable.

  1. We can ask which variables have a significant effect on our response variable. This allows us to select the most parsimonious model for explaining our study system.
  2. We might be interested in how strong an effect an explanatory variable has on the response variable (i.e., its effect size).
  3. We can assess how much of the variability in our response variable is explained by the set of explanatory variables in our model. This is referred to as model explanatory power.

We will return yet again to the cars dataset for this session:

data(mtcars)

mtcars$cyl <- factor(mtcars$cyl)

Model Selection

In classical statistics, we want to discard any candidate explanatory variables that turn out not to have a significant effect on our response variable. This is referred to as a frequentist approach. In a later session, we will investigate alternative approaches to model selection that don’t involve a binary inclusion or exclusion of explanatory variables. A frequentist approach allows us to select the most parsimonious model for explaining our study system (as long as we considered all potentially important variables in the first place!).

There are a number of different approaches that are used to select the most parsimonious statistical model. Here, we will follow the most commonly used of these approaches, which is to start with the most complex model, and see which explanatory variables can be dropped from the model because they don’t have a significant effect. This approach tends to be referred to as backward stepwise model selection.

Following on from the session on analysis of covariance, we will start with the most complex interactive model of car fuel efficiency as a function of power and number of cylinders:

# First, we need to create log-transformed versions of our response and continuous
# explanatory variable, as we saw in the last session that these variables have
# a skewed distribution
mtcars$LogMPG <- log(mtcars$mpg)
mtcars$LogHP <- log(mtcars$hp)

# We also need to make sure that the number of cylinders is treated 
# as a grouping variable
mtcars$cyl <- factor(mtcars$cyl)

# Now we can build the model
mFull <- lm(mtcars$LogMPG~mtcars$LogHP*mtcars$cyl)

summary(mFull)
## 
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP * mtcars$cyl)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35724 -0.08548 -0.00269  0.08830  0.26543 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                4.8204     0.8156   5.910 3.11e-06 ***
## mtcars$LogHP              -0.3535     0.1857  -1.903   0.0681 .  
## mtcars$cyl6               -1.5756     1.9063  -0.827   0.4160    
## mtcars$cyl8               -0.8895     1.2865  -0.691   0.4954    
## mtcars$LogHP:mtcars$cyl6   0.2984     0.4045   0.738   0.4673    
## mtcars$LogHP:mtcars$cyl8   0.1221     0.2635   0.463   0.6469    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1541 on 26 degrees of freedom
## Multiple R-squared:  0.7753, Adjusted R-squared:  0.7321 
## F-statistic: 17.94 on 5 and 26 DF,  p-value: 1.06e-07
anova(mFull)
## Analysis of Variance Table
## 
## Response: mtcars$LogMPG
##                         Df  Sum Sq Mean Sq F value    Pr(>F)    
## mtcars$LogHP             1 1.96733 1.96733 82.8173 1.453e-09 ***
## mtcars$cyl               2 0.14950 0.07475  3.1467   0.05973 .  
## mtcars$LogHP:mtcars$cyl  2 0.01427 0.00713  0.3003   0.74314    
## Residuals               26 0.61763 0.02376                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis of variance table here already gives us P values, and so some indication of which variables have a significant effect on our response variable. The problem is that these F ratios and P values are influenced by the order in which the variables are entered into the model. We can demonstrate this very easily by creating a model with the same structure, but where the two variables are entered in the opposite order:

mFull2 <- lm(mtcars$LogMPG~mtcars$cyl*mtcars$LogHP)

anova(mFull)
## Analysis of Variance Table
## 
## Response: mtcars$LogMPG
##                         Df  Sum Sq Mean Sq F value    Pr(>F)    
## mtcars$LogHP             1 1.96733 1.96733 82.8173 1.453e-09 ***
## mtcars$cyl               2 0.14950 0.07475  3.1467   0.05973 .  
## mtcars$LogHP:mtcars$cyl  2 0.01427 0.00713  0.3003   0.74314    
## Residuals               26 0.61763 0.02376                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mFull2)
## Analysis of Variance Table
## 
## Response: mtcars$LogMPG
##                         Df  Sum Sq Mean Sq F value   Pr(>F)    
## mtcars$cyl               2 2.00808 1.00404 42.2662 6.75e-09 ***
## mtcars$LogHP             1 0.10876 0.10876  4.5783  0.04193 *  
## mtcars$cyl:mtcars$LogHP  2 0.01427 0.00713  0.3003  0.74314    
## Residuals               26 0.61763 0.02376                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the F ratios and P values for the effects of the first two variables entered into the model (the effects of power and number of cylinders alone) are slightly different in the two models. The values for the interaction term are identical, because this was the last variable entered into the model.

The best way to assess the significance of terms in a statistical model is to try dropping terms, and seeing whether this leads to a significant drop in the ability of the model to explain the response variable.

The first step in a backward stepwise model selection exercise is to try dropping the interaction term from the model, in this case comparing the full complex model with an additive model that doesn’t have the interaction term. As well as allowing us to compute the sums of squares, F ratios and P values from a single model, the anova function in R also allows us to compare two models:

mR1 <- lm(mtcars$LogMPG~mtcars$LogHP+mtcars$cyl)

anova(mFull,mR1)
## Analysis of Variance Table
## 
## Model 1: mtcars$LogMPG ~ mtcars$LogHP * mtcars$cyl
## Model 2: mtcars$LogMPG ~ mtcars$LogHP + mtcars$cyl
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)
## 1     26 0.61763                           
## 2     28 0.63190 -2 -0.014267 0.3003 0.7431

The analysis of variance to compare two models is based on the reduction in the model sum of squares in the simpler model compared to the more complex model. In other words, this tells us how much less well the simpler model explains the response variable compared to the more complex model.

Just as with a classic analysis of variance test, we can use this sum of squares to calculate an F ratio and P value. This P value tells us the probability that the more complex model would be as much better fitting than the simpler model as we observe given the null hypothesis that the extra term in the more complex model is in fact unimportant. Here, a P value substantially greater than 0.05 tells us that the interaction term does not significantly improve the fit of the model to the data.

Once we have tested the significance of the interaction terms in our model, we then remove these and test the main effect (i.e., non-interaction) terms. Here we have two main effects, so we would simultaneously try dropping each of these, and assessing the effect on the fit of the model to the data (by comparison with the additive model that contains both terms):

mR2 <- lm(mtcars$LogMPG~mtcars$LogHP)
mR3 <- lm(mtcars$LogMPG~mtcars$cyl)

anova(mR1,mR2)
## Analysis of Variance Table
## 
## Model 1: mtcars$LogMPG ~ mtcars$LogHP + mtcars$cyl
## Model 2: mtcars$LogMPG ~ mtcars$LogHP
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     28 0.6319                              
## 2     30 0.7814 -2   -0.1495 3.3123 0.05115 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mR1,mR3)
## Analysis of Variance Table
## 
## Model 1: mtcars$LogMPG ~ mtcars$LogHP + mtcars$cyl
## Model 2: mtcars$LogMPG ~ mtcars$cyl
##   Res.Df     RSS Df Sum of Sq      F Pr(>F)  
## 1     28 0.63190                             
## 2     29 0.74066 -1  -0.10876 4.8192 0.0366 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, dropping car power leads to a significant reduction in model explanatory power, but dropping the number of cylinders leads to a non-significant reduction. If any of our variables yields a non-significant P value, we drop the one that has the highest p value, and then proceed to a second round of model selection. So, now we need to compare a model with just power to a model without power. This second model has no explanatory variables, and is referred to as an intercept-only model. To run such a model in R, we ask R to fit our response variable as a function of 1:

mR4 <- lm(mtcars$LogMPG~1)

# Remember that the model with just power in the last round of model selection was mR2
anova(mR4,mR2)
## Analysis of Variance Table
## 
## Model 1: mtcars$LogMPG ~ 1
## Model 2: mtcars$LogMPG ~ mtcars$LogHP
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1     31 2.7487                                 
## 2     30 0.7814  1    1.9673 75.531 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Dropping car power leads to a highly significant reduction in our ability to explain fuel efficiency. So, that’s it, we have found our most parsimonious model (a model with just car power). Had there still been non-significant terms, we would have continued with successive rounds of model selection until all terms in the model are associated with a significant P value. If any interaction terms were significant, we must add these back into our final model at the end of model selection.

We already had our final, most parsimonious, model stored in R, but we will run the final model again, to avoid any ambiguity:

mFinal <- lm(mtcars$LogMPG~mtcars$LogHP)

summary(mFinal)
## 
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38189 -0.05707 -0.00691  0.10815  0.37501 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.54538    0.29913  18.538  < 2e-16 ***
## mtcars$LogHP -0.53009    0.06099  -8.691 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1614 on 30 degrees of freedom
## Multiple R-squared:  0.7157, Adjusted R-squared:  0.7062 
## F-statistic: 75.53 on 1 and 30 DF,  p-value: 1.08e-09

Effect Sizes

P values tell us how likely it is that we would have observed a given effect by chance given the null hypothesis that there is no relationship between variables. However, this doesn’t on its own tell us how strong an effect is in real terms. A weak effect of an explanatory variable on a response variable could be identified as a significant effect given a large enough sample size. To understand how strong an effect an explanatory variable has on a response variable, we need to consider effect sizes. That is, how much a response variable changes for a given change in an explanatory variable.

Let’s consider here the difference in fuel economy between a car with the lowest power among those in the dataset (52 horsepower, or 3.95 in log-transformed terms) and the car with the highest power (335 horsepower, 5.81 in log-transformed values).

We can use our model to make some predictions. From the coefficient table, we can extract the intercept and slope of the fitted relationship in our final model:

summary(mFinal)
## 
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38189 -0.05707 -0.00691  0.10815  0.37501 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.54538    0.29913  18.538  < 2e-16 ***
## mtcars$LogHP -0.53009    0.06099  -8.691 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1614 on 30 degrees of freedom
## Multiple R-squared:  0.7157, Adjusted R-squared:  0.7062 
## F-statistic: 75.53 on 1 and 30 DF,  p-value: 1.08e-09
intercept <- summary(mFinal)$coefficients['(Intercept)','Estimate']
slope <- summary(mFinal)$coefficients['mtcars$LogHP','Estimate']

intercept
## [1] 5.545381
slope
## [1] -0.5300919

Now we can calculate the average fuel efficiency of a car with a power of 53 HP and of a car with a power of 335 HP:

# First, let's set the log-transformed horsepowers for cars of 52 and 335 HP
log.hp.52 <- log(52)
log.hp.335 <- log(335)

# Now we will use the regression equation to calculate predicted fuel efficiency
av.fuelecon.52 <- intercept + log.hp.52 * slope
av.fuelecon.335 <- intercept + log.hp.335 * slope

# We must remember that our response variable was also log-transformed
# So we need to back-transform by taking the exponential
av.fuelecon.52 <- exp(av.fuelecon.52)
av.fuelecon.335 <- exp(av.fuelecon.335)

av.fuelecon.52
## [1] 31.52745
av.fuelecon.335
## [1] 11.74417

So, the fuel economy of the most powerful car in this dataset is more than 60% lower than that of the least powerful car. This is a very large effect size!

Model Explanatory Power

The final piece of information we are likely to be interested in for assessing the quality of our models is explanatory power. Model explanatory power tells us how much of the variation in our response variable is explained by the combination of explanatory variables in our model. The most commonly used measure of explanatory power for basic statistical approaches is the coefficient of determination, or R2. We can calculate R2 manually, based on the sums of squares, using the following equation: \[R^2=1-\frac{SS_{residual}}{SS_{total}}\].

See the session on linear regression for a refresher on the basis for the different sums of squares values.

From this equation, we can easily calculate the R2 value of our final model from earlier:

anova(mFinal)
## Analysis of Variance Table
## 
## Response: mtcars$LogMPG
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## mtcars$LogHP  1 1.9673 1.96733  75.531 1.08e-09 ***
## Residuals    30 0.7814 0.02605                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ss.resid <- anova(mFinal)['Residuals','Sum Sq']
ss.model <- anova(mFinal)['mtcars$LogHP','Sum Sq']
ss.total <- ss.resid + ss.model

r2 <- 1 - (ss.resid/ss.total)
r2
## [1] 0.7157233

However, we don’t need to worry about manually calculating R2 values for every model we run, because it is reported in the model summary:

summary(mFinal)
## 
## Call:
## lm(formula = mtcars$LogMPG ~ mtcars$LogHP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38189 -0.05707 -0.00691  0.10815  0.37501 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.54538    0.29913  18.538  < 2e-16 ***
## mtcars$LogHP -0.53009    0.06099  -8.691 1.08e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1614 on 30 degrees of freedom
## Multiple R-squared:  0.7157, Adjusted R-squared:  0.7062 
## F-statistic: 75.53 on 1 and 30 DF,  p-value: 1.08e-09

The R2 of our final model tells us that just over 70% of the variability in car fuel efficiency is explained by the power of the car. This is a very good value - you will be very lucky to get such high R2 values with real biological datasets, especially ecological datasets.

We may also be interested in the adjusted R2 value, which is penalized according to the complexity (number of parameters) of the model.

Next Time

That’s it for this session. In the next session, we will look at how to deal with types of data for which we wouldn’t expect the assumptions of standard statistical tests to be met.