Generalized Linear Models (GLMs)

Tim Newbold

2024-12-06

Overview

In this session, we will cover Generalized Linear Models, which allow us to deal with types of data where we cannot expect a conformation to the standard assumptions of parametric statistical tests.

I will cover here only the practicalities of running Generalized Linear Models in R. If you are interested, this great article gives a very short introduction to the statistics behind GLMs, although it focuses on application in Python rather than R.

This session will require the use of my MResModelling R package, which you can install from Github:

remotes::install_github("timnewbold/StatisticalModels")
remotes::install_github("timnewbold/MResEcologicalModelling",subdir="MResModelling")

library(MResModelling)

Example Datasets

We will work with two datasets, both derived from the PREDICTS database (Hudson et al. 2017), and originally from a study of parasitoid wasp assemblages in Hawai’i (Gould et al. 2013).

The first dataset describes the abundance of 21 different species of parasitoid wasp at 754 different locations in Hawai’i in three different types of land use: primary vegetation (pristine forest), secondary vegetation (forest recovering after past destruction), and pasture (areas used for livestock grazing).

The second dataset gives site-level metrics of biodiversity (e.g., total species richness, total recorded abundance) at the same sites in Hawai’i. To keep things simple, I have focused this dataset just on the 242 sites in primary vegetation or pasture.

# Load species-level data (loads as object 'hh')
data(HawaiiHymenoptera)
# For simplicity, we will select only those columns that we will use later
# Site_number: the site at which species were sampled
# LandUse: the land use at the sampled site
# Taxon_name_entered: the species sampled
# Measurement: the recorded abundance of the species
hh <- hh[,c('Site_number','LandUse','Taxon_name_entered','Measurement')]
# Make the site number a grouping variable (factor)
hh$Site_number <- factor(hh$Site_number)
str(hh)

## 'data.frame':    15834 obs. of  4 variables:
##  $ Site_number       : Factor w/ 754 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LandUse           : Factor w/ 3 levels "Primary","Pasture",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Taxon_name_entered: Factor w/ 21 levels "Agasthenes swezeyi",..: 19 3 4 9 17 1 8 7 5 21 ...
##  $ Measurement       : num  0 0 0 0 0 0 0 0 0 0 ...

# Load site-level data (loads as object 'hhs2')
data(HawaiiHymenopteraSitesSubset)
# Again, we will select the relevant columns
# Site_number: the site at which species were sampled
# Predominant_land_use: the land use at the sampled site
# ForestCover: the percentage forest cover at the sampled site
# Species_richness: the recorded species richness at the site
hhs2 <- hhs2[,c('Site_number','Predominant_land_use','ForestCover','Species_richness')]
# Make the site number a grouping variable (factor)
hhs2$Site_number <- factor(hhs2$Site_number)
str(hhs2)

## 'data.frame':    242 obs. of  4 variables:
##  $ Site_number         : Factor w/ 242 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Predominant_land_use: Factor w/ 2 levels "Primary vegetation",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ForestCover         : int  41 41 41 41 41 0 0 0 0 0 ...
##  $ Species_richness    : int  1 1 2 2 2 0 1 1 2 2 ...

Poisson GLMs

Fitting a Poisson GLM in R

Count data often conform to a Poisson distribution, and so are commonly encountered in ecology. For example, in the site-level parasitoid wasp data, we have a count of the number of species recorded at each site.

Fitting a Poisson GLM in R is very similar to fitting an analysis of covariance (or linear model), except that now we need to use the glm function. To run a GLM, we need to provide one extra piece of information beyond that needed for a linear model: the family of model we want to use. In this case, we want a Poisson family, family=poisson.

srMod1 <- glm(Species_richness~Predominant_land_use+ForestCover,family=poisson,
              data=hhs2)

summary(srMod1)

## 
## Call:
## glm(formula = Species_richness ~ Predominant_land_use + ForestCover, 
##     family = poisson, data = hhs2)
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  0.967598   0.075202  12.867  < 2e-16 ***
## Predominant_land_usePasture -1.520983   0.165513  -9.190  < 2e-16 ***
## ForestCover                 -0.003842   0.001380  -2.785  0.00536 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 394.18  on 241  degrees of freedom
## Residual deviance: 280.76  on 239  degrees of freedom
## AIC: 743.46
## 
## Number of Fisher Scoring iterations: 6

Incidentally, if you run a GLM but specify a Gaussian family (family=gaussian), the result will be the same as for a linear model.

You will see that the coefficient table produced by a GLM is the same as for a linear model. The intercept tells us the estimated value of the response variable when the continuous explanatory variables (here just forest cover) have a value of 0, and for reference groups in our grouping (categorical) variables (here for primary vegetation). We then also have coefficients describing the slope of the relationship with our continuous explanatory variables, and coefficients giving the estimated difference in the response variable for non-reference groupings. We can see here that species richness appears to show a negative relationship with forest cover, and appears to be lower in pasture than in primary vegetation.

Model Selection

Model selection can be performed in almost exactly the same way for GLMs as for standard linear models. In the case of our species richness model, we have two potential explanatory variables in our model. If we use backward stepwise model selection, we would test the effect of removing each of these variables from the model:

srMod2 <- glm(Species_richness~Predominant_land_use,family=poisson,data=hhs2)
anova(srMod1,srMod2)

## Analysis of Deviance Table
## 
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ Predominant_land_use
##   Resid. Df Resid. Dev Df Deviance
## 1       239     280.76            
## 2       240     288.62 -1  -7.8586

srMod3 <- glm(Species_richness~ForestCover,family=poisson,data=hhs2)
anova(srMod1,srMod3)

## Analysis of Deviance Table
## 
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ ForestCover
##   Resid. Df Resid. Dev Df Deviance
## 1       239     280.76            
## 2       240     388.03 -1  -107.27

The key difference for GLMs is that we no longer obtain an F ratio when we compare two models using analysis of variance. This is because you cannot calculate sums of squares from a generalized linear model. Instead GLMs are assessed in terms of changes in the deviance explained. Deviance values are calculated based on the (log) likelihoods of three models: the proposed model, a null model (an intercept-only model, as we encountered in the last session), and a saturated model (which has one parameter per data point).

A generalized linear model can then be characterized in terms of two types of deviance:

The null deviance, which is a measure of the overall variability in the response variable: \[Dev_{null}=2(LL_{saturated} - LL_{null})\]
The residual deviance, which is a measure of the variability in the response variable that remains unexplained by the proposed model: \[Dev_{residual}=2(LL_{saturated}-LL_{proposed})\]

The comparison of two models using analysis of variance is based on the difference in the residual deviance of the models. This deviance difference can be compared to a chi-squared distribution to obtain a P value, which we can specify as a parameter to the anova function, test=“Chi”:

anova(srMod1,srMod2,test="Chi")

## Analysis of Deviance Table
## 
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ Predominant_land_use
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1       239     280.76                        
## 2       240     288.62 -1  -7.8586 0.005058 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(srMod1,srMod3,test="Chi")

## Analysis of Deviance Table
## 
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ ForestCover
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       239     280.76                          
## 2       240     388.03 -1  -107.27 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This tells us that both land use and forest cover have a significant effect on the species richness of parasitoid wasps in this study system, and that our first (most complex) model is the most parsimonious model. Let’s re-run the final model, just for clarity:

srModFinal <- glm(Species_richness~Predominant_land_use+ForestCover,
                  family=poisson,data=hhs2)

Effect Sizes

Effect sizes can be calculated in exactly the same way for GLMs as for linear models. But it is important to remember that GLMs use a link function to relate the response variable to the linear predictor (the equation describing the effect of the explanatory variables). Therefore, we have to back-transform our predictions to make them match the original scale of the response variable.

Let’s say we are interested to know the percentage change in species richness in pasture compared to primary vegetation:

# First, we will calculate the species richness in primary vegetation (which as 
# the reference land-use grouping is represented by the model intercept), remembering
# to back-transform by taking the exponential
# We will disregard forest cover for now (i.e., assume a forest cover of 0 in 
# our predictions). Since we did not fit an interaction term in our model, this
# assumption will not affect our estimated percentage reduction in species richness
intercept <- srModFinal$coefficients['(Intercept)']
sp.rich.primary <- exp(intercept)
sp.rich.primary

## (Intercept) 
##    2.631615

# Now we will use the coefficient for the difference in (log) species richness in
# pasture compared to primary vegetation to calculate the species richness in pasture
pasture.coef <- srModFinal$coefficients['Predominant_land_usePasture']
sp.rich.pasture <- exp(pasture.coef+intercept)
sp.rich.pasture

## Predominant_land_usePasture 
##                       0.575

# Finally, we will calculate the percentage change in species richness in
# pasture compared to primary vegetation
pasture.percent.change <- ((sp.rich.pasture/sp.rich.primary)*100)-100
# Round our final number to 1 decimal place
pasture.percent.change <- round(pasture.percent.change,1)
pasture.percent.change

## Predominant_land_usePasture 
##                       -78.2

Therefore, our model shows that there is an average reduction in parasitoid species richness of 78.2%. This is a very large effect size.

Explanatory Power

When we ran linear models, we used the coefficient of determination, or R² to assess how much of the variability in our response variable is explained by a given model. R² is based on the sums of squares of our model, and so cannot be calculated for GLMs. Instead, we can calculate the analogous “deviance explained” by our model: \[Dev_{explained} = \frac{Dev_{null} - Dev_{residual}}{Dev_{null}}\]

For our model of species richness as a function of land use and forest cover:

# Extract the null and residual deviance from the model
dev.null <- srModFinal$null.deviance
dev.resid <- srModFinal$deviance

# Calculate the deviance explained by the model
dev.explained <- (dev.null - dev.resid)/dev.null

# Round to 3 decimal places
dev.explained <- round(dev.explained,3)

dev.explained

## [1] 0.288

Land use and forest cover explain nearly 30% of the variation in parasitoid species richness in this study system. That is a decent explanatory power for a relatively simple model of a complex ecological system (many factors determine the species richness that is sampled in a field survey).

Checking Model Assumptions

It is more tricky to check the assumptions of GLMs than those of linear models, because classical residuals are not expected to behave in the same way for GLMs. I like the DHARMa package in R for working with GLMs, which uses a simulation-based approach to compare the residuals from the actual model with the expectation if the model is behaving normally.

install.packages("DHARMa")

library(DHARMa)

# Simulate residuals 
simResids <- simulateResiduals(srModFinal)

# Generate plots to compare the model residuals to expectations
plot(simResids)

These plots show us that this model is behaving as we would expect in terms of homogeneity of variance and distribution of residuals (i.e., there are no significant deviations from expectations).

For Poisson GLMs, there is one further assumption that we have not encountered before. If data follow a Poisson distribution, then the mean of the distribution is equal to the variance. Accordingly, a Poisson distribution is represented by just one parameter λ, which describes both the mean and the variance of the distribution. We can demonstrate this very simply by drawing some random values from a Poisson distribution:

# Draw 100,000 values from a Poisson distribution with a mean of 5
x <- rpois(n = 100000,lambda = 5)
# Check the mean and variance of this distribution
mean(x); var(x)

## [1] 5.01121

## [1] 5.005534

# Now try a Poisson distribution with a mean of 20
y <- rpois(n = 100000,lambda = 20)
mean(y); var(y)

## [1] 19.98936

## [1] 19.96513

However, count data in ecology are often overdispersed, where the variance is greater than the mean. This violates the assumption of a Poisson GLM, and means that any statistics that we calculate from the model may be unreliable.

We can get an indication of whether a model is over-dispersed by inspecting the model summary. As a rule of thumb, if the response variable conforms to a true Poisson distribution, we expect the residual deviance to be approximately equal to the residual degrees of freedom. If the deviance is much greater than the degrees of freedom, this indicates over-dispersion.

Identifying over-dipersion by inspecting the model summary

There is also a function in the AER package to test for over-dispersion statistically:

install.packages("AER")

library(AER)

dispersiontest(srModFinal)

## 
##  Overdispersion test
## 
## data:  srModFinal
## z = 1.1264, p-value = 0.13
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   1.394782

In this case, the model is not significantly over-dispersed, so our Poisson model seems to be fine.

Now we will try to run a model of the abundance of one of the parasitoid wasp species. We will focus here on the species Pimpla punicipes:

hh.sp <- droplevels(hh[(hh$Taxon_name_entered=="Pimpla punicipes"),])
# Note - the droplevels function here removes groupings that are not retained
# in the data subset

# The estimates of forest cover were not included in the species-level dataset,
# so we will match these from the site-level dataset
hh.sp$ForestCover <- hhs2$ForestCover[match(hh.sp$Site_number,hhs2$Site_number)]

# Remove sites that have missing (NA) forest-cover estimates
hh.sp <- droplevels(hh.sp[!is.na(hh.sp$ForestCover),])

We will model the abundance of this species as a function of land use and forest cover again. Since the abundances are also counts, we will assume a Poisson distribution for now.

abundMod1 <- glm(Measurement~LandUse+ForestCover,family=poisson,data=hh.sp)

abundMod1

## 
## Call:  glm(formula = Measurement ~ LandUse + ForestCover, family = poisson, 
##     data = hh.sp)
## 
## Coefficients:
##    (Intercept)  LandUsePasture     ForestCover  
##      -0.682680       -2.600734        0.007296  
## 
## Degrees of Freedom: 241 Total (i.e. Null);  239 Residual
## Null Deviance:       407.3 
## Residual Deviance: 324.8     AIC: 473.6

Let’s check the assumptions of this model:

simResids2 <- simulateResiduals(abundMod1)

plot(simResids2)

dispersiontest(abundMod1)

## 
##  Overdispersion test
## 
## data:  abundMod1
## z = 2.26, p-value = 0.01191
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   2.811833

Here the model is violating the assumptions of residual distribution, equality of variance and dispersion.

GLMs for Overdispersed Poisson Data

There are two main approaches that we can take to deal with over-dispersed count data in GLMs:

Fit a quasi-Poisson GLM
Fit a negative binomial GLM

Quasi-Poisson GLMs

A quasi-Poisson GLM has an extra parameter that represents the dispersion

Fitting a quasi-Poisson GLM is very simple. We just have to specify family=quasipoisson in the glm function:

abundMod2 <- glm(Measurement~LandUse+ForestCover,family=quasipoisson,data=hh.sp)

summary(abundMod1)

## 
## Call:
## glm(formula = Measurement ~ LandUse + ForestCover, family = poisson, 
##     data = hh.sp)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.682680   0.163127  -4.185 2.85e-05 ***
## LandUsePasture -2.600734   0.599948  -4.335 1.46e-05 ***
## ForestCover     0.007296   0.002426   3.007  0.00264 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 407.27  on 241  degrees of freedom
## Residual deviance: 324.77  on 239  degrees of freedom
## AIC: 473.6
## 
## Number of Fisher Scoring iterations: 6

summary(abundMod2)

## 
## Call:
## glm(formula = Measurement ~ LandUse + ForestCover, family = quasipoisson, 
##     data = hh.sp)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)    -0.682680   0.275061  -2.482   0.0138 *
## LandUsePasture -2.600734   1.011622  -2.571   0.0108 *
## ForestCover     0.007296   0.004091   1.783   0.0758 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 2.843216)
## 
##     Null deviance: 407.27  on 241  degrees of freedom
## Residual deviance: 324.77  on 239  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Comparing the summary tables for the two models, you will notice that some of the key statistical parameters are identical, such as the estimates of the coefficients, the residual deviance and the null deviance. In contrast, the estimates of uncertainty around the coefficients are larger, and the P values for the individual coefficients correspondingly higher. Furthermore, we can’t obtain an AIC value for quasi-Poisson models, because these models use quasi-likelihood rather than true likelihood.

If we want to perform model selection on quasi-Poisson GLMs, and we compare two models using analysis of variance, the comparison is based on a quasi F ratio.

# Let's compare our first model with a simpler model not containing forest cover
abundMod2a <- glm(Measurement~LandUse,family=quasipoisson,data=hh.sp)

anova(abundMod2,abundMod2a,test="F")

## Analysis of Deviance Table
## 
## Model 1: Measurement ~ LandUse + ForestCover
## Model 2: Measurement ~ LandUse
##   Resid. Df Resid. Dev Df Deviance      F  Pr(>F)  
## 1       239     324.77                             
## 2       240     334.02 -1  -9.2503 3.2534 0.07253 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

So in the case of the abundance models for Pimpla punicipes, if we had stuck with the Poisson model, we would wrongly have concluded that forest cover has a significant effect. In fact, when we account for over-dispersion, we find that the effect of forest cover is (at least marginally) non-significant.

For now, we will work with the model containing just land use, rather than completing the model-selection exercise.

Unfortunately it is difficult to check the assumptions of quasi-Poisson models, as the R packages available for model checking don’t handle these model types.

We can, though, perform the normal checks of effect size and explanatory power. In the case of our reduced model (with only land use as an explanatory variable), we can ask how strong a reduction there is in the abundance of Pimpla punicipes in pasture compared to primary vegetation (which is the same as for a comparable Poisson model):

intercept <- abundMod2a$coefficients['(Intercept)']

abund.primary <- exp(intercept)

coef.pasture <- abundMod2a$coefficients['LandUsePasture']

abund.pasture <- exp(intercept+coef.pasture)

percent.diff <- ((abund.pasture/abund.primary)*100)-100

percent.diff

## (Intercept) 
##   -94.85169

We can therefore conclude that this species is 95% less abundant in pasture compared to primary vegetation.

Let’s also assess how much of the variation in the species’ abundance is captured by our model of land use:

dev.null <- abundMod2a$null.deviance
dev.resid <- abundMod2a$deviance

dev.explained <- (dev.null-dev.resid)/dev.null

dev.explained

## [1] 0.1798554

Our model explains 18% of the variation in the abundance of Pimpla punicipes, not bad for such a simple model.

Negative Binomial GLMs

An alternative to dealing with over-dispersed count data is to use a negative binomial GLM. For this, you need to use a separate R package:

install.packages("MASS")

Now, instead of using the glm function, we use the glm.nb function from the MASS package:

abundMod3 <- glm.nb(Measurement~LandUse+ForestCover,data=hh.sp)

summary(abundMod3)

## 
## Call:
## glm.nb(formula = Measurement ~ LandUse + ForestCover, data = hh.sp, 
##     init.theta = 0.4320458077, link = log)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -0.720600   0.247348  -2.913  0.00358 ** 
## LandUsePasture -2.562815   0.650728  -3.938  8.2e-05 ***
## ForestCover     0.008049   0.003954   2.036  0.04179 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.432) family taken to be 1)
## 
##     Null deviance: 186.92  on 241  degrees of freedom
## Residual deviance: 140.76  on 239  degrees of freedom
## AIC: 400.13
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.432 
##           Std. Err.:  0.110 
## 
##  2 x log-likelihood:  -392.132

Let’s try removing land use and forest cover in turn from the model, to assess their statistical significance in explaining differences in abundance. In the case of negative binomial GLMs, the comparison of two models is done on the basis of the ratio of their likelihoods (which is the same as the difference in their log likelihoods). This likelihood ratio can be compared to a chi-squared distribution to obtain a P value.

abundMod3a <- glm.nb(Measurement~LandUse,data=hh.sp)
abundMod3b <- glm.nb(Measurement~ForestCover,data=hh.sp)

anova(abundMod3,abundMod3a)

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Measurement
##                   Model     theta Resid. df    2 x log-lik.   Test    df
## 1               LandUse 0.4026367       240       -395.8646             
## 2 LandUse + ForestCover 0.4320458       239       -392.1322 1 vs 2     1
##   LR stat.    Pr(Chi)
## 1                    
## 2 3.732422 0.05336597

anova(abundMod3,abundMod3b)

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: Measurement
##                   Model     theta Resid. df    2 x log-lik.   Test    df
## 1           ForestCover 0.3353563       240       -414.6416             
## 2 LandUse + ForestCover 0.4320458       239       -392.1322 1 vs 2     1
##   LR stat.      Pr(Chi)
## 1                      
## 2 22.50945 2.091126e-06

Again, we find that forest cover has a (marginally) non-significant effect on the abundance of Pimpla punicipes. It appears that land use has a highly significant effect. If we doing a proper analysis, we would now move on to a second round of model selection to ensure that this is definitely the case, but in the interests of time here we will accept the model with just land use as our most parsimonious model.

abundModFinal <- glm.nb(Measurement~LandUse,data=hh.sp)

summary(abundModFinal)

## 
## Call:
## glm.nb(formula = Measurement ~ LandUse, data = hh.sp, init.theta = 0.4026367275, 
##     link = log)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -0.3169     0.1543  -2.054     0.04 *  
## LandUsePasture  -2.9665     0.6230  -4.761 1.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.4026) family taken to be 1)
## 
##     Null deviance: 181.29  on 241  degrees of freedom
## Residual deviance: 140.12  on 240  degrees of freedom
## AIC: 401.86
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.403 
##           Std. Err.:  0.100 
## 
##  2 x log-likelihood:  -395.865

Let’s check that our model is conforming to the assumptions of parametric tests:

simResids3 <- simulateResiduals(abundModFinal)

plot(simResids3)

The model is behaving very well. Let’s look at the effect size for land use again:

intercept <- abundModFinal$coefficients['(Intercept)']

abund.primary <- exp(intercept)

coef.pasture <- abundModFinal$coefficients['LandUsePasture']

abund.pasture <- exp(intercept+coef.pasture)

percent.diff <- ((abund.pasture/abund.primary)*100)-100

percent.diff

## (Intercept) 
##   -94.85169

Reassuringly, our effect size is very similar as for the quasi-Poisson GLM.

Finally, let’s look at the explanatory power of this model:

dev.null <- abundMod2a$null.deviance
dev.resid <- abundMod2a$deviance

dev.explained <- (dev.null-dev.resid)/dev.null

dev.explained

## [1] 0.1798554

Again, very similar to the quasi-Poisson model, which is good.

Binomial GLMs (Logistic Regression)

A binomial GLM is useful when we have a binary response variable. Such a situation is often encountered in ecology. For example, a lot of ecological data describe the presence or absence of species within communities. For this section of the session, we will create a version of the dataset for Pimpla puncipes that we have already been working with, adding a variable on species presence or absence.

# Where abundance is > 0 set species as present, otherwise absent
hh.sp$PresAbs <- ifelse(hh.sp$Measurement>0,1,0)

Fitting a Binomial GLM in R

We will fit a model of the presence/absence of P. punicipes as a function of land use and forest cover. The principle of fitting a binomial GLM is the same as for other GLMs. In this case, we have to specify family=binomial:

occMod1 <- glm(PresAbs~LandUse+ForestCover,data=hh.sp,family="binomial")

occMod1

## 
## Call:  glm(formula = PresAbs ~ LandUse + ForestCover, family = "binomial", 
##     data = hh.sp)
## 
## Coefficients:
##    (Intercept)  LandUsePasture     ForestCover  
##      -1.041891       -3.327557        0.009227  
## 
## Degrees of Freedom: 241 Total (i.e. Null);  239 Residual
## Null Deviance:       266.5 
## Residual Deviance: 216.3     AIC: 222.3

Model Selection

As before, we will test the effect of removing each explanatory variable in turn:

occMod1a <- glm(PresAbs~LandUse,data=hh.sp,family="binomial")
occMod1b <- glm(PresAbs~ForestCover,data=hh.sp,family="binomial")

anova(occMod1,occMod1a,test="Chi")

## Analysis of Deviance Table
## 
## Model 1: PresAbs ~ LandUse + ForestCover
## Model 2: PresAbs ~ LandUse
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       239     216.28                       
## 2       240     220.89 -1  -4.6127  0.03174 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(occMod1,occMod1b,test="Chi")

## Analysis of Deviance Table
## 
## Model 1: PresAbs ~ LandUse + ForestCover
## Model 2: PresAbs ~ ForestCover
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       239     216.28                          
## 2       240     240.54 -1  -24.264 8.402e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, we find that both land use and forest cover have a significant effect on the occupancy of P. punicipes at the sampled sites in Hawai’i.

occModFinal <- glm(PresAbs~LandUse+ForestCover,data=hh.sp,
                   family="binomial")

Checking Model Assumptions

Let’s check that our final model is conforming to the assumptions of parametric statistical tests:

simResids4 <- simulateResiduals(occModFinal)

plot(simResids4)

Again, all looks good.

Effect Sizes

Let’s now investigate the effect sizes for land use and forest cover. A binomial GLM uses a logit link function: \[g(p)=ln(\frac{p}{1-p})\]

Where \(p\) is the probability that the binary response variable equals one. This means that we need to use an inverse-logit back-transformation to convert our model predictions into a probability that the species is present: \[p=\frac{1}{1+exp^{-\eta}}\]

Where \(\eta\) is the value predicted using the linear predictor.

# Get the model intercept
intercept <- occModFinal$coefficients['(Intercept)']

# First, we will calculate the effect size for land use
# Calculate probability of presence for primary vegetation (i.e., the intercept value)
prob.pres.primary <- 1/(1+exp(-intercept))

# Now calculate probability of presence for pasture

# Get the coefficient expressing the difference in probability of presence in pasture
# compared to primary vegetation
coef.pasture <- occModFinal$coefficients['LandUsePasture']

# Calculate the probability of presence in pasture
prob.pres.pasture <- 1/(1+exp(-(intercept+coef.pasture)))

# Express the probability of presence in pasture as a percentage change compared
# to primary vegetation
percent.diff.pasture <- ((prob.pres.pasture/prob.pres.primary)*100)-100

percent.diff.pasture

## (Intercept) 
##   -95.20679

# Second, the effect size for forest cover
# Calculate probability of presence where forest cover is zero 
# (i.e., the intercept value)
prob.pres.fc0 <- 1/(1+exp(-intercept))

# Get the coefficient describing the slope of the relationship with forest cover
fc.slope <- occModFinal$coefficients['ForestCover']

# Calculate the probability of presence where forest cover is 100%
prob.pres.fc100 <- 1/(1+exp(-(intercept+fc.slope*100)))

# Express the probability of presence where forest cover is 0 as a percent
# difference compared to where forest cover is 100%
percent.diff.fc <- ((prob.pres.fc0/prob.pres.fc100)*100)-100

percent.diff.fc

## (Intercept) 
##   -44.54037

The probability of presence of P. punicipes is therefore 95.2% lower in pasture than in primary vegetation, and 44.5% lower when there is no forest cover compared to when there is 100% forest cover. Both are large effect sizes.

Explanatory Power

Finally, let’s calculate the explanatory power of our model relating P. punicipes presence/absence to land use and forest cover.

dev.null <- occModFinal$null.deviance
dev.resid <- occModFinal$deviance

dev.explained <- (dev.null-dev.resid)/dev.null

dev.explained

## [1] 0.1885569

Our model thus explains 18.9`% of the variation in the presence/absence of P. punicipes, not bad for a very simple model.

Next Time

That’s it for this session on Generalized Linear Models. In the next session, we will look at an alternative to backward stepwise model selection, which avoids the need to somewhat arbitrarily designate terms as significant or not.

References

Gould, Rachelle K., Liba Pejchar, Sara G. Bothwell, Berry Brosi, Stacie Wolny, Chase Mendenhall, and Gretchen Daily. 2013. “Forest Restoration and Parasitoid Wasp Communities in Montane Hawai’i.” PLoS ONE 8: e59356. https://doi.org/10.1371/journal.pone.0059356.

Hudson, Lawrence N., Tim Newbold, Sara Contu, Samantha L. L. Hill, Igor Lysenko, Adriana De Palma, Helen R. P. Phillips, et al. 2017. “The Database of the PREDICTS (Projecting Responses of Ecological Diversity In Changing Terrestrial Systems) Project.” Ecology & Evolution 7: 145–88. https://doi.org/10.1002/ece3.2579.