Overview
In this session, we will cover Generalized Linear Models, which allow us to deal with types of data where we cannot expect a conformation to the standard assumptions of parametric statistical tests.
I will cover here only the practicalities of running Generalized Linear Models in R. If you are interested, this great article gives a very short introduction to the statistics behind GLMs, although it focuses on application in Python rather than R.
This session will require the use of my MResModelling R package, which you can install from Github:
::install_github("timnewbold/MResEcologicalModelling",subdir="MResModelling") remotes
library(MResModelling)
Example Datasets
We will work with two datasets, both derived from the PREDICTS database (Hudson et al. 2017), and originally from a study of parasitoid wasp assemblages in Hawai’i (Gould et al. 2013).
The first dataset describes the abundance of 21 different species of parasitoid wasp at 754 different locations in Hawai’i in three different types of land use: primary vegetation (pristine forest), secondary vegetation (forest recovering after past destruction), and pasture (areas used for livestock grazing).
The second dataset gives site-level metrics of biodiversity (e.g., total species richness, total recorded abundance) at the same sites in Hawai’i. To keep things simple, I have focused this dataset just on the 242 sites in primary vegetation or pasture.
# Load species-level data (loads as object 'hh')
data(HawaiiHymenoptera)
# For simplicity, we will select only those columns that we will use later
# Site_number: the site at which species were sampled
# LandUse: the land use at the sampled site
# Taxon_name_entered: the species sampled
# Measurement: the recorded abundance of the species
<- hh[,c('Site_number','LandUse','Taxon_name_entered','Measurement')]
hh # Make the site number a grouping variable (factor)
$Site_number <- factor(hh$Site_number)
hhstr(hh)
## 'data.frame': 15834 obs. of 4 variables:
## $ Site_number : Factor w/ 754 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LandUse : Factor w/ 3 levels "Primary","Pasture",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Taxon_name_entered: Factor w/ 21 levels "Agasthenes swezeyi",..: 19 3 4 9 17 1 8 7 5 21 ...
## $ Measurement : num 0 0 0 0 0 0 0 0 0 0 ...
# Load site-level data (loads as object 'hhs2')
data(HawaiiHymenopteraSitesSubset)
# Again, we will select the relevant columns
# Site_number: the site at which species were sampled
# Predominant_land_use: the land use at the sampled site
# ForestCover: the percentage forest cover at the sampled site
# Species_richness: the recorded species richness at the site
<- hhs2[,c('Site_number','Predominant_land_use','ForestCover','Species_richness')]
hhs2 # Make the site number a grouping variable (factor)
$Site_number <- factor(hhs2$Site_number)
hhs2str(hhs2)
## 'data.frame': 242 obs. of 4 variables:
## $ Site_number : Factor w/ 242 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Predominant_land_use: Factor w/ 2 levels "Primary vegetation",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ForestCover : int 41 41 41 41 41 0 0 0 0 0 ...
## $ Species_richness : int 1 1 2 2 2 0 1 1 2 2 ...
Poisson GLMs
Fitting a Poisson GLM in R
Count data often conform to a Poisson distribution, and so are commonly encountered in ecology. For example, in the site-level parasitoid wasp data, we have a count of the number of species recorded at each site.
Fitting a Poisson GLM in R is very similar to fitting an analysis of covariance (or linear model), except that now we need to use the glm function. To run a GLM, we need to provide one extra piece of information beyond that needed for a linear model: the family of model we want to use. In this case, we want a Poisson family, family=poisson.
<- glm(Species_richness~Predominant_land_use+ForestCover,family=poisson,
srMod1 data=hhs2)
summary(srMod1)
##
## Call:
## glm(formula = Species_richness ~ Predominant_land_use + ForestCover,
## family = poisson, data = hhs2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2942 -1.0724 -0.4067 0.5067 5.2226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.967598 0.075202 12.867 < 2e-16 ***
## Predominant_land_usePasture -1.520983 0.165513 -9.190 < 2e-16 ***
## ForestCover -0.003842 0.001380 -2.785 0.00536 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 394.18 on 241 degrees of freedom
## Residual deviance: 280.76 on 239 degrees of freedom
## AIC: 743.46
##
## Number of Fisher Scoring iterations: 6
Incidentally, if you run a GLM but specify a Gaussian family (family=gaussian), the result will be the same as for a linear model.
You will see that the coefficient table produced by a GLM is the same as for a linear model. The intercept tells us the estimated value of the response variable when the continuous explanatory variables (here just forest cover) have a value of 0, and for reference groups in our grouping (categorical) variables (here for primary vegetation). We then also have coefficients describing the slope of the relationship with our continuous explanatory variables, and coefficients giving the estimated difference in the response variable for non-reference groupings. We can see here that species richness appears to show a negative relationship with forest cover, and appears to be lower in pasture than in primary vegetation.
Model Selection
Model selection can be performed in almost exactly the same way for GLMs as for standard linear models. In the case of our species richness model, we have two potential explanatory variables in our model. If we use backward stepwise model selection, we would test the effect of removing each of these variables from the model:
<- glm(Species_richness~Predominant_land_use,family=poisson,data=hhs2)
srMod2 anova(srMod1,srMod2)
## Analysis of Deviance Table
##
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ Predominant_land_use
## Resid. Df Resid. Dev Df Deviance
## 1 239 280.76
## 2 240 288.62 -1 -7.8586
<- glm(Species_richness~ForestCover,family=poisson,data=hhs2)
srMod3 anova(srMod1,srMod3)
## Analysis of Deviance Table
##
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ ForestCover
## Resid. Df Resid. Dev Df Deviance
## 1 239 280.76
## 2 240 388.03 -1 -107.27
The key difference for GLMs is that we no longer obtain an F ratio when we compare two models using analysis of variance. This is because you cannot calculate sums of squares from a generalized linear model. Instead GLMs are assessed in terms of changes in the deviance explained. Deviance values are calculated based on the (log) likelihoods of three models: the proposed model, a null model (an intercept-only model, as we encountered in the last session), and a saturated model (which has one parameter per data point).
A generalized linear model can then be characterized in terms of two types of deviance:
- The null deviance, which is a measure of the overall variability in the response variable: \[Dev_{null}=2(LL_{saturated} - LL_{null})\]
- The residual deviance, which is a measure of the variability in the response variable that remains unexplained by the proposed model: \[Dev_{residual}=2(LL_{saturated}-LL_{proposed})\]
The comparison of two models using analysis of variance is based on the difference in the residual deviance of the models. This deviance difference can be compared to a chi-squared distribution to obtain a P value, which we can specify as a parameter to the anova function, test=“Chi”:
anova(srMod1,srMod2,test="Chi")
## Analysis of Deviance Table
##
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ Predominant_land_use
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 239 280.76
## 2 240 288.62 -1 -7.8586 0.005058 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(srMod1,srMod3,test="Chi")
## Analysis of Deviance Table
##
## Model 1: Species_richness ~ Predominant_land_use + ForestCover
## Model 2: Species_richness ~ ForestCover
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 239 280.76
## 2 240 388.03 -1 -107.27 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This tells us that both land use and forest cover have a significant effect on the species richness of parasitoid wasps in this study system, and that our first (most complex) model is the most parsimonious model. Let’s re-run the final model, just for clarity:
<- glm(Species_richness~Predominant_land_use+ForestCover,
srModFinal family=poisson,data=hhs2)
Effect Sizes
Effect sizes can be calculated in exactly the same way for GLMs as for linear models. But it is important to remember that GLMs use a link function to relate the response variable to the linear predictor (the equation describing the effect of the explanatory variables). Therefore, we have to back-transform our predictions to make them match the original scale of the response variable.
Let’s say we are interested to know the percentage change in species richness in pasture compared to primary vegetation:
# First, we will calculate the species richness in primary vegetation (which as
# the reference land-use grouping is represented by the model intercept), remembering
# to back-transform by taking the exponential
# We will disregard forest cover for now (i.e., assume a forest cover of 0 in
# our predictions). Since we did not fit an interaction term in our model, this
# assumption will not affect our estimated percentage reduction in species richness
<- srModFinal$coefficients['(Intercept)']
intercept <- exp(intercept)
sp.rich.primary sp.rich.primary
## (Intercept)
## 2.631615
# Now we will use the coefficient for the difference in (log) species richness in
# pasture compared to primary vegetation to calculate the species richness in pasture
<- srModFinal$coefficients['Predominant_land_usePasture']
pasture.coef <- exp(pasture.coef+intercept)
sp.rich.pasture sp.rich.pasture
## Predominant_land_usePasture
## 0.575
# Finally, we will calculate the percentage change in species richness in
# pasture compared to primary vegetation
<- ((sp.rich.pasture/sp.rich.primary)*100)-100
pasture.percent.change # Round our final number to 1 decimal place
<- round(pasture.percent.change,1)
pasture.percent.change pasture.percent.change
## Predominant_land_usePasture
## -78.2
Therefore, our model shows that there is an average reduction in parasitoid species richness of 78.2%. This is a very large effect size.
Explanatory Power
When we ran linear models, we used the coefficient of determination, or R2 to assess how much of the variability in our response variable is explained by a given model. R2 is based on the sums of squares of our model, and so cannot be calculated for GLMs. Instead, we can calculate the analogous “deviance explained” by our model: \[Dev_{explained} = \frac{Dev_{null} - Dev_{residual}}{Dev_{null}}\]
For our model of species richness as a function of land use and forest cover:
# Extract the null and residual deviance from the model
<- srModFinal$null.deviance
dev.null <- srModFinal$deviance
dev.resid
# Calculate the deviance explained by the model
<- (dev.null - dev.resid)/dev.null
dev.explained
# Round to 3 decimal places
<- round(dev.explained,3)
dev.explained
dev.explained
## [1] 0.288
Land use and forest cover explain nearly 30% of the variation in parasitoid species richness in this study system. That is a decent explanatory power for a relatively simple model of a complex ecological system (many factors determine the species richness that is sampled in a field survey).
Checking Model Assumptions
It is more tricky to check the assumptions of GLMs than those of linear models, because classical residuals are not expected to behave in the same way for GLMs. I like the DHARMa package in R for working with GLMs, which uses a simulation-based approach to compare the residuals from the actual model with the expectation if the model is behaving normally.
install.packages("DHARMa")
library(DHARMa)
# Simulate residuals
<- simulateResiduals(srModFinal)
simResids
# Generate plots to compare the model residuals to expectations
plot(simResids)
These plots show us that this model is behaving as we would expect in terms of homogeneity of variance and distribution of residuals (i.e., there are no significant deviations from expectations).
For Poisson GLMs, there is one further assumption that we have not encountered before. If data follow a Poisson distribution, then the mean of the distribution is equal to the variance. Accordingly, a Poisson distribution is represented by just one parameter λ, which describes both the mean and the variance of the distribution. We can demonstrate this very simply by drawing some random values from a Poisson distribution:
# Draw 100,000 values from a Poisson distribution with a mean of 5
<- rpois(n = 100000,lambda = 5)
x # Check the mean and variance of this distribution
mean(x); var(x)
## [1] 4.99534
## [1] 4.995748
# Now try a Poisson distribution with a mean of 20
<- rpois(n = 100000,lambda = 20)
y mean(y); var(y)
## [1] 20.01243
## [1] 19.97658
However, count data in ecology are often overdispersed, where the variance is greater than the mean. This violates the assumption of a Poisson GLM, and means that any statistics that we calculate from the model may be unreliable.
We can get an indication of whether a model is over-dispersed by inspecting the model summary. As a rule of thumb, if the response variable conforms to a true Poisson distribution, we expect the residual deviance to be approximately equal to the residual degrees of freedom. If the deviance is much greater than the degrees of freedom, this indicates over-dispersion.
There is also a function in the AER package to test for over-dispersion statistically:
install.packages("AER")
library(AER)
dispersiontest(srModFinal)
##
## Overdispersion test
##
## data: srModFinal
## z = 1.1264, p-value = 0.13
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 1.394782
In this case, the model is not significantly over-dispersed, so our Poisson model seems to be fine.
Now we will try to run a model of the abundance of one of the parasitoid wasp species. We will focus here on the species Pimpla punicipes:
<- droplevels(hh[(hh$Taxon_name_entered=="Pimpla punicipes"),])
hh.sp # Note - the droplevels function here removes groupings that are not retained
# in the data subset
# The estimates of forest cover were not included in the species-level dataset,
# so we will match these from the site-level dataset
$ForestCover <- hhs2$ForestCover[match(hh.sp$Site_number,hhs2$Site_number)]
hh.sp
# Remove sites that have missing (NA) forest-cover estimates
<- droplevels(hh.sp[!is.na(hh.sp$ForestCover),]) hh.sp
We will model the abundance of this species as a function of land use and forest cover again. Since the abundances are also counts, we will assume a Poisson distribution for now.
<- glm(Measurement~LandUse+ForestCover,family=poisson,data=hh.sp)
abundMod1
abundMod1
##
## Call: glm(formula = Measurement ~ LandUse + ForestCover, family = poisson,
## data = hh.sp)
##
## Coefficients:
## (Intercept) LandUsePasture ForestCover
## -0.682680 -2.600734 0.007296
##
## Degrees of Freedom: 241 Total (i.e. Null); 239 Residual
## Null Deviance: 407.3
## Residual Deviance: 324.8 AIC: 473.6
Let’s check the assumptions of this model:
<- simulateResiduals(abundMod1)
simResids2
plot(simResids2)
dispersiontest(abundMod1)
##
## Overdispersion test
##
## data: abundMod1
## z = 2.26, p-value = 0.01191
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion
## 2.811833
Here the model is violating the assumptions of residual distribution, equality of variance and dispersion.
GLMs for Overdispersed Poisson Data
There are two main approaches that we can take to deal with over-dispersed count data in GLMs:
- Fit a quasi-Poisson GLM
- Fit a negative binomial GLM
Quasi-Poisson GLMs
A quasi-Poisson GLM has an extra parameter that represents the dispersion
Fitting a quasi-Poisson GLM is very simple. We just have to specify family=quasipoisson in the glm function:
<- glm(Measurement~LandUse+ForestCover,family=quasipoisson,data=hh.sp)
abundMod2
summary(abundMod1)
##
## Call:
## glm(formula = Measurement ~ LandUse + ForestCover, family = poisson,
## data = hh.sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4372 -1.0813 -0.2739 -0.2739 5.7961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.682680 0.163127 -4.185 2.85e-05 ***
## LandUsePasture -2.600734 0.599948 -4.335 1.46e-05 ***
## ForestCover 0.007296 0.002426 3.007 0.00264 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 407.27 on 241 degrees of freedom
## Residual deviance: 324.77 on 239 degrees of freedom
## AIC: 473.6
##
## Number of Fisher Scoring iterations: 6
summary(abundMod2)
##
## Call:
## glm(formula = Measurement ~ LandUse + ForestCover, family = quasipoisson,
## data = hh.sp)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4372 -1.0813 -0.2739 -0.2739 5.7961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.682680 0.275061 -2.482 0.0138 *
## LandUsePasture -2.600734 1.011622 -2.571 0.0108 *
## ForestCover 0.007296 0.004091 1.783 0.0758 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 2.843216)
##
## Null deviance: 407.27 on 241 degrees of freedom
## Residual deviance: 324.77 on 239 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Comparing the summary tables for the two models, you will notice that some of the key statistical parameters are identical, such as the estimates of the coefficients, the residual deviance and the null deviance. In contrast, the estimates of uncertainty around the coefficients are larger, and the P values for the individual coefficients correspondingly higher. Furthermore, we can’t obtain an AIC value for quasi-Poisson models, because these models use quasi-likelihood rather than true likelihood.
If we want to perform model selection on quasi-Poisson GLMs, and we compare two models using analysis of variance, the comparison is based on a quasi F ratio.
# Let's compare our first model with a simpler model not containing forest cover
<- glm(Measurement~LandUse,family=quasipoisson,data=hh.sp)
abundMod2a
anova(abundMod2,abundMod2a,test="F")
## Analysis of Deviance Table
##
## Model 1: Measurement ~ LandUse + ForestCover
## Model 2: Measurement ~ LandUse
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 239 324.77
## 2 240 334.02 -1 -9.2503 3.2534 0.07253 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So in the case of the abundance models for Pimpla punicipes, if we had stuck with the Poisson model, we would wrongly have concluded that forest cover has a significant effect. In fact, when we account for over-dispersion, we find that the effect of forest cover is (at least marginally) non-significant.
For now, we will work with the model containing just land use, rather than completing the model-selection exercise.
Unfortunately it is difficult to check the assumptions of quasi-Poisson models, as the R packages available for model checking don’t handle these model types.
We can, though, perform the normal checks of effect size and explanatory power. In the case of our reduced model (with only land use as an explanatory variable), we can ask how strong a reduction there is in the abundance of Pimpla punicipes in pasture compared to primary vegetation (which is the same as for a comparable Poisson model):
<- abundMod2a$coefficients['(Intercept)']
intercept
<- exp(intercept)
abund.primary
<- abundMod2a$coefficients['LandUsePasture']
coef.pasture
<- exp(intercept+coef.pasture)
abund.pasture
<- ((abund.pasture/abund.primary)*100)-100
percent.diff
percent.diff
## (Intercept)
## -94.85169
We can therefore conclude that this species is 95% less abundant in pasture compared to primary vegetation.
Let’s also assess how much of the variation in the species’ abundance is captured by our model of land use:
<- abundMod2a$null.deviance
dev.null <- abundMod2a$deviance
dev.resid
<- (dev.null-dev.resid)/dev.null
dev.explained
dev.explained
## [1] 0.1798554
Our model explains 18% of the variation in the abundance of Pimpla punicipes, not bad for such a simple model.
Negative Binomial GLMs
An alternative to dealing with over-dispersed count data is to use a negative binomial GLM. For this, you need to use a separate R package:
install.packages("MASS")
Now, instead of using the glm function, we use the glm.nb function from the MASS package:
<- glm.nb(Measurement~LandUse+ForestCover,data=hh.sp)
abundMod3
summary(abundMod3)
##
## Call:
## glm.nb(formula = Measurement ~ LandUse + ForestCover, data = hh.sp,
## init.theta = 0.4320458077, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0378 -0.8533 -0.2682 -0.2682 3.5551
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.720600 0.247348 -2.913 0.00358 **
## LandUsePasture -2.562815 0.650728 -3.938 8.2e-05 ***
## ForestCover 0.008049 0.003954 2.036 0.04179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.432) family taken to be 1)
##
## Null deviance: 186.92 on 241 degrees of freedom
## Residual deviance: 140.76 on 239 degrees of freedom
## AIC: 400.13
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.432
## Std. Err.: 0.110
##
## 2 x log-likelihood: -392.132
Let’s try removing land use and forest cover in turn from the model, to assess their statistical significance in explaining differences in abundance. In the case of negative binomial GLMs, the comparison of two models is done on the basis of the ratio of their likelihoods (which is the same as the difference in their log likelihoods). This likelihood ratio can be compared to a chi-squared distribution to obtain a P value.
<- glm.nb(Measurement~LandUse,data=hh.sp)
abundMod3a <- glm.nb(Measurement~ForestCover,data=hh.sp)
abundMod3b
anova(abundMod3,abundMod3a)
## Likelihood ratio tests of Negative Binomial Models
##
## Response: Measurement
## Model theta Resid. df 2 x log-lik. Test df
## 1 LandUse 0.4026367 240 -395.8646
## 2 LandUse + ForestCover 0.4320458 239 -392.1322 1 vs 2 1
## LR stat. Pr(Chi)
## 1
## 2 3.732422 0.05336597
anova(abundMod3,abundMod3b)
## Likelihood ratio tests of Negative Binomial Models
##
## Response: Measurement
## Model theta Resid. df 2 x log-lik. Test df
## 1 ForestCover 0.3353563 240 -414.6416
## 2 LandUse + ForestCover 0.4320458 239 -392.1322 1 vs 2 1
## LR stat. Pr(Chi)
## 1
## 2 22.50945 2.091126e-06
Again, we find that forest cover has a (marginally) non-significant effect on the abundance of Pimpla punicipes. It appears that land use has a highly significant effect. If we doing a proper analysis, we would now move on to a second round of model selection to ensure that this is definitely the case, but in the interests of time here we will accept the model with just land use as our most parsimonious model.
<- glm.nb(Measurement~LandUse,data=hh.sp)
abundModFinal
summary(abundModFinal)
##
## Call:
## glm.nb(formula = Measurement ~ LandUse, data = hh.sp, init.theta = 0.4026367275,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9120 -0.9120 -0.2678 -0.2678 3.5177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3169 0.1543 -2.054 0.04 *
## LandUsePasture -2.9665 0.6230 -4.761 1.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.4026) family taken to be 1)
##
## Null deviance: 181.29 on 241 degrees of freedom
## Residual deviance: 140.12 on 240 degrees of freedom
## AIC: 401.86
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.403
## Std. Err.: 0.100
##
## 2 x log-likelihood: -395.865
Let’s check that our model is conforming to the assumptions of parametric tests:
<- simulateResiduals(abundModFinal)
simResids3
plot(simResids3)
The model is behaving very well. Let’s look at the effect size for land use again:
<- abundModFinal$coefficients['(Intercept)']
intercept
<- exp(intercept)
abund.primary
<- abundModFinal$coefficients['LandUsePasture']
coef.pasture
<- exp(intercept+coef.pasture)
abund.pasture
<- ((abund.pasture/abund.primary)*100)-100
percent.diff
percent.diff
## (Intercept)
## -94.85169
Reassuringly, our effect size is very similar as for the quasi-Poisson GLM.
Finally, let’s look at the explanatory power of this model:
<- abundMod2a$null.deviance
dev.null <- abundMod2a$deviance
dev.resid
<- (dev.null-dev.resid)/dev.null
dev.explained
dev.explained
## [1] 0.1798554
Again, very similar to the quasi-Poisson model, which is good.
Binomial GLMs (Logistic Regression)
A binomial GLM is useful when we have a binary response variable. Such a situation is often encountered in ecology. For example, a lot of ecological data describe the presence or absence of species within communities. For this section of the session, we will create a version of the dataset for Pimpla puncipes that we have already been working with, adding a variable on species presence or absence.
# Where abundance is > 0 set species as present, otherwise absent
$PresAbs <- ifelse(hh.sp$Measurement>0,1,0) hh.sp
Fitting a Binomial GLM in R
We will fit a model of the presence/absence of P. punicipes as a function of land use and forest cover. The principle of fitting a binomial GLM is the same as for other GLMs. In this case, we have to specify family=binomial:
<- glm(PresAbs~LandUse+ForestCover,data=hh.sp,family="binomial")
occMod1
occMod1
##
## Call: glm(formula = PresAbs ~ LandUse + ForestCover, family = "binomial",
## data = hh.sp)
##
## Coefficients:
## (Intercept) LandUsePasture ForestCover
## -1.041891 -3.327557 0.009227
##
## Degrees of Freedom: 241 Total (i.e. Null); 239 Residual
## Null Deviance: 266.5
## Residual Deviance: 216.3 AIC: 222.3
Model Selection
As before, we will test the effect of removing each explanatory variable in turn:
<- glm(PresAbs~LandUse,data=hh.sp,family="binomial")
occMod1a <- glm(PresAbs~ForestCover,data=hh.sp,family="binomial")
occMod1b
anova(occMod1,occMod1a,test="Chi")
## Analysis of Deviance Table
##
## Model 1: PresAbs ~ LandUse + ForestCover
## Model 2: PresAbs ~ LandUse
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 239 216.28
## 2 240 220.89 -1 -4.6127 0.03174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(occMod1,occMod1b,test="Chi")
## Analysis of Deviance Table
##
## Model 1: PresAbs ~ LandUse + ForestCover
## Model 2: PresAbs ~ ForestCover
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 239 216.28
## 2 240 240.54 -1 -24.264 8.402e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this case, we find that both land use and forest cover have a significant effect on the occupancy of P. punicipes at the sampled sites in Hawai’i.
<- glm(PresAbs~LandUse+ForestCover,data=hh.sp,
occModFinal family="binomial")
Checking Model Assumptions
Let’s check that our final model is conforming to the assumptions of parametric statistical tests:
<- simulateResiduals(occModFinal)
simResids4
plot(simResids4)
Again, all looks good.
Effect Sizes
Let’s now investigate the effect sizes for land use and forest cover. A binomial GLM uses a logit link function: \[g(p)=ln(\frac{p}{1-p})\]
Where \(p\) is the probability that the binary response variable equals one. This means that we need to use an inverse-logit back-transformation to convert our model predictions into a probability that the species is present: \[p=\frac{1}{1+exp^{-\eta}}\]
Where \(\eta\) is the value predicted using the linear predictor.
# Get the model intercept
<- occModFinal$coefficients['(Intercept)']
intercept
# First, we will calculate the effect size for land use
# Calculate probability of presence for primary vegetation (i.e., the intercept value)
<- 1/(1+exp(-intercept))
prob.pres.primary
# Now calculate probability of presence for pasture
# Get the coefficient expressing the difference in probability of presence in pasture
# compared to primary vegetation
<- occModFinal$coefficients['LandUsePasture']
coef.pasture
# Calculate the probability of presence in pasture
<- 1/(1+exp(-(intercept+coef.pasture)))
prob.pres.pasture
# Express the probability of presence in pasture as a percentage change compared
# to primary vegetation
<- ((prob.pres.pasture/prob.pres.primary)*100)-100
percent.diff.pasture
percent.diff.pasture
## (Intercept)
## -95.20679
# Second, the effect size for forest cover
# Calculate probability of presence where forest cover is zero
# (i.e., the intercept value)
<- 1/(1+exp(-intercept))
prob.pres.fc0
# Get the coefficient describing the slope of the relationship with forest cover
<- occModFinal$coefficients['ForestCover']
fc.slope
# Calculate the probability of presence where forest cover is 100%
<- 1/(1+exp(-(intercept+fc.slope*100)))
prob.pres.fc100
# Express the probability of presence where forest cover is 0 as a percent
# difference compared to where forest cover is 100%
<- ((prob.pres.fc0/prob.pres.fc100)*100)-100
percent.diff.fc
percent.diff.fc
## (Intercept)
## -44.54037
The probability of presence of P. punicipes is therefore 95.2% lower in pasture than in primary vegetation, and 44.5% lower when there is no forest cover compared to when there is 100% forest cover. Both are large effect sizes.
Explanatory Power
Finally, let’s calculate the explanatory power of our model relating P. punicipes presence/absence to land use and forest cover.
<- occModFinal$null.deviance
dev.null <- occModFinal$deviance
dev.resid
<- (dev.null-dev.resid)/dev.null
dev.explained
dev.explained
## [1] 0.1885569
Our model thus explains 18.9`% of the variation in the presence/absence of P. punicipes, not bad for a very simple model.
Next Time
That’s it for this session on Generalized Linear Models. In the next session, we will look at an alternative to backward stepwise model selection, which avoids the need to somewhat arbitrarily designate terms as significant or not.