Analysis of Variance

Tim Newbold

2021-10-15

Overview

This session will cover Analysis of Variance (ANOVA). I will start by introducing the manual calculation of an ANOVA, and then show you the function you can use to calculate ANOVA automatically in R.

An ANOVA statistical test allows us to ask whether a measured response variable differs among several treatment groups in an experiment (or observational groups).

If you are not familiar with basic statistical concepts (hypotheses, null hypotheses, P values, degrees of freedom etc.), I recommend taking a look at these videos.

In this session, we will work with the cars dataset that we have used previously:

data(mtcars)

Specifically, we will use ANOVA to test whether the power of cars varies according to how many cylinders they have. Let’s start by plotting a boxplot of this relationship:

library(ggplot2)
ggplot(data = mtcars,mapping = aes(x=as.factor(cyl),y=hp)) + geom_boxplot() + theme_classic()

It certainly looks as though cars with more cylinders tend to be more powerful, but we will use an ANOVA to test whether this is a statistically significant relationship.

Manual Calculation of ANOVA

Basic principles of ANOVA

To run an ANOVA by hand, we first need to calculate the F ratio. We can then compare this value to the F distribution to obtain a P value (i.e., the probability of obtaining the observed differences among groups by chance, given the null hypothesis that there are actually no differences among the groups). The F ratio is the ratio of the mean of squares between groups to the mean of squares within groups: \[F = \frac{MS_{between}}{MS_{within}}\]

The mean of squares is the sum of squares divided by the degrees of freedom: \[MS_{between} = \frac{SS_{between}}{DF_{between}}\] \[MS_{within} = \frac{SS_{within}}{DF_{within}}\]

For a dataset with \(N\) total samples and \(K\) treatment groups, the degrees of freedom are as follows: \[DF_{between} = K - 1\] \[DF_{within} = N - K\] \[DF_{total} = N - 1\]

The sums of squares are sums of squared deviances.

The total sum of squares is the sum (across each group \(k\), and each data value \(i\) within each group) of the squared differences between each recorded data value, \(Y_{ik}\), and the grand mean of data values in the whole dataset \(\overline{Y}\): \[SS_{total} = \sum_{k=1}^{K}{\sum_{i=1}^{I}{(Y_{ik} - \overline{Y})^2}}\]
The sum of squares between groups is the sum (again, across each group \(k\), and each data value \(i\) within each group) of squared differences between the mean of data values within each group \(k\), \(\overline{Y_k}\), and the grand mean of data values in the whole dataset, \(\overline{Y}\): \[SS_{between} = \sum_{k=1}^{K}\sum_{i=1}^{I}{(\overline{Y_k} - \overline{Y})^2}\]
The sum of squares within groups is the sum (yet again, across each group \(k\), and each data value \(i\) within each group) of squared differences between each data value, \(Y_{ik}\), and the mean of data values across the group to which that data value belongs, \(\overline{Y_k}\): \[SS_{within} = \sum_{k=1}^{K}\sum_{i=1}^{I}{(Y_{ik} - \overline{Y_k})^2}\]

Manual ANOVA on Cars Dataset

Let’s put this into practice to ask whether there is a significant difference in the power of cars depending on how many cylinders they have.

First, we will calculate the grand mean of all values in the dataset:

grand.mean <- mean(mtcars$hp)
grand.mean

## [1] 146.6875

Next, we will use the tapply function to calculate the means of values within each group:

TIP: The tapply function allows us to perform some operation FUN on a variable X grouped by another variable INDEX.

group.means <- tapply(X = mtcars$hp,INDEX = mtcars$cyl,FUN = mean)
group.means

##         4         6         8 
##  82.63636 122.28571 209.21429

To help with our later calculations, we will now add these group means as a new column in the original data table. We can do this using the match function, matching based on the number of cylinders (don’t worry if it is not yet clear to you how this function works):

mtcars$group.means <- group.means[match(mtcars$cyl,names(group.means))]
head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                   group.means
## Mazda RX4           122.28571
## Mazda RX4 Wag       122.28571
## Datsun 710           82.63636
## Hornet 4 Drive      122.28571
## Hornet Sportabout   209.21429
## Valiant             122.28571

TIP: If you want to read more about the match function use help(match)

Now, let’s calculate the total sum of squares (the sum of the squared differences between each data value and the grand mean value across the whole dataset). To do so, we will first create a column to hold the individual squared differences:

mtcars$SquDevTotal <- (mtcars$hp - grand.mean)^2
sst <- sum(mtcars$SquDevTotal)
sst

## [1] 145726.9

Next, we will calculate the sum of squares between groups (the sum of the squared differences between the group means and the grand mean):

mtcars$SquDevBetween <- (mtcars$group.means - grand.mean)^2
ssb <- sum(mtcars$SquDevBetween)
ssb

## [1] 104030.5

And then the sum of squares within groups (the sum of the squared differences between each data value and its respective group mean):

mtcars$SquDevWithin <- (mtcars$hp - mtcars$group.means)^2
ssw <- sum(mtcars$SquDevWithin)
ssw

## [1] 41696.33

We now need to calculate the degrees of freedom (total, between groups and within groups). First, we will derive the \(N\) and \(K\) parameters (the total size of the dataset and the number of groups, respectively):

N = nrow(mtcars)
K = length(unique(mtcars$cyl))
df.total <- N - 1
df.between <- K - 1
df.within <- N - K
df.total

## [1] 31

df.between

## [1] 2

df.within

## [1] 29

As a check that we did our calculations correctly, let’s make sure that \(SS_{total} = SS_{between}+SS_{within}\) and that \(DF_{total} = DF_{between} + DF_{within}\):

sst == ssb + ssw

## [1] TRUE

df.total == df.between + df.within

## [1] TRUE

We can then calculate the mean of squares between and within groups (i.e., the sum of squares divided by the degrees of freedom):

msb <- ssb/df.between
msw <- ssw/df.within
msb

## [1] 52015.27

msw

## [1] 1437.805

Now we can calculate our F ratio:

f <- msb/msw
f

## [1] 36.17687

Finally, we can use the f ratio and the degrees of freedom (between and within groups) to calculate the P value. We can do this using the pf function, which calculates the probability of obtaining a given F ratio if our null hypothesis is true (i.e., if the data are drawn from a single distribution with no difference in values among groups). We have to specify lower.tail = FALSE to obtain a P value at the correct end of the distribution:

P <- pf(q = f,df1 = df.between,df2 = df.within,lower.tail = FALSE)
P

## [1] 1.318541e-08

Running ANOVA in R

Basic Operation

As you have seen, manually calculating an ANOVA is rather laborious. Luckily, R has a function, aov, that can run an ANOVA very quickly.

First, we need to convert the variable giving the number of cylinders to a factor, so that R treats the groups correctly:

mtcars$cyl <- factor(mtcars$cyl)

Now we can run the ANOVA, specifying that we want to model the horsepower of cars as a function of (~) the number of cylinders:

a1 <- aov(mtcars$hp~mtcars$cyl)
summary(a1)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## mtcars$cyl   2 104031   52015   36.18 1.32e-08 ***
## Residuals   29  41696    1438                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Luckily, this gives us the same answer as before, but requiring a lot less work!

Interpeting Model Coefficients

Once we have run an analysis of variance in R, we can also inspect the coefficients produced by the model. In the case of analysis of variance, this just tells us the group means, but understanding how coefficient tables are structured in models created in R will be very important as we move onto more complex models later.

a1$coefficients

## (Intercept) mtcars$cyl6 mtcars$cyl8 
##    82.63636    39.64935   126.57792

Basically, you get an ‘intercept’ coefficient, which for an ANOVA gives you the group mean for the reference group (in this case, cars with four cylinders), and you get coefficients describing how the average value differs for the other groups. To obtain group means for the non-reference groups, you simply add together the intercept with the coefficient for the group in question (you can check that this is the same value we calculated earlier):

Interpreting the cofficient table in an ANOVA

Checking Model Assumptions

ANOVAs are parametric statistical tests, and as such have to meet the four assumptions of this group of tests:

equality of variance
a normal distribution of residuals
a linear relationship between variables, and
independence of individual data points

We will explore the first two of these assumptions here.

First, the distribution of residuals. Model residuals are the differences between the observed and predicted values in a model. In an ANOVA, the residuals are the differences between each data value and the group mean value (i.e., the differences that form the basis of the sum of squares within groups). We can investigate the distribution of model residuals using either a Q-Q plot (generated using the qqnorm and qqline functions), or using a simple histogram. In the first plot, if the residuals are normally distributed, we expect to see the points lining up along the diagonal line:

# The qqnorm function creates the basic Q-Q plot, while the Q-Q line adds the diagonal line
# along which we expect points to full if there is a normal distribution of residuals
qqnorm(residuals(a1))
qqline(residuals(a1))

hist(residuals(a1))

Both of these plots suggest that there is some skew in the model residuals.

Next, we will test for equality of variance, in this case we are interested in equality of variances among groups. Replotting the original boxplot of power against number of cylinders already suggests that variance in horsepower is higher for cars with 8 cylinders:

boxplot(mtcars$hp~mtcars$cyl)

Most important, though, for the model assumptions is whether the model residuals show equality of variance among groups. A boxplot of the residuals against the number of cylinders shows that this assumption is clearly violated, since there is a much greater spread of residuals for cars with 8 cylinders:

boxplot(residuals(a1)~mtcars$cyl)

One simple thing to check, if the assumptions of parametric tests are violated, is whether our response variable appears to be drawn from a normal distribution:

hist(mtcars$hp)

The values are clearly right-skewed. To attempt to deal with this issue, let’s try log-transforming the values of this variable:

mtcars$LogHP <- log(mtcars$hp)
hist(mtcars$LogHP)

The distribution of values certainly now looks much better. If we replot our original box-plot, this time with log-transformed values, it looks as though variances are more equal among groups:

boxplot(mtcars$LogHP~mtcars$cyl)

So, let’s try re-running our Analysis of Variance, this time with the log-transformed values of horsepower:

a2 <- aov(mtcars$LogHP~mtcars$cyl)
summary(a2)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## mtcars$cyl   2  5.449  2.7245    50.9 3.27e-10 ***
## Residuals   29  1.552  0.0535                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We still have a highly significant difference in power among groups with cars with different numbers of cylinders. Let’s check against the model assumptions again:

qqnorm(residuals(a2)); qqline(residuals(a2))

hist(residuals(a2))

boxplot(residuals(a2)~mtcars$cyl)

The model seems to be behaving much better now. We would probably be happy with this model.

Next Time

That’s it for this session. The next session covers linear regression, where we test for an effect of a continuous (rather than grouping) variable on our response variable.