Working with Spatial Data in R

Tim Newbold

2025-08-22

Overview

This session will cover the basics of working with spatial data in R. We will work mostly with the terra package, which is very versatile.

library(terra)

## terra 1.8.54

We will also use a number of other packages:

library(sf)
library(dplyr)
library(ggplot2)
library(tidyterra)
library(osmdata)
library(rnaturalearth)
library(geodata)

If you don’t have any of the aforementioned packages already, you can install them using the install.packages function.

Types of Spatial Data

There are two major classes of spatial data that you will encounter: vector data, which consist of points, lines or polygons, or raster data, which describe values across a grid (typically with square grid cells, but sometimes hexagonal).

Vector Data

There are lots of ways to load vector data in R. Ultimately, it is useful to convert vector data into the SpatVector data class used in the terra package, as this package provides lots of functionality for working with and combining different types of spatial data. You can create a SpatVector object, or convert another spatial data type to this format, using the vect function in terra

There are three main ways that we can load spatial data (and indeed all types of data) in R. First, there may be packages that allow us to read data automatically from a web location. Second, we can read data from a local file stored on your computer. Third, we can create a spatial data object from within R. We will cover all of these three methods, to some extent, during the course of this tutorial.

Lines

Lines are a type of vector data describing linear features. For example, in ecology we might consider data on roads or hedgerows.

We will use the osmdata package to query road data from Open Street Map. We will here obtain data on major roads (motorways, primary and secondary roads) in a region of New Zealand. This package reads in roads data in a bespoke format. We will first convert the data to a ‘simple features’ format using the osmdata_sf function. Then, we can convert to a terra package SpatVector format using the vect function.

# Define bounding box for roads query
opq(bbox = c(171.5,-43.6,172.9,-42.3)) %>%
  # Obtain data for major highways (motorways, primary and secondary roads)
  add_osm_feature(key = "highway", 
                  value = c("motorway", "primary", "secondary")) %>%
  # Convert to simple features class
  osmdata_sf() -> 
  nz_major

# Convert to terra SpatVector object (the osm_lines part of the original object
# contains the actual roads data we are interested in)
nz_major$osm_lines %>% vect() -> nz_major_sv

We will plot the roads map using ggplot and the geom_spatvector function from the tidyterra package. You can see from the code that we plot spatial data in a very similar way to how we would produce other graphs using ggplot. Here, the scale_x_continuous and scale_y_continuous commands, which would normally allow you to adjust the span of your x and y axes in a normal graph, can be used to set the spatial limits of the plotted data.

p <- ggplot() + 
  geom_spatvector(data = nz_major_sv) + 
  scale_x_continuous(name = "",limits = c(171.5,172.9)) + 
  scale_y_continuous(name = "",limits = c(-43.6,-42.3)) + 
  theme_classic()

p

Each line in a spatial vector lines dataset is described by a series of coordinates that form the vertices of each line. We can inspect these using the geom function in terra:

# Show the geometry for the first road in the dataset
head(geom(nz_major_sv[1,]))

##      geom part        x         y hole
## [1,]    1    1 172.5445 -43.48979    0
## [2,]    1    1 172.5445 -43.48978    0
## [3,]    1    1 172.5444 -43.48978    0
## [4,]    1    1 172.5444 -43.48977    0
## [5,]    1    1 172.5444 -43.48975    0
## [6,]    1    1 172.5443 -43.48973    0

The first six rows here all refer to the same geometry, (geom = 1), because the coordinates of these rows are all vertices of the first road in the dataset. In total, there are 3726 different roads (geometries) in this dataset.

Polygons

Polygon data are very similar to vector line data, but the lines connect at the ends to form polygons.

Polygons are often used to describe the boundaries of regions. In ecological studies, we might be interested in administrative regions, such as countries, or ecological regions, such as biogeographic realms, biomes or ecoregions, among many others.

Here, we will use the rnaturalearth package to read in country borders, here the border of New Zealand. This package has the option to return the data directly as a terra package ‘SpatVector’ object: returnClass=‘sv’.

# Obtain polygon data describing the border of New Zealand
ne_countries(scale = 50,returnclass = 'sv',country = "New Zealand") -> nz

We can plot this polygon map using the same approach we used for the roads data.

p <- ggplot() + 
  geom_spatvector(data = nz,fill="#5DA89955",col="#00000000") + 
  # Limit the span of the plot because one part of New Zealand (the Chatham
  # Islands) falls in the Western Hemisphere
  scale_x_continuous(name = "",limits = c(165,180)) + 
  scale_y_continuous(name = "",limits = c(-50,-30)) + 
  theme_classic()

p

We can then combine the plots for the polygons (New Zealand border) and lines (major roads). We will zoom in here to focus on the specific region for which we obtained the roads data (getting the data for all roads in New Zeleand would be rather time-consuming).

p <- ggplot() + 
  geom_spatvector(data = nz,fill="#5DA89955",col="#00000000") + 
  geom_spatvector(data = nz_major_sv,col="#2E2585") +
  # Limit the span of the plot because one part of New Zealand (the Chatham
  # Islands) falls in the Western Hemisphere
  scale_x_continuous(name = "",limits = c(171.5,172.9)) + 
  scale_y_continuous(name = "",limits = c(-43.6,-42.3)) + 
  theme_classic()

p

As with line data, polygons data are described a series of coordinates forming the vertices of each polygon, which we can see using the geom function again.

head(geom(nz))

##      geom part        x         y hole
## [1,]    1    1 173.1153 -41.27930    0
## [2,]    1    1 173.2309 -41.28418    0
## [3,]    1    1 173.3378 -41.21094    0
## [4,]    1    1 173.4473 -41.15137    0
## [5,]    1    1 173.5625 -41.10205    0
## [6,]    1    1 173.7379 -40.98896    0

Here, all rows belong to the same geometry, (geom = 1) (New Zealand), but the data are separated into 13 different polygons (parts). Here, the first six rows that we see are all vertices for the first polygon (‘part’). With polygon data, the final column, hole tells us whether a vertex belongs to an inner hole rather than the outer boundary (although there aren’t any holes in the New Zealand map).

Points

The final type of vector data you will encounter describe single points in space.

To illustrate point data, we will use a dataset from within the PREDICTS database, which we will explore in more detail in a later session.

For this, you will need my predictsFunctions package for working with the PREDICTS database. You can install this by running: remotes::install_github(“timnewbold/predicts-demo”,subdir=“predictsFunctions”)

library(predictsFunctions)

For now, don’t worry about the details of this code. It just obtains the PREDICTS database, selects data from a single study on beetles in New Zealand, and calculates species richness for the locations sampled.

url("https://www.dropbox.com/scl/fi/0mqoacviqiiurrtplby2m/predicts.rds?rlkey=m5ijge7w1dthkvm5gbp652uel&dl=1") %>% readRDS() %>% droplevels() -> predicts_2016
url("https://www.dropbox.com/scl/fi/3dqv4ptkqkuswxx350xbx/predicts_2022.rds?rlkey=20g6v43l0vjdh3k8e7p533hri&dl=1") %>% readRDS() %>% droplevels() -> predicts_2022
full_join(predicts_2016,predicts_2022) -> predicts

ewers2007 <- droplevels(predicts[(predicts$SS=="CC1_2007__Ewers 1"),])

ewers2007 <- ewers2007 %>% CorrectSamplingEffort()

## Correcting 0 missing sampling effort values
## Rescaling sampling effort
## Correcting 208069 values for sensitivity to sampling effort

ewers2007 <- ewers2007 %>% MergeSites(silent=TRUE)

ewers2007 %>%
  droplevels() %>%
  SiteMetrics(extra.cols = c("Longitude","Latitude",
                             "Predominant_land_use",
                             "SSB","SSBS","Study_common_taxon","Country"),
              srEstimators = NULL) %>%
  mutate(LandUse = Predominant_land_use) %>% droplevels() -> sites

## Computing site metrics for 208069 measurements
## The data contain 1 sources, 1 studies and 233 sites
## Computing site-level values
## Computing total abundance
## Computing species richness
## Computing Simpson's diversity
## Assembling site-level values

At the moment, we just have a data frame that contains the sampled locations, with their latitude and longitude coordinates. To work with these data as spatial data, we need to convert them into a proper spatial object. To do this, we can use the vect function to create a vector spatial object from our data, specifying the ‘Longitude’ and ‘Latitude’ columns as the geometry (the spatial coordinates).

Because we are creating these data from scratch, we need to specify the coordinate and projection system. Much more on this later, but for now we will simply specify the coordinate system using the crs function. We will use a simple geographical projection based on the World Geodetic System 1984 (WGS84) coordinate system (this is projection number 4326 in the EPSG list).

sitesMap <- vect(sites,geom=c('Longitude','Latitude'))
crs(sitesMap) <- "epsg:4326"

As before, we can plot the vector data using the geom_spatvector function with ggplot. Again, we will add the map of New Zealand as background.

p <- ggplot() + 
  geom_spatvector(data = nz,fill="#5DA89955",col="#00000000") + 
  geom_spatvector(data = sitesMap,size=1,col="#7E2954") + 
  scale_x_continuous(name = "",limits = c(165,180)) + 
  scale_y_continuous(name = "",limits = c(-50,-30)) + 
  theme_classic()

p

Zooming in to just the region where sampled locations lie, we can see the spatial arrangement of the points a bit better.

p <- ggplot() + 
  geom_spatvector(data = nz,fill="#5DA89955",col="#00000000") + 
  geom_spatvector(data = sitesMap,size=1,col="#7E2954") + 
  scale_x_continuous(name = "",limits = c(172.3,172.5)) + 
  scale_y_continuous(name = "",limits = c(-42.65,-42.45)) + 
  theme_classic()

p

We could also add the roads data. I chose the location of the roads data to download to match up with these sites with biodiversity data. We only see one road becuase we are looking at a remote region in New Zealand.

p <- ggplot() + 
  geom_spatvector(data = nz,fill="#5DA89955",col="#00000000") + 
  geom_spatvector(data = nz_major_sv,col="#2E2585") +
  geom_spatvector(data = sitesMap,size=1,col="#7E2954") + 
  scale_x_continuous(name = "",limits = c(172.3,172.5)) + 
  scale_y_continuous(name = "",limits = c(-42.65,-42.45)) + 
  theme_classic()

p

The full original dataset is still contained within the map object:

head(sitesMap)

##         Source_ID Study_number Study_name Site_number  Site_name Block
## 1 CC1_2007__Ewers            1     Beetle           1 NewSite001     A
## 2 CC1_2007__Ewers            1     Beetle           2 NewSite002     A
## 3 CC1_2007__Ewers            1     Beetle           3 NewSite003     A
## 4 CC1_2007__Ewers            1     Beetle           4 NewSite004     A
## 5 CC1_2007__Ewers            1     Beetle           5 NewSite005     A
## 6 CC1_2007__Ewers            1     Beetle           6 NewSite006     A
##   Use_intensity Sample_start_earliest Sample_end_latest Diversity_metric_type
## 1   Minimal use            2000-10-30        2001-02-10             Abundance
## 2   Minimal use            2000-10-30        2001-02-10             Abundance
## 3   Minimal use            2000-10-30        2001-02-10             Abundance
## 4   Minimal use            2000-10-30        2001-02-10             Abundance
## 5   Minimal use            2000-10-30        2001-02-10             Abundance
## 6   Minimal use            2000-10-30        2001-02-10             Abundance
##   Predominant_land_use                 SSB                  SSBS
## 1   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 1
## 2   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 2
## 3   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 3
## 4   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 4
## 5   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 5
## 6   Primary vegetation CC1_2007__Ewers 1 A CC1_2007__Ewers 1 A 6
##   Study_common_taxon     Country                SS                 SSS
## 1         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 1
## 2         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 2
## 3         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 3
## 4         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 4
## 5         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 5
## 6         Coleoptera New Zealand CC1_2007__Ewers 1 CC1_2007__Ewers 1 6
##   Total_abundance Species_richness Simpson_diversity ChaoR Richness_rarefied
## 1        256.2766               64         21.937953    NA                NA
## 2        150.6596               49         22.563549    NA                NA
## 3        114.9362               33         15.468927    NA                NA
## 4        254.7234               35          6.499758    NA                NA
## 5        142.8936               33         11.500000    NA                NA
## 6        147.7805               36         13.750499    NA                NA
##              LandUse
## 1 Primary vegetation
## 2 Primary vegetation
## 3 Primary vegetation
## 4 Primary vegetation
## 5 Primary vegetation
## 6 Primary vegetation

Just as we did before with the other vector data types, we can inspect the geometry of the points map:

head(geom(sitesMap))

##      geom part        x         y hole
## [1,]    1    1 172.4025 -42.52868    0
## [2,]    2    1 172.4025 -42.52868    0
## [3,]    3    1 172.4024 -42.52867    0
## [4,]    4    1 172.4024 -42.52866    0
## [5,]    5    1 172.4023 -42.52864    0
## [6,]    6    1 172.4021 -42.52860    0

Here, each row belongs to a different geometry (point) with specific (point) coordinates. The ‘hole’ column is meaningless for point data, but is still given as this is a vector object.

Raster Data

We have now covered the three vector spatial data types. Now, we will move onto raster data, which have a different structure. In raster data, we have data values for each cell within a spatial grid. Generally this grid consists of square cells, but you may sometimes come across rasters with rectangular or even hexagonal grid cells.

Whereas before we read in vector spatial data using the vect function in the terra package, now we will use the rast function (also from the terra package) to read and create raster data.

We will first read in an example raster dataset from file. This dataset describes the proportion of natural habitat globally at a 30-arc-second resolution (see box, below), although for a lighter-weight download here we will work with a subset of these data just for New Zealand.

A Note on Spatial Resolution

For rasters that are in a geographic projection (see more on projection systems later), the spatial units of the raster are degrees of longitude/latitude.

Thus, the resolution of such a raster can be described in terms of degrees (1 degree, 0.5 degrees etc.). Or, it can be described in terms of arc-minutes. There are 60 arc-minutes in a degree, and so a 0.5 degree resolution is the same as 30 arc-minute resolution. Alternatively, the resolution can be given in arc-seconds. There are 60 arc-seconds in an arc-minute, and so 3600 arc-seconds in a degree.

For a raster with square cells, you obviously need just one value to describe resolution. In the rare case of a raster with rectangular cells, you need two values.

For a raster with hexagonal cells, resolution is described using a bespoke system. For example, in the H3 system (implemented in the h3 package, in case you are interested), resolutions range from 0, which is a very coarse resolution where each hexagon covers approximately 4 million km², to 15, which is an incredibly fine resolution, with each hexagon covering less than 1 m².

rast("https://www.dropbox.com/scl/fi/y8jr7uawyp1c4zcrscgiy/ProportionNaturalHabitatNewZealand.tif?rlkey=e4ecfbapcaf934et3l93if7sb&dl=1") ->
  propn_natural

We can plot rasters using the tidyterra package again, this time using the geom_spatraster function.

p <- ggplot() + 
  geom_spatraster(data = propn_natural) + 
  # This function allows the use of a range of nice colour palettes
  scale_fill_grass_c(palette = 'forest_cover') + 
  theme_classic()

## <SpatRaster> resampled to 500821 cells.

We can also combine our raster data with vector maps as an overlay, here showing the proportion of natural habitat, as well as roads and PREDICTS sites for the study region we looked at previously.

p <- ggplot() + 
  geom_spatraster(data = propn_natural) + 
  scale_fill_grass_c(palette = 'forest_cover') + 
  geom_spatvector(data = nz_major_sv,col="#2E2585") +
  geom_spatvector(data = sitesMap,size=1,col="#7E2954") + 
  scale_x_continuous(name = "",limits = c(172.3,172.5)) + 
  scale_y_continuous(name = "",limits = c(-42.66,-42.45)) + 
  theme_classic()

## <SpatRaster> resampled to 500821 cells.

Similar to a vector object, where each point or line/polygon vertex is described by a pair of coordinates (longitude and latitude), the raster object contains a pair of coordinates representing the centre of each cell. We can show the coordinates using the crds function.

# Here we set na.rm=FALSE so that we show the coordinates for all cells in
# the grid, not just those with non-NA values
head(crds(propn_natural,na.rm=FALSE))

##             x       y
## [1,] 165.0042 -30.007
## [2,] 165.0125 -30.007
## [3,] 165.0208 -30.007
## [4,] 165.0292 -30.007
## [5,] 165.0375 -30.007
## [6,] 165.0458 -30.007

You can see that this function (as do all functions that extract properties of individual raster grid cells) shows the top row of cells first, then the second row, the third, etc.

Besides the coordinates for each grid cell, the other main information in a raster object are the values of the variable shown in the raster (here the proportional coverage of natural habitats). We can show these values using the values function. Again, this function returns the values for the top row of cells first, then each following row of data.

summary(values(propn_natural))

##  PRI_1km_2005_0ice
##  Min.   :0.0      
##  1st Qu.:0.2      
##  Median :0.3      
##  Mean   :0.4      
##  3rd Qu.:0.5      
##  Max.   :1.0      
##  NA's   :3906266

ggplot() + 
  geom_histogram(mapping = aes(x = values(propn_natural))) + 
  theme_classic()

Calling the raster object itself in R gives us an overview of all the properties of the raster.

propn_natural

## class       : SpatRaster 
## size        : 2400, 1799, 1  (nrow, ncol, nlyr)
## resolution  : 0.008333333, 0.008333333  (x, y)
## extent      : 165, 179.9917, -50.00283, -30.00283  (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326) 
## source      : ProportionNaturalHabitatNewZealand.tif?rlkey=e4ecfbapcaf934et3l93if7sb&dl=1 
## varname     : ProportionNaturalHabitatNewZealand 
## name        : PRI_1km_2005_0ice 
## min value   :        0.01605449 
## max value   :        1.00000000

Here, we can see that this raster consists of 2,400 rows and 1,799 columns of data, has a spatial resolution of 30 arc-seconds (0.0083333 of a decimal degree), spans from 165 to 180 degrees of longitude (xmin and xmax) and from -50 to -30 degrees of latitude (ymin and ymax), uses the WGS84 coordinate reference system (more on this below), and has raster values (representing the proportion of natural habitat) ranging from 0.016 to 1.

It is very simple to perform arithmetic operations on raster objects in R. You simply pass the whole raster into a calculation just as you would any other object. For example, let’s convert our raster describing proportional coverage of natural habitats into percentage values by multiplying by 100:

perc_natural <- propn_natural * 100

# We have to use the basic colour scheme here, as the one we used before
# only works on 0 to 1 values
p <- ggplot() + 
  geom_spatraster(data = perc_natural) + 
  theme_classic()

## <SpatRaster> resampled to 500821 cells.

You can also combine two or more separate rasters into a calculation (as long as they have an identical spatial domain and resolution).

So far, we have been working with simple raster data, where each raster has a single set of values across the grid. However, rasters can also have multiple ‘layers’, where each layer has its own set of values.

To look at an example of such multi-layer data, we will download monthly average temperature values from the WorldClim dataset using the geodata package. These data describe the average temperature in each of the months of the year, at a resolution of 5 arc-minutes.

# Read average temperature at 5 arc-minutes resolution
MonthlyTemp <- worldclim_global(var = "tavg",res = 5,path = tempdir())

We can see how many layers a raster object contains using the nlyr function, here showing that we have a layer for each of the 12 months of the year.

nlyr(MonthlyTemp)

## [1] 12

We can plot these data, plotting each of the 12 layers as a different facet.

p <- ggplot() + 
  geom_spatraster(data = MonthlyTemp) + 
  scale_fill_grass_c(palette = "celsius") + 
  facet_wrap(~lyr) + 
  theme_classic()

p

If we want to combine the layers, one option is simply to apply a mathematical function to the raster layer. For example, if we want to calculate the average temperature across all months of the year, we can simply apply the mean function to our raster of monthly average temperatures.

AnnAvgTemp <- mean(MonthlyTemp)

Plotting the data again, we now just have a single layer for annual mean temperature.

p <- ggplot() + 
  geom_spatraster(data = AnnAvgTemp) + 
  scale_fill_grass_c(palette = 'celsius') + 
  theme_classic()

p

Resampling/Aggregating rasters

It is often the case that we need a raster to be in a different spatial resolution than the one provided, for example because we have other datasets with a different resolution and we want to ensure consistency.

There are two main ways that we can change the spatial resolution of a raster. The simplest is aggregation, where we change the resolution of a map by a specific factor. For this, we can use the aggregate function in the terra package. For this method, we specify a function to use when aggregating. Here, we will take a simple average, using the mean function, and ignoring NA values.

propn_natural_5a <- aggregate(x = propn_natural,fact = 10,
                              fun = function(x) mean(x, na.rm=TRUE))

p <- ggplot() + 
  geom_spatraster(data = propn_natural_5a) + 
  # This function allows the use of a range of nice colour palettes
  scale_fill_grass_c(palette = 'forest_cover') + 
  theme_classic()

p

If you look carefully, you will see that the resulting map has a more blocky appearance than the original, owing to its coarser spatial resolution.

Note, that there is also a disaggregate function, if you need to convert your original map to one with a finer spatial resolution.

If you need to change to a resolution that is not a simple factor of the original, or for other sorts of flexibility in changing resolution, you can use the resample function, also in the terra package. The default method to use for combining the values of cells is something called ‘bilinear interpolation’, which interpolates across several cells. Here, we will use a simple average (method = “average”), for comparability with the map we created using aggregation. There are also other methods you can use, including using the value for the nearest neighbour, which is useful if you have a map that describes categories rather than continuous values. We will use the coarser resolution map as a template for the resampling here (basically, the output of the resampling will match the grid resolution and spatial extent of the template map).

propn_natural_5b <- resample(x = propn_natural,y = propn_natural_5a,method="average")

p <- ggplot() + 
  geom_spatraster(data = propn_natural_5b) + 
  # This function allows the use of a range of nice colour palettes
  scale_fill_grass_c(palette = 'forest_cover') + 
  theme_classic()

p

You can see that this map is essentially identical to the one we created before.

Coordinate and Projection Systems

All maps captured as 2-dimensional objects are a representation of a 3-dimensional (spherical) world. The way that this conversion works depends on the coordinate and projection system used.

The simplest way of representing spatial data is using a geographical projection, where 3-dimensional space is flattened onto a 2-dimensional map, but spatial locations are still described in terms of degrees of longitude and latitude, and the spatial unit is degrees.

Because the Earth isn’t perfectly spherical, longitude and latitude coordinates are mapped using a theoretical representation of the Earth’s surface. This is called the ‘geodetic’ system. Different systems can produce coordinates that vary, sometimes by as much as several hundred metres. Therefore, you need to be careful to ensure that any maps you want to overlay or compare use the same coordinate systems (more on converting coordinate systems and projections later). A very commonly used coordinate system is the World Geodetic System 1984 (WGS84 for short).

For this part of the tutorial, we will return to the map of annual average temperature that we calculated earlier.

p <- ggplot() + 
  geom_spatraster(data = AnnAvgTemp) + 
  scale_fill_grass_c(palette = 'celsius') + 
  theme_classic()

p

This map is probably very similar to many depictions you have seen of the world map.

However, the area covered by each grid cell on a grid defined by degrees of longitude and latitude varies enormously from the Equator to the Poles.

Here we can use the cellSize function to extract the area (in m²) of each grid cell in our raster. This function actually returns a raster of cell sizes. We can first show a simple summary of the grid cell areas, showing variation from just over a half a km² (567,025 m²) to over 85 km² (85,477,893 m²).

summary(cellSize(AnnAvgTemp))

##       area         
##  Min.   :  567025  
##  1st Qu.:33348328  
##  Median :60848574  
##  Mean   :54662805  
##  3rd Qu.:79021509  
##  Max.   :85477893

By plotting the area of the grid cells against latitude, we can see (as expected) that cell areas are tiny at the poles, but enormous at the Equator.

# Take a sample of cells consisting of the first cell in each row in the grid
samp <- seq(from=1,to=9326881, by=4320)

p <- ggplot() + 
  # Note here that the second column returned by the crds function gives
  # the latitudes of the grid cells
  geom_point(mapping = aes(x = crds(AnnAvgTemp,na.rm=FALSE)[,2][samp],
                           # Devide cell areas by 1,000,000 to convert to km2
                           y = values(cellSize(AnnAvgTemp))[samp]/1e6)) + 
  scale_x_continuous(name = "Latitude (decimal degrees)",
                     breaks = seq(from=-90,to=90,by=10)) + 
  scale_y_continuous(name = "Cell area (square km)") + 
  geom_vline(mapping = aes(xintercept = 0),colour = "#BF2C23",
                           linewidth = 1.5,linetype = "dashed") + 
  theme_classic()

p

Having such widely varying cell areas can be a problem for scientific studies, particularly in ecology, where many of the masurements we make of ecological communities vary depending on the spatial scale at which we measure them.

A solution to this problem is to change to using a projection that represents the world in such a way that grid cells have equal areas. This is called an equal-area projection, of which there are many examples. Here, we will use the Behrmann cylindrical equal area projection, and refer to it using its EPSG code (53017). See https://epsg.io/ for more details on these codes. In the terra package, we can convert the projection system for both vector and raster maps using the project function. We just need to specify the output coordinate system (y=“epsg:53017”) and the output spatial resolution. Here, we will specify here that we want an output cell resolution of 10,000 m, which is close to the average resolution of the input map (remember, that the resolution describes the length and width of cells, not the cell area).

AnnAvgTemp_EA <- project(x = AnnAvgTemp,y = "epsg:53017", res = 10000)

AnnAvgTemp_EA

## class       : SpatRaster 
## size        : 1471, 3467, 1  (nrow, ncol, nlyr)
## resolution  : 10000, 10000  (x, y)
## extent      : -17333574, 17336426, -7356108, 7353892  (xmin, xmax, ymin, ymax)
## coord. ref. : Sphere_Behrmann (ESRI:53017) 
## source(s)   : memory
## name        :      mean 
## min value   : -54.70879 
## max value   :  30.85022

Inspecting the summary of this map, we can see a few key differences compared to the previous maps with the WGS84 geographic projection. Most importantly, all spatial units now are in m rather than degrees of latitude/longitude. So the spatial resolution here is 10,000 m (as we specified), and the values describing the extent of the map are also given in m. Similarly, if we inspect the coordinates of this map, these are also given in units of m rather than degrees of longitude/latitude.

head(crds(AnnAvgTemp_EA,na.rm=FALSE))

##              x       y
## [1,] -17328574 7348892
## [2,] -17318574 7348892
## [3,] -17308574 7348892
## [4,] -17298574 7348892
## [5,] -17288574 7348892
## [6,] -17278574 7348892

When projecting a raster map in this way, R has to combine the values of different cells to obtain values for the new grid. By default, the project function uses an averaging procedure called bilinear interpolation, but you can also specify simple averaging or the use of nearerst-neighbour values instead (the latter is useful if your data consist of categories rather than continuous values).

If we plot this map, we can see that the representation of the world correctly shows the larger area encompassed by tropical regions such as Africa.

p <- ggplot() + 
  geom_spatraster(data = AnnAvgTemp_EA) + 
  scale_fill_grass_c(palette = 'celsius') + 
  theme_classic()

p

The cell areas are now much more consistent, although there is still some variation, particularly at extreme polar latitudes (no projection is perfect).

summary(cellSize(AnnAvgTemp_EA))

##       area          
##  Min.   : 67201976  
##  1st Qu.: 99662424  
##  Median : 99890344  
##  Mean   : 99836937  
##  3rd Qu.:100269984  
##  Max.   :100805600

We can project spatial vector objects in exactly the same way as for rasters.

So, for example, let’s project the sites sampled in the New Zealand beetles study and the New Zealand roads map to the same equal-area projection we have been using so far. We will also project the natural habitat map while we are at it.

nz_major_sv_EA <- project(x = nz_major_sv,y = "epsg:53017")
sitesMap_EA <- project(x = sitesMap,y = "epsg:53017")
propn_natural_EA <- project(propn_natural,y = "epsg:53017",res = 1000)

p <- ggplot() + 
  geom_spatraster(data = propn_natural_EA) + 
  scale_fill_grass_c(palette = 'forest_cover') + 
  geom_spatvector(data = nz_major_sv_EA) + 
  geom_spatvector(data = sitesMap_EA) + 
  # Note that we are having to set the plotting limits in units of metres here
  scale_x_continuous(name = "",limits = c(1.659e7,1.661e7)) +
  scale_y_continuous(name = "",limits = c(-4.99e6,-4.96e6)) + 
  theme_classic()

p

Combining Spatial Data and Making Calculations

In the final section of this tutorial, we will look at how to combine different spatial datasets to make calculations for use in later analyses.

We will focus in this session on extracting values of a raster at specified points, and on calculating distances between objects. These are commonly used methods in GIS, and are often very useful in scientific studies.

Extracting Raster/Polygon Values at Specified Points

Extracting values from a raster is very simple using the extract function in the terra package. We simply specify the raster we want to obtain values from, and the points at which we want to extract values.

It is important to make sure, as we have already done in the previous section, that the raster map and the points map are in the same spatial projection and use the same coordinate system, and that the units of the coordinates match. Otherwise, the resulting extracted values may be inaccurate.

# Here the main data layer is labelled 'PRI_1km_2005_0ice' from the original data 
# source. We have to specify that we want these values from the output produced by 
# the extract function (otherwise we get cell IDs)
sitesMap_EA$NaturalHabitat <- extract(
  x = propn_natural_EA,
  y = sitesMap_EA)$PRI_1km_2005_0ice

As well as specifying a spatial object of the points at which we want to extract values, we can also just specify the coordinates directly (again, making sure that these coordinates have the same spatial projection and units as the raster map).

natHabExtract <- extract(x = propn_natural_EA,y = crds(sitesMap_EA))$PRI_1km_2005_0ice

natHab <- data.frame(NaturalHabitat=natHabExtract)

ggplot() + 
  geom_histogram(data = natHab,mapping = aes(x = NaturalHabitat)) + 
  theme_classic()

We can also extract values at specified points from a polygon object. So, for example, we could compare our points map to the polygon map of countries we worked with earlier. As before, we need to specify the label for the column of data we want to return. Here, we will ask for the ISO3 country code (column ‘iso_a3’).

# We didn't project the New Zealand map to an equal-area projection earlier,
# so we will do that now
nz_EA <- project(x = nz, y = "epsg:53017")

countryExtract <- extract(x = nz_EA, y = sitesMap_EA)$iso_a3

table(countryExtract)

## countryExtract
## NZL 
## 233

Not surprisingly, the only value we obtain is the country code for New Zealand.

Finally, we might be interested to know the average value of a raster in the area up to some specified distance around a set of points. For example, let’s say we want to know the average natural habitat cover up to a 2,500-m radius around the points where we have sampled biodiversity. Given that animals move, it may be more important to know the average natural habitat across a landscape than at the specific point where we sampled biodiversity (although, of course depending on the resolution of the raster map, we are already looking at some sort of average across a landscape, albeit quite a small landscape; we will also ignore for now the fact that the beetles that were sampled in the study we are looking at are not particulary mobile!).

This is a slightly more complex operation. We first need to define the landscapes around our spatial points. We can do this using the buffer function in terra.

This creates a polygon object describing circular buffers around each of our points. In this case, these buffers are strongly overlapping.

# Create a 2,500-m buffer around each of the biodiversity points
PtsBuff <- buffer(x = sitesMap_EA,width = 2500)

p <- ggplot() + 
  geom_spatraster(data = propn_natural_EA) + 
  scale_fill_grass_c(palette = 'forest_cover') + 
  geom_spatvector(data = PtsBuff,fill="#00000000") + 
  # geom_spatvector(data = nz_major_sv_EA) + 
  geom_spatvector(data = sitesMap_EA) + 
  # Note that we are having to set the plotting limits in units of metres here
  scale_x_continuous(name = "",limits = c(1.658e7,1.662e7)) +
  scale_y_continuous(name = "",limits = c(-5.00e6,-4.95e6)) + 
  theme_classic()

p

Now we have defined the landscapes over which we want to average our natural habitat values, we can perform an extract operation to obtain the raster values in our landscapes of interest.

When we extract values from a raster based on overlaying polygons, we obtain a data table of values, where the ID column corresponds to the ID in the original geometry.

NatHabValsLandscapes <- extract(x = propn_natural_EA,y = PtsBuff)

head(NatHabValsLandscapes)

##   ID PRI_1km_2005_0ice
## 1  1         0.5548934
## 2  1         0.5079334
## 3  1         0.3325407
## 4  1         0.5357677
## 5  1         0.6108298
## 6  1         0.4311446

Now we can summarize the average values of natural habitat for each buffer:

# Summarize average natural habitat values across our 2,500-m buffers
# Remember that, because of the way it was created, 
# our original layer labelled natural habitat as 'PRI_1km_2005_0ice'
# We need to specify na.rm=TRUE when averaging because there are some missing
# values in our natural habitat layer
NatHabValsLandscapes %>% group_by(ID) %>% 
  summarize(NaturalHabitat5k = mean(PRI_1km_2005_0ice,na.rm=TRUE)) -> 
  NatHabLandscapeAvgs

Finally, we can merge these final average values into the data frame for our sampled sites. We can do this using a left join, but first we need to add the ID for each of the sites to its data frame (at the moment, this is only stored in the geometry).

sitesMap_EA$ID <- data.frame(geom(sitesMap_EA))$geom

sitesMap_EA %>% left_join(NatHabLandscapeAvgs,by = "ID") -> sitesMap_EA

Plotting our original site-specific values of proportion of natural habitat with our new values averaged across landscapes, we can see that we have pretty similar values (because in this case our buffers aren’t much bigger than the 1,000-m resolution of the original raster). However, there is some variation in the estimated values.

p <- ggplot() + 
  geom_point(data = sitesMap_EA,
             mapping = aes(x = NaturalHabitat,y = NaturalHabitat5k)) +
  theme_classic()

p

Calculating Distances

To begin with, we will focus on a simple calculation of distances, namely getting the shortest distance from a set of points to a set of lines (here, the roads map we worked with earlier). Other distance calcaulations for vector maps, such as distances to the nearest edge of a polygon or distances between two sets of points, follow the same basic principles.

For this section of the tutorial, we will be working with the ‘distance’ function from the terra package. This is a very flexible function that allows the calculation of all sorts of different distance measurements.

The function itself calculates all pairwise differences, and so afterwards we need to calculate the minimum distances.

# Get all pairwise distances from the sites in the sites map 
# to each road in the roads map
AllDistances <- distance(x = sitesMap_EA,
                         y = nz_major_sv_EA)

# Calculate the minimum distance to any road for each site
MinDistances <- apply(AllDistances, 1, min)

# Add these distances to the sites map data frame
sitesMap_EA$RoadDistance <- MinDistances

We might also be interested in calculating the distance from a set of points to the nearest raster cell of a specific value.

Let’s say we want to know the distance from each site to the nearest cell with more than 30% coverage of natural habitat.

The easiest way to do this is first to get the coordinates of cells with more than 30% natural habitat, convert these cells into a vector map of points using the cell coordinates, and then calculate distances between the two sets of points (the locations with biodiversity data and the cells with high levels of natural habitat).

# Note that the outline of this code was supplied by Microsoft Copilot

# Identify cells with natural habitat higher than 30%
high_nathab_cells <- which(values(propn_natural_EA)>0.3)

# Get the coordinates of these cells
high_nathab_coords <- xyFromCell(propn_natural_EA,high_nathab_cells)

# Convert the coordinates into a vector points map
high_nathab_points <- vect(high_nathab_coords, type="points", 
                           crs=crs(propn_natural_EA))

# Calculate all pairwise distances between biodiversity sampling sites and
# cells with high natural habitat
AllDistances <- distance(sitesMap_EA,high_nathab_points)

# Calculate the minimum distances for all sites with biodiversity sampling
MinDistances <- apply(AllDistances, 1, min)

# Add the minimum distances to the original biodiversity sites map
sitesMap_EA$NaturalHabitatDistance <- MinDistances

There are loads of other types of calculation that you can do with spatial data, which we don’t have time to cover today, including variations on the extraction and distance operations that we have just investigated. If you are interested in further exploring the use and handling of spatial data, I encourage you to explore some of the many functions in the very versatile terra package.

Exercises

Try extracting the annual average temperature at the points where beetle biodiversity was sampled
Try to calculate the distance from each of the points where beetle biodiversity was sampled to the nearest bit of coastline (hint, the functions we used earlier to calculate distances between points and lines work also for polygons, but to use just the edges you have to apply the as.lines function to the polygon first)
CHALLENGE EXERCISE. Try to calculate the average of annual mean temperature across the whole of New Zealand