Point-referenced spatial data

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Recap: point-referenced spatial data

We observe Y(s) at location s
What is s in this case?
What are some examples of point-referenced data?

Point-referenced spatial data

Observe Y(s) at s \in \cal L = \{s_1, \dots, s_n\}
The set \cal L is assumed fixed and known, ie the locations are not random

Goals

Learn/estimate how Y(\cdot) changes in space. Model the spatial dependence
Generalize linear regression: drop i.i.d. assumption in favor of spatial dependence
Predict Y(\cdot) at locations that were not observed

Example

How does air pollution spread geographically?
Considering the impact of air pollution on health, how should we model the spread of respiratory disease?
Using a fixed set of observations of air pollution, how can we estimate the level of air pollution everywhere else?

Exploratory analysis and simulation

Simulating a fixed set of spatial locations
This is just for simulation. Remember with point-referenced data the coordinates are given and non-random
Remember to set the seed for reproducible analyses!!

set.seed(2024) # set a seed

n_obs <- 200
x_coord <- runif(n_obs, 0, 1)
y_coord <- runif(n_obs, 0, 1)
coords <- data.frame(x_coord, y_coord)

Plotting

library(tidyverse)

ggplot(coords, aes(x_coord, y_coord)) +
  geom_point(size=.7) +
  theme_minimal()

Exploratory analysis and simulation

We just have locations now, there is no data. Let’s generate some data for those locations
No need to set the seed here. The script will run from the beginning and do all RNGs in sequence
You can set the seed here if you expect to run chunks of code in different sequences

#set.seed(1234) 
y <- rnorm(n_obs, 0, 1)

How many random variables are we generating?
What do you expect to see when we plot y on the coordinates we had before? Why?

Exploratory analysis and simulation

df <- coords %>% mutate(y=y)

ggplot(df, aes(x_coord, y_coord, color=y)) +
  geom_point() +
  theme_minimal() +
  scale_color_viridis_c()

Exploratory analysis and simulation

Make it easier to see there is no spatial dependence by

increasing sample size

n_obs <- 20000
x_coord <- runif(n_obs, 0, 1)
y_coord <- runif(n_obs, 0, 1)
coords <- data.frame(x_coord, y_coord)
y <- rnorm(n_obs, 0, 1)
df <- coords %>% mutate(y=y)

ggplot(df, aes(x_coord, y_coord, color=y)) +
  geom_point(size=.5) +
  theme_minimal() +
  scale_color_viridis_c()

Exploratory analysis and simulation

Make it easier to see there is no spatial dependence by

plotting on a regular grid of spatial locations
regular grid = each data point can be plotted as a pixel on a raster image

coords <- expand.grid(xlocs <- seq(0,1,length.out=100), xlocs) 
n_obs <- nrow(coords)
y <- rnorm(n_obs, 0, 1)
df <- coords %>% mutate(y=y)

ggplot(df, aes(Var1, Var2, fill=y)) +
  geom_raster() +
  theme_minimal() +
  scale_fill_scico(palette="batlowK")

Exploratory analysis and simulation

We have seen simulated spatial data without spatial dependence
What would we expect to see if there was spatial dependence?
Can you think of ways we could simulate data with spatial dependence?

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 1

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 2

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 3

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 4

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 5

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 6

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 7

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 8

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 9

Exploratory analysis and simulation

What does spatial dependence “look like”? Example 10

Exploratory analysis and simulation

What common feature can we recognize?

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

We only observe 1 sample

What if we could “re-run” the data-generation? Different data, same process

Exploratory analysis and simulation

How “similar” are the random variables Y(s_1) and Y(s_2)?
In other words, how large do we expect this is: | p(Y(s_2) | Y(s_1)) - p(Y(s_2)) |?
Easier? How large do we expect this is: Cov( Y(s_1), Y(s_2) )?

Exploratory analysis and simulation

How “similar” are the random variables Y(s_3) and Y(s_4)?
In other words, how large do we expect this is: | p(Y(s_4) | Y(s_3)) - p(Y(s_4)) |?
Easier? How large do we expect this is: Cov( Y(s_3), Y(s_4) )?

Exploratory analysis and simulation

Distance between random variables (largely) drives how related they are
If the variables are close to each other in space, then Cov(Y(s), Y(s')) should be larger
The question of what model to use for Cov(Y(s), Y(s')) is crucial
If we let Cov(Y(s), Y(s')) = g(s, s'), how do we choose g(\cdot)?
For example, we could model g(\cdot) as a positive decreasing function of \| s-s' \|
We can think of many other alternative models. More on this later…

Exploratory analysis: empirical semivariogram

How do we measure spatial association in data before we attempt any sort of modeling?
Call Y(s_i) and Y(s_j) the r.v.’s at s_i and s_j
Suppose the domain is D = [0,1]^2
Partition the line (0, m) \subset \Re into K disjoint intervals
I_1 = (0=m_0, m_1], I_2 =(m_2, m_3], \dots, I_K = (m_{K-1}, m_K=m].
Define t_k = \frac{m_k - m_{k-1}}{2} (the midpoint of the kth interval)
Define N(t_k) = \{ (s_i, s_j) : \| s_i - s_j \| \in I_k \}
N(t_k) is a set of all pairs of locations whose pairwise distance is “approximately” t_k

Exploratory analysis: empirical semivariogram

Define N(t_k) = \{ (s_i, s_j) : \| s_i - s_j \| \in I_k \}
Define the empirical semivariogram as:

\gamma(t_k) = \frac{1}{2 |N(t_k)|} \sum_{(s_i, s_k) \in I_k} (Y(s_i) - Y(s_j))^2

What should \gamma(t) look like for increasing t if there is spatial dependence?
If spatial dependence, Y(s_i) and Y(s_j) should be close to each other if s_i and s_j are not far apart

Exploratory analysis: empirical semivariogram

Define N(t_k) = \{ (s_i, s_j) : \| s_i - s_j \| \in I_k \}
Define the empirical semivariogram as:

\gamma(t_k) = \frac{1}{2 |N(t_k)|} \sum_{(s_i, s_k) \in I_k} (Y(s_i) - Y(s_j))^2

What should \gamma(t) look like for increasing t if there is spatial dependence?
If spatial dependence, Y(s_i) and Y(s_j) should be close to each other if s_i and s_j are not far apart
We then expect \gamma(t) to be increasing with t: as the locations are farther from each other, the observations become increasingly different

Exploratory analysis: empirical semivariogram

install.packages("geoR") (you may need to install XQuartz in your system)
Plot the empirical semivariogram via geoR::variog
Let’s use Example 8 from earlier

sv <- simdf %>% with(geoR::variog(data = y, coords = cbind(Var1, Var2), messages=FALSE))

sv_df <- data.frame(dists = sv$u, variogram = sv$v, npairs = sv$n, sd = sv$sd)
sv_plot <- ggplot(sv_df, aes(x=dists, y=variogram)) + geom_point(size=2, shape=8) +
  theme_minimal()

grid.arrange(data_plot, sv_plot, nrow=1)

Exploratory analysis: empirical semivariogram

Let’s use Example 10 from earlier

Exploratory analysis: example

We want to investigate a variable representing the time it takes for vegetation to reach peak greenness
Changes in the greenup time may be explained by climate change

df <- read.csv("data/michigan_greenness.csv")

(data_plot <- ggplot(df, aes(Longitude, Latitude, fill=Greenness)) +
  geom_raster() +
  scale_fill_scico(palette="batlowK") + 
  theme_minimal())

Exploratory analysis: example

sv <- df %>% with(geoR::variog(data = Greenness, coords = cbind(Longitude, Latitude), messages=FALSE))
sv_df <- data.frame(dists = sv$u, variogram = sv$v, npairs = sv$n, sd = sv$sd)
sv_plot <- ggplot(sv_df, aes(x=dists, y=variogram)) + geom_point(size=2, shape=8) +
  theme_minimal()

grid.arrange(data_plot, sv_plot, nrow=1)