Gaussian Processes models for point-referenced spatial data, part 2

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Likelihood inference: MLE estimation of covariance parameters

Let’s consider the Greenness data
Small preprocessing: remove sample mean

Likelihood inference: MLE estimation of covariance parameters

Call \cal T our set of observed locations. The data vector at \cal T is Y_{\cal T}, or simply Y
Let’s use an exponential covariance model for a mean-zero GP
Because we are directly assuming that the data are a realization of a GP, this is called a GP model of the response or response model Y(\cdot) \sim GP(0, C_{\theta}(\cdot)) For our n-dimensional vector of observations, we have Y \sim N(0, C_{\theta}) where the (i,j) element of the C_{\theta} matrix is C_{\theta}(i,j) = \sigma^2 \exp \{ -\phi \|s_i - s_j \| \} where s_i, s_j \in \cal T.

Likelihood inference: MLE estimation of covariance parameters

Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

The matrix L is… ?
Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\max_{\theta} \log p(Y \mid \theta)

Likelihood inference: MLE estimation of covariance parameters

Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

The matrix L is the Cholesky decomposition of C_{\theta}, i.e. C_{\theta} = L L^\top
L is where \theta appears in the likelihood function
Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\max_{\theta} \log p(Y \mid \theta)

Likelihood inference: MLE estimation of covariance parameters

Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

The matrix L is the Cholesky decomposition of C_{\theta}, i.e. C_{\theta} = L L^\top
L is where \theta appears in the likelihood function
Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\min_{\theta} \left( \sum_{i=1}^n\log(L_{ii}) + \frac{1}{2}Y^T L^{-T} L^{-1} Y \right)

Likelihood inference: MLE estimation of covariance parameters

Let’s do this in R in a simple way

set.seed(696)

coords <- df %>% dplyr::select(Longitude, Latitude) %>% as.matrix()
Y <- df$Greenness %>% as.vector()

Covariance_make <- function(theta){
  sigmasq <- exp(theta[1])
  phi <- exp(theta[2])
  Cmatrix <- sigmasq * exp(-phi * as.matrix(dist(coords)) )
  return(Cmatrix)
}

mvn_eval <- function(theta){
  Ctheta <- Covariance_make(theta)
  Ltheta <- t(chol(Ctheta))
  Linv_Y <- solve(Ltheta, Y) # = L^-1 Y
  
  neg_logdens <- sum(log(diag(Ltheta))) + 0.5 * crossprod(Linv_Y)
  return(neg_logdens)
}

(save_time <- system.time({ results <- optim(c(0,0), mvn_eval) }))

   user  system elapsed 
 45.836   3.691  39.925

Likelihood inference: MLE estimation of covariance parameters

$par
[1] 6.063779 2.904695

$value
[1] 10342.57

$counts
function gradient 
      77       NA 

$convergence
[1] 0

$message
NULL

convergence=0 means the optimization algorithm found a (local) minimum
Because of the way we parametrized, \hat{\sigma}^2=429.9972789 and \hat{\phi} =18.2596753
Let’s try different starting values:

results2 <- optim(c(-1,1), mvn_eval)

$par
[1] 6.065505 2.885855

$value
[1] 10342.61

$counts
function gradient 
      71       NA 

$convergence
[1] 0

$message
NULL

\hat{\sigma}^2=430.7403429 and \hat{\phi} =17.9188881

Key take-aways from the previous slides

Scalability of likelihood-based inference of GP models

With n= 3008 observations the optimization routine takes about 40 seconds for 77 evaluations of the likelihood function
That is \approx 0.5s/iteration: what if we had many more observations?
Presumably, optimizing a more complex covariance function will require more likelihood evaluations

Covariance estimation (aka kernel learning)

We changed the starting values and we got slightly different optima
The likelihood values are very similar
In general, GP models have relatively flat likelihoods
Many different values of elements of \theta lead to very similar p(Y | \theta)
This will be one justification for using a Bayesian model in which we include prior information on the parameters

Making predictions based on a GP response model

We have a point-estimate \hat{\theta} for our covariance parameters
In other words, we have found what values of \theta in our covariance model best fit our data
We do not have any other unknowns in our model
How do we make predictions?

Making predictions based on a GP response model

Back to our theory of Gaussian Processes:

Our GP assumption says that any finite set of spatially-indexed random variables are jointly Gaussian
We have \cal T as our observed locations, Y_{\cal T} the realization of a GP at those locations
Suppose we now have \cal U as the set of locations where we want to make predictions, i.e. Y_{\cal U} is unobserved
\cal T and \cal U are both finite sets of locations, hence

\begin{bmatrix}Y_{\cal T} \\ Y_{\cal U}\end{bmatrix} \sim \text{????}

Making predictions based on a GP response model

Back to our theory of Gaussian Processes:

Our GP assumption says that any finite set of spatially-indexed random variables are jointly Gaussian
With locations \cal L, we are saying Y_{\cal L} \sim N(0, C_{\cal L})
Key: for any finite set \cal L
In our applied problem we have \cal T as our observed locations, Y_{\cal T} the realization of a GP at those locations
Suppose we now have \cal U as the set of locations where we want to make predictions, i.e. Y_{\cal U} is unobserved
\cal T and \cal U are both finite sets of locations, hence we can build \cal L = \cal T \cup \cal U, and then:

Y_{\cal L} = \begin{bmatrix}Y_{\cal T} \\ Y_{\cal U}\end{bmatrix} \sim N\left( 0, \begin{bmatrix} C_{\cal T} & C_{\cal T, \cal U} \\ C_{\cal T, \cal U}^ \top & C_{\cal U} \end{bmatrix} \right) = N(0, C_{\cal L})

We should now be able to make predictions. How do we continue?

Making predictions based on a GP response model

We use the joint density above to find the conditional density below:

Y_{\cal U} | Y_{\cal T} \sim N\left( m_{\cal U|\cal T}, C_{\cal U|\cal T} \right) where \begin{align*} m_{\cal U|\cal T} &= C_{\cal U, \cal T} C_{\cal T}^{-1} Y_{\cal T} \\ C_{\cal U|\cal T} &= C_{\cal U} - C_{\cal U, \cal T} C_{\cal T}^{-1} C_{\cal T, \cal U} \end{align*} The conditional mean is our point prediction at new locations; the conditional variance is giving us uncertainty at the new locations \begin{align*} m_{\text{new}|\text{obs}} &= C_{\text{new}, \text{obs}} C_{\text{obs}}^{-1} Y_{\text{obs}} \\ C_{\text{new}|\text{obs}} &= C_{\text{new}} - C_{\text{new}, \text{obs}} C_{\text{obs}}^{-1} C_{\text{obs}, \text{new}} \end{align*}

Making predictions based on a GP response model

Uncertainty of predictions: we can use 95% credible bands point-wise. At 1 new location s^ *, we have

y(s^*) | Y_{\cal T} \sim N\left( m_{\text{pred}}, C_{\text{pred}} \right) where \begin{align*} m_{\text{pred}} &= C(s^*, {\cal T}) C_{\cal T}^{-1} Y_{\cal T} \\ C_{\text{pred}} &= \sigma^2 - C(s^*, {\cal T}) C_{\cal T}^{-1} C({\cal T}, s^*) \end{align*}

Therefore U_{95} = \Phi^{-1}(0.975, m_{\text{pred}}, C_{\text{pred}}) - \Phi^{-1}(0.025, m_{\text{pred}}, C_{\text{pred}})
\Phi^{-1} is the inverse CDF of the (univariate) Gaussian. In R, qnorm.
Because \cal U can be large, applying the above formula for each element of \cal U saves on memory relative to extracting the diagonal elements from C_{\cal U | \cal T}, but that would also work

Making predictions based on a GP response model

As we move away from the observed data
- The predictive mean gets closer to 0 (which is the prior mean we have assumed for the GP)
- Uncertainty bands become wider
This matches the intuition that spatial dependence is determined by distance

More on point predictions and uncertainty bands

We considered the Greenness dataset which is a gridded dataset in 2 dimensions
Let’s simplify and consider a 1-dimensional spatial domain and a small dataset
Explain what the code below is doing
What covariance function are we using

set.seed(696)
n <- 10
x <- runif(n)
sigmasq <- 1
phi <- 12
tausq <- 0.0001
C <- sigmasq * exp(-phi * as.matrix(dist(x))^2) + tausq * diag(n)

y <- t(chol(C)) %*% rnorm(n)

df <- data.frame(x=x, y=y)
df %>% ggplot(aes(x,y)) +
  geom_point() +
  theme_minimal()

Summary

The same things we see in 1 dimension occur in 2 or more dimensions!
As we move away from the observed data
- The predictive mean gets closer to the prior mean we have assumed for the GP
- Uncertainty bands become wider, up to being equal to the prior variance
This matches the intuition that spatial dependence is determined by distance
When we are far from data, we do not have information to make predictions
We thus revert to our prior information (i.e. our GP assumption)
Because the GP interpolates at the observed locations, it is a flexible model for any dataset
The Gaussian assumption is difficult to criticize when we only have 1 replicate of the data because for any dataset with 1 replicate, the GP will “connect all the points” via interpolation
In the presence of noise, other assumptions can be criticized, such as the choice of the covariance function and its parameters

Gaussian Processes models for point-referenced spatial data, part 2

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Likelihood inference: MLE estimation of covariance parameters

Key take-aways from the previous slides

Scalability of likelihood-based inference of GP models

Covariance estimation (aka kernel learning)

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

More on point predictions and uncertainty bands

More on point predictions

More on point predictions

More on point predictions

More on uncertainty bands

More on uncertainty bands

More on point predictions and uncertainty bands

Summary