Gaussian Processes models for point-referenced spatial data, part 2

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Likelihood inference: MLE estimation of covariance parameters

  • Let’s consider the Greenness data
  • Small preprocessing: remove sample mean

Likelihood inference: MLE estimation of covariance parameters

  • Call \cal T our set of observed locations. The data vector at \cal T is Y_{\cal T}, or simply Y
  • Let’s use an exponential covariance model for a mean-zero GP
  • Because we are directly assuming that the data are a realization of a GP, this is called a GP model of the response or response model Y(\cdot) \sim GP(0, C_{\theta}(\cdot)) For our n-dimensional vector of observations, we have Y \sim N(0, C_{\theta}) where the (i,j) element of the C_{\theta} matrix is C_{\theta}(i,j) = \sigma^2 \exp \{ -\phi \|s_i - s_j \| \} where s_i, s_j \in \cal T.

Likelihood inference: MLE estimation of covariance parameters

  • Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

  • The matrix L is… ?
  • Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\max_{\theta} \log p(Y \mid \theta)

Likelihood inference: MLE estimation of covariance parameters

  • Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

  • The matrix L is the Cholesky decomposition of C_{\theta}, i.e. C_{\theta} = L L^\top
  • L is where \theta appears in the likelihood function
  • Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\max_{\theta} \log p(Y \mid \theta)

Likelihood inference: MLE estimation of covariance parameters

  • Recall from preliminaries that because of your MVN assumption, we can write

\log p(Y \mid \theta) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}Y^T L^{-T} L^{-1} Y

  • The matrix L is the Cholesky decomposition of C_{\theta}, i.e. C_{\theta} = L L^\top
  • L is where \theta appears in the likelihood function
  • Our goal is to find the MLE of \theta

\hat{\theta}_{MLE} = \arg\min_{\theta} \left( \sum_{i=1}^n\log(L_{ii}) + \frac{1}{2}Y^T L^{-T} L^{-1} Y \right)

Likelihood inference: MLE estimation of covariance parameters

  • Let’s do this in R in a simple way
set.seed(696)

coords <- df %>% dplyr::select(Longitude, Latitude) %>% as.matrix()
Y <- df$Greenness %>% as.vector()

Covariance_make <- function(theta){
  sigmasq <- exp(theta[1])
  phi <- exp(theta[2])
  Cmatrix <- sigmasq * exp(-phi * as.matrix(dist(coords)) )
  return(Cmatrix)
}

mvn_eval <- function(theta){
  Ctheta <- Covariance_make(theta)
  Ltheta <- t(chol(Ctheta))
  Linv_Y <- solve(Ltheta, Y) # = L^-1 Y
  
  neg_logdens <- sum(log(diag(Ltheta))) + 0.5 * crossprod(Linv_Y)
  return(neg_logdens)
}

(save_time <- system.time({ results <- optim(c(0,0), mvn_eval) }))
   user  system elapsed 
 45.836   3.691  39.925 

Likelihood inference: MLE estimation of covariance parameters

$par
[1] 6.063779 2.904695

$value
[1] 10342.57

$counts
function gradient 
      77       NA 

$convergence
[1] 0

$message
NULL
  • convergence=0 means the optimization algorithm found a (local) minimum
  • Because of the way we parametrized, \hat{\sigma}^2=429.9972789 and \hat{\phi} =18.2596753
  • Let’s try different starting values:
results2 <- optim(c(-1,1), mvn_eval)
$par
[1] 6.065505 2.885855

$value
[1] 10342.61

$counts
function gradient 
      71       NA 

$convergence
[1] 0

$message
NULL
  • \hat{\sigma}^2=430.7403429 and \hat{\phi} =17.9188881

Key take-aways from the previous slides

Scalability of likelihood-based inference of GP models

  • With n= 3008 observations the optimization routine takes about 40 seconds for 77 evaluations of the likelihood function
  • That is \approx 0.5s/iteration: what if we had many more observations?
  • Presumably, optimizing a more complex covariance function will require more likelihood evaluations

Covariance estimation (aka kernel learning)

  • We changed the starting values and we got slightly different optima
  • The likelihood values are very similar
  • In general, GP models have relatively flat likelihoods
  • Many different values of elements of \theta lead to very similar p(Y | \theta)
  • This will be one justification for using a Bayesian model in which we include prior information on the parameters

Making predictions based on a GP response model

  • We have a point-estimate \hat{\theta} for our covariance parameters
  • In other words, we have found what values of \theta in our covariance model best fit our data
  • We do not have any other unknowns in our model
  • How do we make predictions?

Making predictions based on a GP response model

Back to our theory of Gaussian Processes:

  • Our GP assumption says that any finite set of spatially-indexed random variables are jointly Gaussian
  • We have \cal T as our observed locations, Y_{\cal T} the realization of a GP at those locations
  • Suppose we now have \cal U as the set of locations where we want to make predictions, i.e. Y_{\cal U} is unobserved
  • \cal T and \cal U are both finite sets of locations, hence

\begin{bmatrix}Y_{\cal T} \\ Y_{\cal U}\end{bmatrix} \sim \text{????}

Making predictions based on a GP response model

Back to our theory of Gaussian Processes:

  • Our GP assumption says that any finite set of spatially-indexed random variables are jointly Gaussian

  • With locations \cal L, we are saying Y_{\cal L} \sim N(0, C_{\cal L})

  • Key: for any finite set \cal L

  • In our applied problem we have \cal T as our observed locations, Y_{\cal T} the realization of a GP at those locations

  • Suppose we now have \cal U as the set of locations where we want to make predictions, i.e. Y_{\cal U} is unobserved

  • \cal T and \cal U are both finite sets of locations, hence we can build \cal L = \cal T \cup \cal U, and then:

Y_{\cal L} = \begin{bmatrix}Y_{\cal T} \\ Y_{\cal U}\end{bmatrix} \sim N\left( 0, \begin{bmatrix} C_{\cal T} & C_{\cal T, \cal U} \\ C_{\cal T, \cal U}^ \top & C_{\cal U} \end{bmatrix} \right) = N(0, C_{\cal L})

  • We should now be able to make predictions. How do we continue?

Making predictions based on a GP response model

Y_{\cal L} = \begin{bmatrix}Y_{\cal T} \\ Y_{\cal U}\end{bmatrix} \sim N\left( 0, \begin{bmatrix} C_{\cal T} & C_{\cal T, \cal U} \\ C_{\cal T, \cal U}^ \top & C_{\cal U} \end{bmatrix} \right) = N(0, C_{\cal L})

We use the joint density above to find the conditional density below:

Y_{\cal U} | Y_{\cal T} \sim N\left( m_{\cal U|\cal T}, C_{\cal U|\cal T} \right) where \begin{align*} m_{\cal U|\cal T} &= C_{\cal U, \cal T} C_{\cal T}^{-1} Y_{\cal T} \\ C_{\cal U|\cal T} &= C_{\cal U} - C_{\cal U, \cal T} C_{\cal T}^{-1} C_{\cal T, \cal U} \end{align*} The conditional mean is our point prediction at new locations; the conditional variance is giving us uncertainty at the new locations \begin{align*} m_{\text{new}|\text{obs}} &= C_{\text{new}, \text{obs}} C_{\text{obs}}^{-1} Y_{\text{obs}} \\ C_{\text{new}|\text{obs}} &= C_{\text{new}} - C_{\text{new}, \text{obs}} C_{\text{obs}}^{-1} C_{\text{obs}, \text{new}} \end{align*}

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

Making predictions based on a GP response model

  • Uncertainty of predictions: we can use 95% credible bands point-wise. At 1 new location s^ *, we have

y(s^*) | Y_{\cal T} \sim N\left( m_{\text{pred}}, C_{\text{pred}} \right) where \begin{align*} m_{\text{pred}} &= C(s^*, {\cal T}) C_{\cal T}^{-1} Y_{\cal T} \\ C_{\text{pred}} &= \sigma^2 - C(s^*, {\cal T}) C_{\cal T}^{-1} C({\cal T}, s^*) \end{align*}

  • Therefore U_{95} = \Phi^{-1}(0.975, m_{\text{pred}}, C_{\text{pred}}) - \Phi^{-1}(0.025, m_{\text{pred}}, C_{\text{pred}})
  • \Phi^{-1} is the inverse CDF of the (univariate) Gaussian. In R, qnorm.
  • Because \cal U can be large, applying the above formula for each element of \cal U saves on memory relative to extracting the diagonal elements from C_{\cal U | \cal T}, but that would also work

Making predictions based on a GP response model

Making predictions based on a GP response model

  • As we move away from the observed data
    • The predictive mean gets closer to 0 (which is the prior mean we have assumed for the GP)
    • Uncertainty bands become wider
  • This matches the intuition that spatial dependence is determined by distance

More on point predictions and uncertainty bands

  • We considered the Greenness dataset which is a gridded dataset in 2 dimensions
  • Let’s simplify and consider a 1-dimensional spatial domain and a small dataset
  • Explain what the code below is doing
  • What covariance function are we using
set.seed(696)
n <- 10
x <- runif(n)
sigmasq <- 1
phi <- 12
tausq <- 0.0001
C <- sigmasq * exp(-phi * as.matrix(dist(x))^2) + tausq * diag(n)

y <- t(chol(C)) %*% rnorm(n)

df <- data.frame(x=x, y=y)
df %>% ggplot(aes(x,y)) +
  geom_point() +
  theme_minimal()

More on point predictions

  • Let’s make predictions on a fine grid
  • The GP interpolates!

More on point predictions

  • Far from data? Prediction = prior mean of the GP (zero, usually)

More on point predictions

  • Notice what happens when we increase tausq (now at 0.05)

More on uncertainty bands

  • Uncertainty increases as we move away from the observations

More on uncertainty bands

  • Far from data? Uncertainty band = prior variance

More on point predictions and uncertainty bands

  • Notice what happens when we increase tausq (now at 0.05)

Summary

  • The same things we see in 1 dimension occur in 2 or more dimensions!
  • As we move away from the observed data
    • The predictive mean gets closer to the prior mean we have assumed for the GP
    • Uncertainty bands become wider, up to being equal to the prior variance
  • This matches the intuition that spatial dependence is determined by distance
  • When we are far from data, we do not have information to make predictions
  • We thus revert to our prior information (i.e. our GP assumption)
  • Because the GP interpolates at the observed locations, it is a flexible model for any dataset
  • The Gaussian assumption is difficult to criticize when we only have 1 replicate of the data because for any dataset with 1 replicate, the GP will “connect all the points” via interpolation
  • In the presence of noise, other assumptions can be criticized, such as the choice of the covariance function and its parameters