Low-rank GPs and the Predictive GP

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Introduction

We have introduced GP models
Spatial GP models are priors over 2D surfaces (or 2D functions)
Specifying the GP boils down to modeling covariance
GPs can be used as part of a Bayesian hierarchical model
We look for the posterior of a Bayesian model
When the joint posterior is “difficult” we design MCMC algorithms to sample from it
Models based on GPs scale poorly in terms of memory and flops and time

Scalable GP models

We will now consider building special GPs that can scale to larger datasets
There is no free lunch: scalability comes at a price
As usual, we will need to consider trade-offs and make decisions based on our modeling goals
Typically, we will restrict dependence across space, simplifying its structure
By restricting spatial dependence we will speed up computations
We will also lose in flexibility relative to an unrestricted GP

Graphical model interpretation of multivariate Gaussian random variables

Suppose we have a set of n random variables X_1, \dots, X_n
Call X = (X_1, \dots, X_n)^\top
Assume that X \sim N(0, \Sigma)
p(X_1, \dots, X_n) = p(X) = N(X; 0, \Sigma)

Graphical model interpretation of multivariate Gaussian random variables

Using the properties of probability we can write \begin{align*} p(X_1, \dots, X_n) &= p(X_1) p(X_2 | X_1) \cdots p(X_n | X_1, \dots, X_{n-1})\\ &= \prod_i^n p(X_i | X_{[1:i-1]}) \end{align*} where X_{[1:i-1]} = (X_1, \dots, X_{i-1})^\top
What is p(X_i | X_{[1:i-1]})?

Graphical model interpretation of multivariate Gaussian random variables

What is p(X_i | X_{[i-1]})?
p(X_i | X_{[1:i-1]}) is a conditional Gaussian distribution
This is the same distribution that we get when we predict X_i using X_{[1:i-1]} N(X_i; H_i X_{[1:i-1]}, R_i) where H_i = \Sigma(i, [1:i-1]) \Sigma([1:i-1], [1:i-1])^{-1} and R_i = \Sigma(i, i) - H_i \Sigma([1:i-1], i).

Graphical model interpretation of multivariate Gaussian random variables

Because p(X_1, \dots, X_n) = \prod_i^n p(X_i | X_{[i]}), we can draw a DAG according to which the joint density factorizes
The number of edges cannot be any larger (for a DAG)
If we added even just a single directed edge, the graphical model would form a cycle
This DAG is dense
We can attribute the computational complexity of GPs to the fact that their DAGs are dense
Scaling GPs can be cast as a problem of sparsifying this DAG
In other words, methods to remove edges from the DAG in a smart way

Graphical model interpretation of multivariate Gaussian random variables

Draw a DAG for p(X_1, \dots, X_6) = p(X_1) p(X_2 |???) p(X_3 |???) \cdots p(X_6 | ???)

Graphical model interpretation of multivariate Gaussian random variables

Draw a DAG for p(X_1, \dots, X_6)
This DAG is dense

Graphical model interpretation of multivariate Gaussian random variables

Draw a DAG for p(X_1, \dots, X_n)
This DAG is dense

Graphical model interpretation of multivariate Gaussian random variables

Draw a DAG for p(X_1, \dots, X_n)
This DAG is dense

Low-rank Gaussian Processes

Dense dependence follows from assuming a GP
X(\cdot) \sim GP implies X_{\cal L} \sim N(0, \Sigma) implies dense DAG
We want X(\cdot) \sim \text{???} leading to a simpler DAG that is sparse

Then, remember how we build a stochastic process:

Define the joint distribution for any finite set of random variables
Check Kolmogorov’s conditions
Result: there is a stochastic process whose finite dimensional distribution is the one we chose

Low-rank Gaussian Processes

Assume dense dependence for only a small set of variables
All the others are assumed conditionally independent
We assume the resulting joint density factorizes according to this DAG (see prelim slides)

Low-rank Gaussian Processes

Suppose from n variables we allow dense dependence for only k of them
p(X_1, \dots, X_n) = p(X_1, \dots, X_k, X_{k+1}, \dots, X_n)
The finite dimensional distribution we assume has the form

\begin{align*} p^*(X_1, \dots, X_n) &= p(X_1, \dots, X_k) \prod_{i=k+1}^n p(X_i | X_1, \dots, X_k) \end{align*}

We then assume that all these densities are Gaussian

\begin{align*} p^*(X_1, \dots, X_n) &= N(X_{[1:k]}; 0, \Sigma_{[1:k]}) \prod_{i=k+1}^n N(X_i; H_i X_{[1:k]}, R_i) \end{align*}

where, as usual,

\begin{align*} H_i &= \Sigma_{i, [1:k]} \Sigma_{[1:k]}^{-1}\\ R_i &= \Sigma_i - H_i \Sigma_{i, [1:k]}^\top \end{align*}

Low-rank Gaussian Processes

We allow dense dependence at a special set of locations which we call the knots
We assume that the process at all other locations is conditionally independent on the value of the process at the knots
We make a new joint density p^* that (unlike the original p density) factorizes according to the DAG we saw earlier
To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal L \cap \cal K = \emptyset, then

\begin{align*} p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) \\ &= \int p(X_{\cal L} , X_{\cal K}) dX_{\cal K} \\ &= \int p(X_{\cal L} | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \\ &= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \end{align*}

Low-rank Gaussian Processes

We allow dense dependence at a special set of locations which we call the knots
We assume that the process at all other locations is conditionally independent on the value of the process at the knots
To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal K \subset \cal L and call \cal U = \cal L \setminus \cal K

\begin{align*} p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) = p(X_{\cal K}, X_{\cal U}) \\ &= p(X_{\cal U} | X_{\cal K}) p(X_{\cal K}) \end{align*}

Low-rank Gaussian Processes

From \begin{align*} p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \end{align*} we assume all densities are Gaussian

\begin{align*} p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} N(X_i; H_i X_{\cal K}, R_i ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \\ &= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \end{align*}

p^*(X_1, \dots, X_n) = ????
We define D_{\cal L} as the diagonal matrix collecting all R_i.

Low-rank Gaussian Processes

\begin{align*} p^*(X_1, \dots, X_n) &= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \end{align*} We can write this as

\begin{align*} p(X_{\cal K}): \qquad X_{\cal K} &\sim N(0, C_{\cal K})\\ p(X_{\cal L} | X_{\cal K}): \qquad X_{\cal L} &= H_{\cal L} X_{\cal K} + \eta_{\cal L}, \qquad \eta_{\cal L} \sim N(0, D_{\cal L}) \end{align*} Therefore \begin{align*} p(X_{\cal L}): \qquad X_{\cal L} &\sim N(0, H_{\cal L} C_{\cal K} H_{\cal L}^\top + D_{\cal L}) \end{align*}

And here we see that

H_{\cal L} C_{\cal K} H_{\cal L}^\top = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K} C_{\cal K}^{-1} C_{\cal K, \cal L} = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K, \cal L}

Low-rank Gaussian Processes

From the previous slide: if X(s) and X(s') are both realizations of a low-rank GP with knots \cal K, then if s\ne s' Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') \ne C(s, s')

Where C(s, \cal K) is our chosen covariance function. Whereas if s=s' then Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') + D_{s} = C(s, s)

Notice anything? What is Cov(X(s), X(s')) in the unrestricted GP?

Low-rank Gaussian Processes: induced covariance function

If X(\cdot) is a low-rank GP constructed using covariance function C(\cdot, \cdot), then

Cov(X(s), X(s')) \neq C(s, s')
Equivalently, X(\cdot) is an unrestricted GP using the covariance function \tilde{C}(\cdot, \cdot) where

\tilde{C}(s, s') = Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') \qquad \text{if } s\neq s' \tilde{C}(s, s') = C(s, s') \qquad \text{if } s= s'

In the literature

What we have seen in the previous slides corresponds to Finley, Sang, Banerjee, Gelfand (2009) doi:10.1016/j.csda.2008.09.008.

Quiñonero-Candela and Rasmussen (2005) https://www.jmlr.org/papers/v6/quinonero-candela05a.html
Banerjee, Gelfand, Finley and Sang (2008) doi:10.1111/j.1467-9868.2008.00663.x
The ML literature typically refers to these models as sparse GPs
“Knots” are sometimes called “sensors” or “inducing points”

In a regression model

Assume observed locations are \cal L
Make all the usual assumptions for regression and prior
The only change: w \sim LRGP with covariance function C(\cdot, \cdot) and knots \cal K:

Y = X\beta + w + \varepsilon We can write the model as

Y = X\beta + H w_{\cal K} + \eta + \varepsilon where H = C_{\cal L, \cal K} C_{\cal K}^{-1} and \eta \sim ???

If we omit \eta from the model, then we are fitting a regression model based on the Predictive Gaussian Process as introduced in Banerjee, Gelfand, Finley and Sang (2008)
If we include \eta then we are fitting a Modified Predictive Gaussian Process model as in Finley, Sang, Banerjee, Gelfand (2009). Equivalently,

Y = X\beta + H w_{\cal K} + \varepsilon^* \qquad \text{where} \qquad \varepsilon^* \sim N(0, \tau^2 I_n + D_{\cal L})

Low-rank Gaussian Processes in R

In R, the spBayes package implements the (modified) predictive process as we have seen here
Use argument knots=... in spLM
Everything we said earlier about spLM remains true even when using the predictive process