Low-rank GPs and the Predictive GP

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Introduction

  • We have introduced GP models
  • Spatial GP models are priors over 2D surfaces (or 2D functions)
  • Specifying the GP boils down to modeling covariance
  • GPs can be used as part of a Bayesian hierarchical model
  • We look for the posterior of a Bayesian model
  • When the joint posterior is “difficult” we design MCMC algorithms to sample from it
  • Models based on GPs scale poorly in terms of memory and flops and time

Scalable GP models

  • We will now consider building special GPs that can scale to larger datasets
  • There is no free lunch: scalability comes at a price
  • As usual, we will need to consider trade-offs and make decisions based on our modeling goals
  • Typically, we will restrict dependence across space, simplifying its structure
  • By restricting spatial dependence we will speed up computations
  • We will also lose in flexibility relative to an unrestricted GP

Graphical model interpretation of multivariate Gaussian random variables

  • Suppose we have a set of n random variables X_1, \dots, X_n

  • Call X = (X_1, \dots, X_n)^\top

  • Assume that X \sim N(0, \Sigma)

  • p(X_1, \dots, X_n) = p(X) = N(X; 0, \Sigma)

Graphical model interpretation of multivariate Gaussian random variables

  • Using the properties of probability we can write \begin{align*} p(X_1, \dots, X_n) &= p(X_1) p(X_2 | X_1) \cdots p(X_n | X_1, \dots, X_{n-1})\\ &= \prod_i^n p(X_i | X_{[1:i-1]}) \end{align*} where X_{[1:i-1]} = (X_1, \dots, X_{i-1})^\top
  • What is p(X_i | X_{[1:i-1]})?

Graphical model interpretation of multivariate Gaussian random variables

  • What is p(X_i | X_{[i-1]})?
  • p(X_i | X_{[1:i-1]}) is a conditional Gaussian distribution
  • This is the same distribution that we get when we predict X_i using X_{[1:i-1]} N(X_i; H_i X_{[1:i-1]}, R_i) where H_i = \Sigma(i, [1:i-1]) \Sigma([1:i-1], [1:i-1])^{-1} and R_i = \Sigma(i, i) - H_i \Sigma([1:i-1], i).

Graphical model interpretation of multivariate Gaussian random variables

  • Because p(X_1, \dots, X_n) = \prod_i^n p(X_i | X_{[i]}), we can draw a DAG according to which the joint density factorizes
  • The number of edges cannot be any larger (for a DAG)
  • If we added even just a single directed edge, the graphical model would form a cycle
  • This DAG is dense
  • We can attribute the computational complexity of GPs to the fact that their DAGs are dense
  • Scaling GPs can be cast as a problem of sparsifying this DAG
  • In other words, methods to remove edges from the DAG in a smart way

Graphical model interpretation of multivariate Gaussian random variables

  • Draw a DAG for p(X_1, \dots, X_6) = p(X_1) p(X_2 |???) p(X_3 |???) \cdots p(X_6 | ???)

Graphical model interpretation of multivariate Gaussian random variables

  • Draw a DAG for p(X_1, \dots, X_6)
  • This DAG is dense

Graphical model interpretation of multivariate Gaussian random variables

  • Draw a DAG for p(X_1, \dots, X_n)
  • This DAG is dense

Graphical model interpretation of multivariate Gaussian random variables

  • Draw a DAG for p(X_1, \dots, X_n)
  • This DAG is dense

Low-rank Gaussian Processes

  • Dense dependence follows from assuming a GP
  • X(\cdot) \sim GP implies X_{\cal L} \sim N(0, \Sigma) implies dense DAG
  • We want X(\cdot) \sim \text{???} leading to a simpler DAG that is sparse

Then, remember how we build a stochastic process:

  1. Define the joint distribution for any finite set of random variables
  2. Check Kolmogorov’s conditions
  3. Result: there is a stochastic process whose finite dimensional distribution is the one we chose

Low-rank Gaussian Processes

  • Assume dense dependence for only a small set of variables
  • All the others are assumed conditionally independent
  • We assume the resulting joint density factorizes according to this DAG (see prelim slides)

Low-rank Gaussian Processes

  • Suppose from n variables we allow dense dependence for only k of them
  • p(X_1, \dots, X_n) = p(X_1, \dots, X_k, X_{k+1}, \dots, X_n)
  • The finite dimensional distribution we assume has the form

\begin{align*} p^*(X_1, \dots, X_n) &= p(X_1, \dots, X_k) \prod_{i=k+1}^n p(X_i | X_1, \dots, X_k) \end{align*}

  • We then assume that all these densities are Gaussian

\begin{align*} p^*(X_1, \dots, X_n) &= N(X_{[1:k]}; 0, \Sigma_{[1:k]}) \prod_{i=k+1}^n N(X_i; H_i X_{[1:k]}, R_i) \end{align*}

where, as usual,

\begin{align*} H_i &= \Sigma_{i, [1:k]} \Sigma_{[1:k]}^{-1}\\ R_i &= \Sigma_i - H_i \Sigma_{i, [1:k]}^\top \end{align*}

Low-rank Gaussian Processes

  • We allow dense dependence at a special set of locations which we call the knots
  • We assume that the process at all other locations is conditionally independent on the value of the process at the knots
  • We make a new joint density p^* that (unlike the original p density) factorizes according to the DAG we saw earlier
  • To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal L \cap \cal K = \emptyset, then

\begin{align*} p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) \\ &= \int p(X_{\cal L} , X_{\cal K}) dX_{\cal K} \\ &= \int p(X_{\cal L} | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \\ &= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \end{align*}

Low-rank Gaussian Processes

  • We allow dense dependence at a special set of locations which we call the knots
  • We assume that the process at all other locations is conditionally independent on the value of the process at the knots
  • To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal K \subset \cal L and call \cal U = \cal L \setminus \cal K

\begin{align*} p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) = p(X_{\cal K}, X_{\cal U}) \\ &= p(X_{\cal U} | X_{\cal K}) p(X_{\cal K}) \end{align*}

Low-rank Gaussian Processes

From \begin{align*} p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \end{align*} we assume all densities are Gaussian

\begin{align*} p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} N(X_i; H_i X_{\cal K}, R_i ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \\ &= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \end{align*}

  • p^*(X_1, \dots, X_n) = ????
  • We define D_{\cal L} as the diagonal matrix collecting all R_i.

Low-rank Gaussian Processes

\begin{align*} p^*(X_1, \dots, X_n) &= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \end{align*} We can write this as

\begin{align*} p(X_{\cal K}): \qquad X_{\cal K} &\sim N(0, C_{\cal K})\\ p(X_{\cal L} | X_{\cal K}): \qquad X_{\cal L} &= H_{\cal L} X_{\cal K} + \eta_{\cal L}, \qquad \eta_{\cal L} \sim N(0, D_{\cal L}) \end{align*} Therefore \begin{align*} p(X_{\cal L}): \qquad X_{\cal L} &\sim N(0, H_{\cal L} C_{\cal K} H_{\cal L}^\top + D_{\cal L}) \end{align*}

And here we see that

H_{\cal L} C_{\cal K} H_{\cal L}^\top = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K} C_{\cal K}^{-1} C_{\cal K, \cal L} = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K, \cal L}

Low-rank Gaussian Processes

From the previous slide: if X(s) and X(s') are both realizations of a low-rank GP with knots \cal K, then if s\ne s' Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') \ne C(s, s')

Where C(s, \cal K) is our chosen covariance function. Whereas if s=s' then Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') + D_{s} = C(s, s)

Notice anything? What is Cov(X(s), X(s')) in the unrestricted GP?

Low-rank Gaussian Processes: induced covariance function

If X(\cdot) is a low-rank GP constructed using covariance function C(\cdot, \cdot), then

  • Cov(X(s), X(s')) \neq C(s, s')
  • Equivalently, X(\cdot) is an unrestricted GP using the covariance function \tilde{C}(\cdot, \cdot) where

\tilde{C}(s, s') = Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s') \qquad \text{if } s\neq s' \tilde{C}(s, s') = C(s, s') \qquad \text{if } s= s'

In the literature

What we have seen in the previous slides corresponds to Finley, Sang, Banerjee, Gelfand (2009) doi:10.1016/j.csda.2008.09.008.

  • Quiñonero-Candela and Rasmussen (2005) https://www.jmlr.org/papers/v6/quinonero-candela05a.html
  • Banerjee, Gelfand, Finley and Sang (2008) doi:10.1111/j.1467-9868.2008.00663.x
  • The ML literature typically refers to these models as sparse GPs
  • “Knots” are sometimes called “sensors” or “inducing points”

In a regression model

  • Assume observed locations are \cal L
  • Make all the usual assumptions for regression and prior
  • The only change: w \sim LRGP with covariance function C(\cdot, \cdot) and knots \cal K:

Y = X\beta + w + \varepsilon We can write the model as

Y = X\beta + H w_{\cal K} + \eta + \varepsilon where H = C_{\cal L, \cal K} C_{\cal K}^{-1} and \eta \sim ???

  • If we omit \eta from the model, then we are fitting a regression model based on the Predictive Gaussian Process as introduced in Banerjee, Gelfand, Finley and Sang (2008)
  • If we include \eta then we are fitting a Modified Predictive Gaussian Process model as in Finley, Sang, Banerjee, Gelfand (2009). Equivalently,

Y = X\beta + H w_{\cal K} + \varepsilon^* \qquad \text{where} \qquad \varepsilon^* \sim N(0, \tau^2 I_n + D_{\cal L})

Low-rank Gaussian Processes in R

  • In R, the spBayes package implements the (modified) predictive process as we have seen here
  • Use argument knots=... in spLM
  • Everything we said earlier about spLM remains true even when using the predictive process