Low-rank GPs and the Predictive GP
Spatial Statistics - BIOSTAT 696/896
Michele Peruzzi
University of Michigan
Introduction
- We have introduced GP models
- Spatial GP models are priors over 2D surfaces (or 2D functions)
- Specifying the GP boils down to modeling covariance
- GPs can be used as part of a Bayesian hierarchical model
- We look for the posterior of a Bayesian model
- When the joint posterior is “difficult” we design MCMC algorithms to sample from it
- Models based on GPs scale poorly in terms of memory and flops and time
Scalable GP models
- We will now consider building special GPs that can scale to larger datasets
- There is no free lunch: scalability comes at a price
- As usual, we will need to consider trade-offs and make decisions based on our modeling goals
- Typically, we will restrict dependence across space, simplifying its structure
- By restricting spatial dependence we will speed up computations
- We will also lose in flexibility relative to an unrestricted GP
Graphical model interpretation of multivariate Gaussian random variables
Suppose we have a set of n random variables X_1, \dots, X_n
Call X = (X_1, \dots, X_n)^\top
Assume that X \sim N(0, \Sigma)
p(X_1, \dots, X_n) = p(X) = N(X; 0, \Sigma)
Graphical model interpretation of multivariate Gaussian random variables
- Using the properties of probability we can write \begin{align*}
p(X_1, \dots, X_n) &= p(X_1) p(X_2 | X_1) \cdots p(X_n | X_1, \dots, X_n)\\
&= \prod_i^n p(X_i | X_{[1:i-1]})
\end{align*} where X_{[1:i-1]} = (X_1, \dots, X_{i-1})^\top
- What is p(X_i | X_{[1:i-1]})?
Graphical model interpretation of multivariate Gaussian random variables
- What is p(X_i | X_{[i-1]})?
- p(X_i | X_{[1:i-1]}) is a conditional Gaussian distribution
- This is the same distribution that we get when we predict X_i using X_{[i-1]} N(X_i; H_i X_{[1:i-1]}, R_i) where H_i = \Sigma(i, [i]) \Sigma([i], [i])^{-1} and R_i = \Sigma(i, i) - H_i \Sigma([1:i-1], i).
Graphical model interpretation of multivariate Gaussian random variables
- Because p(X_1, \dots, X_n) = \prod_i^n p(X_i | X_{[i]}), we can draw a DAG according to which the joint density factorizes
- The number of edges cannot be any larger (for a DAG)
- If we added even just a single directed edge, the graphical model would form a cycle
- This DAG is dense
- We can attribute the computational complexity of GPs to the fact that their DAGs are dense
- Scaling GPs can be cast as a problem of sparsifying this DAG
- In other words, methods to remove edges from the DAG in a smart way
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X_1, \dots, X_6) = p(X_1) p(X_2 |???) p(X_3 |???) \cdots p(X_6 | ???)
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X_1, \dots, X_6)
- This DAG is dense
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X_1, \dots, X_n)
- This DAG is dense
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X_1, \dots, X_n)
- This DAG is dense
Low-rank Gaussian Processes
- Dense dependence follows from assuming a GP
- X(\cdot) \sim GP implies X_{\cal L} \sim N(0, \Sigma) implies dense DAG
- We want X(\cdot) \sim \text{???} leading to a simpler DAG that is sparse
Then, remember how we build a stochastic process:
- Define the joint distribution for any finite set of random variables
- Check Kolmogorov’s conditions
- Result: there is a stochastic process whose finite dimensional distribution is the one we chose
Low-rank Gaussian Processes
- Assume dense dependence for only a small set of variables
- All the others are assumed conditionally independent
- We assume the resulting joint density factorizes according to this DAG (see prelim slides)
Low-rank Gaussian Processes
- Suppose from n variables we allow dense dependence for only k of them
- p(X_1, \dots, X_n) = p(X_1, \dots, X_k, X_{k+1}, \dots, X_n)
- The finite dimensional distribution we assume has the form
\begin{align*}
p^*(X_1, \dots, X_n) &= p(X_1, \dots, X_k) \prod_{i=k+1}^n p(X_i | X_1, \dots, X_k)
\end{align*}
- We then assume that all these densities are Gaussian
\begin{align*}
p^*(X_1, \dots, X_n) &= N(X_{[1:k]}; 0, \Sigma_{[1:k]}) \prod_{i=k+1}^n N(X_i; H_i X_{[1:k]}, R_i)
\end{align*}
where, as usual,
\begin{align*}
H_i &= \Sigma_{i, [1:k]} \Sigma_{[1:k]}^{-1}\\
R_i &= \Sigma_i - H_i \Sigma_{i, [1:k]}^\top
\end{align*}
Low-rank Gaussian Processes
- We allow dense dependence at a special set of locations which we call the knots
- We assume that the process at all other locations is conditionally independent on the value of the process at the knots
- We make a new joint density p^* that (unlike the original p density) factorizes according to the DAG we saw earlier
- To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal L \cap \cal K = \emptyset, then
\begin{align*}
p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) \\
&= \int p(X_{\cal L} , X_{\cal K}) dX_{\cal K} \\
&= \int p(X_{\cal L} | X_{\cal K}) p(X_{\cal K}) dX_{\cal K} \\
&= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K}
\end{align*}
Low-rank Gaussian Processes
- We allow dense dependence at a special set of locations which we call the knots
- We assume that the process at all other locations is conditionally independent on the value of the process at the knots
- To define a stochastic process, we need to define the finite dimensional distributions for any set \cal L of locations. Call \cal K the knots, assume \cal K \subset \cal L and call \cal U = \cal L \setminus \cal K
\begin{align*}
p^*(X_1, \dots, X_n) &= p^*(X_{\cal L}) = p(X_{\cal K}, X_{\cal U}) \\
&= p(X_{\cal U} | X_{\cal K}) p(X_{\cal K})
\end{align*}
Low-rank Gaussian Processes
From \begin{align*}
p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} p(X_i | X_{\cal K}) p(X_{\cal K}) dX_{\cal K}
\end{align*} we assume all densities are Gaussian
\begin{align*}
p^*(X_1, \dots, X_n) &= \int \prod_{i \in \cal L} N(X_i; H_i X_{\cal K}, R_i ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K} \\
&= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K}
\end{align*}
- p^*(X_1, \dots, X_n) = ????
- We define D_{\cal L} as the diagonal matrix collecting all R_i.
Low-rank Gaussian Processes
\begin{align*}
p^*(X_1, \dots, X_n) &= \int N(X_{\cal L}; H_{\cal L} X_{\cal K}, D_{\cal L} ) N(X_{\cal K}; 0, C_{\cal K}) dX_{\cal K}
\end{align*} We can write this as
\begin{align*}
p(X_{\cal K}): \qquad X_{\cal K} &\sim N(0, C_{\cal K})\\
p(X_{\cal L} | X_{\cal K}): \qquad X_{\cal L} &= H_{\cal L} X_{\cal K} + \eta_{\cal L}, \qquad \eta_{\cal L} \sim N(0, D_{\cal L})
\end{align*} Therefore \begin{align*}
p(X_{\cal L}): \qquad X_{\cal L} &\sim N(0, H_{\cal L} C_{\cal K} H_{\cal L}^\top + D_{\cal L})
\end{align*}
And here we see that
H_{\cal L} C_{\cal K} H_{\cal L}^\top = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K} C_{\cal K}^{-1} C_{\cal K, \cal L} = C_{\cal L, \cal K} C_{\cal K}^{-1} C_{\cal K, \cal L}
and from this, if X(s) and X(s') are both realizations of a low-rank GP with knots \cal K, then Cov(X(s), X(s')) = C(s, \cal K) C_{\cal K}^{-1} C(\cal K, s')
Where C(s, \cal K) is our chosen covariance function.
Notice anything? What is Cov(X(s), X(s')) in the unrestricted GP?
In the literature
What we have seen in the previous slides corresponds to Finley, Sang, Banerjee, Gelfand (2009) doi:10.1016/j.csda.2008.09.008
.
- Quiñonero-Candela and Rasmussen (2005)
https://www.jmlr.org/papers/v6/quinonero-candela05a.html
- Banerjee, Gelfand, Finley and Sang (2008)
doi:10.1111/j.1467-9868.2008.00663.x
- The ML literature typically refers to these models as sparse GPs
- “Knots” are sometimes called “sensors” or “inducing points”
In a regression model
- Assume observed locations are \cal L
- Make all the usual assumptions for regression and prior
- The only change: w \sim LRGP with covariance function C(\cdot, \cdot) and knots \cal K:
Y = X\beta + w + \varepsilon We can write the model as
Y = X\beta + H w_{\cal K} + \eta + \varepsilon where H = C_{\cal L, \cal K} C_{\cal K}^{-1} and \eta \sim ???
- If we omit \eta from the model, then we are fitting a regression model based on the Predictive Gaussian Process as introduced in Banerjee, Gelfand, Finley and Sang (2008)
- If we include \eta then we are fitting a Modified Predictive Gaussian Process model as in Finley, Sang, Banerjee, Gelfand (2009). Equivalently,
Y = X\beta + H w_{\cal K} + \varepsilon^* \qquad \text{where} \qquad \varepsilon^* \sim N(0, \tau^2 I_n + D_{\cal L})
Low-rank Gaussian Processes in R
- In R, the
spBayes
package implements the (modified) predictive process as we have seen here
- Use argument
knots=...
in spLM
- Everything we said earlier about
spLM
remains true even when using the predictive process