Low-rank GPs and the Predictive GP
Spatial Statistics - BIOSTAT 696/896
Michele Peruzzi
University of Michigan
Introduction
- We have introduced GP models
- Spatial GP models are priors over 2D surfaces (or 2D functions)
- Specifying the GP boils down to modeling covariance
- GPs can be used as part of a Bayesian hierarchical model
- We look for the posterior of a Bayesian model
- When the joint posterior is “difficult” we design MCMC algorithms to sample from it
- Models based on GPs scale poorly in terms of memory and flops and time
Scalable GP models
- We will now consider building special GPs that can scale to larger datasets
- There is no free lunch: scalability comes at a price
- As usual, we will need to consider trade-offs and make decisions based on our modeling goals
- Typically, we will restrict dependence across space, simplifying its structure
- By restricting spatial dependence we will speed up computations
- We will also lose in flexibility relative to an unrestricted GP
Graphical model interpretation of multivariate Gaussian random variables
Suppose we have a set of n random variables X1,…,Xn
Call X=(X1,…,Xn)⊤
Assume that X∼N(0,Σ)
p(X1,…,Xn)=p(X)=N(X;0,Σ)
Graphical model interpretation of multivariate Gaussian random variables
- Using the properties of probability we can write p(X1,…,Xn)=p(X1)p(X2∣X1)⋯p(Xn∣X1,…,Xn−1)=i∏np(Xi∣X[1:i−1]) where X[1:i−1]=(X1,…,Xi−1)⊤
- What is p(Xi∣X[1:i−1])?
Graphical model interpretation of multivariate Gaussian random variables
- What is p(Xi∣X[i−1])?
- p(Xi∣X[1:i−1]) is a conditional Gaussian distribution
- This is the same distribution that we get when we predict Xi using X[1:i−1] N(Xi;HiX[1:i−1],Ri) where Hi=Σ(i,[1:i−1])Σ([1:i−1],[1:i−1])−1 and Ri=Σ(i,i)−HiΣ([1:i−1],i).
Graphical model interpretation of multivariate Gaussian random variables
- Because p(X1,…,Xn)=∏inp(Xi∣X[i]), we can draw a DAG according to which the joint density factorizes
- The number of edges cannot be any larger (for a DAG)
- If we added even just a single directed edge, the graphical model would form a cycle
- This DAG is dense
- We can attribute the computational complexity of GPs to the fact that their DAGs are dense
- Scaling GPs can be cast as a problem of sparsifying this DAG
- In other words, methods to remove edges from the DAG in a smart way
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X1,…,X6)=p(X1)p(X2∣???)p(X3∣???)⋯p(X6∣ ???)
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X1,…,X6)
- This DAG is dense
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X1,…,Xn)
- This DAG is dense
Graphical model interpretation of multivariate Gaussian random variables
- Draw a DAG for p(X1,…,Xn)
- This DAG is dense
Low-rank Gaussian Processes
- Dense dependence follows from assuming a GP
- X(⋅)∼GP implies XL∼N(0,Σ) implies dense DAG
- We want X(⋅)∼??? leading to a simpler DAG that is sparse
Then, remember how we build a stochastic process:
- Define the joint distribution for any finite set of random variables
- Check Kolmogorov’s conditions
- Result: there is a stochastic process whose finite dimensional distribution is the one we chose
Low-rank Gaussian Processes
- Assume dense dependence for only a small set of variables
- All the others are assumed conditionally independent
- We assume the resulting joint density factorizes according to this DAG (see prelim slides)
Low-rank Gaussian Processes
- Suppose from n variables we allow dense dependence for only k of them
- p(X1,…,Xn)=p(X1,…,Xk,Xk+1,…,Xn)
- The finite dimensional distribution we assume has the form
p∗(X1,…,Xn)=p(X1,…,Xk)i=k+1∏np(Xi∣X1,…,Xk)
- We then assume that all these densities are Gaussian
p∗(X1,…,Xn)=N(X[1:k];0,Σ[1:k])i=k+1∏nN(Xi;HiX[1:k],Ri)
where, as usual,
HiRi=Σi,[1:k]Σ[1:k]−1=Σi−HiΣi,[1:k]⊤
Low-rank Gaussian Processes
- We allow dense dependence at a special set of locations which we call the knots
- We assume that the process at all other locations is conditionally independent on the value of the process at the knots
- We make a new joint density p∗ that (unlike the original p density) factorizes according to the DAG we saw earlier
- To define a stochastic process, we need to define the finite dimensional distributions for any set L of locations. Call K the knots, assume L∩K=∅, then
p∗(X1,…,Xn)=p∗(XL)=∫p(XL,XK)dXK=∫p(XL∣XK)p(XK)dXK=∫i∈L∏p(Xi∣XK)p(XK)dXK
Low-rank Gaussian Processes
- We allow dense dependence at a special set of locations which we call the knots
- We assume that the process at all other locations is conditionally independent on the value of the process at the knots
- To define a stochastic process, we need to define the finite dimensional distributions for any set L of locations. Call K the knots, assume K⊂L and call U=L∖K
p∗(X1,…,Xn)=p∗(XL)=p(XK,XU)=p(XU∣XK)p(XK)
Low-rank Gaussian Processes
From p∗(X1,…,Xn)=∫i∈L∏p(Xi∣XK)p(XK)dXK we assume all densities are Gaussian
p∗(X1,…,Xn)=∫i∈L∏N(Xi;HiXK,Ri)N(XK;0,CK)dXK=∫N(XL;HLXK,DL)N(XK;0,CK)dXK
- p∗(X1,…,Xn)= ????
- We define DL as the diagonal matrix collecting all Ri.
Low-rank Gaussian Processes
p∗(X1,…,Xn)=∫N(XL;HLXK,DL)N(XK;0,CK)dXK We can write this as
p(XK):XKp(XL∣XK):XL∼N(0,CK)=HLXK+ηL,ηL∼N(0,DL) Therefore p(XL):XL∼N(0,HLCKHL⊤+DL)
And here we see that
HLCKHL⊤=CL,KCK−1CKCK−1CK,L=CL,KCK−1CK,L
Low-rank Gaussian Processes
From the previous slide: if X(s) and X(s′) are both realizations of a low-rank GP with knots K, then if s=s′ Cov(X(s),X(s′))=C(s,K)CK−1C(K,s′)=C(s,s′)
Where C(s,K) is our chosen covariance function. Whereas if s=s′ then Cov(X(s),X(s′))=C(s,K)CK−1C(K,s′)+Ds=C(s,s)
Notice anything? What is Cov(X(s),X(s′)) in the unrestricted GP?
Low-rank Gaussian Processes: induced covariance function
If X(⋅) is a low-rank GP constructed using covariance function C(⋅,⋅), then
- Cov(X(s),X(s′))=C(s,s′)
- Equivalently, X(⋅) is an unrestricted GP using the covariance function C~(⋅,⋅) where
C~(s,s′)=Cov(X(s),X(s′))=C(s,K)CK−1C(K,s′)if s=s′ C~(s,s′)=C(s,s′)if s=s′
In the literature
What we have seen in the previous slides corresponds to Finley, Sang, Banerjee, Gelfand (2009) doi:10.1016/j.csda.2008.09.008
.
- Quiñonero-Candela and Rasmussen (2005)
https://www.jmlr.org/papers/v6/quinonero-candela05a.html
- Banerjee, Gelfand, Finley and Sang (2008)
doi:10.1111/j.1467-9868.2008.00663.x
- The ML literature typically refers to these models as sparse GPs
- “Knots” are sometimes called “sensors” or “inducing points”
In a regression model
- Assume observed locations are L
- Make all the usual assumptions for regression and prior
- The only change: w∼LRGP with covariance function C(⋅,⋅) and knots K:
Y=Xβ+w+ε We can write the model as
Y=Xβ+HwK+η+ε where H=CL,KCK−1 and η∼ ???
- If we omit η from the model, then we are fitting a regression model based on the Predictive Gaussian Process as introduced in Banerjee, Gelfand, Finley and Sang (2008)
- If we include η then we are fitting a Modified Predictive Gaussian Process model as in Finley, Sang, Banerjee, Gelfand (2009). Equivalently,
Y=Xβ+HwK+ε∗whereε∗∼N(0,τ2In+DL)
Low-rank Gaussian Processes in R
- In R, the
spBayes
package implements the (modified) predictive process as we have seen here
- Use argument
knots=...
in spLM
- Everything we said earlier about
spLM
remains true even when using the predictive process
Low-rank GPs and the Predictive GP Spatial Statistics - BIOSTAT 696/896 Michele Peruzzi University of Michigan