Bayesian Regression with Gaussian Processes
Spatial Statistics - BIOSTAT 696/896
Michele Peruzzi
University of Michigan
From response model to regression with latent GP
We have considered (ML) inference of the following model
Y(\cdot) \sim GP(0, C_{\theta}(\cdot, \cdot))
Optimizing becomes more difficult as we increase the dimension of \theta
We may want to model a non-zero mean based on covariate matrix X
We may want to make the model more complex or flexible
Bayesian linear regression with latent GP
Linear regression model with spatial random effects
Spatial random effects = unobserved error term with spatial correlation
Equivalent to having correlated measurement error
Separate purely spatial error from purely noise error terms
At location s , we model
Y(s) = X(s)^\top\beta + w(s) + \varepsilon(s)
X(s) is a p\times 1 vector of covariates at location s
\beta is a constant vector of regression coefficients
w(s) is the realization at s of a GP
\varepsilon(s) is independent noise
Bayesian linear regression with latent GP
At location s , we model
Y(s) = X(s)^\top\beta + w(s) + \varepsilon(s)
X(s) is a p\times 1 vector of covariates at location s
\beta is a constant vector of regression coefficients
w(s) is the realization at s of a GP: we assume w(\cdot) \sim GP(0, \tau^2 C_{\theta}(\cdot, \cdot) ) where we have included \tau^2 for conjugacy
Example: C_{\theta}(s, s') = s^2 \exp\{ -\phi \| s-s' \| \} and the process variance can be computed as
Var(w) = Cov(w(s), w(s)) = \tau^2 s^2
Notice: if we denote Var(w) = \sigma^2 then \sigma^2 = \tau^2 s^2 and we interpret s^2 as the signal-to-noise ratio
\text{SNR:} \qquad s^2 = \frac{\sigma^2}{\tau^2}
\varepsilon(s) is iid noise \varepsilon(s) \sim N(0, \tau^2)
We do not need a nugget term in C_{\theta}(\cdot, \cdot) as we are including iid noise in our model
Bayesian linear regression with latent GP
Take \cal T as the set of observed locations, then the model can be written as
Y = X\beta + w + \varepsilon where Y , X , w and \varepsilon are vectors and matrices with appropriate dimensions.
We call this a latent GP model because we are explicitly introducing latent (unobserved) variables (i.e., each element of w )
Let’s assign conjugate priors to most unknown parameters:
\begin{align*}
\tau^2 &\sim Inv.G(a, b) \\
\beta | \tau^2 &\sim N(0, \tau^2 V_{\beta})
\end{align*}
GP assumption implies w \sim N(0, \tau^2 C_{\theta})
The prior on \theta is not conjugate: write generically \theta \sim p(\theta) (we will specify later)
Bayesian linear regression with latent GP
Priors:
\begin{align*}
\theta &\sim p(\theta) \\
\tau^2 &\sim Inv.G(a, b) \\
\beta | \tau^2 &\sim N(0, \tau^2 V_{\beta}) \\
\text{at}\ \cal T \ \text{:}\qquad w &\sim N(0, \tau^2 C_{\theta}) \qquad \text{because} \qquad w(\cdot) &\sim GP(0, \tau^2 C_{\theta}(\cdot))
\end{align*}
Model/Likelihood:
Y | \beta, w, \tau^2 \sim N(X \beta + w, \tau^2 I_n)
Posterior we want to find:
p(\beta, \theta, \tau^2 | Y) = \text{???}
Bayesian linear regression with latent GP
Posterior we want to find:
p(\beta, \theta, \tau^2 | Y) = \text{???}
w are “nuisance” parameters. They are “in the way” when estimating \theta a posteriori
p(w|Y) is the posterior of the GP, which we will need to make predictions at new locations
p(w | Y) = \int p(w, \beta, \theta, \tau^2 | Y)d\beta, d\theta, d\tau^2 = \int p(w | Y,\beta, \theta, \tau^2)p(\beta, \theta, \tau^2 | Y)d\beta, d\theta, d\tau^2
Therefore, once again we first need p(\beta, \theta, \tau^2 | Y)
Notice: p(w | Y, \beta, \theta, \tau^2) is the conditional posterior for fixed values of \beta, \theta, \tau^2
We have a roadmap, now we will work towards finding p(\beta, \theta, \tau^2 | Y)
Finding p(\beta, \theta, \tau^2 | Y)
We marginalize over w (aka integrate out w ) and get the marginal likelihood
p(Y | \beta, \theta, \tau^2) =\int p(Y, w| \beta, \theta, \tau^2) dw = \int p(Y | \beta, w, \theta, \tau^2) p(w | \theta) dw
With Gaussian things , this is easy! If U \sim N(m_U, S_U) independent of V \sim N(m_V, S_V) and Z = U + V + \varepsilon where \varepsilon \sim N(0, a I_n) then Z|U,V \sim N(U + V, aI_n) , Z|U \sim N(U + m_V, S_V + aI_n) , Z \sim N(m_U + m_V, S_U + S_V + aI_n) .
In our case, the Gaussisan things are \beta , w , and Y , therefore:
Y | \beta, \theta, \tau^2 \sim N(X \beta, \tau^2 C_{\theta} + \tau^2 I_n) = N(X \beta, \tau^2 (C_{\theta} + I_n))
Notice: the marginal model is just the likelihood of the usual linear regression model, with correlated errors! (See e.g. slides on preliminaries, page 13, set V_y = C_{\theta} + I_n )
Finding p(\beta, \theta, \tau^2 | Y)
Having the Marginal Likelihood, our target posterior can be written as
p(\beta, \theta, \tau^2 | Y) \propto p(Y | \beta, \theta, \tau^2) p(\beta, \theta, \tau^2)
This is not a conjugate model because of \theta , however:
p(\beta, \theta, \tau^2 | Y) \propto p(\beta, \tau^2 | Y, \theta) p(\theta | Y)
p(\beta, \tau^2 | Y, \theta) is the posterior of the marginal model when we condition (or fix) a value of \theta
Because we chose a Normal-Inverse Gamma prior on (\tau^2, \beta) , we know that p(\beta, \tau^2 | Y, \theta) is also a Normal-Inverse Gamma, with updated parameters.
Finding p(\beta, \theta, \tau^2 | Y)
For a fixed and known \theta value, set V_y = C_{\theta} + I_n .
The Bayesian hierarchical model for our latent GP spatial regression is:
\begin{align*}
\tau^2 &\sim Inv.G(a, b) \\
\beta | \tau^2 &\sim N(0, \tau^2 V_{\beta}) \\
Y | \beta, \tau^2 &\sim N(X \beta, \tau^2 V_y)
\end{align*}
Then the posterior is (see prelim slides here) :
p(\tau^2, \beta | Y) = Inv.G(\tau^2; a^*,b^*) N(\beta; m_{\beta}^*, \tau^2 V_{\beta}^*)
where we just need to compute the updated parameters
\begin{align*}
V_{\beta}^* &= (V_{\beta}^{-1} + X^T V_y^{-1} X)^{-1} \\
m_{\beta}^* &= V_{\beta}^* X^T V_y^{-1} y \\
a^* = a + n/2 \quad\text{and}\quad b^* &= b + \frac{1}{2}(y^TV_y^{-1} y - m_{\beta}^* V_{\beta}^{* -1} m_{\beta}^*)
\end{align*}
Finding p(\beta, \theta, \tau^2 | Y)
For a fixed and known \theta value,
the Bayesian hierarchical model for our latent GP spatial regression is:
\begin{align*}
\tau^2 &\sim Inv.G(a, b) \\
\beta | \tau^2 &\sim N(0, \tau^2 V_{\beta}) \\
Y | \beta, \tau^2 &\sim N(X \beta, \tau^2 \textcolor{teal}{(C_{\theta} + I_n)})
\end{align*}
Then the posterior is (see prelim slides here) :
p(\tau^2, \beta | Y) = Inv.G(\tau^2; a^*,b^*) N(\beta; m_{\beta}^*, \tau^2 V_{\beta}^*)
where we just need to compute the updated parameters
\begin{align*}
V_{\beta}^* &= (V_{\beta}^{-1} + X^T \textcolor{teal}{(C_{\theta} + I_n)}^{-1} X)^{-1} \\
m_{\beta}^* &= V_{\beta}^* X^T \textcolor{teal}{(C_{\theta} + I_n)}^{-1} y) \\
a^* = a + n/2 \quad\text{and}\quad b^* &= b + \frac{1}{2}(y^T\textcolor{teal}{(C_{\theta} + I_n)}^{-1} y - m_{\beta}^* V_{\beta}^{* -1} m_{\beta}^*)
\end{align*}
Finding p(\beta, \theta, \tau^2 | Y)
Recap so far:
We want to compute p(\beta, \theta, \tau^2 | Y)
We have found that if we fixed \theta , then the “remaining” posterior is conjugate
We know everything about p(\beta, \tau^2 | Y, \theta) !
For example, we can sample from it, i.e. we can build
\{ \tau^2_{(t)}, \beta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)} \sim p(\beta, \tau^2 | Y, \theta)
Unfortunately for us, p(\beta, \theta, \tau^2 | Y) \neq p(\beta, \tau^2 | Y, \theta)
The posterior we really want is
p(\beta, \theta, \tau^2 | Y) \propto p(\beta, \tau^2 | Y, \theta) p(\theta | Y)
Finding p(\beta, \theta, \tau^2 | Y)
Our task is to find p(\beta, \theta, \tau^2 | Y)
Notice that we can write it “the other way”, i.e. p(\beta, \theta, \tau^2 | Y) \propto p(\theta | Y, \beta, \tau^2) p(\beta, \tau^2 | Y)
In the term p(\theta | Y, \beta, \tau^2) , we are looking at the posterior of \theta after fixing a value for the pair (\beta, \tau^2)
The term p(\beta, \tau^2 | Y) is not the N-Inv.G we had before. Notice this does not condition on \theta
Make a big leap of faith (for now):
i.e., suppose we have a procedure that allows us to build
\{ \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \theta_{(t)} \sim p(\theta | Y, \beta, \tau^2)
Finding p(\beta, \theta, \tau^2 | Y)
Putting things together:
We know how to get
\{ \tau^2_{(t)}, \beta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)} \sim p(\beta, \tau^2 | Y, \theta)
We “know” how to get
\{ \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \theta_{(t)} \sim p(\theta | Y, \beta, \tau^2)
Therefore, we can build a Gibbs sampler!
Gibbs sampling
Suppose we want to extract samples from the joint distribution p(A, B)
We know how to extract samples from p(A | B) and p(B | A)
We then build a Markov chain \Theta_{t} = (A_t, B_t) on discrete time
Fix some initial value \Theta_0 by setting A_0 = a , B_0 = b
At time t we sample a new \Theta_{t} by updating its components: A_t \sim p(A | B = b_{t-1}) \qquad B_t \sim p(B | A = a_t)
The stationary distribution for this Markov chain is p(A, B)
In other words, we are alternating sampling from each conditional distribution to obtain (with large t ) a (correlated) sample from the joint distribution
Gibbs sampling: toy example
Suppose we want to extract samples from the joint distribution N\left(\begin{bmatrix} x_1 \\ x_2\end{bmatrix}; \begin{bmatrix} 0 \\ 0\end{bmatrix}, \begin{bmatrix}1 & \rho \\ \rho & 1 \end{bmatrix}\right)
We know how to extract samples from p(X_1 | X_2=x_2) and p(X_2 | X_1=x_1) :
\begin{align*}
p(X_1 | X_2=x_2) &= N(x_1; \rho x_2, 1-\rho^2) \\
p(X_2 | X_1=x_1) &= N(x_2; \rho x_1, 1-\rho^2)
\end{align*}
set.seed (696 )
rho <- 0.6
gibbs_step <- function (x){
x[1 ] <- rnorm (1 , rho * x[2 ], sqrt ( 1 - rho^ 2 ) )
x[2 ] <- rnorm (1 , rho * x[1 ], sqrt ( 1 - rho^ 2 ) )
return (x)
}
num_samples <- 10000
x_all_samples <- matrix (NA , ncol= 2 , nrow= num_samples)
colnames (x_all_samples) <- c ("X1" , "X2" )
x <- c (0 ,0 ) # initial value
# run the Gibbs sampler
for (i in 1 : num_samples){
x <- gibbs_step (x)
x_all_samples[i,] <- x
}
Gibbs sampling: toy example
Finding p(\beta, \tau^2, \theta | Y)
We know how to get
\{ \tau^2_{(t)}, \beta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)} \sim p(\beta, \tau^2 | Y, \theta)
We “know” how to get
\{ \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \theta_{(t)} \sim p(\theta | Y, \beta, \tau^2)
Therefore, we build a Gibbs sampler and get
\{ \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} \sim p(\beta, \tau^2, \theta | Y)
We learn p(\beta, \tau^2, \theta | Y) by summarizing samples from it.
Marginal posteriors:
Need samples from p(\beta | Y) , p(\tau^2 | Y) , p(\theta | Y) ?
Just pick the element we need, e.g. extract \{ \beta_{(t)} \}_{t=1}^T from \{ \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} \}_{t=1}^T for a sample from p(\beta | Y)
With samples from p(\beta, \tau^2, \theta | Y)
One important thing we still need is p(w | Y) , the posterior for the spatial process
\begin{align*}
p(w | Y) &= \int p(\beta, \tau^2, \theta, w | Y) d\beta d\tau^2 d\theta \\
&= \int p(w | Y, \beta, \tau^2, \theta) p(\beta, \tau^2, \theta | Y) d\beta d\tau^2 d\theta
\end{align*}
This means that for each sample of \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} , all we need to do is independently sample from p(w | Y, \beta, \tau^2, \theta) to obtain a sample from the joint posterior.
p(w | Y, \beta, \tau^2, \theta) is our next target
It will be simple!
Finding p(w | Y, \beta, \tau^2, \theta)
Before we start, take a moment to notice what we are trying to do
We want a posterior (we condition on Y ) but we take everything else as known
Prior: w | \theta \sim N(0, C_{\theta})
Model: Y = X\beta + w + \varepsilon,\quad \varepsilon \sim N(0, \tau^2 I_n)
Rewrite the model: Y^* = I_n w + \varepsilon \qquad \text{where}\quad Y^* = Y-X\beta
(full conditional) Posterior
How do we continue?
Finding p(w | Y, \beta, \tau^2, \theta)
Before we start, take a moment to notice what we are trying to do
We want a posterior (we condition on Y ) but we take everything else as known
Prior: w | \theta \sim N(0, \tau^2 C_{\theta})
Model: Y = X\beta + w + \varepsilon,\quad \varepsilon \sim N(0, \tau^2 I_n)
Rewrite the model: Y^* = I_n w + \varepsilon \qquad \text{where}\quad Y^* = Y-X\beta
(full conditional) Posterior
This is just the usual linear regression. Here, w plays the role of \beta . It is a Gaussian model with Gaussian prior, so the posterior is Gaussian (conjugate):
\begin{align*}
&w | Y, \beta, \tau^2, \theta \sim N(m_w, \tau^2 \Sigma_w)\\
&\Sigma_w^{-1} = C_{\theta}^{-1} + I_n\\
&m_w = \Sigma_w (Y - X\beta)
\end{align*}
Predictions
The last thing we are left to figure out is how to make predictions at new locations
We know how to do that for a GP response model (How?) , what about now?
We are looking for Y(s^*) where s^* was not originally in our dataset
Y(s^*) \sim N( X(s^*)^\top \beta + w(s^*), \tau^2)
First, note that p(w(s^*) | w, Y) = p(w(s^*) | w)
p(w(s^*) | w) is the GP prediction as we saw it here
With samples \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)}, w_{(t)} , we just need to compute p(w(s^*) | w) and then extract from Y(s^*) \sim N( X(s^*)^\top \beta + w(s^*), \tau^2)
Why? Because:
\begin{align*}
p(Y^*, w^*, w, \beta, \tau^2, \theta | Y) &= p(Y^*, w^* | Y, w, \beta, \tau^2, \theta) p(w, \beta, \tau^2, \theta | Y) \\
&= p(Y^* | w^*, Y, w, \beta, \tau^2, \theta) p(w^* | Y, w, \beta, \tau^2, \theta) p(w, \beta, \tau^2, \theta | Y) \\
&= p(Y^* | w^*, \beta, \tau^2) p(w^* | w, \theta)p(w, \beta, \tau^2, \theta | Y)
\end{align*}
We’re done!
We have outlined all steps for estimating parameters and making predictions
The basic univariate linear spatial model can be extended in many ways
Make it a latent-Gaussian model of non-Gaussian data (spatial GLMM)
The model we have considered is also referred to as a spatially-varying intercept model
Assume Z(s) \in \Re is a covariate not already in X(s) . Then,
Y(s) = X(s) \beta + Z(s) w(s) + \varepsilon(s) is a spatially-varying coefficient (SVC) model. The regression coefficient that explains how Z(s) is associated to Y(s) is a spatial process (basically, we can draw a map of it) rather than a constant
But we are not done!!!
We have not explained how we are getting samples
\{ \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \theta_{(t)} \sim p(\theta | Y, \beta, \tau^2)
For that, we need to outline the Metropolis-Hastings algorithm.
Metropolis-Hastings algorithm
Goal: obtain samples from a target density p(x)
Idea: make a Markov chain that has p(x) as its stationary distribution
Iteration t of the MH algorithm
Given the current value x_{(t-1)} , propose a new value x^* by sampling it from density q(\cdot | x_{(t-1)})
Compute (by evaluating the densities) \alpha = \frac{p(x^*)}{p(x_{(t-1)})} \frac{q(x_{(t-1)}| x^*)}{q(x^* | x_{(t-1)})}
Accept and therefore set x_{(t)} = x^* with probability \min\{1, \alpha\}
For example, sample U \sim U(0,1) and accept if U < \alpha , reject otherwise
We have built a Markov chain, and it can be shown to converge to p(\cdot)
Output: (correlated) samples \{ x_{(t)} \}_{t=1}^T
In our case
Our target is p(\theta | Y, \beta, \tau^2) \propto p(Y | \theta, \beta, \tau^2) p(\theta)
Because we are taking ratios of this target, the normalizing constant cancels out
This is very convenient: MH works even if we can only evaluate the target density up to a proportionality constant
Metropolis-Hastings algorithm
Goal: obtain samples from p(\theta | Y, \beta, \tau^2) \propto p(Y | \theta, \beta, \tau^2) p(\theta)
Idea: make a Markov chain that has p(Y | \theta, \beta, \tau^2) p(\theta) as its stationary distribution
Iteration t of the MH algorithm
Given the current value \theta_{(t-1)} , propose a new value \theta^* by sampling it from density q(\cdot | \theta_{(t-1)})
Compute (by evaluating the densities) \alpha = \frac{p(Y | \theta^*, \beta, \tau^2) p(\theta^*)}{p(Y | \theta_{(t-1)}, \beta, \tau^2) p(\theta_{(t-1)})} \frac{q(\theta_{(t-1)}| \theta^*)}{q(\theta^* | \theta_{(t-1)})}
Accept and therefore set \theta_{(t)} = \theta^* with probability \min\{1, \alpha\}
For example, sample U \sim U(0,1) and accept if U < \alpha , reject otherwise
Output: (correlated) samples \{ \theta_{(t)} \}_{t=1}^T from p(\theta | Y, \beta, \tau^2)
Metropolis-Hastings algorithm & Gibbs samplers
A Gibbs sampler assumes we can sample from all the full conditional distributions
We would then alternate sampling from each of them to obtain a correlated sample from the joint posterior
If even one of the full conditionals is not available, we can replace that step with a Metropolis-Hastings step
The resulting Metropolis-within-Gibbs algorithm is one of the most important algorithms in Bayesian statistics because of its extremely wide applicability
The Gibbs sampler itself is a special case of Metropolis-Hastings
Many generalizations of Metropolis-Hastings (adaptive MH, Hamiltonian MC, …)
Back to finding p(\beta, \tau^2, \theta | Y)
We use a conditionally conjugate step to get
\{ \tau^2_{(t)}, \beta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)} \sim p(\beta, \tau^2 | Y, \theta)
We use Metropolis-Hastings to get
\{ \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \theta_{(t)} \sim p(\theta | Y, \beta, \tau^2)
Therefore, we build a Metropolis-within-Gibbs sampler and get
\{ \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} \}_{t=1}^T \qquad \text{where} \qquad \tau^2_{(t)}, \beta_{(t)}, \theta_{(t)} \sim p(\beta, \tau^2, \theta | Y)
We learn p(\beta, \tau^2, \theta | Y) by summarizing samples from it.
Alternative sampling algorithms
In the previous discussion, we have marginalized out w
This is convenient because w is high dimensional
The “remaining” joint posterior is lower dimensional
Lower dimension = easier to explore and learn
We could also integrate out \beta (R package spBayes
does this)
Non-marginalized sampler
What if we do not integrate out w ?
We are then looking at the joint posterior p(w, \beta, \tau^2, \theta | Y)
Here, a Gibbs sampler (with a Metropolis step) works this way:
Sample from p(\tau^2 | Y, w, \beta, \theta)
Sample from p(w | Y, \beta, \tau^2, \theta)
Sample from p(\beta | Y, w, \tau^2, \theta)
Update \theta with a Metropolis step targeting p(\theta | Y, w, \tau^2, \beta)
Non-marginalized sampler
What if we do not integrate out w ?
We are then looking at the joint posterior p(w, \beta, \tau^2, \theta | Y)
Here, a Gibbs sampler (with a Metropolis step) works this way:
Sample from p(\tau^2 | Y, w, \beta, \theta) = p(\tau^2 | Y, w, \beta) why?
Sample from p(w | Y, \beta, \tau^2, \theta)
Sample from p(\beta | Y, w, \tau^2, \theta) = p(\beta | Y, w, \tau^2) why?
Update \theta with a Metropolis step targeting p(\theta | Y, w, \tau^2, \beta) = p(\theta | w) why?
Non-marginalized sampler
What if we do not integrate out w ?
We are then looking at the joint posterior p(w, \beta, \tau^2, \theta | Y)
Here, a Gibbs sampler (with a Metropolis step) works this way:
Sample from p(\tau^2 | Y, w, \beta)
Sample from p(w | Y, \beta, \tau^2, \theta)
Sample from p(\beta | Y, w, \tau^2)
Update \theta with a Metropolis step targeting p(\theta | w) \propto p(w | \theta) p(\theta) what is p(w | \theta) ?
Non-marginalized sampler, step 1
Sample from p(\tau^2 | Y, w, \beta)
We condition on everything else, so Y^* = Y-X\beta - w can be computed
Treat Y^* as the dataset
Then the prior-model pair is
\begin{align*}
\tau^2 &\sim Inv.G(a_{\tau}, b_{\tau})\\
Y^* &\sim N(0, \tau^2 I_n)
\end{align*}
How do we find the (full conditional) posterior in this case?
Non-marginalized sampler, step 1
Sample from p(\tau^2 | Y, w, \beta)
We condition on everything else, so Y^* = Y-X\beta - w can be computed
Treat Y^* as the dataset
Then the prior-model pair is
\begin{align*}
\tau^2 &\sim Inv.G(a_{\tau}, b_{\tau})\\
Y^* &\sim N(0, \tau^2 I_n)
\end{align*}
In the usual linear regression model, drop everything related to \beta
Inverse Gamma full conditional
Non-marginalized sampler, step 2
Sample from p(w | Y, \beta, \tau^2, \theta)
Treat Y^* = Y - X\beta as the dataset
Then the prior-model pair is
\begin{align*}
w &\sim N(0, C_{\theta})\\
Y^* &\sim N(I_n w, \tau^2 I_n)
\end{align*}
How do we find the (full conditional) posterior in this case?
Non-marginalized sampler, step 2
Sample from p(w | Y, \beta, \tau^2, \theta)
Treat Y^* = Y - X\beta as the dataset
Then the prior-model pair is
\begin{align*}
w &\sim N(0, C_{\theta})\\
Y^* &\sim N(I_n w, \tau^2 I_n)
\end{align*}
This is a regression model with known error variance
Gaussian prior, Gaussian model, therefore Gaussian posterior
p(w | Y, \beta, \tau^2, \theta) = N(w; \mu_w, \Sigma_w) where
\begin{align*}
\Sigma_w^{-1} &= C_{\theta}^{-1} + 1/\tau^2\\
\mu_w &= \Sigma_w (Y - X\beta)/\tau^2
\end{align*}
Non-marginalized sampler, step 3
Sample from p(\beta | Y, w, \tau^2)
Treat Y^* = Y - w as the dataset
Then the prior-model pair is
\begin{align*}
\beta &\sim N(0, V)\\
Y^* &\sim N(X\beta, \tau^2 I_n)
\end{align*}
How do we find the (full conditional) posterior in this case?
Non-marginalized sampler, step 3
Sample from p(\beta | Y, w, \tau^2)
Treat Y^* = Y - w as the dataset
Then the prior-model pair is
\begin{align*}
\beta &\sim N(0, V)\\
Y^* &\sim N(X\beta, \tau^2 I_n)
\end{align*}
This is a regression model with known error variance
Gaussian prior, Gaussian model, therefore Gaussian posterior
p(\beta | Y, w, \tau^2, \theta) = N(\beta; \mu_{\beta}, \Sigma_{\beta}) where
\begin{align*}
\Sigma_{\beta}^{-1} &= V^{-1} X^\top X/\tau^2\\
\mu_{\beta} &= \Sigma_{\beta} X^\top (Y - w)/\tau^2
\end{align*}
Non-marginalized sampler, step 4
Update p(\theta | w) \propto p(w | \theta) p(\theta)
p(w | \theta) = N(w; 0, C_{\theta})
Alternative methods
Because Y = X\beta + w + \varepsilon can be written as Y = \begin{bmatrix} X & I_n\end{bmatrix} \begin{bmatrix} \beta \\ w \end{bmatrix} + \varepsilon
We can call X^* = \begin{bmatrix} X & I_n\end{bmatrix} and \beta^* = \begin{bmatrix} \beta \\ w \end{bmatrix} and then update as a block via
p(\beta^* | Y, \tau^2, \theta) = N(\beta^*; \mu_{\beta^*}, \Sigma_{\beta^*})
\Sigma_{\beta^*}^{-1} = V_{\beta^*}^{-1} + X^{*^\top} X^*/\tau^2
\mu_{\beta^*} = \Sigma_{\beta^*} X^{*^\top} Y/\tau^2
Once again, this follows from the linear regression posterior we saw in the preliminaries
Using package spBayes
for spatial linear regression with latent GPs
Using package spBayes
for spatial linear regression with latent GPs
First, specify how we want to set things up.
The model we are implementing is
Y = X\beta + w + \varepsilon, \qquad \text{where}\quad \varepsilon \sim N(0, \tau^2 I_n)
Priors:
\beta \sim N(0, 10^3 I_p) where p is the number of columns of X
w(\cdot) \sim GP(0, C_{\sigma^2, \phi}(\cdot)) and we let C_{\sigma^2, \phi}(s, s') = \sigma^2 \exp\{ -\phi \| s-s' \| \}
\tau^2 \sim Inv.G(2, 0.1) and \sigma^2 \sim Inv.G(2,2)
\phi \sim U(3, 30) . It is common to let \phi have a uniform prior where the range of possible values is set to give “reasonable” spatial dependence
library (spBayes)
n.samples <- 2000
starting <- list ("phi" = 3 / 0.5 , "sigma.sq" = 50 , "tau.sq" = 1 )
tuning <- list ("phi" = 0.1 , "sigma.sq" = 0.1 , "tau.sq" = 0.1 )
priors <- list ("beta.Norm" = list (rep (0 ,ncol (X)), diag (1000 ,ncol (X))),
"phi.Unif" = c (3 / 1 , 3 / 0.1 ), "sigma.sq.IG" = c (2 , 2 ),
"tau.sq.IG" = c (2 , 0.1 ))
cov.model <- "exponential"
Using package spBayes
for spatial linear regression with latent GPs
We fit the GP regression model in 2 steps
system.time ({
model_fit <- spLM (Y~ X-1 , coords= coords, starting= starting,
tuning= tuning, priors= priors, cov.model= cov.model,
n.samples= n.samples, verbose= FALSE , n.report= 500 )
})
user system elapsed
7.602 0.010 7.629
system.time ({
model_fit <- model_fit %>% spRecover (start= round (n.samples/ 3 ), verbose= FALSE )
})
user system elapsed
35.201 0.913 25.636
Using package spBayes
for spatial linear regression with latent GPs
We chose to run MCMC for 2000 iterations
If we are satisfied with convergence, we drop some initial samples (burn-in period) because they are likely not sampled at convergence
Recall with MCMC we get correlated samples from the joint posterior after reaching convergence
Using package spBayes
for spatial linear regression with latent GPs
spBayes
calls theta
the vector of (\sigma^2, \tau^2, \phi)
In this example, the MCMC chains look poor
All chains mix poorly , or in other terms they show high autocorrelation
Inference based on these chains should not be trusted
We need to improve MCMC tuning and possibly run the Markov chain longer
Using package spBayes
for spatial linear regression with latent GPs
At each step, a new proposed value for \phi is generated
tuning <- list("phi"=0.1, "sigma.sq"=0.1, "tau.sq"=0.1)
sets the standard deviation of the proposal for phi
to 0.1
In other words, this determines the jump size
If the jump is too big, the proposal will probably get rejected
In our case: 16.4% acceptance rate
We target an acceptance rate of about 24% so we should lower tuning$phi
Using package spBayes
for spatial linear regression with latent GPs
Let’s take the previous MCMC run (even though we did not like it too much)
We can now predict at a grid of locations
predictions <- spPredict (model_fit,pred.coords = coords_out, pred.covars = X_out)
----------------------------------------
General model description
----------------------------------------
Model fit with 1000 observations.
Prediction at 900 locations.
Number of covariates 3 (including intercept if specified).
Using the exponential spatial correlation model.
-------------------------------------------------
Sampling
-------------------------------------------------
Sampled: 100 of 2000, 4.95%
Sampled: 200 of 2000, 9.95%
Sampled: 300 of 2000, 14.95%
Sampled: 400 of 2000, 19.95%
Sampled: 500 of 2000, 24.95%
Sampled: 600 of 2000, 29.95%
Sampled: 700 of 2000, 34.95%
Sampled: 800 of 2000, 39.95%
Sampled: 900 of 2000, 44.95%
Sampled: 1000 of 2000, 49.95%
Sampled: 1100 of 2000, 54.95%
Sampled: 1200 of 2000, 59.95%
Sampled: 1300 of 2000, 64.95%
Sampled: 1400 of 2000, 69.95%
Sampled: 1500 of 2000, 74.95%
Sampled: 1600 of 2000, 79.95%
Sampled: 1700 of 2000, 84.95%
Sampled: 1800 of 2000, 89.95%
Sampled: 1900 of 2000, 94.95%
Sampled: 2000 of 2000, 99.95%
Using package spBayes
for spatial linear regression with latent GPs
Calculate residuals from the model fit
w_post <- model_fit$ p.w.recover.samples
b_post <- model_fit$ p.beta.recover.samples
XB_post <- X %*% t (b_post)
XBW_mean <- (XB_post + w_post) %>% apply (1 , mean)
model_residual <- Y - XBW_mean
df <- df %>% mutate (residual = model_residual)
residual_map <- ggplot (df, aes (xcoord, ycoord, color= residual)) +
geom_point () +
scale_color_viridis_c () +
theme_minimal () +
ggtitle ("Residual" )
sv <- df %>% with (geoR:: variog (data = residual, coords = cbind (xcoord, ycoord), messages= FALSE ))
sv_df <- data.frame (dists = sv$ u, variogram = sv$ v, npairs = sv$ n, sd = sv$ sd)
sv_plot <- ggplot (sv_df, aes (x= dists, y= variogram)) + geom_point (size= 2 , shape= 8 ) +
theme_minimal ()
grid.arrange (residual_map, sv_plot, nrow= 1 )
Why does spBayes
need two steps
Priors:
\tau^2 \sim Inv.G(a_{\tau}, b_{\tau})
\beta \sim N(0, \tau^2 V)
s^2 \sim Inv.G(a_{s}, b_s)
\phi \sim U(l_{\phi}, u_{\phi})
w(\cdot) \sim GP(0, C_{s, \phi}(\cdot)) where C_{s, \phi}(s, s') = \sigma^2 \exp\{ -\phi \| s-s' \| \}
Model:
Y(s) = X(s)^\top \beta + w(s) + \varepsilon(s),\ \text{where}\ \varepsilon(s) \sim N(0, \tau^2) \quad \text{(iid)} In vector form: Y = X\beta + w + \varepsilon , where \varepsilon \sim N(0, \tau^2 I_n) .
Marginalized model:
In addition to integrating out w as we have seen before, we can also do that with \beta
In spBayes
, this is the marginal model that is fit via MCMC
Y | \tau^2, s^2, \phi \sim N(0, \tau^2 (X V X^\top + I_n) + C_{\sigma^2, \phi})
Why does spBayes
need two steps
Marginalized model:
In addition to integrating out w as we have seen before, we can also do that with \beta
In spBayes
, this is the marginal model that is fit via MCMC
Y | \tau^2, s^2, \phi \sim N(0, \tau^2 (X V X^\top + I_n) + C_{\sigma^2, \phi})
It is possible to update \tau^2 via a Gibbs step or jointly with \phi and \sigma^2 as a Metropolis update
The Metropolis-Hastings algorithm targets the posterior p(\tau^2, \sigma^2, \phi | Y) \propto p(Y | \tau^2, \sigma^2, \phi) p(\tau^2, \sigma^2, \phi)
Once we have samples from p(\tau^2, \sigma^2, \phi | Y) we can sample from p(\beta | Y, \tau^2, \sigma^2, \phi)
Once we have samples from p(\tau^2, \sigma^2, \phi | Y) we can sample from p(w | Y, \tau^2, \sigma^2, \phi)
Both of these steps are performed in spRecover
Moving forward
spBayes::spLM
fits a GP regression model
Computations take a considerable amount of time for n=1000
Datasets with n=10000 or larger are quite common
Let’s see what happens with spLM
at increasing sample sizes
Moving forward
Things do not look promising
Next, we will look into why this is the performance profile we get
We will then consider some solutions to the scalability problem