Quick review of preliminaries

Spatial Statistics - BIOSTAT 696/896

Michele Peruzzi

University of Michigan

Random variables and random vectors

  • Real-valued function defined on outcome of experiment Y : \Omega \to \Re
  • Example: Bernoulli r.v. represents coin toss with values \{0,1\} each corresponding to \Omega = \{ H, T \}
  • With spatial data, we have one random variable at each domain location
  • In other words, we have a spatial index
  • Example: Y(x) is the random variable at location x
  • When we speak of multivariate spatial data, we mean Y(x) is a random vector

Random variables and random vectors

  • With spatial data, we have one random variable at each domain location
  • In other words, we have a spatial index
  • Example: Y(x) is the random variable at location x
  • Our aim is to model the dependence structure of any given set of such spatially-indexed random variables

Expectation, covariance, conditional expectation

  • E[Y] = \int y f(y)dy where f(y) is the probability density function of Y
  • Cov[X, Y] = E[X - E[X]]E[Y - E[Y]] = E[XY] - E[X]E[Y]
  • Cov is a linear operator
  • Conditional expectation E[Y|X] = \int y f_{Y|X}(y|x) is the expectation of the random variable Y | X = x
  • Cov[X, Y] = E[Cov[X,Y | Z]] + Cov[E[X|Z], E[Y|Z]]
  • In particular Var[X] = E[Var[X|Y]] + Var[E[X|Y]]

Joint density of a collection of random variables

  • Suppose we have two (continuous) random variables X and Y
  • Denote their joint probability density is f(X,Y)
  • We can write f(X,Y) = f(X) f(Y | X) = f(Y) f(X | Y) where f(X) is the marginal density of X
  • Why is this useful? Sometimes, specifying f(X,Y) directly is difficult
  • Easier to specify f(X) (or f(Y)) and then the conditional density f(Y | X) (or f(X|Y))
  • i.e. model X, then model Y for every given value of X
  • More in general, with X,Y,Z, we can write f(X,Y,Z) = f(X) f(Y|X) f(Z|X,Y)
  • For a set of n variables, f(X_1, \dots, X_n) = f(X_1) f(X_2 | X_1) \dots f(X_n | X_1, \dots X_{n-1})

Graphical models

  • Suppose you are given a collection of random variables
  • Describing their joint distribution may be difficult
  • Problem may be simplified by looking at pairs of variables. Are they “related”?
  • So, suppose you are given pairwise “dependence” information
  • Draw one node for each random variable
  • Draw one edge between each pair of dependent (ie, related) random variables

Graphical models: undirected graphs

  • We can then write the resulting joint density as: f(X,Y,Z,V,W) = \frac{1}{c} f(X|W,Y) f(Y|X,V) f(Z|V) f(V|Y,Z) f(W|X) where c is an unknown normalizing constant
  • We are modeling conditional independence
  • The graphical model helps us conceptually in defining f(X,Y,Z,V,W) which is a complex object to deal with

Graphical models: directed acyclic graphs (DAG)

  • We can go one step further and consider a special case of graphical model
  • If random variable X “explains” Y (i.e. changes in Y can be attributed to changes in X) then we represent the dependence with a directed edge X \to Y
  • Remember this is a model, i.e. a representation of reality based on what we think is reasonable
  • Of particular interest is when all edges are directed, and no cycles can be found
  • Acyclicity: we can never start from a node, follow the “arrows”, and return to the same node
  • The graphical model we obtain is a directed acyclic graph, also called a Bayesian network

Graphical models: directed acyclic graphs (DAG)

  • Useful property of DAGs is that we get f(X,Y,Z,V,W) = f(X|W) f(Y|X,V) f(Z|V) f(V) f(W)
  • Notice there is no normalizing constant
  • We say that the joint density factorizes according to the given DAG

Bayesian paradigm

  • Model everything as a random variable/vector
  • Make assumptions about conditional independence between all variables
  • A Bayesian (hierarchical) model can be represented as a DAG
  • (this is why DAGs are also called Bayesian networks)
  • This is a model for the joint probability of all variables, including the data
  • Interest is in finding the conditional probability of all unknowns given the data

Bayesian paradigm: linear regression example

\sigma^2 \sim Inv.G(a,b) \beta \sim N(m_{\beta}, \sigma^2 V_{\beta}) Y = X \beta + \varepsilon, \text{ where } \varepsilon \sim N(0, \sigma^2 I_n) Equivalently, Y \sim N(X \beta, \sigma^2 I_n).

Bayesian paradigm: linear regression example

  • We typically always condition on the covariates (X) so we need not explicitly assign them a probability model
  • The joint density of all variables factorizes according to the DAG p(\sigma^2, \beta, Y) = p(\sigma^2) p(\beta | \sigma^2) p(Y | \sigma^2, \beta)

Prior, posterior, predictive distributions

  • Prior distribution: the assumed marginal density of the model parameters (eg, p(\sigma^2, \beta) = Inv.G(\sigma^2; a,b) N(\beta; m_{\beta}, \sigma^2 V_{\beta}))
  • Probability model/likelihood: the assumed cond. density of the data, given the parameters (eg, p(Y | \beta, \sigma^2) = N(y; X\beta, \sigma^2 V_y))
  • Typically, V_y = I_n the identity matrix, but the more general case is useful
  • Posterior distribution: the implied cond. density of the model parameters given the data (eg, p(\beta, \sigma^2 | Y))
  • Predictive distribution: the implied cond. density of new data, given old data: p(Y^{\text{new}}| Y) = \int p(Y^{\text{new}}| Y, \beta, \sigma^2) p(\beta, \sigma^2 | Y) d\beta d\sigma^2

Bayes theorem:

P(A | B) = \frac{P(A) P(B|A)}{P(B)} Bayesian linear regression: p(\sigma^2, \beta | Y) = \frac{ p(\sigma^2, \beta) p(Y | \sigma^2, \beta) }{p(Y)} \propto p(\sigma^2, \beta) p(Y | \sigma^2, \beta)

Bayesian regression with conjugate priors

  • If p(\sigma^2, \beta) is Normal-Inverse Gamma and p(Y|\beta,\sigma^2) is Normal, then p(\sigma^2, \beta | Y) is again Normal-Inverse Gamma
  • Because prior and posterior belong to the same family, they are a conjugate pair
  • Let p(\sigma^2, \beta) = Inv.G(\sigma^2; a,b) N(\beta; m_{\beta}, \sigma^2 V_{\beta})
  • Let p(Y | \beta, \sigma^2) = N(y; X\beta, \sigma^2 V_y)
  • Then the posterior is p(\sigma^2, \beta | Y) = Inv.G(\sigma^2; a^*,b^*) N(\beta; m_{\beta}^*, \sigma^2 V_{\beta}^*) where we just need to compute the updated parameters V_{\beta}^* = (V_{\beta}^{-1} + X^T V_y^{-1} X)^{-1} m_{\beta}^* = V_{\beta}^*(V_{\beta}^{-1}m_{\beta} + X^T V_y^{-1} y) a^* = a + n/2 \quad\text{and}\quad b^* = b + \frac{1}{2}(m_{\beta}^T V_{\beta}^{-1} m_{\beta} + y^TV_y^{-1} y - m_{\beta}^* V_{\beta}^{* -1} m_{\beta}^*)

Bayesian regression with conjugate priors: special cases

  • X is called the design matrix and we consider it fixed, known, non-random. None of its elements are missing
  • Many problems can be cast in terms of linear regressions!

Special case (1)

  • X = 1_n, ie the vector of size n whose elements are all 1, and V_y=I_n
  • This is an intercept-only model with iid Gaussian errors
  • This becomes the classical problem of estimating mean and variance of a sample of observations

Special case (2)

  • X = I_n the identity matrix of size I_n, V_y=I_n
  • This is typical in signal processing
  • Observations typically indexed by (continuous) time
  • De-noising problem: y(t_i) = \beta(t_i) + \varepsilon(t_i), with t_i \in \{ t_1, \dots, t_n \}
  • \beta(\cdot) is an unknown function, whereas \{ y(t_1), \dots, y(t_n) \} is the data set of noisy realizations of that function

Bayesian regression with conjugate priors: special case (1)

  • Model: Y \sim N(\mu, \sigma^2), with priors \mu \sim N(m, v) and \sigma^2 \sim Inv.G(a,b)
  • Problem: estimate \mu and \sigma^2 a posteriori
  • Solution: cast problem into linear regression with conjugate priors Y = 1_n \mu + \varepsilon, \varepsilon \sim N(0, \sigma^2 I_n)
  • Posterior p(\sigma^2 | Y) p(\mu | sigma^2, Y) = InvG(a^*, b^*) N(\mu; \mu^*, V^*) where

V^* = \left(\frac{1}{v} + \frac{1}{\sigma^2} 1_n^T 1_n \right)^{-1} = \left(\frac{1}{v} + \frac{n}{\sigma^2} \right)^{-1}

\mu^* = V^* \left(\frac{m}{v} + \frac{1}{\sigma^2} 1_n^T Y \right) = V^* \left(\frac{m}{v} + \frac{\sum_i y_i}{\sigma^2}\right)

  • Bayesian linear models share this common structure
  • This will be important to remember when working with linear models with latent Gaussian Processes

Bayesian regression with conjugate priors

  • Conjugate priors are convenient because they allow us to immediately find the posterior
  • But we are restricted in how flexible we can make our model
  • Only work well in somewhat restrictive situations
  • What if we want to assume a non-conjugate prior-likelihood pair?
  • We can! But finding the posterior may not be very easy
  • Solution: posterior sampling
  • Posterior sampling methods = extract a sample from the (difficult) posterior, study the posterior by summarizing features of the extracted sample
  • Examples: Gibbs sampling, Metropolis-Hastings algorithm… more on these later

Multivariate Normal

  • Suppose X \sim MVN(x; \mu, \Sigma), where X is a column vector of dimension n\times 1
  • \mu also has dimension n\times 1 whereas \Sigma is an n\times n symmetric and positive definite matrix
  • E(X) = \mu and Var(X) = \Sigma
  • Taking the subvector V = [X_1, \dots, X_k]^T, we have V \sim MVN(v; \mu_V, \Sigma_V) = MVN(v; \mu_{[1:k]}, \Sigma_{[1:k, 1:k]})
  • Taking the subvector W = [X_{k+1}, \dots, X_n]^T, we have W \sim MVN(w; \mu_W, \Sigma_W) = MVN(w; \mu_{[k+1:n]}, \Sigma_{[k+1:n, k+1:n]})
  • We also have Cov(V, W) = \Sigma_{VW} = \Sigma_{[1:k, k+1:n]} = \Sigma_{WV}^T = \Sigma_{[k+1:n, 1:k]}^T

Multivariate Normal

  • Conditioning on jointly normal random vectors
  • We have V | W=w \sim MVN(v; \mu_{V|W}, \Sigma_{V|W}), where \mu_{V|W} = \mu_V + \Sigma_{VW} \Sigma_W^{-1} (w - \mu_W) \Sigma_{V|W} = \Sigma_V - \Sigma_{VW} \Sigma_W^{-1} \Sigma_{VW}^T
  • In particular, if \mu = 0 then \mu_{V|W} = \Sigma_{VW} \Sigma_W^{-1} w
  • This is very important!

Multivariate Normal: linear transformations

  • If X \sim MVN(\mu, \Sigma), a and B are an appropriately sized vector and matrix, respectively, then
  • Let Y = a + BX, we have

E(Y) = E(a+BX) = a + B\ E(X)

Cov(Y)= Cov(a + BX) = B\ Cov(X)\ B^T

Y \sim MVN(a + B\mu, B\Sigma B^T)

Multivariate Normal: density evaluation via Cholesky decompositions

Density of X \sim MVN(\mu, \Sigma) where X \in \Re^n is

p(x; \mu, \Sigma) = |2\pi \Sigma |^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu) \right\}

  • |2\pi \Sigma |^{-\frac{1}{2}} = (2\pi)^{-\frac{n}{2}} | \Sigma |^{-\frac{1}{2}}, |\Sigma|^{-1} = \frac{1}{|\Sigma|}, and |\Sigma| = det\{\Sigma\}

  • Cholesky decomposition of \Sigma: it is the lower triangular matrix L such that L L^T = \Sigma

  • \Sigma^{-1} = L^{-T} L^{-1} where L^{-T} = (L^{-1})^T

  • det(\Sigma) = 2 \prod_{i=1}^n L_{ii}

  • Therefore, if we have \Sigma, compute L then use it to evaluate density

  • Log-density most typically used in practice. This is then

\log p(x; \mu, \Sigma) = -\frac{n}{2} \log(2\pi) - \sum_{i=1}^n\log(L_{ii}) - \frac{1}{2}(x-\mu)^T L^{-T} L^{-1} (x-\mu)

Stochastic process modeling

  • Goal is do model \{ Y(s), s \in D \}, an uncountable collection of random variables over D
  • We want to be able to describe any collection of random variables!
  • This is generally difficult
  • Easier: make distributional assumptions for any finite collection of random variables
  • In other words, define p(Y(s_1), \dots, Y(s_n)) for any possible finite set \cal L = \{ s_1, \dots, s_n \}
  • Kolmogorov’s consistency conditions guarantee the existence of a stochastic process \{ Y(s), s \in D \} based on our definition of p(Y(s_1), \dots, Y(s_n))

Make your own stochastic process

  1. Define p(Y(s_1), \dots, Y(s_n)) for any possible finite set \cal L = \{ s_1, \dots, s_n \}
  2. Verify that Kolmogorov’s conditions are satisfied

Example: Poisson process

  • On a continuous time domain D = \{t \in \Re: t \ge 0\}
  • N(t) is an integer number of events of interest between times 0 and t, (1) N(t)\ge 0 and if s < t, then N(s) \le N(t) and N(t) - N(s) is the number of events occurred in (s, t]
  • N(t) \sim Poisson(\lambda t) for some \lambda > 0
  • \{ N(t), t \in D \} is called a Poisson process with rate \lambda

Example: Markov chains

  • Take D = \{..., -2, -1, 0, 1, 2, \dots \}
  • Consider an uncountable set of random variables \{ X(t), t \in D \}
  • \{ X(t), t \in D \} is a Markov chain if

P( X(t+1) = j | X(t) = i_t, X(t-1) = i_{t-1}, \dots, X(0) = i_0 ) = P( X(t+1) = j | X(t) = i_t )

  • Notice above: we are defining a finite dimensional joint distribution by listing all the conditionals!
\begin{align*} P( X(t+1) = j,\ &X(t) = i_t,\ X(t-1) = i_{t-1},\ \dots,\ X(0) = i_0 ) =\\ & P(X(0) = i_0) \cdot P(X(1) = i_1 | X(0) = i_0) \cdots \\ & \cdots P( X(t+1) = j | X(t) = i_t,\ X(t-1) = i_{t-1},\ \dots,\ X(0) = i_0 ) \end{align*}

Markov chains: things to keep in mind

  • We do not need in-depth knowledge of Markov chains
  • We will use Markov-chain Monte Carlo (MCMC), which is based on the theory of discrete-time, continuous space Markov chains
  • For a discrete-space Markov chain, and under some conditions, and with transition matrix P, at time t denoted with P^{(t)} = P \cdots [t times] \cdots P, we have \lim_{t \to \infty} P^{(t)}_{ij} = \pi_j
  • \pi is called the stationary distribution
  • General idea of MCMC: design and build a Markov chain whose stationary distribution is a target density \pi(\cdot); then, by running the Markov chain many times we will eventually get correlated samples of \pi(\cdot)