ST308 Bayesian Inference

Class 7


Philipp Sterzinger

p.sterzinger@lse.ac.uk

12 March 2025

Recap

Markov Chains

  • Dependent sequence of random variables \(\{x_t\}_{t = 1, \ldots, T}\) taking values in \(\mathcal{X}\)
  • At each point \(t\), the sequence jumps between states according to transition probability \(\Pr(x_{t} \mid x_{t-1}, \ldots, x_1)\), \(\Pr(x_1 = x) = \pi(x)\)
  • Markov property

\[ \Pr(x_{t+1} \mid x_t, \ldots, x_1) = \Pr(x_{t + 1} \mid x_t) \]

Code
library(ggplot2)

T <- 10 
state_space <- 1:3
P <- c(0.5, 0.25, 0.25, 1 / 3, 0.125, 13 / 24, 0.1, 0.2, 0.8) 
P <- matrix(P, nrow = 3, byrow = TRUE) 

set.seed(123)
xs <- numeric(T) 
xs[1] <- sample(state_space, 1) 
for(i in 2:T){
    xs[i] <- sample(state_space, 1, prob = P[xs[i - 1],])
}

mc <- data.frame(T = 1:T, xs = xs) 

ggplot(mc, aes(x = T, y = xs)) + 
    geom_step(color = "grey", linetype = "dashed", linewidth = 1) +  
    geom_point(color = "steelblue", size = 3) +  
    theme_minimal() +  
    labs(title = "A Markov Chain", y = "States", x = "T") +  
    theme(
        plot.title = element_text(hjust = 0.5, face = "bold"),  
        axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12), 
        panel.border = element_blank(), 
        panel.grid.major.x = element_blank(),  
        panel.grid.minor = element_blank(),
        axis.line.x = element_line(color = "black"),  
        axis.line.y = element_line(color = "black") 
    ) + 
    scale_x_continuous(breaks = 1:T) + 
    scale_y_continuous(breaks = state_space) 

Markov Chains

  • Joint distribution

\[ \begin{aligned} \Pr(x_t, x_{t-1}, \ldots, x_1) &= \Pr(x_t \mid x_{t - 1}, \ldots, x_1) \Pr(x_{t-1}, \ldots, x_1) \\ &= \Pr(x_t \mid x_{t - 1}) \Pr(x_{t-1}, \ldots, x_1) \\ \vdots \\ &= \pi(x_1) \prod_{i = 1}^{t-1} \Pr(x_{i + 1} \mid x_i) \end{aligned} \]

  • Time homogenuous Markov chain

\[ \Pr(x_{t + 1} = x' \mid x_t = x) = T(x, x') \quad (\textrm{independent of } t) \]

  • Stationary distribution

\[ \pi^*(x') = \sum_{x \in \mathcal{X}} T(x, x') \pi^*(x) \]

Markov Chains

Stationary distribution

If we choose \(x_1\) according to \(\pi^*\), then

\[ \begin{aligned} \Pr(x_2 = x') &= \sum_{x \in \mathcal{X}} \Pr(x_2 = x' \mid x_1 = x) \Pr(x_1 = x) =\sum_{x \in \mathcal{X}} T(x, x') \pi^*(x) = \pi^*(x') \\ \Pr(x_3 = x') &= \sum_{x \in \mathcal{X}} \Pr(x_3 = x' \mid x_2 = x) \Pr(x_2 = x) =\sum_{x \in \mathcal{X}} T(x, x') \pi^*(x) = \pi^*(x') \\ \vdots \\ \Pr(x_t = x') &= \sum_{x \in \mathcal{X}} \Pr(x_t = x' \mid x_{t - 1} = x) \Pr(x_{t - 1} = x) =\sum_{x \in \mathcal{X}} T(x, x') \pi^*(x) = \pi^*(x') \,, \end{aligned} \]

\(\longrightarrow\) the marginal distribution of the Markov chain at time \(t\) is equal to \(\pi^{*}\) for \(t = 1,2,\ldots\)

Markov Chains

Theorem 1 Under some regularity conditions, regardless of the initial distribution,

\[ \lim\limits_{t \to \infty} \Pr(x_t = x) = \pi^{*}(x) \,, \] where \(\pi^{*}\) is the unique stationary distribution of the Markov chain.

Construct Markov chain such that \(\pi^*\) is posterior distribution of interest

  • Pick initial distribution \(\pi\), \(T(x,x')\)
  • Simulate a Markov chain
  • After long \(t\), \(\Pr(x_t = x) \approx \pi^*(x)\)
  • Use the realisations of the Markov chain as an approximation to \(\pi^*\)

\(\longrightarrow\) this is called Markov chain Monte Carlo

Metropolis Hastings

Our states are now values of a parameter vector \(\theta\) for which we wish to approximate the posterior given data \(y\)

Ingredients:

  • Posterior: \(\pi(\theta \mid y)\), sufficient to have \(f(y \mid \theta) \pi(\theta)\)
  • Proposal distribution \(q(\theta_t \mid \theta)\) from which we can sample

Algorithm:

  • Initialise \(\theta_0\) at time \(t = 0\)

  • For \(t = 0,\ldots, T-1\) do

    • \(\theta^* \sim q(\theta^*, \theta_t)\)
    • Set \(\theta_{t + 1} = \theta^*\) with probability \(\alpha(\theta_t, \theta^*)\) for \[\alpha(\theta_t, \theta^*) = \min \left\{\frac{\pi(\theta^* \mid y)}{\pi(\theta_t \mid y)} \frac{q(\theta_t \mid \theta^*)}{q(\theta^* \mid \theta_t)} , 1\right\}\]
    • Else, \(\theta_{t + 1} = \theta_t\)

Gibbs sampler

Ingredients:

  • \(\theta = (\theta_1,\ldots,\theta_p)\)
  • \(\pi(\theta_i \mid \theta_{-i}, y)\), from which we can sample

Algorithm:

  • Initialise \(\theta^{(0)}\) at time \(t = 0\)

  • For \(t = 1,\ldots, T\) do

    • \(\theta_1^{(t)} \sim \pi(\theta_1 \mid \theta_{2}^{(t-1)},\ldots, \theta_{p}^{(t-1)}, y)\)
    • \(\theta_2^{(t)} \sim \pi(\theta_2 \mid \theta_1^{(t)}, \theta_{3}^{(t-1)},\ldots, \theta_{p}^{(t-1)}, y)\)
    • \(\theta_3^{(t)} \sim \pi(\theta_3 \mid \theta_1^{(t)}, \theta_2^{(t)}, \theta_{4}^{(t-1)},\ldots, \theta_{p}^{(t-1)}, y)\)
    • etc

Exercises

Gibbs Sampler

We will illustrate a Gibbs sampler via the following toy example. Let \(y=(y_1,\dots,y_n)\) be a random sample from the N(\(\mu,\sigma^2\)). Leet’s use the improper prior \(\pi(\mu,\sigma^2)\propto (\sigma^2)^{-1}\) for simplicity.

As shown in class the full conditional distributions for \(\mu\) and \(\sigma^2\) are the N\((\bar{y},\frac{\sigma^2}{n})\) and the IGamma\(\left(n/2, \frac{1}{2}\sum_{i=1}^n(y_i-\mu)^2\right)\) respectively.

We will generate \(1000\) independent N(\(\mu,\sigma^2\)) random variables setting \(\mu\) and \(\sigma^2\) to some values.

Code
set.seed(10)
N=1000;
mu = 5;
sigma = 4;
y = rnorm(N,mu,sigma);
mean(y)
[1] 5.045499
Code
sd(y)
[1] 3.967368

Gibbs Sampler

Then we will run the Gibbs Sampler for the model above to sample from the posterior of \(\mu\) and \(\sigma\)

Code
Niter = 1000 # number of MCMC iterations
out_mu = rep(NA,Niter); #vector to store the mu draws
out_sigma2 = rep(NA,Niter); #vector to store the sigma2 draws
mu = 20; # initial value for mu
sigma2 = 2; # initial value for sigma2

xbar = mean(y) # sufficient stat for mu
alpha = N/2; # alpha parameter of the IGamma full conditional for sigma2

#Main loop of Gibbs Sampler
for (iter in 1:Niter){
  # Store mu and sigma2 values
  out_mu[iter] = mu 
  out_sigma2[iter] = sigma2
  
  #update mu given sigma2 (and y)
  mu = rnorm(1,xbar,sqrt(sigma2/N))
  
  #update sigma2 given mu (and y)
  beta = 0.5*sum((y-mu)^2);
  sigma2  = 1/rgamma(1,alpha,beta);
}

Gibbs Sampler

Finally we will summarise the posterior output to obtains Bayes estimators and \(95\%\) credible intervals.

Code
plot(out_mu,type='l')

Code
print('mu')
[1] "mu"
Code
print(c(mean(out_mu),median(out_mu),quantile(out_mu,probs=c(0.025,0.975))))
                      2.5%    97.5% 
5.055802 5.045980 4.788446 5.279399 
Code
plot(out_sigma2,type = 'l')

Code
print('sigma2')
[1] "sigma2"
Code
print(c(mean(out_sigma2),median(out_sigma2),quantile(out_sigma2,probs=c(0.025,0.975))))
                      2.5%    97.5% 
15.75077 15.75798 14.40502 17.07139 

Gibbs Sampler

Posterior distribution

Code
library(MASS) 
library(pracma)

X = cbind(out_mu, out_sigma2)[100:Niter, ]

density_est <- kde2d(X[,1], X[,2], n = 100)
f <- function(x0, y0) {
  interp2(density_est$x, density_est$y, density_est$z, x0, y0, method = "linear")
}


vx=seq(4.5, 5.5,length=201)
vy=seq(13.5, 18,length=201)
z=outer(vx,vy, f)

xhist <- hist(X[,1], plot=FALSE)
yhist <- hist(X[,2], plot=FALSE)

top <- max(c(xhist$density, yhist$density))
nf <- layout(matrix(c(2,0,1,3),2,2,byrow=TRUE), c(3,1), c(1,3), TRUE)
par(mar=c(3,3,1,1))
image(vx,vy,z,col=rev(heat.colors(101)))
contour(vx,vy,z,col="blue",add=TRUE)
points(X,cex=.2)
par(mar=c(0,3,1,1))
barplot(xhist$density, axes=FALSE, ylim=c(0, top), space=0,col="light green")
lines((density(X[,1])$x-xhist$breaks[1])/diff(xhist$breaks)[1],
density(X[,1])$y,col="red")
par(mar=c(3,0,1,1))
barplot(yhist$density, axes=FALSE, xlim=c(0, top), space=0, 
horiz=TRUE,col="light green")
lines(density(X[,2])$y,(density(X[,2])$x-yhist$breaks[1])/
diff(yhist$breaks)[1],col="red")

Gibbs Sampler

Activity

Repeat the above exercise for the following model \[ y_i\sim \text{Poisson}(\lambda), \;\;i=1,...,N, \] with priors \[ \lambda \sim \text{Gamma}(2,\beta), \] \[ \beta \sim \text{Exponential}(1) \] It can be shown (good exercise) that the full conditionals for \(\lambda,\beta\) are the Gamma\((2 + \sum_i y_i,n+\beta)\) and the Gamma\((3,1+\lambda)\).

Gibbs Sampler

Data can be simulated in the following way:

Code
set.seed(5)
N=1000;
beta = 1;
lambda = rgamma(1,2,beta);
y = rpois(N,lambda);
mean(y)
[1] 0.645
Code
lambda
[1] 0.646926

Run the Gibbs Sampler for the model above to sample from the posterior of \(\lambda\) and \(\beta\)

Gibbs Sampler

Code
Niter <- 1000
lambdas <- numeric(Niter)
betas <- numeric(Niter) 

lambda <- .5
beta <- 2

y_bar <- sum(y) 

for(i in 1:Niter){
    lambdas[i] <- lambda 
    betas[i] <- beta 

    lambda <- rgamma(1, 2 + y_bar, N + beta) 
    beta <- rgamma(1, 3, 1 + lambda) 
}
Code
plot(lambdas,type='l')

Code
print('lambda')
[1] "lambda"
Code
print(c(mean(lambdas),median(lambdas),quantile(lambdas,probs=c(0.025,0.975))))
                         2.5%     97.5% 
0.6461722 0.6460111 0.5998643 0.6948180 
Code
plot(betas,type = 'l')

Code
print('beta')
[1] "beta"
Code
print(c(mean(betas),median(betas),quantile(betas,probs=c(0.025,0.975))))
                         2.5%     97.5% 
1.8358719 1.6611382 0.4086278 4.4935497