Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Lab 2: Gibbs Sampling and Metropolis-Hastings


Name:

Collaborators:


Background

In this homework assignment, we will investigate a hierarchical Bayesian model of polling data. To perform inference, we will implement a hybrid Gibbs/Metropolis-Hastings sampler.

Polling Data

The 2004 election between George W. Bush and John Kerry was closely contested. Both campaigns focused heavily on the swing state of Ohio, worth 20 votes in the electoral college. CNN/USA Today/Gallup conducted several Ohio polls in the months leading up to the election on November 4th, and obtained the following raw results:

KerryBushTotal
Sept. 4-7284344628
Sept. 25-28312325637
Oct. 17-20346339685
Oct. 28-315565111067

The polling data can be found here (although there are some dead links on the page).

We will let {xi}i=14\{x_i\}_{i = 1}^4 denote the number of votes for Kerry in each poll, and let {ni}i=14\{n_i\}_{i = 1}^4 denote the total number surveyed in each poll. We will represent these quantities in code using integer PyTorch tensors.

import torch
from torch.distributions import Gamma, Beta, Binomial, Normal
torch.manual_seed(305)

import matplotlib.pyplot as plt
from matplotlib.cm import Blues
import seaborn as sns
sns.set_context("notebook")
xs = torch.tensor([284, 312, 346, 556], dtype=int)
ns = torch.tensor([628, 637, 685, 1067], dtype=int)

Hierarchical Model

It is reasonable to model each voter’s support as a draw from a Bernoulli distribution, with some poll-specific probability ρi\rho_i of supporting Kerry. We believe each poll has a different probability since the polls are conducted at different times and may reach different subpopulations of voters (e.g. if one is conducted over telephone and another is an internet survey). However, these ρi\rho_i are likely highly correlated due to a base level of support for Kerry in the general population. We will therefore use the following model:

ρiBeta(α,β)for i=1,,4xiρiBin(ρi,ni)for i=1,,4\begin{align*} \rho_i &\sim \mathrm{Beta}(\alpha, \beta) \quad \text{for } i =1, \ldots, 4 \\ x_i \mid \rho_i &\sim \mathrm{Bin}(\rho_i, n_i) \quad \text{for } i =1, \ldots, 4 \end{align*}

Note that the beta distribution is a conjugate prior for the binomial distribution. (See also, Bishop Ch. 2.1.)

Following a Bayesian approach, we introduce priors over the parameters α\alpha and β\beta. Unfortunately, the parameters of the beta distribution do not have a simple conjugate prior. Instead, we will assume α\alpha and β\beta follow gamma distributions, resulting in the following Bayesian hierarchical model.

αGa(cα,dα)βGa(cβ,dβ)ρiα,βBeta(α,β)for i=1,,4xiρiBin(ρi,ni)for i=1,,4\begin{align*} \alpha &\sim \mathrm{Ga}(c_\alpha, d_\alpha) \\ \beta &\sim \mathrm{Ga}(c_\beta, d_\beta) \\ \rho_i \mid \alpha, \beta &\sim \mathrm{Beta}(\alpha, \beta) \quad \text{for } i =1, \ldots, 4 \\ x_i \mid \rho_i &\sim \mathrm{Bin}(\rho_i, n_i) \quad \text{for } i =1, \ldots, 4 \end{align*}

Here, cα,dα,cβ,dβc_\alpha, d_\alpha, c_\beta, d_\beta are fixed hyperparameters for the “hyper-priors” on α\alpha and β\beta and we use the shape-rate parametrization of the gamma distribution.

How should we interpret these hyperparameters? Recall that in a beta-Bernoulli model we can interpret the parameters of the beta as pseudo-observations. That is, we can view α\alpha and β\beta as results from a prior poll that resulted in α\alpha voters for Kerry and β\beta voters for Bush. Thus, we can interpret the mean of α\alpha as previously observed supporters for Kerry and its standard deviation as a measure of uncertainty in this number of previous voters.

Since the mean of α\alpha is cαdα\frac{c_\alpha}{d_\alpha} and the standard deviation is cαdα2\sqrt{\frac{c_\alpha}{d_\alpha^2}}, and similarly for β\beta, we can adjust the hyperparameters to accord with our prior belief on Kerry/Bush’s level of support.

Problem 1 [Math]: Derive the complete conditional for ρ\rho

In order to implement a Gibbs sampler, we must be able to sample from the complete conditional of each of the unobserved variables in our model. We will sample the variables ρ=(ρ1,,ρ4)\rho = (\rho_1, \dots, \rho_4) simultaneously as part of a block Gibbs update.

Part (a): Demonstrate conditional independence of {ρi}\{\rho_i\}

First, show that we have:

p(ρα,β,{xi})=i=14p(ρiα,β,xi)p(\rho \mid \alpha, \beta, \{x_i\}) = \prod_{i=1}^4 p(\rho_i \mid \alpha, \beta, x_i)

Your answer here.


Part (b):

Fixing ii, derive the conditional density of ρi\rho_i given α,β,xi\alpha, \beta, x_i, that is, compute:

p(ρiα,β,xi)p(\rho_i \mid \alpha, \beta, x_i)

Hint: Remember that this conditional density is proportional to the joint density of all the variables. The density should simplify nicely to a known density.


Your answer here.


Problem 2: Write code to sample from the complete conditional for ρ\rho

Write a function which generates a sample from the conditional distribution for ρ\rho given α\alpha, β\beta, and {xi,ni}\{x_i, n_i\}.

from torch.distributions import Beta

def gibbs_sample_rhos(alpha, beta, xs, ns):
    """
    Returns a sample from the complete conditional of \{\rho_i\} given alpha, 
    beta, and \{x_i, n_i\}.

    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        xs, ns: the data
    Returns:
        rhos: torch.tensor of length 4 with values in [0,1]
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return rhos

Problem 3: Derive and implement the complete conditional for α\alpha

Part (a): Derive the complete conditional

Next, we need to derive the complete conditional for the global variables α\alpha and β\beta. We will first find the complete conditional of α\alpha. As usual, you should use that the conditional density is proportional to the joint density to find this conditional. However, unlike earlier, the conditional density is not of a simple, known form. Instead, you should find an expression for the conditional density up to an unknown normalizing constant, i.e. an expression that the conditional density is proportional to, up to a constant.


Your answer here.


Part (b): Write code evaluating the unnormalized log probability

By taking the logarithm of your answer above, we have an expression for the log conditional density, up to an additive constant. Implement this expression below in code.

Hint: You may find the Pytorch function lgamma useful.

def alpha_log_cond(alpha, beta, rhos, c_alpha, d_alpha):
    """
    Returns log p(\alpha \mid \beta, \{\rho_i\}, \{x_i\})
    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        rhos: torch.tensor of length 4 with values in [0,1]
        c_alpha: scalar value > 0
        d_alpha: scalar value > 0
    Returns:
        log_cond: scalar, conditional log probability of alpha
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return log_cond

Problem 4: Derive and implement the complete conditional for β\beta

Part (a): Derive the complete conditional

Write down an expression for the complete conditional density of β\beta, up to a normalizing constant. Your expression should be very similar to the expression for α\alpha in problem 3(a). You do not need to show your work.


Your answer here.


Part (b): Write code evaluating the unnormalized log probability

As in 3(b), write a function which evaluates the log complete conditional for β\beta, up to an additive constant. This should be very similar to your function from 3(b).

def beta_log_cond(alpha, beta, rhos, c_beta, d_beta):
    """
    Returns log p(\beta \mid \alpha, \{\rho_i\}, \{x_i\})
    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        rhos: torch.tensor of length 4 with values in [0,1]
        c_beta: scalar value > 0
        d_beta: scalar value > 0
    Returns:
        log_cond: scalar, conditional log probability of beta
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return log_cond

Problem 5: Write code to sample from the complete conditionals for α\alpha and β\beta

Because we only know the complete conditionals for α\alpha and β\beta up to a normalizing constant, we will use the Metropolis-Hastings algorithm to sample from them. We will use a normal distribution centered at our current point as the proposal distribution. We will use the same standard deviation for the proposals of α\alpha and β\beta. Thus, the proposal distributions will be:

q(αα)=N(α;α,l2)q(ββ)=N(β;β,l2)\begin{align*} q(\alpha \mid \alpha') &= \mathcal{N}(\alpha ; \alpha', l^2) \\ q(\beta \mid \beta') &= \mathcal{N}(\beta ; \beta', l^2) \end{align*}

for some l>0l > 0.

Note that we require α>0\alpha > 0 and β>0\beta > 0, so we may make a proposal which lies outside the support of the complete conditional. You should convince yourself that there is zero probability of accepting such a proposal in the Metropolis algorithm.

Implement a step of the Metropolis-Hastings algorithm for both α\alpha and β\beta in the function below. You will have to use alpha_log_cond and beta_log_cond which you implemented above.

Hint: You may find torch.rand useful for simulating uniform random variables, or torch.distributions.Bernoulli for sampling binary random variables with specified probabilities.

def mh_step_alpha_beta(alpha, beta, rhos, c_alpha, d_alpha, c_beta, d_beta, l=20.):
    """
    Performs a MH step for both alpha and beta.
    Args:
        alpha, beta, rhos: current values of these random variables
        c_alpha, d_alpha, c_beta, d_beta: hyperparameters
        l: standard deviation of proposal distribution
    Returns:
        new_alpha, new_beta: new values of alpha and beta after MH step
    """
    new_alpha = Normal(alpha, l).sample()

    # If proposal is negative, we can reject immediately.
    if new_alpha < 0:
        new_alpha = alpha
    # Otherwise, we must check the accept-reject condition.
    else:
    ##
    # YOUR CODE HERE
    #
    ##
    
    new_beta = Normal(beta, l).sample()
    # If proposal is negative, we can reject immediately.
    if new_beta < 0:
        new_beta = beta
    # Otherwise, we must check the accept-reject condition.
    else:
    ##
    # YOUR CODE HERE
    #
    ##

    return new_alpha, new_beta

Problem 6: Implement a Gibbs sampler

Using mh_step_alpha_beta and sample_rhos, implement a Gibbs sampler for the proposed model.

Using the generated samples, we will estimate the posterior mean and standard deviation of ρ1,,ρ4\rho_1, \dots, \rho_4, and αα+β\frac{\alpha}{\alpha + \beta}. The latter quantity represents the base level of support for Kerry in the broader population, and is of particular interest for who will win the election.

We will also track the joint log-probability of our sample throughout the iterations of the sampler. This will be useful as a diagnostic tool to check whether our sampler is correctly implemented. To easily evaluate the joint probability in code, we recommend making use of the torch.distributions imported above.

rhos_0 = xs / ns
def gibbs(n_iter,
          xs,
          ns,
          alpha=torch.tensor(10.),
          beta=torch.tensor(10.),
          rhos=rhos_0,
          c_alpha=torch.tensor(6.25), 
          d_alpha=torch.tensor(0.025),
          c_beta=torch.tensor(6.25),
          d_beta=torch.tensor(0.025),
          burn_in=0.5):
    """
    Performs Gibbs sampling in the Bayesian hierarchical model of polling data.
    Args:
        n_iter: number of iterations of Gibbs sampling
        xs, ns: the data
        alpha, beta, rhos: initial values of these random variables
        c_alpha, d_alpha, c_beta, d_beta: hyperparameters
        burn_in: fraction of samples to discard before computing expectations
    Returns:
        rhos_mean, rhos_std: posterior mean and standard deviation of rhos
        prob_mean, prob_std: posterior mean and standard deviation of alpha/(alpha+beta)
        lps: torch tensor of size 'n_iter' containing log joint probabilities.
    """
    rhos_samples = []
    prob_samples = []
    lps = torch.zeros(n_iter)

    for it in range(n_iter):
        # Resample alpha, beta using a single Metropolis-Hastings step.
        ##
        # YOUR CODE HERE
        # alpha, beta = ...
        ##        

        # Resample rhos using the complete conditional.
        ##
        # YOUR CODE HERE
        # rhos = ...
        ##

        if it > n_iter * burn_in:
            rhos_samples += [rhos]
            prob_samples += [alpha / (alpha + beta)]

        # Evaluate log-joint probability at current sample
        ##
        # YOUR CODE HERE
        #
        ##

    rhos_samples = torch.stack(rhos_samples)
    prob_samples = torch.stack(prob_samples)
    rhos_mean, rhos_std = torch.mean(rhos_samples, axis=0), torch.std(rhos_samples, axis=0)
    prob_mean, prob_std = torch.mean(prob_samples), torch.std(prob_samples)

    return rhos_mean, rhos_std, prob_mean, prob_std, lps

Problem 7: Model Diagnostics

Run Gibbs sampling for 10000 iterations (this should take about 15 seconds to run). Use the default values provided in the function signature. Plot the evolution of the joint-log-probability over iterations.

# Run Gibbs sampling for 10000 iterations
##
# YOUR CODE HERE
#
##
# Plot the joint-log probability over iterations
##
# YOUR CODE HERE
#
##

Part (a):

What are the posterior means and standard deviations for ρ1,...,ρ4\rho_1, ..., \rho_4 and αα+β\frac{\alpha}{\alpha + \beta}? Do these results seem reasonable to you?

# Print the posterior means and standard deviations.
##
# YOUR CODE HERE
#
##

Your answer here.


Part (b):

Find a setting of the hyperparameters cα,dα,cβ,dβc_\alpha, d_\alpha, c_\beta, d_\beta which will result in a prior that heavily favors Kerry. Re-run the Gibbs sampler with these settings. What is the new mean and standard deviation for αα+β\frac{\alpha}{\alpha + \beta}?

# Run Gibbs sampling with new hyperparameters and print the posterior mean and 
# standard deviation.
##
# YOUR CODE HERE
# 
##

Your answer here.


Problem 8: Reflections

Part (a):

Describe a setting in which a Gibbs sampler will be slow to mix. Why might our parametrization of the beta distribution in terms of α\alpha, β\beta slow down the Gibbs sampler? Can you think of a different parametrization which would work better?


Your answer here.


Part (b):

Describe one way in which a Markov Chain using a Metropolis-Hastings adjusted transition may be slow to mix. Which variable in the code above could be tuned in order to speed up convergence?


Your answer here.


Submission Instructions

Formatting: check that your code does not exceed 80 characters in line width. You can set Tools → Settings → Editor → Vertical ruler column to 80 to see when you’ve exceeded the limit.

Download your notebook in .ipynb format. Then run the following command to convert to a PDF:

jupyter nbconvert --to pdf <yourname>_hw1.ipynb

Installing nbconvert:

If you’re using Anaconda for package management,

conda install -c anaconda nbconvert

Back up option

If you can’t get nbconvert to work, you can save it as a PDF from the browser. Just make sure that your code and figures are not cut off.

Upload your .pdf files to Gradescope.