HW2: Gibbs Sampling and Metropolis-Hastings#


Name:

Collaborators:


Background#

In this homework assignment, we will investigate a hierarchical Bayesian model of polling data. To perform inference, we will implement a hybrid Gibbs/Metropolis-Hastings sampler.

Polling Data#

The 2004 election between George W. Bush and John Kerry was closely contested. Both campaigns focused heavily on the swing state of Ohio, worth 20 votes in the electoral college. CNN/USA Today/Gallup conducted several Ohio polls in the months leading up to the election on November 4th, and obtained the following raw results:

Kerry

Bush

Total

Sept. 4-7

284

344

628

Sept. 25-28

312

325

637

Oct. 17-20

346

339

685

Oct. 28-31

556

511

1067

The polling data can be found here (although there are some dead links on the page).

We will let \(\{x_i\}_{i = 1}^4\) denote the number of votes for Kerry in each poll, and let \(\{n_i\}_{i = 1}^4\) denote the total number surveyed in each poll. We will represent these quantities in code using integer PyTorch tensors.

import torch
from torch.distributions import Gamma, Beta, Binomial, Normal
torch.manual_seed(305)

import matplotlib.pyplot as plt
from matplotlib.cm import Blues
import seaborn as sns
sns.set_context("notebook")
xs = torch.tensor([284, 312, 346, 556], dtype=int)
ns = torch.tensor([628, 637, 685, 1067], dtype=int)

Hierarchical Model#

It is reasonable to model each voter’s support as a draw from a Bernoulli distribution, with some poll-specific probability \(\rho_i\) of supporting Kerry. We believe each poll has a different probability since the polls are conducted at different times and may reach different subpopulations of voters (e.g. if one is conducted over telephone and another is an internet survey). However, these \(\rho_i\) are likely highly correlated due to a base level of support for Kerry in the general population. We will therefore use the following model:

\[\begin{split} \begin{align*} \rho_i &\sim \mathrm{Beta}(\alpha, \beta) \quad \text{for } i =1, \ldots, 4 \\ x_i \mid \rho_i &\sim \mathrm{Bin}(\rho_i, n_i) \quad \text{for } i =1, \ldots, 4 \end{align*} \end{split}\]

Note that the beta distribution is a conjugate prior for the binomial distribution. (See also, Bishop Ch. 2.1.)

Following a Bayesian approach, we introduce priors over the parameters \(\alpha\) and \(\beta\). Unfortunately, the parameters of the beta distribution do not have a simple conjugate prior. Instead, we will assume \(\alpha\) and \(\beta\) follow gamma distributions, resulting in the following Bayesian hierarchical model.

\[\begin{split} \begin{align*} \alpha &\sim \mathrm{Ga}(c_\alpha, d_\alpha) \\ \beta &\sim \mathrm{Ga}(c_\beta, d_\beta) \\ \rho_i \mid \alpha, \beta &\sim \mathrm{Beta}(\alpha, \beta) \quad \text{for } i =1, \ldots, 4 \\ x_i \mid \rho_i &\sim \mathrm{Bin}(\rho_i, n_i) \quad \text{for } i =1, \ldots, 4 \end{align*} \end{split}\]

Here, \(c_\alpha, d_\alpha, c_\beta, d_\beta\) are fixed hyperparameters for the “hyper-priors” on \(\alpha\) and \(\beta\) and we use the shape-rate parametrization of the gamma distribution.

How should we interpret these hyperparameters? Recall that in a beta-Bernoulli model we can interpret the parameters of the beta as pseudo-observations. That is, we can view \(\alpha\) and \(\beta\) as results from a prior poll that resulted in \(\alpha\) voters for Kerry and \(\beta\) voters for Bush. Thus, we can interpret the mean of \(\alpha\) as previously observed supporters for Kerry and its standard deviation as a measure of uncertainty in this number of previous voters.

Since the mean of \(\alpha\) is \(\frac{c_\alpha}{d_\alpha}\) and the standard deviation is \(\sqrt{\frac{c_\alpha}{d_\alpha^2}}\), and similarly for \(\beta\), we can adjust the hyperparameters to accord with our prior belief on Kerry/Bush’s level of support.

Problem 1 [Math]: Derive the complete conditional for \(\rho\)#

In order to implement a Gibbs sampler, we must be able to sample from the complete conditional of each of the unobserved variables in our model. We will sample the variables \(\rho = (\rho_1, \dots, \rho_4)\) simultaneously as part of a block Gibbs update.

Part (a): Demonstrate conditional independence of \(\{\rho_i\}\)#

First, show that we have:

\[ p(\rho \mid \alpha, \beta, \{x_i\}) = \prod_{i=1}^4 p(\rho_i \mid \alpha, \beta, x_i) \]

Your answer here.


Part (b):#

Fixing \(i\), derive the conditional density of \(\rho_i\) given \(\alpha, \beta, x_i\), that is, compute:

\[p(\rho_i \mid \alpha, \beta, x_i)\]

Hint: Remember that this conditional density is proportional to the joint density of all the variables. The density should simplify nicely to a known density.


Your answer here.


Problem 2: Write code to sample from the complete conditional for \(\rho\)#

Write a function which generates a sample from the conditional distribution for \(\rho\) given \(\alpha\), \(\beta\), and \(\{x_i, n_i\}\).

from torch.distributions import Beta

def gibbs_sample_rhos(alpha, beta, xs, ns):
    """
    Returns a sample from the complete conditional of \{\rho_i\} given alpha, 
    beta, and \{x_i, n_i\}.

    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        xs, ns: the data
    Returns:
        rhos: torch.tensor of length 4 with values in [0,1]
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return rhos

Problem 3: Derive and implement the complete conditional for \(\alpha\)#

Part (a): Derive the complete conditional#

Next, we need to derive the complete conditional for the global variables \(\alpha\) and \(\beta\). We will first find the complete conditional of \(\alpha\). As usual, you should use that the conditional density is proportional to the joint density to find this conditional. However, unlike earlier, the conditional density is not of a simple, known form. Instead, you should find an expression for the conditional density up to an unknown normalizing constant, i.e. an expression that the conditional density is proportional to, up to a constant.


Your answer here.


Part (b): Write code evaluating the unnormalized log probability#

By taking the logarithm of your answer above, we have an expression for the log conditional density, up to an additive constant. Implement this expression below in code.

Hint: You may find the Pytorch function lgamma useful.

def alpha_log_cond(alpha, beta, rhos, c_alpha, d_alpha):
    """
    Returns log p(\alpha \mid \beta, \{\rho_i\}, \{x_i\})
    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        rhos: torch.tensor of length 4 with values in [0,1]
        c_alpha: scalar value > 0
        d_alpha: scalar value > 0
    Returns:
        log_cond: scalar, conditional log probability of alpha
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return log_cond

Problem 4: Derive and implement the complete conditional for \(\beta\)#

Part (a): Derive the complete conditional#

Write down an expression for the complete conditional density of \(\beta\), up to a normalizing constant. Your expression should be very similar to the expression for \(\alpha\) in problem 3(a). You do not need to show your work.


Your answer here.


Part (b): Write code evaluating the unnormalized log probability#

As in 3(b), write a function which evaluates the log complete conditional for \(\beta\), up to an additive constant. This should be very similar to your function from 3(b).

def beta_log_cond(alpha, beta, rhos, c_beta, d_beta):
    """
    Returns log p(\beta \mid \alpha, \{\rho_i\}, \{x_i\})
    Args:
        alpha: scalar value > 0
        beta: scalar value > 0
        rhos: torch.tensor of length 4 with values in [0,1]
        c_beta: scalar value > 0
        d_beta: scalar value > 0
    Returns:
        log_cond: scalar, conditional log probability of beta
    """
    ##
    # YOUR CODE HERE
    #
    ##
    return log_cond

Problem 5: Write code to sample from the complete conditionals for \(\alpha\) and \(\beta\)#

Because we only know the complete conditionals for \(\alpha\) and \(\beta\) up to a normalizing constant, we will use the Metropolis-Hastings algorithm to sample from them. We will use a normal distribution centered at our current point as the proposal distribution. We will use the same standard deviation for the proposals of \(\alpha\) and \(\beta\). Thus, the proposal distributions will be:

\[\begin{split} \begin{align*} q(\alpha \mid \alpha') &= \mathcal{N}(\alpha ; \alpha', l^2) \\ q(\beta \mid \beta') &= \mathcal{N}(\beta ; \beta', l^2) \end{align*} \end{split}\]

for some \(l > 0\).

Note that we require \(\alpha > 0\) and \(\beta > 0\), so we may make a proposal which lies outside the support of the complete conditional. You should convince yourself that there is zero probability of accepting such a proposal in the Metropolis algorithm.

Implement a step of the Metropolis-Hastings algorithm for both \(\alpha\) and \(\beta\) in the function below. You will have to use alpha_log_cond and beta_log_cond which you implemented above.

Hint: You may find torch.rand useful for simulating uniform random variables, or torch.distributions.Bernoulli for sampling binary random variables with specified probabilities.

def mh_step_alpha_beta(alpha, beta, rhos, c_alpha, d_alpha, c_beta, d_beta, l=20.):
    """
    Performs a MH step for both alpha and beta.
    Args:
        alpha, beta, rhos: current values of these random variables
        c_alpha, d_alpha, c_beta, d_beta: hyperparameters
        l: standard deviation of proposal distribution
    Returns:
        new_alpha, new_beta: new values of alpha and beta after MH step
    """
    new_alpha = Normal(alpha, l).sample()

    # If proposal is negative, we can reject immediately.
    if new_alpha < 0:
        new_alpha = alpha
    # Otherwise, we must check the accept-reject condition.
    else:
    ##
    # YOUR CODE HERE
    #
    ##
    
    new_beta = Normal(beta, l).sample()
    # If proposal is negative, we can reject immediately.
    if new_beta < 0:
        new_beta = beta
    # Otherwise, we must check the accept-reject condition.
    else:
    ##
    # YOUR CODE HERE
    #
    ##

    return new_alpha, new_beta

Problem 6: Implement a Gibbs sampler#

Using mh_step_alpha_beta and sample_rhos, implement a Gibbs sampler for the proposed model.

Using the generated samples, we will estimate the posterior mean and standard deviation of \(\rho_1, \dots, \rho_4\), and \(\frac{\alpha}{\alpha + \beta}\). The latter quantity represents the base level of support for Kerry in the broader population, and is of particular interest for who will win the election.

We will also track the joint log-probability of our sample throughout the iterations of the sampler. This will be useful as a diagnostic tool to check whether our sampler is correctly implemented. To easily evaluate the joint probability in code, we recommend making use of the torch.distributions imported above.

rhos_0 = xs / ns
def gibbs(n_iter,
          xs,
          ns,
          alpha=torch.tensor(10.),
          beta=torch.tensor(10.),
          rhos=rhos_0,
          c_alpha=torch.tensor(6.25), 
          d_alpha=torch.tensor(0.025),
          c_beta=torch.tensor(6.25),
          d_beta=torch.tensor(0.025),
          burn_in=0.5):
    """
    Performs Gibbs sampling in the Bayesian hierarchical model of polling data.
    Args:
        n_iter: number of iterations of Gibbs sampling
        xs, ns: the data
        alpha, beta, rhos: initial values of these random variables
        c_alpha, d_alpha, c_beta, d_beta: hyperparameters
        burn_in: fraction of samples to discard before computing expectations
    Returns:
        rhos_mean, rhos_std: posterior mean and standard deviation of rhos
        prob_mean, prob_std: posterior mean and standard deviation of alpha/(alpha+beta)
        lps: torch tensor of size 'n_iter' containing log joint probabilities.
    """
    rhos_samples = []
    prob_samples = []
    lps = torch.zeros(n_iter)

    for it in range(n_iter):
        # Resample alpha, beta using a single Metropolis-Hastings step.
        ##
        # YOUR CODE HERE
        # alpha, beta = ...
        ##        

        # Resample rhos using the complete conditional.
        ##
        # YOUR CODE HERE
        # rhos = ...
        ##

        if it > n_iter * burn_in:
            rhos_samples += [rhos]
            prob_samples += [alpha / (alpha + beta)]

        # Evaluate log-joint probability at current sample
        ##
        # YOUR CODE HERE
        #
        ##

    rhos_samples = torch.stack(rhos_samples)
    prob_samples = torch.stack(prob_samples)
    rhos_mean, rhos_std = torch.mean(rhos_samples, axis=0), torch.std(rhos_samples, axis=0)
    prob_mean, prob_std = torch.mean(prob_samples), torch.std(prob_samples)

    return rhos_mean, rhos_std, prob_mean, prob_std, lps

Problem 7: Model Diagnostics#

Run Gibbs sampling for 10000 iterations (this should take about 15 seconds to run). Use the default values provided in the function signature. Plot the evolution of the joint-log-probability over iterations.

# Run Gibbs sampling for 10000 iterations
##
# YOUR CODE HERE
#
##
# Plot the joint-log probability over iterations
##
# YOUR CODE HERE
#
##

Part (a):#

What are the posterior means and standard deviations for \(\rho_1, ..., \rho_4\) and \(\frac{\alpha}{\alpha + \beta}\)? Do these results seem reasonable to you?

# Print the posterior means and standard deviations.
##
# YOUR CODE HERE
#
##

Your answer here.


Part (b):#

Find a setting of the hyperparameters \(c_\alpha, d_\alpha, c_\beta, d_\beta\) which will result in a prior that heavily favors Kerry. Re-run the Gibbs sampler with these settings. What is the new mean and standard deviation for \(\frac{\alpha}{\alpha + \beta}\)?

# Run Gibbs sampling with new hyperparameters and print the posterior mean and 
# standard deviation.
##
# YOUR CODE HERE
# 
##

Your answer here.


Problem 8: Reflections#

Part (a):#

Describe a setting in which a Gibbs sampler will be slow to mix. Why might our parametrization of the beta distribution in terms of \(\alpha\), \(\beta\) slow down the Gibbs sampler? Can you think of a different parametrization which would work better?


Your answer here.


Part (b):#

Describe one way in which a Markov Chain using a Metropolis-Hastings adjusted transition may be slow to mix. Which variable in the code above could be tuned in order to speed up convergence?


Your answer here.


Submission Instructions#

Formatting: check that your code does not exceed 80 characters in line width. You can set Tools → Settings → Editor → Vertical ruler column to 80 to see when you’ve exceeded the limit.

Download your notebook in .ipynb format. Then run the following command to convert to a PDF:

jupyter nbconvert --to pdf <yourname>_hw1.ipynb

Installing nbconvert:

If you’re using Anaconda for package management,

conda install -c anaconda nbconvert

Back up option

If you can’t get nbconvert to work, you can save it as a PDF from the browser. Just make sure that your code and figures are not cut off.

Upload your .pdf files to Gradescope.