HW1: Bayesian Linear Regression

HW1: Bayesian Linear Regression#

STATS305C, Stanford University, Spring 2023

Your name:

Collaborators:

Hours spent:

Please let us know how many hours in total you spent on this assignment so we can calibrate for future assignments. Your feedback is always welcome!

# Setup
import torch
from torch.distributions import Normal, Gamma, \
    TransformedDistribution, MultivariateNormal
from torch.distributions.transforms import PowerTransform

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook")

Bayesian Linear Regression#

STATS 305A was all about linear regression. In this assignment, you’ll revisit that classic model from a Bayesian perspective.

Let \(\{\mathbf{x}_n, y_n\}_{n=1}^N\) denote a dataset with covariates \(\mathbf{x}_n \in \mathbb{R}^D\) and scalar outcomes \(y_n \in \mathbb{R}\). Let \(\mathbf{X} \in \mathbb{R}^{N \times D}\) denote the design matrix where each row is a vector of covariates and \(\mathbf{y} \in \mathbb{R}^N\) denote the vector of outcomes.

We will model the outcomes as conditionally independent Gaussian random variables given the covariates and the parameters,

\[ \begin{align*} p(\mathbf{y} \mid \mathbf{w}, \sigma^2, \mathbf{X}) &= \prod_{n=1}^N \mathcal{N}(y_n \mid \mathbf{w}^\top \mathbf{x}_n, \sigma^2), \end{align*} \]

where \(\mathbf{w} \in \mathbb{R}^D\) are the weights and \(\sigma^2 \in \mathbb{R}_+\) is the conditional variance.

In a Bayesian linear regression, we place a prior on the parameters,

\[ \begin{align*} p(\mathbf{w}, \sigma^2 | \boldsymbol{\eta}) &= \chi^{-2}(\sigma^2 \mid \nu_0, \sigma_0^2) \, \mathcal{N}(\mathbf{w} \mid \boldsymbol{\mu}_0, \sigma^2 \boldsymbol{\Lambda}_0^{-1}), \end{align*} \]

where \(\nu_0 \in \mathbb{R}_+\) sets the degrees of freedom, \(\sigma_0^2\) is the prior mean of the variance parameter, \(\boldsymbol{\mu}_0\) is the prior mean of the weights, and \(\boldsymbol{\Lambda}_0 \in \mathcal{S}_D\) is a positive definite \(D \times D\) precision matrix. We collect these hyperparameters into the vector \(\boldsymbol{\eta} = (\nu_0, \sigma_0^2, \boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0)\).

PyTorch#

As in the notebooks from class, you will use PyTorch to complete the coding portions of this assignment. If you are unfamiliar with PyTorch, this webpage provides an introductory tutorial to PyTorch tensors. You should also make sure you can solve the problems in homework 0.

Problem 1: Derive the Posterior [Math]#

Derive the posterior distribution \(p(\mathbf{w}, \sigma^2 \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\eta})\) where \(\boldsymbol{\eta} = (\nu_0, \sigma_0^2, \boldsymbol{\mu}_0, \boldsymbol{\Lambda}_0)\). It should be of the same form as the prior,

\[ \begin{align*} p(\mathbf{w}, \sigma^2 \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}) &= \chi^{-2}(\sigma^2 \mid \nu_N, \sigma_N^2) \mathcal{N}(\mathbf{w} \mid \boldsymbol{\mu}_N, \sigma^2 \boldsymbol{\Lambda}_N^{-1}) \end{align*} \]

for some \(\nu_N\), \(\sigma_N^2\), \(\boldsymbol{\mu}_N\), and \(\boldsymbol{\Lambda}_N\).

Hint: Remember that the “standard procedure” for deriving the posterior distribution is to write down the joint distribution (on both parameters and data), and then to only foucs on the terms you care about (the parameters). But, in this setting, you have to be very careful to keep both \(\mathbf{w}\) and \(\sigma^2\), because we are asking for the joint posterior.

Your answer here.

Problem 2: The Posterior Mean [Math]#

a. What does the posterior mean \(\mathbb{E}[\mathbf{w} \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}]\) equal in the uninformative limit where \(\boldsymbol{\Lambda}_0 \to 0\) and \(\nu_0 \to 0\)?

b. What does the posterior mean \(\mathbb{E}[\sigma^2 \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}]\) equal in the uninformative limit where \(\boldsymbol{\Lambda}_0 \to 0\) and \(\nu_0 \to 0\)? Write your answer in terms of the hat matrix \(\mathbf{H} = \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top\).

Your answer here.

Synthetic Data#

We’ll do some simple analysis of a synthetic dataset with \(N =20\) data points. Each data point has covariates \(\mathbf{x}_n = (1, x_n) \in \mathbb{R}^2\) and scalar outcomes \(y_n \in \mathbb{R}\). It looks like this:

!wget -nc https://raw.githubusercontent.com/slinderman/stats305c/main/assignments/hw1/hw1.pt

# Load the data.
# X = [[1, x_1]
#      [1, x_2]
#         ...
#      [1, x_N]]
#
# y = [y_1, ..., y_N]
X, y = torch.load("hw1.pt")

plt.plot(X[:, 1], y, 'ko')
plt.xlabel("$x$")
plt.ylabel("$y$")

Here, the outcomes were simulated from a linear regression with Gaussian noise according to some true parameters (not given). You will compute and visualize the posterior distribution over the weights and variance given the data.

Problem 3: Compute the posterior [Code]#

Write a function to compute the posterior parameters given data and hyperparameters.

Hints: You may find the following commands in PyTorch useful:

If a is a tensor, a.shape is a tuple containing the shape of a.
If a is a tensor, a.T returns the transpose of a.
torch.linalg.solve
* denotes element-wise multiplication while @ denotes standard matrix-matrix or matrix-vector multiplication.

def compute_posterior(X, y, nu_0, sigmasq_0, mu_0, Lambda_0):
    """
    Compute the posterior parameters nu_N, sigmasq_N, mu_N, and Lambda_N 
    given covariates X, outcomes y, and hyperparameters.

    Args:
        X:          (N, D) tensor of covariates
        y:          (N,) tensor of outcomes
        nu_0:       prior degrees of freedom
        sigmasq_0:  prior mean of the variance parameter
        mu_0:       prior mean of the weights
        Lambda_0:   prior precision of the weights

    Returns:
        nu_N:       posterior degrees of freedom
        sigmasq_N:  posterior scale of the variance parameter
        mu_N:       posterior mean of the weights
        Lambda_N:   posterior precision of the weights
    """
    ##
    # YOUR CODE HERE
    #
    ##
    
    return nu_N, sigmasq_N, mu_N, Lambda_N

Please run the following code to print your answers:

# Test:
hypers = dict(
    nu_0=torch.tensor(1.0),
    sigmasq_0=torch.tensor(1.0),
    mu_0=torch.zeros(2),
    Lambda_0=0.1 * torch.eye(2)
)

nu_N, sigmasq_N, mu_N, Lambda_N = compute_posterior(X, y, **hypers)
print("nu_N:       \n", nu_N)
print("")
print("sigmasq_N:  \n", sigmasq_N)
print("")
print("mu_N:       \n", mu_N)
print("")
print("Lambda_N:   \n", Lambda_N)

Problem 4: Plot the posterior density of the variance [Code]#

Plot \(p(\sigma^2 \mid X, y, \eta)\) vs \(\sigma^2\) over the interval \([10^{-3}, 2]\), where \(X\) and \(y\) continue to be the synthetic data we downloaded and used in Problem 3.

You may use the ScaledInvChiSq distribution object below, which we copied from the demo for Lecture 1.

Hint: In Python, you can use dir(object) to list the attributes and functions that an object supports.

Hint: To learn more about PyTorch distributions, see the docs.

class ScaledInvChiSq(TransformedDistribution):
    
    def __init__(self, dof, scale):
        """
        Implementation of the scaled inverse \chi^2 distribution,
        
        ..math:
            \chi^{-2}(\nu_0, \sigma_0^2)

        It is equivalent to an inverse gamma distribution, which we implement
        as a transformation of a Gamma distribution. Thus, this class inherits
        functions like `log_prob` from its parent.

        Args:
            dof:   degrees of freedom parameter
            scale: scale of the $\chi^{-2}$ distribution.
        """
        base = Gamma(dof / 2, dof * scale / 2)
        transforms = [PowerTransform(-1)]
        TransformedDistribution.__init__(self, base, transforms)
        self.dof = dof
        self.scale = scale        

##
# YOUR CODE HERE
#
##

Problem 5: Plot posterior samples of the regression function. [Code]#

Draw 50 samples from the posterior marginal distribution over the weights \(\mathbf{w} \in \mathbb{R}^2\). For each sample, compute the expected value of \(y\) on a grid of points \(x\) evenly spaced between \([-3, 3]\). Remember that our covariates were defined as \(\mathbf{x} = (1, x)\) so that for each sample of the weights you get a line for \(\mathbb{E}[y \mid x, \mathbf{w}]\) as a function of \(x\). Plot these 50 lines on top of each other to get a sense of the posterior uncertainty in the regression function. (You may want to plot each line with some transparency, like alpha=0.1.) Overlay the observed data points.

Hint: You may find torch.inverse useful.

Hint: Remember that in the generative model we have posited, the distribution of \(\mathbf{w}\) depends on \(\sigma^2\).

##
# YOUR CODE HERE
#
##

Problem 6: Posterior Predictive Distribution [Math]#

The subparts of this problem will walk you through deriving the posterior predictive distribution of the outcome at a new input \(\mathbf{x}_{N+1}\). That is, computing,

\[ \begin{align*} p(y_{N+1} \mid x_{N+1}, \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}) \end{align*} \]

integrating over the posterior distribution on the weights \(\mathbf{w}\) and variance \(\sigma^2\). Remember that you found this posterior distribution in Problem 1, but for the purpose of this question it’s enough to leave it in the form

\[ \begin{align*} p(\mathbf{w}, \sigma^2 \mid \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}) &= \chi^{-2}(\sigma^2 \mid \nu_N, \sigma_N^2) \mathcal{N}(\mathbf{w} \mid \boldsymbol{\mu}_N, \sigma^2 \boldsymbol{\Lambda}_N^{-1}), \end{align*} \]

i.e. you don’t need to plug in the values for for some \(\nu_N\), \(\sigma_N^2\), \(\boldsymbol{\mu}_N\), and \(\boldsymbol{\Lambda}_N\) that you found in Problem 1.

Problem 6a#

Using the product rule of probability, write out the joint distribution of the posterior over the parameters and the observation of the next data point \(y_{N+1}\):

\[ \begin{align*} p(y_{N + 1}, \mathbf{w}, \sigma^2 | x_{N + 1}, \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}). \end{align*} \]

You can (and please do) replace any densities from a known family with the notation \(\text{symbol for the family}( \text{variable name} | \text{parameters})\). (We follow this notation in how we write out the posterior above).

Your answer here.

Problem 6b#

Now using the sum rule of probability, compute the posterior predictive distribution

\[ \begin{align*} p(y_{N + 1}, | x_{N + 1}, \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}) = \int p(y_{N + 1}, \mathbf{w}, \sigma^2 | x_{N + 1}, \mathbf{y}, \mathbf{X}, \boldsymbol{\eta}) d\mathbf{w} d\sigma^2. \end{align*} \]

Hint: You can do this integral without taking any integrals! Thinks about conjugate families and how the Student T distribution came up in lectures.

Your answer here.

Submission Instructions#

Formatting: check that your code does not exceed 80 characters in line width. You can set Tools → Settings → Editor → Vertical ruler column to 80 to see when you’ve exceeded the limit.

Download your notebook in .ipynb format and use the following commands to convert it to PDF. Then run the following command to convert to a PDF:

jupyter nbconvert --to pdf <yourlastname>_hw1.ipynb

(Note that for the above code to work, you need to rename your file <yourlastname>_hw1.ipynb)

Possible causes for errors:

the “Open in colab” button. Just delete the code that creates this button (go to the top cell and delete it)
Latex errors: many latex errors aren’t visible in the notebook. Try binary search: comment out half of the latex at a time, until you find the bugs

Getting this HW into PDF form isn’t meant to be a burden. One quick and easy approach is to open it as a Jupyter notebook, print, save to pdf. Just make sure your latex math answers aren’t cut off so I can grade them.

Please post on Ed or come to OH if there are any other problems submitting the HW.

Installing nbconvert:

If you’re using Anaconda for package management,

conda install -c anaconda nbconvert

Upload your .pdf file to Gradescope. Please tag your questions!

HW1: Bayesian Linear Regression

Contents

HW1: Bayesian Linear Regression#

Bayesian Linear Regression#

PyTorch#

Problem 1: Derive the Posterior [Math]#

Problem 2: The Posterior Mean [Math]#

Synthetic Data#

Problem 3: Compute the posterior [Code]#

Problem 4: Plot the posterior density of the variance [Code]#

Problem 5: Plot posterior samples of the regression function. [Code]#

Problem 6: Posterior Predictive Distribution [Math]#

Problem 6a#

Problem 6b#

Submission Instructions#