HW5: Poisson Matrix Factorization#


Name:

Collaborators:


Background#

Poisson matrix factorization (PMF) is a mixed membership model like LDA, and it has close ties to non-negative factorization of count matrices. Let \(\mathbf{X} \in \mathbb{N}^{N \times M}\) denote a count matrix with entries \(x_{n,m}\). We model each entry as a Poisson random variable,

\[ \begin{align*} x_{n,m} &\sim \mathrm{Po}\Big(\boldsymbol{\theta}_{n}^\top \boldsymbol{\eta}_{m} \Big) = \mathrm{Po}\Big(\sum_{k=1}^K \theta_{n,k} \eta_{m,k} \Big), \end{align*} \]

where \(\boldsymbol{\theta}_{n} \in \mathbb{R}_+^K\) and \(\boldsymbol{\eta}_{m} \in \mathbb{R}_+^K\) are non-negative feature vectors for row \(n\) and column \(m\), respectively.

PMF has been used for recommender systems, aka collaborative filtering. In a recommender system, the rows correspond to users, the columns to items, and the entries \(x_{n,m}\) to how much user \(n\) liked item \(m\) (on a scale of \(0,1,2,\ldots\) stars, for example). The \(K\) feature dimensions capture different aspects of items that users may weight in their ratings.

Note that the Poisson rate must be non-negative. It is sufficient to ensure \(\boldsymbol{\theta}_{n}\) and \(\boldsymbol{\eta}_{m}\) are non-negative. To that end, PMF uses gamma priors,

\[\begin{split} \begin{align*} \theta_{n,k} &\sim \mathrm{Ga}(\alpha_\theta, \beta_\theta) \\ \eta_{m,k} &\sim \mathrm{Ga}(\alpha_\eta, \beta_\eta), \end{align*} \end{split}\]

where \(\alpha_\star\) and \(\beta_\star\) are hyperparameters. When \(\alpha_\star < 1\), the gamma distribution has a sharp peak at zero and the prior induces sparsity in the feature vectors.

Latent variable formulation#

PMF can be rewritten in terms of a latent variable model. Note that,

\[\begin{split} \begin{align*} x_{n,m} \sim \mathrm{Po}\Big(\sum_{k=1}^K \theta_{n,k} \eta_{m,k} \Big) \iff x_{n,m} &= \sum_{k=1}^K z_{n,m,k} \\ z_{n,m,k} &\sim \mathrm{Po}(\theta_{nk} \eta_{mk}) \quad \text{independently}. \end{align*} \end{split}\]

From this perspective, a user’s rating of an item is a sum of ratings along each feature dimension, and each feature rating is an independent Poisson random variable.

The joint distribution is,

\[\begin{split} \begin{align*} p(\mathbf{X}, \mathbf{Z}, \boldsymbol{\Theta}, \mathbf{H}) &= \left[\prod_{n=1}^N \prod_{m=1}^M \mathbb{I}\Big[x_{n,m}=\sum_{k=1}^K z_{n,m,k} \Big] \prod_{k=1}^K \mathrm{Po}(z_{n,m,k} \mid \theta_{n,k} \eta_{m,k}) \right] \\ &\qquad \times \left[ \prod_{n=1}^N \prod_{k=1}^K \mathrm{Ga}(\theta_{n,k} \mid \alpha_\theta, \beta_\theta) \right] \times \left[ \prod_{m=1}^M \prod_{k=1}^K \mathrm{Ga}(\eta_{m,k} \mid \alpha_\eta, \beta_\eta) \right] \end{align*} \end{split}\]

where \(\mathbf{Z} \in \mathbb{N}^{N\times M \times K}\) denotes the tensor of feature ratings, \(\boldsymbol{\Theta} \in \mathbb{R}_+^{N \times K}\) is a matrix with rows \(\boldsymbol{\theta}_n\), and \(\mathbf{H} \in \mathbb{R}_+^{M \times K}\) is a matrix with rows \(\boldsymbol{\eta}_m\).

Setup#

import torch
from torch.distributions import Distribution, Gamma, Poisson, Multinomial
from torch.distributions.kl import kl_divergence

from tqdm.auto import trange

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook")

Problem 1: Conditional distributions [math]#

Since this model is constructed from conjugate exponential family distributions, the conditionals are available in closed form. We will let \(\mathbf{z}_{n,m} = (z_{n,m,1}, \ldots, z_{n,m,K})\).

Problem 1a: Derive the conditional for \(\mathbf{z}_{n, m}\)#

Find the conditional density \(p(\mathbf{z}_{n,m} \mid x_{n,m}, \boldsymbol{\theta}_{n}, \boldsymbol{\eta}_{m})\).


Your answer here.


Problem 1b: Derive the conditional for \(\theta_{n,k}\)#

Find the conditional density \(p(\theta_{n,k} \mid \mathbf{Z}, \mathbf{H})\).


Your answer here.


Problem 1c: Derive the conditional for \(\eta_{m, k}\)#

Find the conditional density \(p(\eta_{m, k} \mid \mathbf{Z}, \mathbf{\Theta})\).


Your answer here.


Problem 2: Coordinate ascent variational inference [math]#

We will perform inference in this model using a mean-field variational posterior which factorizes according to:

\[\begin{split} \begin{align*} q(\mathbf{Z}, \mathbf{H}, \boldsymbol{\Theta}) &= q(\mathbf{Z}) q(\mathbf{H}) q(\boldsymbol{\Theta}) \\ &= \left[\prod_{n = 1}^N \prod_{m = 1}^M q(\mathbf{z}_{n, m}) \right] \left[\prod_{n = 1}^N \prod_{k = 1}^K q(\theta_{n, k}) \right] \left[ \prod_{m = 1}^M \prod_{k = 1}^K q(\eta_{m, k}) \right] \end{align*} \end{split}\]

The optimal mean field factors will have the same forms as the conditional distributions above.

Problem 2a: Derive the CAVI update for \(q(\mathbf{z}_{n, m})\)#

Show that, fixing \(q(\mathbf{H})\) and \(q(\boldsymbol{\Theta})\), the optimal \(q(\mathbf{z}_{n, m})\) is given by:

\[\begin{split} \begin{align*} q(\mathbf{z}_{n,m}; \boldsymbol{\lambda}^{(z)}_{n,m}) &= \mathrm{Mult}(\mathbf{z}_{n,m}; x_{n,m}, \boldsymbol{\lambda}^{(z)}_{n,m}) \\ \log \lambda^{(z)}_{n,m,k} &= \mathbb{E}_q[\log \theta_{n,k} + \log \eta_{m,k}] + c \end{align*} \end{split}\]

Your answer here.


Problem 2b: Derive the CAVI update for \(q(\theta_{n, k})\)#

Show that, fixing \(q(\mathbf{Z})\) and \(q(\mathbf{H})\), the optimal \(q(\theta_{n, k})\) is given by:

\[\begin{split} \begin{align*} q(\theta_{n,k}; {\lambda}^{(\theta)}_{n,k,1}, {\lambda}^{(\theta)}_{n,k,2}) &= \mathrm{Ga}(\theta_{n,k}; {\lambda}^{(\theta)}_{n,k,1}, {\lambda}^{(\theta)}_{n,k,2}) \\ {\lambda}^{(\theta)}_{n,k,1} &= \alpha_\theta + \sum_{m=1}^M \mathbb{E}_q[z_{n,m,k}] \\ {\lambda}^{(\theta)}_{n,k,2} &= \beta_\theta + \sum_{m=1}^M \mathbb{E}_q[\eta_{m,k}] \end{align*} \end{split}\]

Your answer here.


Problem 2c: Derive the CAVI update for \(q(\eta_{m, k})\)#

Show that, fixing \(q(\mathbf{Z})\) and \(q(\boldsymbol{\Theta})\), the optimal \(q(\eta_{m, k})\) is given by:

\[\begin{split} \begin{align*} q(\eta_{m,k}; {\lambda}^{(\eta)}_{m,k,1}, {\lambda}^{(\eta)}_{m,k,2}) &= \mathrm{Ga}(\eta_{m,k}; {\lambda}^{(\eta)}_{m,k,1}, \lambda^{(\eta)}_{m,k,2}) \\ {\lambda}^{(\eta)}_{m,k,1} &= \alpha_\eta + \sum_{n=1}^N \mathbb{E}_q[z_{n,m,k}] \\ {\lambda}^{(\eta)}_{m,k,2} &= \beta_\eta + \sum_{n=1}^N \mathbb{E}_q[\theta_{n,k}] \end{align*} \end{split}\]

Your answer here.


Problem 2d: Find the expected sufficient statistics#

To update the variational factors, we need the expectations \(\mathbb{E}_q[z_{n, m, k}]\), \(\mathbb{E}_q[\log \theta_{n,k} + \log \eta_{m,k}]\), \(\mathbb{E}_q[\theta_{n, k}]\), and \(\mathbb{E}_q[\eta_{m, k}]\). Assume that each factor follows the forms derived above. That is, assume \(q(\mathbf{z}_{n, m})\) is multinomial with parameters \(\lambda_{n, m}^{(z)}\) while \(q(\theta_{n, k})\) and \(q(\eta_{m k})\) are gamma with parameters \(\left( \lambda_{n, k, 1}^{(\theta)}, \lambda_{n, k, 2}^{(\theta)} \right)\) and \(\left( \lambda_{m, k, 1}^{(\eta)}, \lambda_{m, k, 2}^{(\eta)} \right)\), respectively. Derive what each of these expectations are in closed form.


Your answer here.


Problem 3: Implement Coordinate Ascent Variational Inference [code]#

First we’ll give some helper functions and objects. Because PyTorch doesn’t offer support for batched multinomial distributions in which the total counts differ (e.g. each \(\mathbf{z}_{n, m}\) follows a multinomial distribution in which the total count is \(x_{n, m}\)), we have defined a BatchedMultinomial distribution for your convenience. This distribution doesn’t support sampling, but will return the mean of each Multinomial variable in its batch. This is exactly what is needed for the CAVI updates.

def gamma_expected_log(gamma_distbn):
    """Helper function to compute the expectation of log(X) where X follows a 
    gamma distribution.
    """
    return torch.digamma(gamma_distbn.concentration) - torch.log(gamma_distbn.rate)

class BatchedMultinomial(Multinomial):
    """ 
    Creates a Multinomial distribution parameterized by `total_count` and
    either `probs` or `logits` (but not both). The innermost dimension of
    `probs` indexes over categories. All other dimensions index over batches.

    The `probs` argument must be non-negative, finite and have a non-zero sum,
    and it will be normalized to sum to 1 along the last dimension. `probs` will 
    return this normalized value. The `logits` argument will be interpreted as 
    unnormalized log probabilities and can therefore be any real number. It will
    likewise be normalized so that the resulting probabilities sum to 1 along
    the last dimension. `logits` will return this normalized value.

    Args:
        total_count (Tensor): number of trials
        probs (Tensor): event probabilities
            Has shape total_count.shape + (num_categories,)
        logits (Tensor): event log probabilities (unnormalized)
            Has shape total_count.shape + (num_categories,)

    Note: this text is mostly from the PyTorch documentation for the 
        Multinomial distribution
    """
    def __init__(self, total_count, probs=None, logits=None, validate_args=None):
        super().__init__(probs=probs, logits=logits, validate_args=validate_args)
        self.total_count = total_count

    @property
    def mean(self):
        return self.total_count[..., None] * self.probs

Problem 3a: Implement a CAVI update step#

Using the update equations derived in Problem 2, complete the cavi_step function below.

Hint: Given a Distribution named d, d.mean returns the mean of that distribution.

def cavi_step(X, q_z, q_theta, q_eta, alpha_theta, beta_theta, alpha_eta, beta_eta):
    """One step of CAVI.

    Args:
        X: torch.tensor of shape (N, M)
        q_z: variational posterior over z, BatchedMultinomial distribution
        q_theta: variational posterior over theta, Gamma distribution
        q_eta: variational posterior over eta, Gamma distribution

    Returns:
        (q_z, q_theta, q_eta): Updated distributions after performing CAVI updates
    """
    ###
    # Your code here

    # Update q_z
    q_z = BatchedMultinomial(...)
    
    # Update the per-user posterior q_theta
    q_theta = Gamma(...)
    
    # Update the per-item posterior q_eta
    q_eta = Gamma(...)
    
    #
    ##

    return q_z, q_theta, q_eta

Problem 3b: ELBO Calculation [math]#

Recall that the evidence lower bound is defined as:

\[ \mathcal{L}(q) = \mathbb{E}_q \left[ \log p(\mathbf{X}, \mathbf{Z}, \mathbf{H}, \boldsymbol{\Theta}) - \log q(\mathbf{Z}, \mathbf{H}, \boldsymbol{\Theta}) \right] \]

Assume that \(q(\mathbf{Z})\) has support contained in \(\{\mathbf{Z}: \sum_{k=1}^K z_{n, m, k} = x_{n, m} \text{ for all } n, m\}\). Show that we can rewrite \(\mathcal{L}(q)\) as:

\[ \mathcal{L}(q) = \mathbb{E}_q [\log p(\mathbf{Z} \mid \boldsymbol{\Theta}, \mathbf{H}) - \log q(\mathbf{Z})] - \mathrm{KL}(q(\boldsymbol{\Theta}) || p(\boldsymbol{\Theta})) - \mathrm{KL}(q(\mathbf{H}) || p(\mathbf{H})) \]

Next, use that \(q(\mathbf{z}_{n,m}; \boldsymbol{\lambda}^{(z)}_{n,m}) = \mathrm{Mult}(\mathbf{z}_{n,m}; x_{n,m}, \boldsymbol{\lambda}^{(z)}_{n,m})\) and by plug in the densities of the Poisson and Multinomial distributions to show that we have:

\[\begin{split} \begin{align*} &\mathbb{E}_q [\log p(\mathbf{Z} \mid \boldsymbol{\Theta}, \mathbf{H}) - \log q(\mathbf{Z})] = \\ &\qquad \sum_{n = 1}^N \sum_{m = 1}^M \mathbb{E}_q \left[ \sum_{k =1}^K - \theta_{n, k} \eta_{m, k} + z_{n, m, k} \log( \theta_{n, k} \eta_{m, k} ) - z_{n,m, k} \log(\lambda_{n, m, k}^{(z)}) \right] - \log(x_{n, m}!) \end{align*} \end{split}\]

Explain why we have:

\[\begin{split} \begin{align*} &\mathbb{E}_q \left[ - \theta_{n, k} \eta_{m, k} + z_{n, m, k} \log( \theta_{n, k} \eta_{m, k} ) - z_{n,m, k} \log(\lambda_{n, m, k}^{(z)}) \right] = \\ &\qquad - \mathbb{E}_q \left[\theta_{n, k}\right] \mathbb{E}_q \left[\eta_{m, k}\right] + \mathbb{E}_q \left[z_{n, m, k} \right] \left( \mathbb{E}_q \left[\log( \theta_{n, k}) \right] + \mathbb{E}_q \left[\log (\eta_{m, k} )\right] - \log(\lambda_{n, m, k}^{(z)}) \right) \end{align*} \end{split}\]

Your answer here.


Problem 3c: Implement the ELBO [code]#

Using our expression above, write a function which evaluates the evidence lower bound.

Hints:

  • Use the kl_divergence function imported above to compute the KL divergence between two Distributions in the same family.

  • Recall that for integers \(n\), \(\Gamma(n + 1) = n!\) where \(\Gamma\) is the Gamma function. \(\log \Gamma\) is implemented in PyTorch as torch.lgamma.

def elbo(X, q_z, q_theta, q_eta, p_theta, p_eta):
    """Compute the evidence lower bound.
    
    Args:
        X: torch.tensor of shape (N, M)
        q_z: variational posterior over z, BatchedMultinomial distribution
        q_theta: variational posterior over theta, Gamma distribution
        q_eta: variational posterior over eta, Gamma distribution
        p_theta: prior over theta, Gamma distribution
        p_eta: prior over eta, Gamma distribution

    Returns:
        elbo: torch.tensor of shape [] 
    """
    ###
    # Your code below
    elbo = ...
    #
    ##
    return elbo / torch.sum(X)

Implement CAVI loop [given]#

Using your functions defined above, complete the function cavi below. cavi loops for some number of iterations, updating each of the variational factors in sequence and evaluating the ELBO at each step.

from torch.distributions import Uniform

def cavi(data, 
         num_factors=10, 
         num_iters=100, 
         tol=1e-5, 
         alpha_theta=0.1,
         beta_theta=1.0,
         alpha_eta=0.1,
         beta_eta=1.0,
         seed=0
        ):
    """Run coordinate ascent VI for Poisson matrix factorization.

    Args:

    Returns:
        elbos, (q_z, q_theta, q_eta):
    """
    data = data.float()
    N, M = data.shape
    K = num_factors      # short hand
    
    # Initialize the variational posteriors.
    q_eta = Gamma(Uniform(0.5 * alpha_eta, 1.5 * alpha_eta).sample((M, K)),
                  Uniform(0.5 * beta_eta, 1.5 * beta_eta).sample((M, K)))
    q_theta = Gamma(Uniform(0.5 * alpha_theta, 1.5 * alpha_theta).sample((N, K)),
                    Uniform(0.5 * beta_theta, 1.5 * beta_theta).sample((N, K)))
    q_z = BatchedMultinomial(data, logits=torch.zeros((N, M, K)))

    p_theta = Gamma(alpha_theta, beta_theta)
    p_eta = Gamma(alpha_eta, beta_eta)
    
    # Run CAVI
    elbos = [elbo(data, q_z, q_theta, q_eta, p_theta, p_eta)]
    for itr in trange(num_iters):
        q_z, q_theta, q_eta = cavi_step(data, q_z, q_theta, q_eta,
                                        alpha_theta, beta_theta,
                                        alpha_eta, beta_eta)
        
        elbos.append(elbo(data, q_z, q_theta, q_eta, p_theta, p_eta))
    return torch.tensor(elbos), (q_z, q_theta, q_eta)
        

Test your implementation on a toy dataset#

To check your implementation is working properly, we will fit a mean-field variational posterior using data sampled from the true model.

# Constants
N = 100   # num "users"
M = 1000 # num "items"
K = 5     # number of latent factors

# Hyperparameters
alpha = 0.1  # sparse gamma prior with mean alpha/beta 
beta = 1.0

# Sample data from the model
torch.manual_seed(305)
theta = Gamma(alpha, beta).sample(sample_shape=(N, K))
eta = Gamma(alpha, beta).sample(sample_shape=(M, K))
data = Poisson(theta @ eta.T).sample()

print(data.shape)
# Plot the data matrix
plt.imshow(data, aspect="auto", vmax=5, cmap="Greys")
plt.xlabel("items")
plt.ylabel("users")
plt.colorbar()

print("Max data:  ", data.max())
print("num zeros: ", torch.sum(data == 0))
elbos, (q_z, q_theta, q_eta) = cavi(data)
plt.plot(elbos[1:])
plt.xlabel("Iteration")
plt.ylabel("ELBO per entry")
true_rates = theta @ eta.T
inf_rates = q_theta.mean @ q_eta.mean.T

# Plot the data matrix
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.imshow(true_rates, aspect="auto", vmax=3, cmap="Greys")
plt.xlabel("items")
plt.ylabel("users")
plt.title("true rates")
plt.colorbar()

plt.subplot(1, 2, 2)
plt.imshow(inf_rates, aspect="auto", vmax=3, cmap="Greys")
plt.xlabel("items")
plt.ylabel("users")
plt.title("inferred rates")
plt.colorbar()

Problem 4: Run your code on a downsampled LastFM dataset#

Next, we will use data gathered from Last.FM users to fit a PMF model. We use a downsampled version of the Last.FM-360K users dataset. This dataset records how many times each user played an artist’s songs. We downsample the data to include only the 2000 most popular artists, as measured by how many users listened to the artist at least once, and the 1000 most prolific users, as measured by how many artists they have listened to.

In the code below , we use lfm to represent the data matrix \(X\) in the model. That is, lfm[n, d] denotes how many times the n-th user played a song by the d-th artist.

!wget -nc https://raw.githubusercontent.com/slinderman/stats305c/main/assignments/hw5/subsampled_last_fm.csv
import pandas as pd

lfm_df = pd.read_csv('subsampled_last_fm.csv')
lfm = lfm_df.pivot_table(index='UserID', columns='ItemID', aggfunc=sum)\
    .fillna(0).astype(int).to_numpy()
lfm = torch.tensor(lfm, dtype=torch.int)
print(lfm.shape)
plt.imshow(lfm, aspect="auto", vmax=100, cmap="Greys")
plt.xlabel("items")
plt.ylabel("users")
plt.colorbar()

Using the code below, run coordinate ascent variational inference on this dataset. Our implementation takes around 10-15 minutes to finish, and achieves a rescaled ELBO of around \(-2\).

elbos, (q_z, q_theta, q_eta) = cavi(lfm, 
     num_factors=40, 
     num_iters=200, 
     alpha_theta=1.,
     beta_theta=0.5,
     alpha_eta=1.,
     beta_eta=0.5)
print(elbos[-1])

Investigate “genres”#

The columns of \(\mathbf{H}\) correspond to weights on artists. Intuitively, each of the \(K\) columns should put weight on subsets of artists that are often played together. We might think of these columns as reflecting different “genres” of music. The code below the top 10 artists for a few of these columns.

# Find the 10 most used genres
genre_loading = q_theta.mean.sum(0)
genre_order = torch.argsort(genre_loading, descending=True)

# Print the top 10 artists for each of the top 10 genres
for genre in genre_order[:10]:
    print("genre ", genre)
    artist_idx = torch.argsort(q_eta.mean[:, genre], 
                               descending=True)[:10].numpy()
    subset = lfm_df[lfm_df['ItemID'].isin(artist_idx)]
    print(subset[['ItemID', 'Artist']].drop_duplicates())
    print("")
    

Problem 4a#

Inspect the data either using the csv file or the pandas dataframe and choose a user who has listened to artists you recognize. If you are not familiar with any of the artists, use the listener with UserID 349, who mostly listens to hip-hop artists. For the particular user \(n\) you choose, find the 10 artists who are predicted to have the most plays by sorting the vector of mean song counts predicted by the model, i.e. the \(n^{\text{th}}\) row of \(\mathbb{E}_q [\boldsymbol{\Theta} \mathbf{H}^\top ]\). Are these artists you would expect the user would enjoy? Are there any artists that the user has not listened to?

Hint: Use torch.argsort(..., descending=True) to return the indices of the largest elements of a vector in descending order.

###
# Your code here.
##

Problem 5: Reflections#

Problem 5a#

Discuss one advantage and one disadvantage of fitting a posterior using variational inference vs. sampling from the posterior using MCMC.


Your answer here.


Problem 5b#

First, explain why the assumption that \(\mathbf{Z}, \mathbf{H}\) and \(\boldsymbol{\Theta}\) are independent in the posterior will never hold.

Next, recall that maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior and the true posterior. In general, how will the approximate posterior differ from the true posterior, given that the variational family does not include the true posterior?


Your answer here.


Problem 5c#

Suppose we are using this model to recommend new items to users. Describe one improvement that could be made to the model which you think would lead to better recommendations.


Your answer here.


Submission Instructions#

Formatting: check that your code does not exceed 80 characters in line width. If you’re working in Colab, you can set Tools → Settings → Editor → Vertical ruler column to 80 to see when you’ve exceeded the limit.

Download your notebook in .ipynb format and use the following commands to convert it to PDF:

jupyter nbconvert --to pdf hw5_yourname.ipynb

Dependencies:

  • nbconvert: If you’re using Anaconda for package management,

conda install -c anaconda nbconvert

Upload your .pdf files to Gradescope.