HW4: Large Language Models#

The first half of this assignment (Parts 0 and 1) will review some key ingredients of sequence modeling. In the process, we will build a baseline transformer model for next token prediction in code. The deliverable will be completing the questions posed in part 0 and part 1.

The second half of the assignment (Part 2) will be an open-ended mini-project where you have the freedom to delve more deeply into language modelling (where the language in question is python code). Further instructions are in Part 2: Mini-project. But, in general, you should feel free to try other architectures (HMMs, RNNs, transformers, state space layers, diffusion models etc.) or to invent new architectures. The goal will be to find some area of possible improvement (we interpret “improvement” quite loosely, but it is up to you to state precisely in what sense your proposed innovation might constitute an improvement and to show convincing evidence that your innovation does or does not constitue an improvement according to your definition); to formulate and state a precise hypothesis; and to falsify or support the hypothesis with rigorous empirical analyses. The deliverable will be a report of no more than 4 pages (references not included in the page limit).

For this final assignment you have the option to work in pairs.

This Assignment

Model: You will begin by implementing a baseline using attention and transformers. But, for the mini-project, you will be free to use any model (HMMs, RNNs, transformers, state space layers, diffusion models etc.) that you would like!

Algorithm: mini-batched stochatic gradient descent / whatever you’d like! We will be using deep learning models in at least the first half of the assignment. When you actually go into “production” you should be sure to use the GPU on colab (make sure you switch to GPU in the “Runtime” tab above). However, you have limited colab gpu units for free, so when you are “developing,” you may wish to run smaller models with less data for fewer iterations just to get the bugs out / make sure you pipeline is working (see the “prototyping” exercise in HW 0 for more discussion).

Data: A large corpus of python code from the Stack. We have taken a dataset of around 4 million tokens from the stack and stored in a csv file for you for easy access.

Setup#

# can take around 30s
%%capture
! pip install datasets #huggingface datasets library
! pip install --upgrade pyarrow
# torch imports
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# hugging face imports
from datasets import load_dataset
from transformers import AutoTokenizer

import matplotlib.pyplot as plt
from tqdm import tqdm

import pandas as pd
import sys
import warnings

torch.manual_seed(305)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
assert device=='cuda', "you need to change runtime type to GPU"
# hyperparams and helper functions
SMALL_ITERS = 1000
LARGE_ITERS = 2000
context_window_size = 256
chunk_size = 512 # BERT can only take max input size 512 characters

def chunk_string(string, size):
    """
    Splits a string into chunks of a specified size.

    :param string: The string to be chunked.
    :param size: The desired chunk size.
    :return: A list of string chunks.
    """
    return [string[i:i+size] for i in range(0, len(string), size)]

Part 0: Preprocessing#

As in the previous problem sets, a certain amount of preprocessing for textual data is required.

0.1: Loading the dataset#

The first step is to actually download the dataset. We will be using a dataset on huggingface. You can think of hugging face as the sklearn of deep learning.

The dominant mode for preprocessing textual data is to tokenize it, that is, to split the dataset into a finite vocabulary of tokens. Then, we can set up a dictionary where counting numbers map to tokens. Tokens can be characters, or words, or subwords; in fact, the “best” way to tokenize text is an active area of research. For our baseline, we will use a tokenizer that microsoft created for code.

%%capture
tokenizer = AutoTokenizer.from_pretrained('microsoft/CodeBERT-base')
# Load the concatenated data
raw_data = pd.read_csv("https://raw.githubusercontent.com/slinderman/stats305b/winter2024/assignments/hw4/python_corpus_4M.csv", header=None)

Note: if you run the below cell, you might get something like the following warning

Token indices sequence length is longer than the specified maximum sequence length for this model. Running this sequence through the model will result in indexing errors

You can safely ignore this error for the purpose of the assignment (we did in our solutions).

But, if you can figure out this error, or find a fix, let the teaching staff know, and we will give you extra credit.

# should take around 3 min to load in around 4M tokens
warnings.filterwarnings("ignore")

tokens = torch.tensor([], dtype=torch.long)
for index, row in raw_data.iterrows():
    text = row[0]
    chunks = chunk_string(text, chunk_size)
    n = len(chunks)
    for idx, chunk in enumerate(chunks):
        new_tokens = torch.tensor(tokenizer.encode(chunk, add_special_tokens=True))

        # logic to avoid incorrectly adding in start and end sequence tokens as an artifact of chunking
        if idx == 0:
            tokens = torch.cat((tokens, new_tokens[:-1]), dim=0)
        elif idx == n-1:
            tokens = torch.cat((tokens, new_tokens[1:]), dim=0)
        else:
            tokens = torch.cat((tokens, new_tokens[1:-1]), dim=0)

print(f"{len(tokens)} tokens have been loaded in")

Question 0.2: Examining the tokenizer#

Let’s see what the tokens look like! We will use these two prompts during the assignment.

prompt_1_text = \
"""def newton(eta, N, X, y, gamma, beta=None):
  \"""
  Performs Newton's method on the negative average log likelihood with an
  l2 regularization term

  beta: torch.Tensor, of shape (teams)
  X: torch.Tensor, the covariate matrix, of shape (-1, teams)
  y: torch.Tensor, the response vector, of shape (teams)
  gamma: float, the scale parameter for the regularization
  beta: torch.Tensor, the starting point for gradient descent, if specified
  \"""

  if beta is None:
    # Instantiate the beta vector at a random point
    beta = torch.randn(X.shape[1])
  else:
    beta = torch.clone(beta)

  loss = []

  # Instantiate a list to store the loss throughout the gradient descent
  # path
  for i in tqdm(range(N)):"""
prompt_2_text = \
"""import torch
import torch.nn.functional as F


def normalize(x, axis=-1):
    \"""Performs L2-Norm.\"""
    num = x
    denom = torch.norm(x, 2, axis, keepdim=True).expand_as(x) + 1e-12
    return num / denom

def euclidean_dist(x, y):
    \"""Computes Euclidean distance.\"""
    m, n = x.size(0), y.size(0)
    xx = torch.pow(x, 2).sum(1, keepdim=True).expand(m, n)
    yy = torch.pow(x, 2).sum(1, keepdim=True).expand(m, m).t()
    dist = xx + yy - 2 * torch.matmul(x, y.t())

    dist = dist.clamp(min=1e-12).sqrt()

    return dist


def cosine_dist(x, y):"""

Here is what the tokenized output for the prompts looks like

tokenizer.decode(tokenizer.encode(prompt_1_text))
tokenizer.decode(tokenizer.encode(prompt_2_text))

And here are what the first and last 10 tokens for prompt 1 look like:

for tok in tokenizer.encode(prompt_1_text, add_special_tokens=True)[:10]:
    print(f"{tok} : {tokenizer.decode([tok])}")
for tok in tokenizer.encode(prompt_1_text)[-10:]:
    print(f"{tok} : {tokenizer.decode([tok])}")

Question 0.2: What is the meanining of the <s> and the <\s> tokens? Why is it useful to have them?

0.3: Building our dataloader#

There are around 50,000 tokens in the codebert vocab, but we only use around 20,000 of them. To make our lives easier, we just reindex the token indices to go from 1 to around 20,000.

# Get unique elements
extra_tokens = torch.cat((torch.tensor(tokenizer.encode(prompt_1_text, add_special_tokens=True)),
                          torch.tensor(tokenizer.encode(prompt_2_text, add_special_tokens=True))),
                         dim=0)

unique_tokens = torch.unique(torch.cat((tokens, extra_tokens), dim=0))

# Create a mapping from code bert to ids that increment by one
from_code_bert_dict = {element.item(): id for id, element in enumerate(unique_tokens)}

# Create a reverse mapping from ids to code bert token ids
to_code_bert_dict = {id: element for element, id in from_code_bert_dict.items()}

vocab_size = len(unique_tokens)
print(f"there are {vocab_size} distinct tokens in the vocabulary")

# helper functions to move between code bert and simple ids
def from_code_bert(tkn_lst):
    """
    Args:
    tkn_lst: a list of code bert tokens
    Returns:
    a list of simple ids
    """
    tkns = [int(from_code_bert_dict[token]) for token in tkn_lst]
    return tkns


def to_code_bert(tkn_lst):
    """
    Args:
    tkn_lst: a list of simple ids
    Returns:
    a list of code bert tokens
    """
    tkns = [int(to_code_bert_dict[token]) for token in tkn_lst]
    return tkns
# let's translate our dataset into our ids
tokens_simple_id = torch.tensor([from_code_bert_dict[token.item()] for token in tokens])

# split up the data into train and validation sets
n = int(0.9 * len(tokens_simple_id)) # first 90% will be train, rest val
train_data = tokens_simple_id.clone()[:n]
val_data = tokens_simple_id.clone()[n:]

print(f"there are {len(train_data)} tokens in the training set")
print(f"there are {len(val_data)} tokens in the validation set")
print(f"there are {vocab_size} distinct tokens in the vocabulary")

We also write helper functions to get batches of data and to evaluate the loss of various models on them.

# function for getting batches of data
def get_batch(split, context_window_size, device, batch_size=32):
    """
    generate a small batch of data of inputs x and targets y

    Args:
        split: 'train' or 'val'
        device: 'cpu' or 'cuda' (should be 'cuda' if available)
    """
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - context_window_size, (batch_size,))
    x = torch.stack([data[i:i+context_window_size] for i in ix])
    y = torch.stack([data[i+1:i+context_window_size+1] for i in ix])
    x = x.to(device)
    y = y.to(device)
    return x, y

# helper function for tracking loss during training
# given to you
@torch.no_grad()
def estimate_loss(model, eval_iters, context_window_size, device):
    """
    Args:
      model: model being evaluated
      eval_iters: number of batches to average over
      context_window_size: size of the context window
      device: 'cpu' or 'cuda' (should be 'cuda' if available)
    """
    out = {}
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split, context_window_size, device)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    return out

Part 1: Language Modeling#

In this first part of the assignment, we will implement a baseline for code modeling.

In the process of building this baseline, we will review 4 key ideas of sequence modeling that have become the backbone of modern language modeling such as ChatGPT:

  1. Framing language modeling as next token prediction, and next token prediction as multiclass logistic regression

  2. Embedding discrete tokens in continuous latent spaces (word embeddings)

  3. Use the attention mechanism to move beyond Markovian models for sequences (we of course pay for this greater expressivity with increased compute, which is made possible in part by using matrix multiplications on acccelerated hardware like GPUs. Reducing the compute burden while maintaining the expressivity needed for good sequence modeling is an active area of research).

  4. Combining attention with deep learning in the Transformer architecture.

For various architectures that you have to train, we provide approximate training times and training loss scores for your reference. They are really just for reference so don’t read too much into them, but we are providing them as a warning mechanism in case something is going seriously wrong.

1.1: Next token prediction as multiclass logistic regression#

Our first language model will simply be a lookup table. That is, given that we have token with value \(v\), we will simply “look up” the logits that correspond to our prediction for the next token. This model is often known as a “bigram model” because it can be derived from the relative proportions of different bigrams (ordered pairs of tokens) occurring in a large text corpus.

Let us be a bit more precise in our definition of the bigram model. Let’s say that the total size of our vocabulary (the number of tokens we are using) is \(V\). Let \(A\) be a matrix in \(\mathbb{R}^{V \times V}\), where each row \(A_v\) corresponds to the logits for the prediction of which token would follow a token that has value \(v\). Thus, we are modeling the distribution of the token following a token that has value \(v\) as

\[\begin{align*} y_{t+1} \mid y_t &= v \sim \mathrm{Cat}(\mathbf{\pi}) \\ \pi &=\mathrm{softmax}(A_v) \end{align*}\]

Question 1.1.1#

\(\mathbf{\pi} \in \Delta_{V-1}\) is the vector of probabilities used to parameterize the categorical distribution for the next token prediction. Explain why we parameterize

\[\begin{equation*} \mathbf{\pi} = \mathrm{softmax}(A_v), \end{equation*}\]

and could not just use

\[\begin{equation*} \mathbf{\pi} = A_v. \end{equation*}\]

your answer here


Question 1.1.2#

Discuss the relationship between the bigram model and contigency tables (discussed in Lecture 2).


your answer here


Question 1.1.3#

Say I have a string of three tokens with ids \((7, 3, 6)\). If I use the bigram model as a generative model for language, given this information, what is distribution of the fourth token? Write your answer in terms of the matrix \(A\) we defined in 1.1


your answer here


Question 1.1.4#

Remember back in Section 0.2 “Tokenizing the data” when we gave you the helper function get_batch? Run get_batch and look at the inputs x and the targets y. Explain any relation between them in the context of formulating language modeling in the context of next token prediction.

xb, yb = get_batch('train', 10, device, batch_size = 1)
print(f"the features have token ids {xb}")
print('\n')
print(f"the targets have token ids {yb}")

Question 1.1.5#

Discuss the strengths and weaknesses of the bigram model as a generative model for language.


your answer here


Question 1.1.6#

Say I have a string \(s\) of length \(T\). Derive the formula for the negative log likelihood of \(s\) under the bigram model in terms of the matrix of logits \(A\). What would your answer be if the matrix of logits \(A\) were all zeros? What would be the value of the negative log likelihood of \(s\) under a model that always perfectly predicted the next token?


your answer here


Question 1.1.7: Implement the BigramLanguageModel#

Implement the bigram language model below.

Your TODOs:

  • if the forward method is provided a target, the loss should be the negative log likelihood of the target (given the context)

  • generate should take in (batched) contexts and a number of new tokens to generate, and then generate text autoregressively from your model. Note that in autoregressive text generation, you iteratively append the tokens you generate to your context.

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        """
        Args:
          vocab_size: size of the vocabulary (the number of tokens)
        """
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.logits_table = nn.Embedding(vocab_size, vocab_size)
        self.vocab_size = vocab_size

    def forward(self, token_ids, targets=None):
        """
        Args:
          token_ids: Int(B, T), token ids that make up the context (batch has size B, each entry in the batch has length T)
          targets: Int(B, T), token ids corresponding to the target of each context in token_ids

        Returns:
          logits: (B, T, V), logits[b,t, :] gives the length V vector of logits for the next token prediction in string b up to t tokens
          loss: scalar, negative log likelihood of target given context
        """

        # idx and targets are both (B,T) tensor of integers
        logits = self.logits_table(token_ids) # (B,T,V)

        if targets is None:
            loss = None
        else:
            # TODO: what should the loss in this setting be?
            loss = ...

        return logits, loss

    @torch.no_grad()
    def generate(self, token_ids, max_new_tokens=context_window_size):
        """
        Args:
          token_ids: (B, T) tensor of token ids to provide as context
          max_new_tokens: int, maximum number of new tokens to generate

        Returns:
          (B, T+max_new_tokens) tensor of context with new tokens appended
        """
        # TODO: your code below
        pass

Question 1.1.8: Evaluating the initialization.#

Evaluate the loss of your untrained bigram model on a batch of data. Make sure the loss (negative log likelihood) is per-token (i.e. you may need to average over both sequence length and batch). Does this loss make sense in the context of your answer to Question 1.1.6? Discuss.

x,y = get_batch("train", context_window_size, device)
bigram_model = BigramLanguageModel(vocab_size)
bm = bigram_model.to(device)

# TODO: your code below

your answer here


Question 1.1.9: Training your bigram model#

Train your bigram model for SMALL_ITERS iterations. Plot and interpret the loss curve.

Our train loss gets down to around 4 in around 5 min of training.

# create a PyTorch optimizer
learning_rate = 1e-2
optimizer = torch.optim.AdamW(bigram_model.parameters(), lr=learning_rate)

eval_interval = 200
eval_iters = 200

loss_list = []

for it in tqdm(range(SMALL_ITERS)):

    # every once in a while evaluate the loss on train and val sets
    if it % eval_interval == 0 or it == SMALL_ITERS - 1:
        print(f"iteration {it}")
        losses = estimate_loss(bm, eval_iters, context_window_size, device)
        print(f"step {it}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # sample a batch of data
    xb, yb = get_batch('train', context_window_size, device)

    # evaluate the loss
    logits, loss = bm(xb, yb)
    loss_list.append(loss.detach().item())
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

your answer here


Note that these models can take up a lot of memory on the GPU. As you go through this assignment, you may want to free the models after you train them using code along the lines of

bm.to('cpu')
torch.cuda.empty_cache()

1.2: Token Embeddings: going from discrete tokens to continuous latent spaces#

In the look up table formulation of the bigram model, we are modelling the logits of the next token didstirbution independently for each token, even if two tokens are extremely similar to each other. One way arond this problem is to learn an embedding of the discrete tokens into \(\mathbb{R}^{D}\), and then to run multi-class logistic regression on top of this learned embedding.

More precisely, if we have a vocabulary of tokens of size \(V\) that we choose to embed in a Euclidean embedding space of dimension \(D\), we can parameterize the distribution of the next token if the current token is \(v\) according to

\[\begin{align*} \mathrm{Cat}\Big( \mathrm{softmax} (\beta X_v) \Big), \end{align*}\]

where \(X_v \in \mathbb{R}^{D}\) is the learned embedding of token \(v\) into \(\mathbb{R}^{D}\) and \(\beta \in \mathbb{R}^{V \times D}\). Notice that if \(X\) were a fixed design matrix this formulation would be equivalent to multi-class logistic regression. However, both \(X\) and \(\beta\) are learnable parameters.

Question 1.2.1: Implement BigramWithWordEmbeddingsLM#

Implement a bigram languge model that uses a linear readout from a low dimensional Euclidean embedding of each token to parameterize the logits of the next token distribution, instead of parameterizing the logits of the next token distribution directly. It should have almost the same implementation as BigramLanguageModel from Question 1.1.6, except init should also take in an embed_size, and the forward method will need to be modified.

class BigramWithWordEmbeddingsLM(nn.Module):

    def __init__(self, vocab_size, embed_size=32):
      """
      Args:
        vocab_size: int, size of the vocabulary
        embed_size: int, dimension of the word embedding (D)
      """
      #TODO, your code here
      pass

    def forward(self, token_ids, targets=None):
        """
        Args:
          token_ids: (B, T) token ids that make up the context (batch has size B, each entry in the batch has length T)
          targets: (B, T) token ids corresponding to the target of each context in token_ids

        Returns:
          logits: (B, T, V), logits[b,t, :] gives the length V vector of logits for the next token prediction in string b up to t tokens
          loss: scalar, negative log likelihood of target given context
        """
        # TODO, your code here

        logits = None
        loss = None

        return logits, loss

    @torch.no_grad()
    def generate(self, token_ids, max_new_tokens=context_window_size):
        """
        Args:
          token_ids: (B, T) tensor of token ids to provide as context
          max_new_tokens: int, maximum number of new tokens to generate

        Returns:
          (B, T+max_new_tokens) tensor of context with new tokens appended
        """
        #TODO
        # your code below
        pass

1.3: Attention: Relaxing Markovian assumptions to transmit information across the sequence length#

A major problem with the bigram models of Sections 1.1 and 1.2 was that they were Markovian: the distribution of the next token was determined entirely by the current token! The attention mechanism provides a way to extract information between the previous tokens in the context to provide a better parameterization for the distribution of the next token.

Question 1.3.1: Averaging over word embeddings#

One simple way to pool information would simply be to average the embeddings!

Your TODO: Add comments to the the code snippet below. Write a description here explaining why the code is mathematically equivalent to averaging the embeddings of the previous tokens and the current token.


your answer here


# average word embedding via matrix multiply and softmax
small_batch_size = 4              # B
small_context_window_size = 8     # T
small_embed_size = 2              # D

# make "synthetic" word embeddings (for illustration purposes only)
X = torch.randn(small_batch_size, small_context_window_size, small_embed_size)

# TODO: comment the code below
print(X.shape)
tril = torch.tril(torch.ones(small_context_window_size, small_context_window_size))
attn_weights = torch.zeros((small_context_window_size, small_context_window_size))
attn_weights = attn_weights.masked_fill(tril == 0, float('-inf'))
attn_weights = F.softmax(attn_weights, dim=-1)
avgDmbds = attn_weights @ X
print(X[0])
print("")
print(avgDmbds[0])

1.3.2: Single-headed scaled \((Q,K,V)\)-attention#

A more sophisticated approach than simply averaging over previous word embeddings is single-headed (Query, Key, Value) scaled attention. That is, we now summarize the information contained in a length \(T\) sequence of tokens that have been embeded into \(X \in \mathbb{R}^{T \times D}\) according to

(10)#\[\begin{equation} \mathrm{SoftmaxAcrossRows} \Bigg( \frac{\mathrm{CausalMask}\Big(X U_q^\top U_k X^\top \Big)}{\sqrt{K}} \Bigg) \Big( X V^\top \Big), \end{equation}\]

where \(U_q, U_k \in \mathbb{R}^{K \times D}\), \(V \in \mathbb{R}^{D \times D}\), and \(K\) is the “head size”.

Question 1.3.2.1#

In the limiting case where \(U_q\) and \(U_k\) are all zeros, and \(V = I_{D}\), what does \((U_q, U_k, V)\) attention simplify to?


your answer here


Question 1.3.2.2#

Imagine we had two matrices \(U_q\) and \(U_k\), both in \(\mathbb{R}^{K \times D}\), where every entry was an independent standard normal.

What would be the distribution of an element of \(U_q^\top U_k\)? What about \(U_q^\top U_k / \sqrt{K}\)?


your answer here


Question 1.3.2.3: Implement single-headed scaled \((U_q,U_k,V)\)-attention.#

Complete the below code so the forward method returns single-headed scaled \((U_q,U_k,V)\)-attention.

class Head(nn.Module):
    """ one head of self-attention """

    def __init__(self, head_size, context_window_size, embed_size=384):
        """
        Args:
          head_size: int, size of the head embedding dimension (K)
          context_window_size: int, number of tokens considered in the past for attention (T)
          embed_size: int, size of the token embedding dimension (D)
        """
        super().__init__()
        self.head_size = head_size
        self.key = nn.Linear(embed_size, head_size, bias=False)
        self.query = nn.Linear(embed_size, head_size, bias=False)
        self.value = nn.Linear(embed_size, embed_size, bias=False)

        # not a param of the model, so registered as a buffer
        self.register_buffer('tril', torch.tril(
            torch.ones(context_window_size, context_window_size)))

    def forward(self, x):
        """
        Args:
          x: (B,T,D) tensor of token embeddings

        Returns:
          (B,T,D) tensor of attention-weighted token embeddings
        """
        # TODO: your code here
        pass
Question 1.3.2.4: Implement a single-headed attention language model#

Complete the code below. Note that because the transformer has no idea where tokens are occuring in space, we have also added in position embeddings.

class SingleHeadedAttentionLM(nn.Module):

    def __init__(self, vocab_size, context_window_size, head_size, embed_size=384):
      """
      Args:
        vocab_size: int, size of the vocabulary (V)
        context_window_size: int, number of tokens considered in the past for attention (T)
        head_size: int, size of the head embedding dimension (K)
        embed_size: int, size of the token embedding dimension (D)
      """
      super().__init__()
      self.token_embedding_table = nn.Embedding(vocab_size, embed_size)
      self.position_embedding_table = nn.Embedding(context_window_size, embed_size)
      self.context_window_size = context_window_size

      # TODO: your code below
      self.atten_head = Head(...)
      self.lm_head = nn.Linear(...)

    def forward(self, token_ids, targets=None):
        """
        Args:
          token_ids: (B, T) token ids that make up the context (batch has size B, each entry
                     in the batch has length T)
          targets: (B, T) token ids corresponding to the target of each context in token_ids

        Returns:
          logits: (B, T, V) logits[b,t] gives the length V vector of logits for the next token
                   prediction in string b up to t tokens
          loss: scalar, negative log likelihood of target given context
        """
        B, T = token_ids.shape # (batch size, length)
        tok_emb = self.token_embedding_table(token_ids) # (B,T,D)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T,D)
        x = tok_emb + pos_emb # (B,T,D)
        x = self.atten_head(x) # (B,T,D)
        logits = self.lm_head(x) # (B,T,V)

        # TODO: your code here
        logits = ...
        loss = ...
        return logits, loss

    @torch.no_grad()
    def generate(self, token_ids, max_new_tokens):
        """
        Args:
          token_ids: (B, T) tensor of token ids to provide as context
          max_new_tokens: int, maximum number of new tokens to generate

        Returns:
          (B, T+max_new_tokens) tensor of context with new tokens appended
        """
        #TODO
        # your code below
        pass

Train your new SingleHeadedAttentionLM for SMALL_ITERS training iterations and plot the loss curve. The head_size shouldn’t matter too much, we just use the embedding_size. Do you seen an improvement compared to your BigramLanguageModel? Discuss.

Note: you may want to modify the learning rate. Training for SMALL_ITERS with a learning rate of 6e-4, we can get to a train loss of around 3.4 in around 4 min of training.

embed_size = 384
sha_model = SingleHeadedAttentionLM(vocab_size, context_window_size, embed_size, embed_size)
sham = sha_model.to(device)
learning_rate = 6e-4
optimizer = torch.optim.AdamW(sha_model.parameters(), lr=learning_rate)

loss_list = []

for it in tqdm(range(SMALL_ITERS)):

    # every once in a while evaluate the loss on train and val sets
    if it % eval_interval == 0 or it == SMALL_ITERS - 1:
        print(f"iteration {it}")
        losses = estimate_loss(sham, eval_iters, context_window_size, device)
        print(
            f"step {it}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}"
        )

    # sample a batch of data
    xb, yb = get_batch("train", context_window_size, device)

    # evaluate the loss
    logits, loss = sham(xb, yb)
    loss_list.append(loss.detach().item())
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

your answer here


1.3.3: Multi-headed attention#

Question 1.3.3.1: Implement multi-headed attention#
class MultiHeadAttention(nn.Module):
    """ multiple heads of self-attention in parallel """

    def __init__(self, context_window_size, num_heads, head_size, embed_size=384):
        """
        Args:
            context_window_size: int, number of tokens considered in the past for attention (T)
            num_heads: int, number of heads (H)
            head_size: int, size of the head embedding dimension
            embed_size: int, size of the token embedding dimension
        """
        super().__init__()
        # TODO, your code below
        self.heads = nn.ModuleList(...)

    def forward(self, x):
        # TODO, your code below
        pass
Question 1.3.3.2: Implement a multi-headed attention LM#

Fill in the code below to create a language model that outputs its logits for next token prediction using multi-headed attention. Train your model for SMALL_ITERS training iterations. Compare the results with the single-headed attention model. Do you see an improvement?

We get to a train loss of around 2.75 in around 5 mins of training.

class MultiHeadedAttentionLM(nn.Module):

    def __init__(self, vocab_size, context_window_size, embed_size=384, num_heads=6):
      super().__init__()
      self.head_size = embed_size // num_heads
      self.context_window_size = context_window_size
      # TODO: your code below

    def forward(self, token_ids, targets=None):
        """
        Args:
          token_ids: (B, T) token ids that make up the context (batch has size B, each entry in the
                     batch has length T)
          targets: (B, T) token ids corresponding to the target of each context in token_ids

        Returns:
          logits: (B, T, V), logits[b,t] gives the length V vector of logits for the next token
                  prediction in string b up to t tokens
          loss: scalar, negative log likelihood of target given context
        """
        # TODO: your code below
        logits = ...
        loss = ...
        return logits, loss

    @torch.no_grad()
    def generate(self, token_ids, max_new_tokens):
        """
        Args:
          token_ids: (B, T) tensor of token ids to provide as context
          max_new_tokens: int, maximum number of new tokens to generate

        Returns:
          (B, T+max_new_tokens) tensor of context with new tokens appended
        """
        # TODO: your code below
        pass

your answer here


1.4: The Transformer Architecture: combining attention with deep learning#

# run this cell to initialize this deep learning module that you should use in the code your write later
# you don't need to edit this layer
class FeedForward(nn.Module):
    """ a simple linear layer followed by a non-linearity
        Given to you, you don't need to write any code here!
    """

    def __init__(self, embed_size):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(embed_size, 4 * embed_size),
            nn.ReLU(),
            nn.Linear(4 * embed_size, embed_size),
        )

    def forward(self, x):
        return self.net(x)

Question 1.4.1: Implement a transformer block#

Complete the code below to implement a transformer block

To make the your implemenation easier to train, we have added two deep learning best practices:

  1. Residual connections. In the forward method of the TransformerBlock, we have made the connections of the residual connection, which of the form

(11)#\[\begin{equation} x = (I + N)(x), \end{equation}\]

where \(I\) stands for the identity transformation and \(N\) stands for some non-linearity. The idea is that every layer is some adjustment of the identity function, which allows gradients to flow through a deep network during back propogation, especially at initialization.

  1. Prenorm via LayerNorm Also in the forward method of the TransformerBlock, the nonlinearity first applied a LayerNorm to its arguments. The LayerNorm basically standardizes the neurons in that layer so that they have mean 0 and variance 1. Doing so is very helpful for numerical stability, espeically of the gradients.

class TransformerBlock(nn.Module):
    """ Transformer block: communication across sequence length, followed by communication across embedding space
        Uses multi-headed attention
    """

    def __init__(self, vocab_size, context_window_size, embed_size=384, num_heads=6):
        super().__init__()
        self.ln1 = nn.LayerNorm(embed_size)
        self.ln2 = nn.LayerNorm(embed_size)

        # TODO: your code below
        self.feed_forward = FeedForward(...)
        self.mh_attention = ...

    def forward(self, x):
        x = x + self.mh_attention(self.ln1(x)) # communication over sequence length
        x = x + self.feed_forward(self.ln2(x)) # communication across embedding space
        return x

Question 1.4.2: Implement your baseline transformer model#

We now stack 6 TransformerBlocks (with a final layer norm applied after the blocks but before the logits) to create our basline TransformerLM.

class TransformerLM(nn.Module):

    def __init__(self, vocab_size, context_window_size, embed_size=384, num_heads=6, n_layers=6):
        """
          Args:
              vocab_size: int, number of tokens in the vocabulary (V)
              context_window_size: int, size of the context window (T)
              embed_size: int, embedding size (D)
              num_heads: int, number of heads (H)
              n_layers: int, number of layers (M)
        """
        super().__init__()
        self.token_embedding_table = nn.Embedding(vocab_size, embed_size)
        self.position_embedding_table = nn.Embedding(context_window_size, embed_size)
        self.blocks = nn.Sequential(*[
            TransformerBlock(vocab_size,
                             context_window_size,
                             embed_size=embed_size,
                             num_heads=num_heads)
            for _ in range(n_layers)])

        # final layer norm
        self.ln_f = nn.LayerNorm(embed_size)
        self.lm_head = nn.Linear(embed_size, vocab_size)

        # good initialization
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, token_ids, targets=None):
        """
        Agrgs:
            token_ids: tensor of integers, provides the contet, shape (B, T)
            targets: tensor of integers, provides the tokens we are preidcitng, shape (B, T)
        """
        B, T = token_ids.shape

        # token_ids and targets are both (B, T) tensor of integers
        tok_emb = self.token_embedding_table(token_ids) # (B, T, D)
        pos_emb = self.position_embedding_table(torch.arange(T, device=device)) # (T, D)
        x = tok_emb + pos_emb # (B, T, D)

        # TODO: your code below
        logits = ...
        loss = ...

        return logits, loss

    @torch.no_grad()
    def generate(self, token_ids, max_new_tokens):
        """
        Args:
            token_ids: tensor of integers forming the context, shape (B, T)
            max_new_tokens: int, max number of tokens to generate
        """
        # TOOD, your code below
        return token_ids

Train your TransformerLM for LARGE_ITERS iterations and plot the loss curve. You may want to change the learning rate.

We used a learning rate of 1e-4 and got to a final train loss of around 2.4 in around 30 mins of training.

trans = TransformerLM(vocab_size, context_window_size)
tlm = trans.to(device)
learning_rate = 1e-4
# TODO, your code below

Generate an unconditional sample of length context_window_size from your trained TransformerLM, and also prompt it with the two prompts we gave you. How does the output look? Discuss?

# the contexts for the different prompts
start_context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(f"shape is {start_context.shape}")
context1 = torch.tensor(from_code_bert(tokenizer.encode(prompt_1_text)[:-1]), device=device).reshape(1, -1) # (1, T)
print(f"shape is {context1.shape}")
context2 = torch.tensor(from_code_bert(tokenizer.encode(prompt_2_text)[:-1])).to(device).reshape(1, -1)
print(f"shape is {context2.shape}")
# unconditional generate from the transformer model
uncond_gen = (tlm.generate(start_context, max_new_tokens=context_window_size)[0].tolist())
print(tokenizer.decode(to_code_bert(uncond_gen)))
# conditional generation of newton's method
# TODO, your code here
# conditional generation of cosine distance
# TODO, your code here

Question 1.4.3#

The negative log-likelihood (averaged per token) we have been using to train our models can be expressed as

\[\begin{equation*} L = -\frac{1}{T} \sum_{t = 1}^{T} \log p(s[t] | \text{context}) \end{equation*}\]

for some document \(s\), where \(s[t]\) is the \(t\)th token of the doc. The natural language processing (NLP) community often reports the quantity

\[\begin{equation*} \text{perplexity} = \exp(L). \end{equation*}\]

Give an intuitive interpretation of what perplexity is. Why might it be a more intuitive or natual measure to report than negative log-likelihood? Does the reported perplexity of your trained TransformerLM model make sense in terms of samples it generates? (Be sure to distinguish betwen train and validation perplexity. Which of train and val perplexity is more helpful for understanding your generated samples? Why?). (Hint: your answer to Question 1.1.6 may be helpful).

Part 2: Mini-Project#

Quick recap: So far we have

  1. Preprocessed the python code dataset by encoding text into integer tokens.

  2. Implemented single headed attention and then further generalized to multiheaded attention. We further combined multiheaded attention with deep learning to create the transformer architecture.

  3. Trained our transformer and generate code output.

Up to this point, the performance of our simple language model has clearly made a lot of progress. We can see that our model has learned to generate in the style of python code syntax, although there are many quirks that suggest it will not make a very practical code assistant in its current state.

Project Outline#

Find some area of possible improvement. We interpret “improvement” quite loosely, but it is up to you to state precisely in what sense your proposed innovation might constitute an improvement and to show convincing evidence that your innovation does or does not constitue an improvement according to your definition. For your idea, formulate a hypothesis for why this change should result in a better model. Implement your changes and report any findings.

Notes: As this assignment is being treated as a project, you should expect training to take longer than previous assignments. However, please use your judgement to decide what is reasonable. We will not expect you to run training procedures that take more than 2 hours on the free Google Colab computing resources and we certainly do not expect you to acquire additional compute. The proposed improvements should not solely rely on increased computing demands, but must be based on the goal of improving the model by more efficiently learning from our data.

Hints: There are many aspects to assessing our model. For example, not only is quality of generated text important, it is also of interest to reduce costs associated with training.

Deliverables#

In addition to a pdf of your python notebook, the submission for this project will be a written report no more than 4 pages in length using the NeurIPS LaTex template. Your report should include detailed analysis of the hypotheses you chose to test along with any conclusions.

The page limit for the report does not include bibliography or appendices. Make sure to keep the “ready for submission” option to help us grade anonymously. One of your apprendices should contain a link to any code used to generate the project so that we can grade it (google drive with colab nbs or github repo are both fine). You should have at least one plot in your main text (which is capped at 4 pages).

Data augmentation#

We got the data for this project from The Stack. If you’d like, you can definitely train on larger datasets by accessing their dataset of python code (we just scratched the surface). You have to make an account on Hugginface to get a Hugginface access token, but the process is pretty quick.

Submission Instructions#

You will generate two PDFs: one from parts 0 and 1, which involves completing this colab to create a transformer baseline; and one from the mini-project in part 2, which will be your write-up of no longer than 4 pages.

Combine the two PDFs into a single PDF and submit on gradescope. Tag your PDF correctly.

If you work in a group of two, submit one assignment on gradescope. If you complete the assignment individually, submit as usual.