{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HW3: Hidden Markov Models\n",
    "\n",
    "This assignment considers the problem of classifying sleep states based on noisy heart rate sensor data, like what you might obtain from an Apple Watch. Sleep researchers use [polysomnography (PSG)](https://en.wikipedia.org/wiki/Polysomnography) to obtain ground truth sleep states. PSG combines measurements of your brain (EEG), muscles (EMG), and heart (EKG) to determine which of four sleep states you are in:\n",
    "1. _Awake_\n",
    "2. _Core Sleep_: Non-REM stages 1 and 2. Heart rate and body temperature drop. Sleep spindles, which are thought to be important for memory consolidation, seen in EEG.\n",
    "3. _Deep Sleep_: Non-REM stage 3. Hard to wake up, and you feel groggy if you do. Lowest heart rate.\n",
    "4. _Rapid Eye Movement (REM) Sleep_: This is when dreaming occurs. Heart rate increases, similar to in an awake state. Not considered restful sleep. \n",
    "\n",
    "Of course, measuring EEG, EMG, and EKG while you're sleeping is a big pain. It would be great if we could predict sleep states using less invasive measures, like what you might obtain with an Apple Watch. That's the goal of this assignment!\n",
    "\n",
    "> **<u>This Assignment</u>**\n",
    ">\n",
    ">**Model:** You will use a **Hidden Markov Model (HMM)** to classify sleep stages based on (synthetic) heart rate measurements. \n",
    ">\n",
    ">**Algorithm:** You will use **expectation-maximization (EM)** and the **forward-backward algorithm** to estimate the model parameters and infer the latent states.\n",
    ">\n",
    ">**Data**: We will work with a synthetic dataset modeled after a sleep study by Walch et al (2019). We found it too difficult to make accurate predictions on their real data, so we simulated synthetic data with a stronger relationship between sleep states and heart rate. This way, you should be able to predict sleep states with better accuracy than we were able to achieve with the actual data. Though the data is synthetic, it still serves as a good exercise for learning about these models and algorithms.\n",
    "\n",
    "\n",
    "**References**\n",
    "- Walch, O. (2019). Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography (version 1.0.0). PhysioNet. https://doi.org/10.13026/hmhs-py35.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import torch"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data consists of 30 time series, each corresponding to one subject's sleep data over from a single night. The time series have been concatenated together in the data frame below. The fields are,\n",
    "- `id`: the index of the time series (0, ..., 29)\n",
    "- `state` : the \"true\" sleep state\n",
    "    - 0 = \"awake\"\n",
    "    - 1 = \"core sleep\"\n",
    "    - 2 = \"deep sleep\"\n",
    "    - 3 = \"REM\"\n",
    "- `hr`: the measured heart rate. Missing data is marked by `NaN`.\n",
    "\n",
    "Each row corresponds to a 30 second time bin. The time series are variable in length, ranging from just shy of 4 hours to over 8 hours. Again, these are simulated data, but we generated the data to look like heart rate traces measured by Walch et al. (2019)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>state</th>\n",
       "      <th>hr</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>97.378630</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>97.254448</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>118.884372</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>99.038121</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>98.649347</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26609</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26610</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26611</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26612</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26613</th>\n",
       "      <td>29</td>\n",
       "      <td>1</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>26614 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "       id  state          hr\n",
       "0       0      0   97.378630\n",
       "1       0      0   97.254448\n",
       "2       0      0  118.884372\n",
       "3       0      0   99.038121\n",
       "4       0      0   98.649347\n",
       "...    ..    ...         ...\n",
       "26609  29      1         NaN\n",
       "26610  29      1         NaN\n",
       "26611  29      1         NaN\n",
       "26612  29      1         NaN\n",
       "26613  29      1         NaN\n",
       "\n",
       "[26614 rows x 3 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Load the concatenated data\n",
    "data = pd.read_csv(\"https://raw.githubusercontent.com/slinderman/stats305b/winter2024/assignments/hw3/hw3_synth.csv\")\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 0: Preprocessing\n",
    "\n",
    "1. Split the data into separate data frames, one for each subject. \n",
    "2. Plot a couple of the heart rate time series along with the true underlying state. \n",
    "3. Plot histograms of heart rate for each state. \n",
    "\n",
    "Make your plots look good, like you would for a figure in a paper. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 1: State Estimation\n",
    "\n",
    "Now we will use an HMM to estimate the sleep states given the heart rate data. \n",
    "\n",
    "**Notation:**\n",
    "\n",
    "Let:\n",
    "- $N=30$ denote the number of subjects\n",
    "- $K=4$ denote the number of sleep states\n",
    "- $T_i$ denote the number of time bins for subject $i \\in \\{1, \\ldots, N\\}$\n",
    "- $z_{i,t} \\in \\{1, \\ldots K\\}$ denote the sleep state for subject $i$ and time $t$\n",
    "- $x_{i,t} \\in \\mathbb{R}$ denote the measured heart rate for subject $i$ at time $t$. \n",
    "- $\\boldsymbol{\\pi}_0 \\in \\Delta_{K-1}$ denote the initial distribution\n",
    "- $\\mathbf{P} \\in [0,1]^{K \\times K}$ denote the transition matrix\n",
    "- $\\mu_k \\in \\mathbb{R}$ denote the conditional mean of the heart rate in state $k$\n",
    "- $\\sigma^2_k \\in \\mathbb{R}_+$ denote the conditional variance of the heart rate in state $k$\n",
    "- $\\Theta = \\{\\boldsymbol{\\pi}_0, \\mathbf{P}, \\{\\mu_k, \\sigma_k^2\\}\\}$ denote the set of all model parameters\n",
    "\n",
    "**Model:**\n",
    "\n",
    "We will use a Gaussian HMM of the form,\n",
    "\\begin{align*}\n",
    "p(\\{\\{z_{i,t}, x_{i,t}\\}_{t=1}^{T_i}\\}_{i=1}^N; \\Theta) \n",
    "&=\n",
    "\\prod_{i=1}^N \\left[\n",
    "\\mathrm{Cat}(z_{i,1} \\mid \\boldsymbol{\\pi}_0) \n",
    "\\prod_{t=2}^{T_i} \\mathrm{Cat}(z_{i, t} \\mid \\mathbf{P}_{z_{i,t-1}}) \n",
    "\\prod_{t=1}^{T_i} \\mathrm{N}(x_{i, t} \\mid \\mu_{z_{i,t}}, \\sigma^2_{z_{i,t}}) \n",
    "\\right]\n",
    "\\end{align*}\n",
    "where $\\mathbf{P}_{k} \\in \\Delta_{K-1}$ is the $k$-th row of the transition matrix.\n",
    "\n",
    "Assume that the missing heart rate values are \"missing at random,\" so that the corresponding likelihood terms can simply be dropped from the joint probability above. That is, replace $\\mathrm{N}(x_{i, t} \\mid \\mu_{z_{i,t}}, \\sigma^2_{z_{i,t}})$ with $1$ for missing values.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 1a: Estimate the Model Parameters\n",
    "Hold out the last 5 subjects for evaluation. Using the first 25 subjects as your training data, estimate the mean and variance of the heart rate in each sleep state. \n",
    "\n",
    "1. Estimate $\\mu_k = \\mathbb{E}[X_t \\mid Z_t=k]$ and $\\sigma_k^2 = \\mathrm{Var}[X_t \\mid Z_t=k]$. Assume the subjects are iid so you can pool all the data in your estimates.\n",
    "\n",
    "2. Estimate the transition matrix $\\mathbf{P} \\in [0,1]^{K \\times K}$.\n",
    "\n",
    "Assume the initial state distribution is $\\boldsymbol{\\pi}_0 = (1, 0, 0, 0)^\\top$. That is, assume the subjects always start in the awake state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 1b: Implement the Forward-Backward Algorithm\n",
    "\n",
    "Implement the forward-backward algorithm for inferring the posterior probabilities $\\Pr(z_{i,t}=k \\mid x_{i,1:T_i}; \\Theta)$ for all $k=1,\\ldots,K$ and $t=1,\\ldots,T_i$. Your function should also return the marginal log likelihood, $\\log p(x_{i,1:T_i}; \\Theta)$, which is a byproduct of the forward pass (see lecture notes). \n",
    "\n",
    "_Note: Make sure you use a numerically stable implementation!_\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 1c: Infer the Sleep States \n",
    "\n",
    "Using your estimated parameters, $\\hat{\\Theta}$, from Problem 1a and your implementation from 1b, compute the posterior probabilities of each sleep state and time step on the held-out subjects' data. Make a nice plot of your results for one subject, including the computed probabilities as well as the true states."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 1d: Evaluate your classifier\n",
    "\n",
    "Let \n",
    "\\begin{align*}\n",
    "\\hat{z}_{i,t} = \\arg \\max_{k \\in [K]} \\Pr(z_{i,t}=k \\mid x_{i,1:T_i}; \\hat{\\Theta})\n",
    "\\end{align*} \n",
    "denote the state with the highest posterior marginal probability.\n",
    "\n",
    "Make a confusion matrix to compare your state estimates with the true states of held-out subjects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 1e: Discussion\n",
    "\n",
    "Discuss your results from Problems 1c and 1d. How accurate is your classifer? Do the errors make sense?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "_Your answer here_\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 2: Parameter Estimation\n",
    "\n",
    "In Part 1, you fixed the parameters using estimates derived from training data. Often, we don't have ground truth state labels and we need to simultaneously estimate the states and the parameters. For this part of the assignment, imagine the true state labels are unknown and all you have access to are the heart rate time series."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 2a: Implement the EM Algorithm\n",
    "\n",
    "Using your implementation of the forward-backward algorithm from above, implement the expectation-maximization (EM) algorithm to simultaneously infer the latent states $\\{\\{z_{i,t}\\}_{t=1}^{T_i}\\}_{i=1}^N$ and estimate the parameters $\\{\\mu_k, \\sigma_k^2\\}_{k=1}^K$. \n",
    "\n",
    "To keep things simple, you can assume that you have $\\hat{\\mathbf{P}}$ from above and that $\\boldsymbol{\\pi}_0 = (1,0,0,0)^\\top$ is known. (It's not hard to estimate these parameters too, but it's a little tedious so we'll spare you the trouble.)\n",
    "\n",
    "Your EM function should return: \n",
    "- the marginal log likelihoods after each iteration of EM\n",
    "- the final parameter estimates\n",
    "- the posterior state probabilities under the final parameter estimates "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 2b: Run your Code\n",
    "\n",
    "Run your EM algorithm on the 25 training subjects. Using your estimated parameters, run the forward-backward algorithm to infer the latent states of the held-out 5 subjects.\n",
    "- Plot the marginal log likelihood as a function of EM iteration. \n",
    "- Plot a confusion matrix of true and inferred states on the held-out data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 2c: Model Selection\n",
    "\n",
    "In the unsupervised setting, we often need to estimate the number of discrete states, $K$, as well. One way to do that is by comparing the marginal log likelihood of held-out data using parameters estimated on the training data, for a range of $K$. \n",
    "\n",
    "Sweep over a range of values of $K$ and for each value run EM on the training data to estimate your model parameters. Then using your forward-backward algorithm to compute the marginal log likelihood of the held-out data. Plot the held-out marginal log likelihood as a function of $K$.\n",
    "\n",
    "_Note: for this problem you can use a dummy transition matrix that simply imposes a \"sticky\" prior. Let $\\mathbf{P} = [[p_{i,j}]]$ where $p_{i,j} = 0.95$ if $i=j$ and $p_{i,j} = \\frac{0.05}{K-1}$ if $i \\neq j$._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Problem 2d: Discussion\n",
    "\n",
    "Discuss your results from Part 2. How do your parameter estimates from EM compare to those estimated from the true state labels in Part 1? How many states would you select based on Problem 2c? Does your result make sense?\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "_Your answer here_\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Part 3: Model Criticism\n",
    "\n",
    "As we said in HW1, applied statistics is an iterative process. We don't just fit one model and call it a day &mdash; we revisit our modeling assumptions in light of our findings and look for ways to improve our fit.\n",
    "\n",
    "Let's keep $K=4$ fixed, since we know there are really four official sleep states. Let's also continue to only consider variations of HMMs. In this part, suggest and then implement _at least one_ improvement to your model, based on your findings above. Quantify your improvements using classification accuracy on held-out subjects.\n",
    "\n",
    "_Note: You may use the true state labels from the training subjects for this part of the assignment._"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Submission Instructions\n",
    "\n",
    "**Formatting:** check that your code does not exceed 80 characters in line width. You can set _Tools &rarr; Settings &rarr; Editor &rarr; Vertical ruler column_ to 80 to see when you've exceeded the limit.\n",
    "\n",
    "**Converting to PDF** The simplest way to convert to PDF is to use the \"Print to PDF\" option in your browser. Just make sure that your code and plots aren't cut off, as it may not wrap lines.\n",
    "\n",
    "**Alternatively** You can download your notebook in .ipynb format and use the following commands to convert it to PDF.  Then run the following command to convert to a PDF:\n",
    "```\n",
    "jupyter nbconvert --to pdf <yourlastname>_hw<number>.ipynb\n",
    "```\n",
    "(Note that for the above code to work, you need to rename your file `<yourlastname>_hw<number>.ipynb`)\n",
    "\n",
    "**Installing nbconvert:**\n",
    "\n",
    "If you're using Anaconda for package management,\n",
    "```\n",
    "conda install -c anaconda nbconvert\n",
    "```\n",
    "\n",
    "**Upload** your .pdf file to Gradescope. Please tag your questions correctly! I.e., for each question, all of and only the relevant sections are tagged.\n",
    "\n",
    "Please post on Ed or come to OH if there are any other problems submitting the HW."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}