{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "P_JIQE4w9xbB" }, "source": [ "# HW1: Logistic Regression\n", "\n", "This class is about models and algorithms for discrete data. This homework will have all 3 ingredients:\n", "* **Data**: the results from all college football games in the 2023 season\n", "* **Model**: The *Bradely-Terry* model for predicting the winners of football game. The Bradley-Terry model is just logistic regression.\n", "* **Algorithm**: We will implement two ways of fitting logistic regression: gradient descent and Newton's method" ] }, { "cell_type": "markdown", "metadata": { "id": "oi2v2m5yCJE9" }, "source": [ "## The Bradley-Terry Model\n", "\n", "In the Bradley-Terry Model, we give team $k$ a team-effect $\\beta_k$. Basically, higher $\\beta_k$ (relatively speaking), means that team $k$ is a better team.\n", "The Bradley-Terry model formalizes this intution by modeling the log odds of team $k$ beating team $k'$ by the difference in their team effects, $\\beta_k - \\beta_{k'}$.\n", "\n", "Let $i = 1,\\ldots, n$ index games, and let $h(i) \\in \\{1,\\ldots,K\\}$ and $a(i) \\in \\{1,\\ldots,K\\}$ denote the indices of the home and away teams, respectively.\n", "Let $Y_i \\in \\{0,1\\}$ denote whether the home team won.\n", "Under the Bradley-Terry model,\n", "\\begin{equation*}\n", " Y_i \\sim \\mathrm{Bern}\\big(\\sigma(\\beta_{h(i)} - \\beta_{a(i)}) \\big),\n", "\\end{equation*}\n", "where $\\sigma(\\cdot)$ is the sigmoid function. We can view this model as a logistic regression model with covariates $x_i \\in \\mathbb{R}^K$ where,\n", "\\begin{align*}\n", "x_{i,k} &=\n", "\\begin{cases}\n", "+1 &\\text{if } h(i) = k \\\\\n", "-1 &\\text{if } a(i) = k \\\\\n", "0 &\\text{o.w.},\n", "\\end{cases}\n", "\\end{align*}\n", "and parameters $\\beta \\in \\mathbb{R}^K$." ] }, { "cell_type": "markdown", "metadata": { "id": "toIIF0ej-a7I" }, "source": [ "## Data\n", "\n", "We use the results of college football games in the fall 2023 season, which are available from the course github page and loaded for you below.\n", "\n", "The data comes as a list of the outcomes of individual games. You'll need to wrangle the data to get it into a format that you can feed into the Bradley-Terry model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qvTw_232nr-v" }, "outputs": [], "source": [ "import torch\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 638 }, "id": "WIYCdEBqnvJG", "outputId": "00e407b9-75af-46de-be25-bec38f06f02d" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdSeasonWeekSeason TypeStart DateStart Time TbdCompletedNeutral SiteConference GameAttendance...Away ConferenceAway DivisionAway PointsAway Line ScoresAway Post Win ProbAway Pregame EloAway Postgame EloExcitement IndexHighlightsNotes
040155088320231regular2023-08-26T17:00:00.000ZFalseTrueFalseFalseNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
140152543420231regular2023-08-26T18:30:00.000ZFalseTrueTrueFalse49000.0...American Athleticfbs3.0NaN0.0010421471.01385.01.346908NaNNaN
240154019920231regular2023-08-26T19:30:00.000ZFalseTrueTrueFalseNaN...UACfcs7.0NaN0.025849NaNNaN6.896909NaNNaN
340152014520231regular2023-08-26T21:30:00.000ZFalseTrueFalseTrue17982.0...Conference USAfbs14.0NaN0.5919991369.01370.06.821333NaNNaN
440152545020231regular2023-08-26T23:00:00.000ZFalseTrueFalseFalse15356.0...FBS Independentsfbs41.0NaN0.7607511074.01122.05.311493NaNNaN
540153239220231regular2023-08-26T23:00:00.000ZFalseTrueFalseFalse23867.0...Mid-Americanfbs13.0NaN0.0455311482.01473.06.547378NaNNaN
640154062820231regular2023-08-26T23:00:00.000ZFalseTrueFalseFalseNaN...Patriotfcs13.0NaN0.077483NaNNaN5.608758NaNNaN
740152014720231regular2023-08-26T23:30:00.000ZFalseTrueFalseFalse21407.0...Mountain Westfbs28.0NaN0.8191541246.01241.05.282033NaNNaN
840153999920231regular2023-08-26T23:30:00.000ZFalseTrueTrueFalseNaN...MEACfcs7.0NaN0.001097NaNNaN3.122344NaNNaN
940152398620231regular2023-08-27T00:00:00.000ZFalseTrueFalseFalse63411.0...Mountain Westfbs28.0NaN0.0017691462.01412.01.698730NaNNaN
\n", "

10 rows × 33 columns

\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ " Id Season Week Season Type Start Date \\\n", "0 401550883 2023 1 regular 2023-08-26T17:00:00.000Z \n", "1 401525434 2023 1 regular 2023-08-26T18:30:00.000Z \n", "2 401540199 2023 1 regular 2023-08-26T19:30:00.000Z \n", "3 401520145 2023 1 regular 2023-08-26T21:30:00.000Z \n", "4 401525450 2023 1 regular 2023-08-26T23:00:00.000Z \n", "5 401532392 2023 1 regular 2023-08-26T23:00:00.000Z \n", "6 401540628 2023 1 regular 2023-08-26T23:00:00.000Z \n", "7 401520147 2023 1 regular 2023-08-26T23:30:00.000Z \n", "8 401539999 2023 1 regular 2023-08-26T23:30:00.000Z \n", "9 401523986 2023 1 regular 2023-08-27T00:00:00.000Z \n", "\n", " Start Time Tbd Completed Neutral Site Conference Game Attendance ... \\\n", "0 False True False False NaN ... \n", "1 False True True False 49000.0 ... \n", "2 False True True False NaN ... \n", "3 False True False True 17982.0 ... \n", "4 False True False False 15356.0 ... \n", "5 False True False False 23867.0 ... \n", "6 False True False False NaN ... \n", "7 False True False False 21407.0 ... \n", "8 False True True False NaN ... \n", "9 False True False False 63411.0 ... \n", "\n", " Away Conference Away Division Away Points Away Line Scores \\\n", "0 NaN NaN NaN NaN \n", "1 American Athletic fbs 3.0 NaN \n", "2 UAC fcs 7.0 NaN \n", "3 Conference USA fbs 14.0 NaN \n", "4 FBS Independents fbs 41.0 NaN \n", "5 Mid-American fbs 13.0 NaN \n", "6 Patriot fcs 13.0 NaN \n", "7 Mountain West fbs 28.0 NaN \n", "8 MEAC fcs 7.0 NaN \n", "9 Mountain West fbs 28.0 NaN \n", "\n", " Away Post Win Prob Away Pregame Elo Away Postgame Elo Excitement Index \\\n", "0 NaN NaN NaN NaN \n", "1 0.001042 1471.0 1385.0 1.346908 \n", "2 0.025849 NaN NaN 6.896909 \n", "3 0.591999 1369.0 1370.0 6.821333 \n", "4 0.760751 1074.0 1122.0 5.311493 \n", "5 0.045531 1482.0 1473.0 6.547378 \n", "6 0.077483 NaN NaN 5.608758 \n", "7 0.819154 1246.0 1241.0 5.282033 \n", "8 0.001097 NaN NaN 3.122344 \n", "9 0.001769 1462.0 1412.0 1.698730 \n", "\n", " Highlights Notes \n", "0 NaN NaN \n", "1 NaN NaN \n", "2 NaN NaN \n", "3 NaN NaN \n", "4 NaN NaN \n", "5 NaN NaN \n", "6 NaN NaN \n", "7 NaN NaN \n", "8 NaN NaN \n", "9 NaN NaN \n", "\n", "[10 rows x 33 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "allgames = pd.read_csv(\"https://raw.githubusercontent.com/slinderman/stats305b/winter2024/data/01_allgames.csv\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem 0: Preprocessing\n", "\n", "Preprocess the data to drop games with nan scores, construct the covariate matrix $X$, construct the response vector $y$, and do any other preprocessing you find useful." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "id": "ZjUJOkAWHWD0" }, "source": [ "## Problem 1: Loss function\n", "\n", "Write a function to compute the loss, $L(\\beta)$ defined be\n", "\n", "\\begin{equation*}\n", " L(\\beta) = -\\frac{1}{n} \\sum_{i=1}^n \\log p(y_i \\mid x_i; \\beta) + \\frac{\\gamma}{2} \\| \\beta \\|_2^2\n", "\\end{equation*}\n", "where $\\gamma$ is a hyperparameter that controls the strength of your $\\ell_2$ regularization.\n", "\n", "You may want to use the `torch.distributions.Bernoulli` class." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WTaCXlvSHuxh" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "id": "8Cx0wyYytSb7" }, "source": [ "## Problem 2: Gradient Descent" ] }, { "cell_type": "markdown", "metadata": { "id": "xuNBMXGsO-7q" }, "source": [ "### Problem 2.1 Implementing and checking your gradients\n", "\n", "\n", "Write a function to compute the gradient of the average negative log likelihood and check your output against the results obtained by PyTorch's automatic differentiation functionality." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ROj5lRuOsASh" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "id": "Cl9CAUpTPtpw" }, "source": [ "### Problem 2.2: Implement Gradient Descent\n", "\n", "\n", "Now, use gradient descent to fit your Bradley-Terry model to the provided data.\n", "\n", "Deliverables for this question:\n", "1. Code the implements gradient descent to fit your Bradley-Terry model to the provided data.\n", "2. A plot of the loss curve of your algorithm and a brief discussion if it makes sense or not\n", "3. A plot of the histogram of the fitted values of $\\beta$\n", "4. The top 10 teams from your ranking, and a discussion of whether this ranking makes sense or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kPSNWKE8sKIH" }, "outputs": [], "source": [ "# your code here (you can use multiple code and markdown cells to organize your answer)" ] }, { "cell_type": "markdown", "metadata": { "id": "lBPDg-5QtXQV" }, "source": [ "## Problem 3: Newton's Method\n", "\n", "Now, use Newton's method to fit your Bradley-Terry model to the provided data.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Pi_R1fgkFbQ0" }, "source": [ "### Problem 3.1 The Hessian\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "RS0kTKtVLDlQ" }, "source": [ "#### Problem 3.1.1. Implement and check the Hessian\n", "Write a function to compute the Hessian of the average negative log likelihood and check your answer against the output of `from torch.autograd.functional.hessian`." ] }, { "cell_type": "markdown", "metadata": { "id": "TtSlxUAkLE-y" }, "source": [ "#### Problem 3.1.2: Positive definiteness\n", "\n", "Compute the Hessian at the point $\\beta = 0$ without regularization (set $\\gamma = 0$). Unless you've done sort of pre-processing, it's probably singular." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "6KQjQZtfsUZ6" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "id": "GKVxT91XLbSL" }, "source": [ "#### Problem 3.1.3\n", "\n", "Describe intuitively and mathematically what it means for the Hessian of the negative log likelihood to be singular in the context of this data and model" ] }, { "cell_type": "markdown", "metadata": { "id": "yAsLFSGXsWXO" }, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": { "id": "TvClzEjJLk52" }, "source": [ "#### Problem 3.1.4\n", "\n", "Give a hypothesis for why the Hessian in this dataset and model is singular, and provide empirical evidence to support your hypothesis." ] }, { "cell_type": "markdown", "metadata": { "id": "dFphHjnxsjE2" }, "source": [ "*your answer here*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "-wxWOBOQslRc" }, "outputs": [], "source": [ "# your code here" ] }, { "cell_type": "markdown", "metadata": { "id": "EtKlPKs9LyNw" }, "source": [ "#### Problem 3.1.5\n", "\n", "Explain why the Hessian is invertible when $\\gamma > 0$." ] }, { "cell_type": "markdown", "metadata": { "id": "CgvigoXaspaw" }, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": { "id": "szaThYwMMuf4" }, "source": [ "### Problem 3.2: Implement Newton's method\n", "\n", "Now, use Newton's method to fit your $\\ell_2$-regularized Bradley-Terry model to the provided data.\n", "\n", "Deliverables for this question:\n", "1. Code the implements Newton's method to fit your Bradley-Terry model to the provided data.\n", "2. A plot of the loss curves from Newton's method and from gradient descent, using the same regularization strength $\\gamma$ and initialization $\\beta_0$. Briefly discuss the results and compare their rates of convergence.\n", "3. A plot of the histogram of the fitted values of $\\beta$\n", "4. The top 10 teams from your ranking, and a discussion of whether this ranking makes sense or not." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "FPYUCllcsri7" }, "outputs": [], "source": [ "# your code here (you can use multiple code and markdown cells to organize your answer)" ] }, { "cell_type": "markdown", "metadata": { "id": "J9R91iI5NCMs" }, "source": [ "## Problem 4: Model criticism and revision\n", "\n", "Let's take another look the Bradley-Terry model from earlier and think about improvements we can make.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "yPSnL3odcj12" }, "source": [ "### Problem 4.1: Improvements to Bradley-Terry Model\n", "Choose one way to improve the Bradley-Terry model. Discuss *a priori* why you think this change will improve the model and implement your change." ] }, { "cell_type": "markdown", "metadata": { "id": "gngpLxYpczp0" }, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": { "id": "Xt9Yn0NPc3nS" }, "source": [ "### Problem 4.2: Evaluation\n", "Assess whether or not your change was an improvement or not. Provide empirical evidence by evaluating performance on a held out test set and include at least one plot supporting your assessment." ] }, { "cell_type": "markdown", "metadata": { "id": "yQvtv-eHdBM5" }, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": { "id": "87F609vpdEq0" }, "source": [ "### Problem 4.3: Reflection\n", "Reflecting on the analysis we've conducted in this assignemnt, which conference is best? Is there a significant difference? Please justify your answer." ] }, { "cell_type": "markdown", "metadata": { "id": "-4YnbZWVdmWv" }, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": { "id": "9TL_LAYoyI2T" }, "source": [ "## Submission Instructions\n", "\n", "**Formatting:** check that your code does not exceed 80 characters in line width. You can set _Tools → Settings → Editor → Vertical ruler column_ to 80 to see when you've exceeded the limit.\n", "\n", "**Converting to PDF** The simplest way to convert to PDF is to use the \"Print to PDF\" option in your browser. Just make sure that your code and plots aren't cut off, as it may not wrap lines.\n", "\n", "**Alternatively** You can download your notebook in .ipynb format and use the following commands to convert it to PDF. Then run the following command to convert to a PDF:\n", "```\n", "jupyter nbconvert --to pdf _hw.ipynb\n", "```\n", "(Note that for the above code to work, you need to rename your file `_hw.ipynb`)\n", "\n", "**Installing nbconvert:**\n", "\n", "If you're using Anaconda for package management,\n", "```\n", "conda install -c anaconda nbconvert\n", "```\n", "\n", "**Upload** your .pdf file to Gradescope. Please tag your questions correctly! I.e., for each question, all of and only the relevant sections are tagged.\n", "\n", "Please post on Ed or come to OH if there are any other problems submitting the HW." ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "A100", "machine_shape": "hm", "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.9" } }, "nbformat": 4, "nbformat_minor": 4 }