Homework 2: Evaluating Binary Classifiers and Implementing Logistic Regression
Files to Turn In:
ZIP file of source code submitted to autograder should contain:
binary_metrics.py
: will be autogradedhw2_notebook.ipynb
: will not be autograded, but may be manually assessed to verify authorship/correctness/partial credit
PDF report (manually graded):
- Export a PDF of your completed Jupyter notebook.
- You can export a notebook as a PDF in 2 steps (or 1 one if you want to install Pandoc and xelatex, but that’s not required).
- In Jupyter, go to
File
->Save and Export Notebook As
->HTML
. Note that without installing additional software directly exporting as a PDF will not work. - Open the saved HTML file, then print the page using your web browser’s builtin print functionality.
- In Jupyter, go to
- While uploading to gradescope, mark each subproblem marked via the in-browser Gradescope annotation tool) in your uploaded report.
Evaluation Rubric:
- 80% will be the report
- 20% will be the autograder score of your code
See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points.
Background
In this HW, you’ll complete two problems related to binary classifiers.
In Problem 1, you’ll implement common metrics for evaluating binary classifiers.
In problem 2, you’ll learn how to decide if a new feature can help classify cancer better than a previous model.
As much as possible, we have tried to decouple these parts, so you may successfully complete the report even if some of your code doesn’t work. Much of your analysis will use library code in sklearn with similar functionality as what you implement yourself.
This homework specifically deals with:
- Classifier Basics
- Evaluation of Binary Classifiers
Starter Code
See the hw2 folder of the public assignments repo for this class: https://github.com/tufts-ml-courses/cs135-24f-assignments/tree/main/hw2
This starter code includes several files
For Problem 1:
- You need to edit code in
binary_metrics.py
For Problem 2:
You need to edit
hw2_notebook.ipynb
, which will help you organize your analysis for the report.Helper functions in
threshold_selection.py
andconfusion_matrix.py
are implemented for you. You should understand the provided code, but you do NOT need to edit these files.
Problem 1: Implement performance metrics for binary predictions
Here, you’ll implement several metrics for comparing provided “true” binary labels with “predicted” binary decisions.
See the starter code file: binary_metrics.py
Task 1(a) : Implement calc_TP_TN_FP_FN
Task 1(b) : Implement calc_ACC
Task 1(c) : Implement calc_TPR
Task 1(d) : Implement calc_PPV
See the starter code for example inputs and the expected output.
Problem 2: Binary Classifier for Cancer-Risk Screening
Dataset: Predicting Cancer Risk from Easy-to-measure Facts
We are building classifiers that decide if patients are low-risk or high-risk for some kind of cancer. If our classifier is reliable enough at identifying low-risk patients, we could use the classifier’s assigned label to perform screening:
- if
: do a follow-up biopsy - if
: no follow-up necessary
Currently, all patients have the biopsy. Can we reduce the number of patients that need to go thru the biopsy, while still catching almost all the cases that have cancer?
Setup: You have been given a dataset containing some medical history information for 750 patients that might be at risk of cancer. Dataset credit: A. Vickers, Memorial Sloan Kettering Cancer Center [original link].
Each patient in our dataset has had a biopsy, a short surgical procedure to extract a tumor sample that is a bit painful but with virtually no lasting harmful effects. After the biopsy, lab techs can test the tissue sample to obtain a direct “ground truth” label so we know each patient’s actual cancer status (binary variable, 1 means “has cancer”, 0 means does not, column name is cancer
in the
We want to build classifiers to predict whether a patient likely has cancer from easier-to-get information, so we could avoid painful biopsies unless they are necessary. Of course, if we skip the biopsy, a patient with cancer would be left undiagnosed and therefore untreated. We’re told by the doctors this outcome would be life-threatening.
Easiest features: It is known that older patients with a family history of cancer have a higher probability of harboring cancer. So we can use age
and famhistory
variables in the
Possible new feature: A clinical chemist has recently discovered a real-valued marker (called marker
in the
To summarize, there are two versions of the features
- 2-feature dataset: ‘age’ and ‘famhistory’
- 3-feature dataset: ‘marker’, ‘age’ and ‘famhistory’
In the starter code, we have provided an existing train/validation/test split of this dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_valid.csv, y_valid.csv, x_test.csv, and y_test.csv.
Data: https://github.com/tufts-ml-courses/cs135-24f-assignments/tree/main/hw2/data_cancer
Understanding the Dataset
Implementation Step 1A
Given the provided datasets (as CSV files), load them and compute the relevant counts needed for Table 1A below.
Table 1 in Report
Provide a table summarizing some basic properties of the provided training set, validation set, and test set:
- Row 1 ‘total count’: how many total examples are in each set?
- Row 2 ‘positive label count’: how many examples have a positive label (means cancer)?
- Row 3 ‘fraction positive’ : what fraction (between 0 and 1) of the examples have cancer?
Establishing Baseline Prediction Quality
Implementation Step 1B
Given a training set of values
- predict-0-always :
for all
Short Answer 1a in Report
What accuracy does the “predict-0-always” classifier get on the test set (report to 3 decimal places)? (You should see a pretty high number). Why isn’t this classifier “good enough” to use in our screening task?
Trying Logistic Regression: Training and Hyperparameter Selection
Implementation Step 1C
Consider the 2-feature dataset. Fit a logistic regression model using sklearn’s LogisticRegression
implementation sklearn.linear_model.LogisticRegression
docs.
When you construct your LogisticRegression
classifier, please be sure that:
- Set
solver='lbfgs'
(ensures consistent performance, coherent penalty) - Provide a positive value for hyperparameter
C
, an “inverse strength value” for the L2 penalty on coefficient weights - Small C (e.g.
) mean the weights should be near zero (equivalent to large )
- Small C (e.g.
- Large C (e.g.
) means the weights should be unpenalized (equivalent to small )
- Large C (e.g.
To avoid overfitting, you should explore a range of C
values, using a regularly-spaced grid: C_grid = np.logspace(-9, 6, 31)
. Among these possible values, select the value that minimizes the mean cross entropy loss on the validation set. The starter code contains a function from sklearn for computing this loss.
Implementation Step 1D
Repeat 1C, for the 3-feature dataset.
Comparing Models with ROC Analysis
We have trained two possible LR models, one using the 2-feature dataset (
Receiver Operating Curves (“ROC” curves) allow us to compare classifiers across many possible decision thresholds. Each curve shows the tradeoffs a classifier makes between true positive rate (TPR) and false positive rate (FPR), as you vary the decision threshold. Remember FPR = 1 - TNR.
Implementation Step 1E
Compare the
You can use `sklearn.metrics.roc_curve’ to plot such curves. To understand how to use this function, consult the function’s User Guide and documentation.
Create a single plot showing two lines:
- one line is the validation-set ROC curve for the
model from 1C (use color BLUE (‘b’) and style ‘.-’) - one line is the validation-set ROC curve for the
model from 1D (use color RED (‘r’) and style ‘.-’)
Figure 1 in Report
In your report, show the plot you created in step 1E. No caption is necessary.
Short Answer 1b in Report
Compare the two models in terms of their ROC curves from Figure 1. Does one dominate the other in terms of overall performance across all thresholds, or are there some threshold regimes where the 2-feature model is preferred and other regimes where the 3-feature model is preferred? Which model do you recommend for the task at hand?
Selecting the Decision Threshold
Remember that even after we train an LR model to make probabilistic predictions, if we intend the classifier to ultimately make some yes/no binary decision (e.g. should we give a biopsy or not), we need to select the threshold we use to obtain a binary decision from probabilities.
Of course, we could just use a threshold of 0.5 (y_pred=0 if y_proba<0.5 else 1
, which is what sklearn
and most implementations will do by default). Below, we’ll compare that approach against several potentially smarter strategies for selecting this threshold.
To get candidate threshold values, use the helper function compute_perf_metrics_across_thresholds
in the starter code file threshold_selection.py.
Implementation Step 1F
For the classifier from 1D above (LR for 3-features), calculate performance metrics using the default threshold of y_proba < 0.5
. Produce the confusion matrix and calculate the TPR and PPV on the test set. Tip: Remember that we have implemented helper functions for you in confusion_matrix.py
.
Implementation Step 1G
For the classifier from 1D above (LR for 3-features), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds
). Then, pick the threshold that maximizes TPR while satisfying PPV >= 0.98 on the validation set. If there’s a tie for the maximum TPR, chose the threshold corresponding to a higher PPV.
Remember, you pick this threshold based on the validation set, then later you’ll evaluate it on the test set.
Implementation Step 1H
For the classifier from 1D above (LR for 3-feature), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds
), and pick the threshold that maximizes PPV while satisfying TPR >= 0.98 on the validation set. If there’s a tie for the maximum PPV, chose the threshold corresponding to a higher TPR.
Remember, you pick this threshold based on the validation set, then later you’ll evaluate it on the test set.
Short Answer 1c in Report
By carefully reading the confusion matrices, report for each of the 3 thresholding strategies in parts 1F - 1H how many subjects in the test set are saved from unnecessary biopsies that would be done in current practice.
Hint: You can assume that currently, the hospital would have done a biopsy on every patient in the test set. Your goal is to build a classifier that improves on this practice.
Short Answer 1d in Report
Among the 3 possible thresholding strategies, which strategy best meets the stated goals of stakeholders in this screening task: avoid life-threatening mistakes whenever possible, while also eliminating unnecessary biopsies? What fraction of current biopsies might be avoided if this strategy was adopted by the hospital?
Hint: You can also assume the test set is a reasonable representation of the true population of patients.