CS 135 Intro to Machine Learning - Homework 2: Evaluating Binary Classifiers and Implementing Logistic Regression

Engage in a Conversation

Homework 2: Evaluating Binary Classifiers and Implementing Logistic Regression

Files to Turn In: CourseNana.COM

ZIP file of source code submitted to autograder should contain: CourseNana.COM

binary_metrics.py : will be autograded
hw2_notebook.ipynb : will not be autograded, but may be manually assessed to verify authorship/correctness/partial credit

PDF report (manually graded): CourseNana.COM

Export a PDF of your completed Jupyter notebook.
You can export a notebook as a PDF in 2 steps (or 1 one if you want to install Pandoc and xelatex, but that’s not required).
1. In Jupyter, go to File -> Save and Export Notebook As -> HTML. Note that without installing additional software directly exporting as a PDF will not work.
2. Open the saved HTML file, then print the page using your web browser’s builtin print functionality.
While uploading to gradescope, mark each subproblem marked via the in-browser Gradescope annotation tool) in your uploaded report.

Evaluation Rubric: CourseNana.COM

80% will be the report
20% will be the autograder score of your code

See the PDF submission portal on Gradescope for the point values of each problem. Generally, tasks with more coding/effort will earn more potential points. CourseNana.COM

Background

In this HW, you’ll complete two problems related to binary classifiers. CourseNana.COM

In Problem 1, you’ll implement common metrics for evaluating binary classifiers. CourseNana.COM

In problem 2, you’ll learn how to decide if a new feature can help classify cancer better than a previous model. CourseNana.COM

As much as possible, we have tried to decouple these parts, so you may successfully complete the report even if some of your code doesn’t work. Much of your analysis will use library code in sklearn with similar functionality as what you implement yourself. CourseNana.COM

This homework specifically deals with: CourseNana.COM

Classifier Basics
Evaluation of Binary Classifiers

Starter Code

See the hw2 folder of the public assignments repo for this class: https://github.com/tufts-ml-courses/cs135-24f-assignments/tree/main/hw2 CourseNana.COM

This starter code includes several files CourseNana.COM

For Problem 1: CourseNana.COM

You need to edit code in binary_metrics.py

For Problem 2: CourseNana.COM

You need to edit hw2_notebook.ipynb, which will help you organize your analysis for the report. CourseNana.COM
Helper functions in threshold_selection.py and confusion_matrix.py are implemented for you. You should understand the provided code, but you do NOT need to edit these files. CourseNana.COM

Problem 1: Implement performance metrics for binary predictions

Here, you’ll implement several metrics for comparing provided “true” binary labels with “predicted” binary decisions. CourseNana.COM

See the starter code file: binary_metrics.py CourseNana.COM

Task 1(a) : Implement calc_TP_TN_FP_FN CourseNana.COM

Task 1(b) : Implement calc_ACC CourseNana.COM

Task 1(c) : Implement calc_TPR CourseNana.COM

Task 1(d) : Implement calc_PPV CourseNana.COM

See the starter code for example inputs and the expected output. CourseNana.COM

Problem 2: Binary Classifier for Cancer-Risk Screening

Dataset: Predicting Cancer Risk from Easy-to-measure Facts

We are building classifiers that decide if patients are low-risk or high-risk for some kind of cancer. If our classifier is reliable enough at identifying low-risk patients, we could use the classifier’s assigned label to perform screening: CourseNana.COM

if $\hat{y} = 1$ : do a follow-up biopsy
if $\hat{y} = 0$ : no follow-up necessary

Currently, all patients have the biopsy. Can we reduce the number of patients that need to go thru the biopsy, while still catching almost all the cases that have cancer? CourseNana.COM

Setup: You have been given a dataset containing some medical history information for 750 patients that might be at risk of cancer. Dataset credit: A. Vickers, Memorial Sloan Kettering Cancer Center [original link]. CourseNana.COM

Each patient in our dataset has had a biopsy, a short surgical procedure to extract a tumor sample that is a bit painful but with virtually no lasting harmful effects. After the biopsy, lab techs can test the tissue sample to obtain a direct “ground truth” label so we know each patient’s actual cancer status (binary variable, 1 means “has cancer”, 0 means does not, column name is cancer in the $y$ data files). CourseNana.COM

We want to build classifiers to predict whether a patient likely has cancer from easier-to-get information, so we could avoid painful biopsies unless they are necessary. Of course, if we skip the biopsy, a patient with cancer would be left undiagnosed and therefore untreated. We’re told by the doctors this outcome would be life-threatening. CourseNana.COM

Easiest features: It is known that older patients with a family history of cancer have a higher probability of harboring cancer. So we can use age and famhistory variables in the $x$ dataset files as inputs to a simple predictor. CourseNana.COM

Possible new feature: A clinical chemist has recently discovered a real-valued marker (called marker in the $x$ dataset files) that she believes can distinguish between patients with and without cancer. We wish to assess whether or not the new marker does indeed identify patients with and without cancer well. CourseNana.COM

To summarize, there are two versions of the features $x$ we’d like you to examine: CourseNana.COM

2-feature dataset: ‘age’ and ‘famhistory’
3-feature dataset: ‘marker’, ‘age’ and ‘famhistory’

In the starter code, we have provided an existing train/validation/test split of this dataset, stored on-disk in comma-separated-value (CSV) files: x_train.csv, y_train.csv, x_valid.csv, y_valid.csv, x_test.csv, and y_test.csv. CourseNana.COM

Data: https://github.com/tufts-ml-courses/cs135-24f-assignments/tree/main/hw2/data_cancer CourseNana.COM

Understanding the Dataset

Implementation Step 1A CourseNana.COM

Given the provided datasets (as CSV files), load them and compute the relevant counts needed for Table 1A below. CourseNana.COM

Table 1 in Report CourseNana.COM

Provide a table summarizing some basic properties of the provided training set, validation set, and test set: CourseNana.COM

Row 1 ‘total count’: how many total examples are in each set?
Row 2 ‘positive label count’: how many examples have a positive label (means cancer)?
Row 3 ‘fraction positive’ : what fraction (between 0 and 1) of the examples have cancer?

Establishing Baseline Prediction Quality

Implementation Step 1B CourseNana.COM

Given a training set of values ${y_{n}}_{n = 1}^{N}$ , we can always consider a simple baseline for prediction that returns the same constant predicted label regardless of the input $x_{i}$ feature vector: CourseNana.COM

predict-0-always : $\hat{y} (x_{i}) = 0$ for all $i$

Short Answer 1a in Report CourseNana.COM

What accuracy does the “predict-0-always” classifier get on the test set (report to 3 decimal places)? (You should see a pretty high number). Why isn’t this classifier “good enough” to use in our screening task? CourseNana.COM

Trying Logistic Regression: Training and Hyperparameter Selection

Implementation Step 1C CourseNana.COM

Consider the 2-feature dataset. Fit a logistic regression model using sklearn’s LogisticRegression implementation sklearn.linear_model.LogisticRegression docs. CourseNana.COM

When you construct your LogisticRegression classifier, please be sure that: CourseNana.COM

Set solver='lbfgs' (ensures consistent performance, coherent penalty)
Provide a positive value for hyperparameter C, an “inverse strength value” for the L2 penalty on coefficient weights
- Small C (e.g. $10^{- 6}$ ) mean the weights should be near zero (equivalent to large $α$ )
- Large C (e.g. $10^{+ 6}$ ) means the weights should be unpenalized (equivalent to small $α$ )

To avoid overfitting, you should explore a range of C values, using a regularly-spaced grid: C_grid = np.logspace(-9, 6, 31). Among these possible values, select the value that minimizes the mean cross entropy loss on the validation set. The starter code contains a function from sklearn for computing this loss. CourseNana.COM

Implementation Step 1D CourseNana.COM

Repeat 1C, for the 3-feature dataset. CourseNana.COM

Comparing Models with ROC Analysis

We have trained two possible LR models, one using the 2-feature dataset ( $F = 2$ ) and the other with the 3-feature dataset ( $F = 3$ ). Which is better? CourseNana.COM

Receiver Operating Curves (“ROC” curves) allow us to compare classifiers across many possible decision thresholds. Each curve shows the tradeoffs a classifier makes between true positive rate (TPR) and false positive rate (FPR), as you vary the decision threshold. Remember FPR = 1 - TNR. CourseNana.COM

Implementation Step 1E CourseNana.COM

Compare the $F = 2$ and $F = 3$ model’s performance on the validation set, using ROC curves. CourseNana.COM

You can use `sklearn.metrics.roc_curve’ to plot such curves. To understand how to use this function, consult the function’s User Guide and documentation. CourseNana.COM

Create a single plot showing two lines: CourseNana.COM

one line is the validation-set ROC curve for the $F = 2$ model from 1C (use color BLUE (‘b’) and style ‘.-’)
one line is the validation-set ROC curve for the $F = 3$ model from 1D (use color RED (‘r’) and style ‘.-’)

Figure 1 in Report CourseNana.COM

In your report, show the plot you created in step 1E. No caption is necessary. CourseNana.COM

Short Answer 1b in Report CourseNana.COM

Compare the two models in terms of their ROC curves from Figure 1. Does one dominate the other in terms of overall performance across all thresholds, or are there some threshold regimes where the 2-feature model is preferred and other regimes where the 3-feature model is preferred? Which model do you recommend for the task at hand? CourseNana.COM

Selecting the Decision Threshold

Remember that even after we train an LR model to make probabilistic predictions, if we intend the classifier to ultimately make some yes/no binary decision (e.g. should we give a biopsy or not), we need to select the threshold we use to obtain a binary decision from probabilities. CourseNana.COM

Of course, we could just use a threshold of 0.5 (y_pred=0 if y_proba<0.5 else 1, which is what sklearn and most implementations will do by default). Below, we’ll compare that approach against several potentially smarter strategies for selecting this threshold. CourseNana.COM

To get candidate threshold values, use the helper function compute_perf_metrics_across_thresholds in the starter code file threshold_selection.py. CourseNana.COM

Implementation Step 1F CourseNana.COM

For the classifier from 1D above (LR for 3-features), calculate performance metrics using the default threshold of y_proba < 0.5. Produce the confusion matrix and calculate the TPR and PPV on the test set. Tip: Remember that we have implemented helper functions for you in confusion_matrix.py. CourseNana.COM

Implementation Step 1G CourseNana.COM

For the classifier from 1D above (LR for 3-features), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds). Then, pick the threshold that maximizes TPR while satisfying PPV >= 0.98 on the validation set. If there’s a tie for the maximum TPR, chose the threshold corresponding to a higher PPV. CourseNana.COM

Remember, you pick this threshold based on the validation set, then later you’ll evaluate it on the test set. CourseNana.COM

Implementation Step 1H CourseNana.COM

For the classifier from 1D above (LR for 3-feature), compute performance metrics across all candidate thresholds on the validation set (use compute_perf_metrics_across_thresholds), and pick the threshold that maximizes PPV while satisfying TPR >= 0.98 on the validation set. If there’s a tie for the maximum PPV, chose the threshold corresponding to a higher TPR. CourseNana.COM

Remember, you pick this threshold based on the validation set, then later you’ll evaluate it on the test set. CourseNana.COM

Short Answer 1c in Report CourseNana.COM

By carefully reading the confusion matrices, report for each of the 3 thresholding strategies in parts 1F - 1H how many subjects in the test set are saved from unnecessary biopsies that would be done in current practice. CourseNana.COM

Hint: You can assume that currently, the hospital would have done a biopsy on every patient in the test set. Your goal is to build a classifier that improves on this practice. CourseNana.COM

Short Answer 1d in Report CourseNana.COM

Among the 3 possible thresholding strategies, which strategy best meets the stated goals of stakeholders in this screening task: avoid life-threatening mistakes whenever possible, while also eliminating unnecessary biopsies? What fraction of current biopsies might be avoided if this strategy was adopted by the hospital? CourseNana.COM

Hint: You can also assume the test set is a reasonable representation of the true population of patients. CourseNana.COM