CS229 Problem Set #2 1
CS 229, Summer 2019
Problem Set #2
Due Monday, July 29 at 11:59 pm on Gradescope.
Notes: (1) These questions require thought, but do not require long answers. Please be as
concise as possible. (2) If you have a question about this homework, we encourage you to post
your question on our Piazza forum, at http://piazza.com/stanford/summer2019/cs229. (3)
If you missed the rst lecture or are unfamiliar with the collaboration or honor code policy,
please read the policy on the course website before starting work. (4) For the coding problems,
you may not use any libraries except those de ned in the provided environment.yml le. In
particular, ML-speci c libraries such as scikit-learn are not permitted. (5) To account for late
days, the due date is Monday, July 29 at 11:59 pm. If you submit after Monday, July 29 at
11:59 pm, you will begin consuming your late days. If you wish to submit on time, submit before
Monday, July 29 at 11:59 pm.
All students must submit an electronic PDF version of the written questions. We highly rec-
ommend typesetting your solutions via LATEX. All students must also submit a zip le of their
source code to Gradescope, which should be created using the make zip.py script. You should
make sure to (1) restrict yourself to only using libraries included in the environment.yml le,
and (2) make sure your code runs without errors. Your submission may be evaluated by the
auto-grader using a private test set, or used for verifying the outputs reported in the writeup.
1. [15 points] Logistic Regression: Training stability
In this problem, we will be delving deeper into the workings of logistic regression. The goal of
this problem is to help you develop your skills debugging machine learning algorithms (which
can be very different from debugging software in general).
We have provided an implementation of logistic regression in src/stability/stability.py,
and two labeled datasets A and B in src/stability/ds1 a.csv and src/stability/ds1 b.csv.
Please do not modify the code for the logistic regression training algorithm for this problem.
First, run the given logistic regression code to train two di erent models on A and B. You can
run the code by simply executing python stability.py in the src/stability directory.
(a) [2 points] What is the most notable di erence in training the logistic regression model on
datasets A and B?
(b) [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but
not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate
your hypothesis for the misbehavior. Remember, you should address why your explanation
does not apply to A.
Hint: The issue is not a numerical rounding or over/underow error.
(c) [5 points] For each of these possible modifications, state whether or not it would lead to
the provided training algorithm converging on datasets such as B. Justify your answers.
i. Using a different constant learning rate.
ii. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1=t2,
where t is the number of gradient descent iterations thus far).
iii. Linear scaling of the input features.
iv. Adding a regularization term k k22
to the loss function.
v. Adding zero-mean Gaussian noise to the training data or labels.
(d) [3 points] Are support vector machines, vulnerable to datasets like B? Why or why not?
Give an informal justi cation.
2. [22 points] Spam classification
In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classi er.
In recent years, spam on electronic media has been a growing concern. Here, we'll build a
classi er to distinguish between real messages, and spam messages. For this class, we will be
building a classi er to detect SMS spam messages. We will be using an SMS spam dataset
developed by Tiago A. Almedia and Jos e Mar a G omez Hidalgo which is publicly available on
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 1
We have split this dataset into training and testing sets and have included them in this assignment
as src/spam/spam train.tsv and src/spam/spam test.tsv. See src/spam/spam readme.txt
for more details about this dataset. Please refrain from redistributing these dataset les. The
goal of this assignment is to build a classi er from scratch that can tell the di erence the spam
and non-spam messages using the text of the SMS message.
(a) [5 points] Implement code for processing the the spam messages into numpy arrays that can
be fed into machine learning models. Do this by completing the get words, create dictionary,
and transform text functions within our provided src/spam.py. Do note the correspond-
ing comments for each function for instructions on what speci c processing is required.
The provided code will then run your functions and save the resulting dictionary into
spam dictionary and a sample of the resulting training matrix into
spam sample train matrix.
In your writeup, report the vocabular size after the pre-processing step. You do not need
to include any other output for this subquestion.
(b) [10 points] In this question you are going to implement a naive Bayes classi er for spam
classi cation with multinomial event model and Laplace smoothing (refer to class notes
on Naive Bayes for details on Laplace smoothing in Section 2.3 of notes2.pdf).
Code your implementation by completing the fit naive bayes model and
predict from naive bayes model functions in src/spam/spam.py.
Now src/spam/spam.py should be able to train a Naive Bayes model, compute your predic-
tion accuracy and then save your resulting predictions to spam naive bayes predictions.
In your writeup, report the accuracy of the trained model on the test set.
Remark. If you implement naive Bayes the straightforward way, you will nd that the
computed often equals zero. This is because p(xjy), which is the
product of many numbers less than one, is a very small number. The standard computer
representation of real numbers cannot handle numbers that are too small, and instead
rounds them o to zero. (This is called \underow.") You'll have to nd a way to compute
Naive Bayes' predicted class labels without explicitly representing very small numbers such
as p(xjy). [Hint: Think about using logarithms.]
(c) [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a
particular class. We can try to get an informal sense of how indicative token i is for the
SPAM class by looking at:
Complete the get top five naive bayes words function within the provided code using
the above formula in order to obtain the 5 most indicative tokens.
Report the top ve words in your writeup.
(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that
we discussed in class. We have provided you an SVM implementation (using a radial basis
function (RBF) kernel) within src/spam/svm.py (You should not need to modify that
code).
One important part of training an SVM parameterized by an RBF kernel (a.k.a Gaussian
kernel) is choosing an appropriate kernel radius parameter.
Complete the compute best svm radius by writing code to compute the best SVM radius
which maximizes accuracy on the validation dataset. Report the best kernel radius you
obtained in the writeup.