1. Homepage
  2. Programming
  3. [2019] CS229: Machine Learning Problem Set #2: Logistic Regression: Training stability

[2019] CS229: Machine Learning Problem Set #2: Logistic Regression: Training stability

Engage in a Conversation
CS229Machine LearningLogistic RegressionTraining stabilityStanford

CS229 Problem Set #2 1 CourseNana.COM

CS 229, Summer 2019 Problem Set #2 CourseNana.COM

Due Monday, July 29 at 11:59 pm on Gradescope. CourseNana.COM

Notes: (1) These questions require thought, but do not require long answers. Please be as concise as possible. (2) If you have a question about this homework, we encourage you to post your question on our Piazza forum, at http://piazza.com/stanford/summer2019/cs229. (3) If you missed the first lecture or are unfamiliar with the collaboration or honor code policy, please read the policy on the course website before starting work. (4) For the coding problems, you may not use any libraries except those defined in the provided environment.yml file. In particular, ML-specific libraries such as scikit-learn are not permitted. (5) To account for late days, the due date is Monday, July 29 at 11:59 pm. If you submit after Monday, July 29 at 11:59 pm, you will begin consuming your late days. If you wish to submit on time, submit before Monday, July 29 at 11:59 pm. CourseNana.COM

All students must submit an electronic PDF version of the written questions. We highly recommend typesetting your solutions via LATEX. All students must also submit a zip file of their source code to Gradescope, which should be created using the make zip.py script. You should make sure to (1) restrict yourself to only using libraries included in the environment.yml file, and (2) make sure your code runs without errors. Your submission may be evaluated by the auto-grader using a private test set, or used for verifying the outputs reported in the writeup. CourseNana.COM

In this problem, we will be delving deeper into the workings of logistic regression. The goal of this problem is to help you develop your skills debugging machine learning algorithms (which can be very different from debugging software in general). CourseNana.COM

We have provided an implementation of logistic regression in src/stability/stability.py, and two labeled datasets A and B in src/stability/ds1 a.csv and src/stability/ds1 b.csv. CourseNana.COM

Please do not modify the code for the logistic regression training algorithm for this problem. First, run the given logistic regression code to train two different models on A and B. You can run the code by simply executing python stability.py in the src/stability directory. CourseNana.COM

  1. (a)  [2 points] What is the most notable difference in training the logistic regression model on datasets A and B?
  2. (b)  [5 points] Investigate why the training procedure behaves unexpectedly on dataset B, but not on A. Provide hard evidence (in the form of math, code, plots, etc.) to corroborate your hypothesis for the misbehavior. Remember, you should address why your explanation does not apply to A.

Hint: The issue is not a numerical rounding or over/underflow error. CourseNana.COM

  1. (c)  [5 points] For each of these possible modifications, state whether or not it would lead to

the provided training algorithm converging on datasets such as B. Justify your answers. CourseNana.COM

    1. Using a different constant learning rate.
    2. Decreasing the learning rate over time (e.g. scaling the initial learning rate by 1/t2, where t is the number of gradient descent iterations thus far).
    3. Linear scaling of the input features.
    4. Adding a regularization term θ2 to the loss function.
    5. Adding zero-mean Gaussian noise to the training data or labels.
  1. (d)  [3 points] Are support vector machines, vulnerable to datasets like B? Why or why not? Give an informal justification.

2. [22 points] Spam classification CourseNana.COM

In this problem, we will use the naive Bayes algorithm and an SVM to build a spam classifier. CourseNana.COM

In recent years, spam on electronic media has been a growing concern. Here, we’ll build a classifier to distinguish between real messages, and spam messages. For this class, we will be building a classifier to detect SMS spam messages. We will be using an SMS spam dataset developed by Tiago A. Almedia and Jos ́e Mar ́ıa G ́omez Hidalgo which is publicly available on http://www.dt.fee.unicamp.br/~tiago/smsspamcollection 1 CourseNana.COM

We have split this dataset into training and testing sets and have included them in this assignment as src/spam/spam train.tsv and src/spam/spam test.tsv. See src/spam/spam readme.txt for more details about this dataset. Please refrain from redistributing these dataset files. The goal of this assignment is to build a classifier from scratch that can tell the difference the spam and non-spam messages using the text of the SMS message. CourseNana.COM

  1. (a)  [5 points] Implement code for processing the the spam messages into numpy arrays that can
    be fed into machine learning models. Do this by completing the
    get words, create dictionary, and transform text functions within our provided src/spam.py. Do note the correspond-
    ing comments for each function for instructions on what specific processing is required.
    The provided code will then run your functions and save the resulting dictionary into
    spam dictionary and a sample of the resulting training matrix into
    spam sample train matrix.
    In your writeup, report the vocabular size after the pre-processing step. You do not need
    to include any other output for this subquestion.
  2. (b)  [10 points] In this question you are going to implement a naive Bayes classifier for spam classification with multinomial event model and Laplace smoothing (refer to class notes on Naive Bayes for details on Laplace smoothing in Section 2.3 of notes2.pdf).
    Code your implementation by completing the
    fit naive bayes model and

predict from naive bayes model functions in src/spam/spam.py.
Now
src/spam/spam.py should be able to train a Naive Bayes model, compute your predic- tion accuracy and then save your resulting predictions to spam naive bayes predictions. In your writeup, report the accuracy of the trained model on the test set.
Remark. If you implement naive Bayes the straightforward way, you will find that the computed p(x|y) = i p(xi|y) often equals zero. This is because p(x|y), which is the product of many numbers less than one, is a very small number. The standard computer representation of real numbers cannot handle numbers that are too small, and instead rounds them off to zero. (This is called “underflow.”) You’ll have to find a way to compute Naive Bayes’ predicted class labels without explicitly representing very small numbers such as p(x|y). [Hint: Think about using logarithms.] CourseNana.COM

  1. (c)  [5 points] Intuitively, some tokens may be particularly indicative of an SMS being in a particular class. We can try to get an informal sense of how indicative token i is for the SPAM class by looking at:

Complete the get top five naive bayes words function within the provided code using the above formula in order to obtain the 5 most indicative tokens.
Report the top five words in your writeup.
CourseNana.COM

(d) [2 points] Support vector machines (SVMs) are an alternative machine learning model that we discussed in class. We have provided you an SVM implementation (using a radial basis function (RBF) kernel) within src/spam/svm.py (You should not need to modify that code). CourseNana.COM

One important part of training an SVM parameterized by an RBF kernel (a.k.a Gaussian kernel) is choosing an appropriate kernel radius parameter.
Complete the
compute best svm radius by writing code to compute the best SVM radius CourseNana.COM

which maximizes accuracy on the validation dataset. Report the best kernel radius you obtained in the writeup. CourseNana.COM

In class, we saw that by choosing a kernel K(x, z) = φ(x)T φ(z), we can implicitly map data to a high dimensional space, and have a learning algorithm (e.g SVM or logistic regression) work in that space. One way to generate kernels is to explicitly define the mapping φ to a higher dimensional space, and then work out the corresponding K. CourseNana.COM

However in this question we are interested in direct construction of kernels. I.e., suppose we have a function K(x,z) that we think gives an appropriate similarity measure for our learning problem, and we are considering plugging K into the SVM as the kernel function. However for K(x,z) to be a valid kernel, it must correspond to an inner product in some higher dimensional space resulting from some feature mapping φ. Mercer’s theorem tells us that K(x, z) is a (Mercer) kernel if and only if for any finite set {x(1) , . . . , x(n) }, the square matrix K Rn×n whose entries are given by Kij = K(x(i),x(j)) is symmetric and positive semidefinite. You can find more details about Mercer’s theorem in the notes, though the description above is sufficient for this problem. CourseNana.COM

Now here comes the question: Let K1, K2 be kernels over Rd × Rd, let a R+ be a positive real number, let f : Rd R be a real-valued function, let φ : Rd Rp be a function mapping from Rd to Rp, let K3 be a kernel over Rp × Rp, and let p(x) a polynomial over x with positive coefficients. CourseNana.COM

For each of the functions K below, state whether it is necessarily a kernel. If you think it is, prove it; if you think it isn’t, give a counter-example. CourseNana.COM

(a) [1 points] K(x, z) = K1(x, z) + K2(x, z) (b) [1 points] K(x, z) = K1(x, z) K2(x, z) CourseNana.COM

(c) [1 points] K(x, z) = aK1(x, z) (d) [1 points] K(x, z) = aK1(x, z) CourseNana.COM

(e) [5 points] K(x, z) = K1(x, z)K2(x, z) (f) [3 points] K(x,z) = f(x)f(z) CourseNana.COM

(g) [3 points] K(x, z) = K3(φ(x), φ(z)) (h) [3 points] K(x, z) = p(K1(x, z)) CourseNana.COM

Let there be a binary classification problem with y ∈ {0,1}. The perceptron uses hypotheses of the form hθ(x) = g(θT x), where g(z) = sign(z) = 1 if z 0, 0 otherwise. In this problem we will consider a stochastic gradient descent-like implementation of the perceptron algorithm where each update to the parameters θ is made using only one training example. However, unlike stochastic gradient descent, the perceptron algorithm will only make one pass through the entire training set. The update rule for this version of the perceptron algorithm is given by CourseNana.COM

θ(i+1) := θ(i) + α(y(i+1) hθ(i) (x(i+1)))x(i+1)
where θ(i) is the value of the parameters after the algorithm has seen the first i training examples. CourseNana.COM

Prior to seeing any training examples, θ(0) is initialized to 0. CourseNana.COM

(a) [3 points] Let K be a Mercer kernel corresponding to some very high-dimensional feature mapping φ. Suppose φ is so high-dimensional (say, -dimensional) that it’s infeasible to ever represent φ(x) explicitly. Describe how you would apply the “kernel trick” to the perceptron to make it work in the high-dimensional feature space φ, but without ever explicitly computing φ(x). CourseNana.COM

[Note: You don’t have to worry about the intercept term. If you like, think of φ as having the property that φ0(x) = 1 so that this is taken care of.] Your description should specify: CourseNana.COM

i. [1 points] How you will (implicitly) represent the high-dimensional parameter vector θ(i), including how the initial value θ(0) = 0 is represented (note that θ(i) is now a vector whose dimension is the same as the feature vectors φ(x)); CourseNana.COM

ii. [1 points] How you will efficiently make a prediction on a new input x CourseNana.COM

. I.e., how you will compute hθ(i) (x(i+1)) = g(θ(i)T φ(x(i+1))), using your representation of θ(i); CourseNana.COM

and
iii. [1 points] How you will modify the update rule given above to perform an update to
θ on a new training example (x(i+1),y(i+1)); i.e., using the update rule corresponding to the feature mapping φ: CourseNana.COM

θ(i+1) := θ(i) + α(y(i+1) hθ(i) (x(i+1)))φ(x(i+1)) CourseNana.COM

  1. (b)  [10 points] Implement your approach by completing the initial state, predict, and update state methods of src/perceptron/perceptron.py.
    We provide two kernels, a dot-product kernel and a radial basis function (RBF) kernel. Run
    src/perceptron/perceptron.py to train kernelized perceptrons on src/perceptron/train.csv. The code will then test the perceptron on src/perceptron/test.csv and save the resulting predictions in the src/perceptron/ folder. Plots will also be saved in src/perceptron/. Include the two plots (corresponding to each of the kernels) in your writeup, and indicate which plot belongs to which kernel.
  2. (c)  [2 points]

One of the provided kernels performs extremely poorly in classifying the points. Which kernel performs badly and why does it fail? CourseNana.COM

Get in Touch with Our Experts

WeChat WeChat
Whatsapp WhatsApp
CS229代写,Machine Learning代写,Logistic Regression代写,Training stability代写,Stanford代写,CS229代编,Machine Learning代编,Logistic Regression代编,Training stability代编,Stanford代编,CS229代考,Machine Learning代考,Logistic Regression代考,Training stability代考,Stanford代考,CS229help,Machine Learninghelp,Logistic Regressionhelp,Training stabilityhelp,Stanfordhelp,CS229作业代写,Machine Learning作业代写,Logistic Regression作业代写,Training stability作业代写,Stanford作业代写,CS229编程代写,Machine Learning编程代写,Logistic Regression编程代写,Training stability编程代写,Stanford编程代写,CS229programming help,Machine Learningprogramming help,Logistic Regressionprogramming help,Training stabilityprogramming help,Stanfordprogramming help,CS229assignment help,Machine Learningassignment help,Logistic Regressionassignment help,Training stabilityassignment help,Stanfordassignment help,CS229solution,Machine Learningsolution,Logistic Regressionsolution,Training stabilitysolution,Stanfordsolution,