Instruction
CMPSC 448: Machine Learning and AI
Homework 3 (Due Sunday, February 26th, 11:59 PM)
This HW only includes theory problems. Please note that you need to submit your solutions in a single PDF file via Gradescope.
Perceptron
Problem 1. (10 points) As we discussed in the lecture, the Perceptron algorithm will only converge if the data is linearly separable, i.e., there exists a linear classifier that can perfectly classify the training data.
If the training data S = {(x1,y1),(x2,y2),...,(xn,yn)} where xi ∈ Rd and yi ∈ {−1,+1} is not linearly separable, then there is a simple trick to force the data to be linearly sep- arable and then apply the Perceptron algorithm as follows. If you have n data points in d dimensions, map data point xi to the (d + n)-dimensional point [xi, ei]⊤ ∈ Rd+n, where ei ∈ {0, 1}n is a n-dimensional vector of zeros, except for the ith position, which is 1 (e.g., e4 = [0,0,0,1,0,...,0]⊤).
Show that if you apply this mapping, the data becomes linearly separable (you may wish to do so by providing a weight vector w in (d + n)-dimensional space that successfully separates the data).
Logistic Regression
Problem 2. (15 points) Recall that in lectures we showed that the Logistic Regression for binary classification boils down to solving the following optimization problem (training error) over n training samples:
n
f(w) = log 1 + e−yiw⊤xi
i=1
a) Compute the gradient of f(w).
b) Please write the pseudocode for using GD to optimize the f(w).
c) Argue that if data is linearly separable (i.e., can be perfectly classified by a linear model), in the solution obtained by GD, some of the coefficients converge to infinity.
Problem 3. (20 points) In this problem we aim at generalizing the Logistic Regression algorithm to solve multi-class classification problem, the setting where the label space in- cludes three or more classes, i.e., Y = {1,2,...,K} for K ≥ 3. To do so, consider a training data S = {(x1, y1), (x2, y2), . . . , (xn, yn)} where the feature vectors are xi ∈ Rd and
1
yi ∈ Y,i = 1,2,...,n. Here we assume that we have K different classes and each input xi is a d dimensional vector. We wish to fit a linear model in a similar spirit to logistic regression, and we will use the softmax function to link the linear inputs to the categorical output, instead of the logistic function.
One way to generalize binary model to multiclass case is to have K sets of parameters wk for k = 1,2,...,K, and compute the probability distribution of the output as follows (for simplicity we ignore the intercept parameters bk),
exp(w⊤k x)
P[y = k | x;w1,...,wK] = ?Ki=1 exp(w⊤i x) for k = 1,...,K
Note that we indeed have a probability distribution, as Kk=1 P [y = k | x, w1, . . . , wK ] = 1. To make the model identifiable, we will fix wK to 0, which means we have K − 1 sets of parameters {w1, . . . , wK−1} , where each one is a d-dimensional vector, to learn. To see this, note that parameter vector for class K can be inferred from rest as the sum of probabilitites for all labels sums up to one (similar to binary classification where we only had a single parameter vector). To make a prediction, we can extend the binary prediction rule to multiclass setting by letting the prediction function to be:
f(multiclass )(x) = argmax P[y = k|x] k∈{1,2,...,K}
As in logistic regression, we will assume that each yn is i.i.d., which leads to the following likelihood function,
P[y1,y2,...,yn | x1,x2,...,xn;w1,...,wK] = ?P[yi | xi,w1,...,wK]
i=1
a) Please write down explicitly the log likelihood function and simplify it as much as you can to derive the training loss needs to be optimized.
b) Compute the gradient of likelihood with respect to each wk and simply it
c) Derive the stochastic gradient descent (SGD) update for multiclass logistic regression.
Decision Trees
Problem 4. (10 points) In this problem, you will investigate building a decision tree for a binary classification problem. The training data is given in Table 1 with 16 instances that will be used to learn a decision tree for predicting whether a mushroom is edible or not based on its attributes (Color, Size and Shape). Please note the label set is a binary set {Yes, No}.
Which attribute would the algorithm choose to use for the root of the tree. Show the details of your calculations.
Recall from lectures that if we let S denote the data set at current node, A denote the feature with values v ∈ V, H denote the entropy function, and Sv denote the subset of S for which the feature A has the value v, the gain of a split along the feature A, denoted InfoGain(S, A) is computed as:
Instance Color Size Shape Edible? D1 Yellow Small Round Yes
D2 | Yellow | Small | Round | No |
D3 | Green | Small | Irregular | Yes |
D4 | Green | Large | Irregular | No |
D5 | Yellow | Large | Round | Yes |
D6 | Yellow | Small | Round | Yes |
D7 | Yellow | Small | Round | Yes |
D8 | Yellow | Small | Round | Yes |
D9 | Green | Small | Round | No |
D10 | Yellow | Large | Round | No |
D11 | Yellow | Large | Round | Yes |
D12 | Yellow | Large | Round | No |
D13 | Yellow | Large | Round | No |
D14 | Yellow | Large | Round | No |
D15 | Yellow | Small | Irregular | Yes |
D16 | Yellow | Large | Irregular | Yes |
Table 1: Mushroom data with 16 instances, three categorical features, and binary labels.
|Sv| InfoGain(S, A) = H(S) − H(Sv).
That is, we are taking the difference of the entropy before the split, and subtracting off the entropies of each new node after splitting, with an appropriate weight depending on the number of training examples at each node.
Problem 5. (5 points) Handling real valued (numerical) features is totally different from categorical features in splitting nodes. This problem intends to discuss a simple way to decide good thresholds for splitting based on numerical features. Specifically, when there is a numerical feature in data, an option would be treating all numeric values of feature as discrete, i.e., proceeding exactly as we do with categorical data. What problems may arise when we use a tree derived this way to classify an unseen example?
Hard Support Vector Machines
Problem 6. (10 points) Consider a data set with three data points in R2 and their corre-
sponding labels:
0 0 −1 X=0 −1, y=−1
−20 +1
Note that ith row of X corresponds to ith training sample with two features and its label is yi.
Manually solve the following optimization problem for hard-margin SVM stated as
v∈V |S| min 1 ∥w∥2 2 (1)
w,b s.t.
yi(w⊤xi +b)≥1, i=1,2,...,n to get the optimal hyperplane (w∗,b∗) and its margin.
Hint: first write down the constraints (correct classification conditions) for three training examples to get three equations in three parameters (w = [w1,w2]⊤ and b). Then derive the possible values for unknown parameters b, w1, w2 and see what choice of values gives the minimum norm for w.
Soft Support Vector Machines
Problem 7 (15 points) In this problem we would like to compare the solutions of hard and soft SVMs on a linearly seperable dataset.
Let n > 1 be a fixed number. Is it true that there exists C > 0 such that for every sample S of n training examples with binary label, which are linearly separable, the hard-SVM and the soft-SVM (with parameter C) solutions will return exactly the same weight vector. Justify your answer. (Hint: consider n = 2, d = 1 and S = {(x1, y1) , (x2, y2)}. Let a > 0 and consider x1 = a, y1 = 1, x2 = −a, y2 = −1. Derive the optimal solution for hard and soft SVM and compare the results.)