CourseNana | DDA4210 Advanced Machine Learning - Assignment 1: Bias-variance decomposition, gradient boosting and recommendation systems

Assignment 1 of DDA4210 CourseNana.COM

The deadline is 23:59, October 31, 2023. CourseNana.COM
The weight of this assignment in the final grade is 10%. CourseNana.COM
Electronic submission: Turn in solutions electronically via Blackboard. Be sure to submit your homework as a single file. Please name your solution file as A1 studentID name CourseNana.COM
Note that late submissions will result in discounted scores: 0-24 hours → 80%, 24-72 hours → 50%, 72 or more hours→ 0%. CourseNana.COM
Answer the questions in English. Otherwise, you’ll lose half of the points. CourseNana.COM
Collaboration policy: You need to solve all questions independently CourseNana.COM

and collaboration between students is not allowed. CourseNana.COM

1. Derive the bias-variance decomposition for the squared error loss function. That is, show that for a model fS trained on a dataset S to predict a target y(x) for each x, 10 points CourseNana.COM

ES [Eout (fS)] = Ex[Bias(x) + Var(x)] given the following definitions: CourseNana.COM

F(x) = ES [fS(x)]
h 2i CourseNana.COM

Eout (fS) = Ex (fS(x) − y(x)) Bias(x) = (F (x) − y(x))2 CourseNana.COM

h 2i Var(x) = ES (fS (x) − F (x)) CourseNana.COM

2. Find the closed-form solutions of the following optimization problems (W ∈ RK×D,N ≫ D > K): CourseNana.COM

(a) minimizeW,b PNi=1 ∥yi − Wxi − b∥2 4 points (b) minimizeW,b 21 PNi=1 ∥yi − Wxi − b∥2 + λ2 ∥W∥2F 6 points CourseNana.COM

3. Consider the following problem
minimize 1∥WΦ−Y∥2F +λ∥W∥2F CourseNana.COM

where∥·∥F denotestheFrobeniusnorm;Y∈RK×N,Φ=[φ(x1),φ(x2),...,φ(xN)], xi ∈ RD, i = 1,2,...,N, and φ is the feature map induced by a kernel function k(·,·). Prove that for any x ∈ RD, we can make prediction as CourseNana.COM

y = Wφ(x) = Y (K + λI)−1 k(x),
where K = Φ⊤Φ and k(x) = [k(x1,x),k(x2,x),...,k(xN,x)]⊤. 15 points CourseNana.COM

4. Compute the space and time complexities (in the form of big O, consider only the training stage) of the following algorithms: 9 points CourseNana.COM

(1) Ridge regression (Question 2(b)) with the closed-form solution CourseNana.COM
(2) PCA (N data points of D-dimension, choose d principal components) CourseNana.COM
(3) Neural network with architecture D − H1 − H2 − K on a mini-batch of size B (consider only the forward process and neglect the computational costs of activation functions) CourseNana.COM

* Hint: the time complexity of A ∈ Rm×n × B ∈ Rn×l is O(mnl); the time complexities of eigenvalue decomposition and inverse of an n × n matrix are both O(n3). CourseNana.COM

5. Prove the convergence of the generic gradient boosting algorithm (Any- Boost). Specifically, suppose in the algorithm of AnyBoost (page 14 of Lecture 02), the gradient of the objective function L is L-Lipschitz continuous, i.e., there exists L > 0 such that CourseNana.COM

∥∇L(H) − ∇L(H′)∥ ≤ L∥H − H′∥
holds for any H and H′. Suppose in the algorithm, α is computed as CourseNana.COM

αt+1 = −⟨∇L(Ht),ht+1⟩. L∥ht+1 ∥2 CourseNana.COM

Then the ensemble model is updated as Ht+1 = Ht + αt+1ht+1. Prove that the algorithm either terminates at round T with ⟨∇L(Ht), ht+1⟩ or L(Ht) converges to a finite value, in which case CourseNana.COM

limt→∞⟨∇L(Ht), ht+1⟩ = 0. 2 CourseNana.COM

W22 CourseNana.COM

Hint: Using the following result: Suppose L : H → R and ∥∇L(F ) − ∇L(G)∥ ≤ CourseNana.COM

L∥F−G∥holdsforanyF andGinH,thenL(F +wG)−L(F)≤w⟨∇L(F),G⟩+ CourseNana.COM

Lw2 ∥G∥2 holds for any w > 0. 14 points 2 CourseNana.COM

6. True or False? If False, then explain shortly. 10 points CourseNana.COM

(1) The inequality G(F,n) ≤ n2 holds for any model class F. CourseNana.COM
(2) The VC dimension of an axis-aligned rectangle in a 2D space is 4. CourseNana.COM
(3) The VC dimension of a circle in a 2D space is 4. CourseNana.COM
(4) The VC dimension of 1-nearest neighbor classifier in d-dimensional space is d + 1. CourseNana.COM
(5) Let d be the VC dimension of F. Then the inequality G(F,n) ≤ end d CourseNana.COM

always holds. CourseNana.COM

7. In LASSO, the model class is defined as F = {x 7→ ⟨w,x⟩ : ∥w∥1 ≤ α}. Suppose x ∈ Rd, y ∈ {−1,+1}, the training data are S = {(xi,yi)}ni=1, and max1≤i≤n ∥xi∥∞ ≤ β, where ∥ · ∥∞ denotes the largest absolute element of a vector. CourseNana.COM

(1) Find an upper bound of the empirical Rademacher complexity RS(F) = Eσ supf∈F n1 Pni=1 σif (xi), where σi are the Rademacher variables. 16 points CourseNana.COM
(2) Suppose the loss function is the absolute loss. Use the inequality (high- lighted in blue) on page 30 and Lemma 5 on page 35 (i.e., R(l ◦ F) ≤ ηR(F)) of Lecture 03 to derive a generalization error bound for LASSO. 6 points CourseNana.COM

Hint: For question (1), please use the inequality ⟨a,b⟩ ≤ ∥a∥1∥b∥∞ and the following lemma:
Lemma 1. Let A ⊆ Rn be a finite set of points with r = maxx∈A ∥x∥2 and denote x = (x1,x2,...,xn). Then CourseNana.COM

"n# CourseNana.COM

Eσ max X xiσi ≤ rp2 log |A|, CourseNana.COM

x∈A
i=1 CourseNana.COM

where |A| denotes the cardinality of set A and σi are the Rademacher variables. CourseNana.COM

8. Answer the following questions about recommendation systems.
(1) What are the differences between collaborative filtering and content-based CourseNana.COM

methods? CourseNana.COM

4 points CourseNana.COM

3 CourseNana.COM

(2) Suppose we have m items and n users. Let Ω be the set of indices of observed ratings given by the users on the items. Provide the objective function of content-based recommendation for users, where the model class is a neural network with two hidden layers. You need to define the no- tations clearly. In addition, explain how to make recommendations for a CourseNana.COM

new user using your model. CourseNana.COM

DDA4210 Advanced Machine Learning - Assignment 1: Bias-variance decomposition, gradient boosting and recommendation systems

Get in Touch with Our Experts