Assignment 1 of DDA4210
-
The deadline is 23:59, October 31, 2023.
-
The weight of this assignment in the final grade is 10%.
-
Electronic submission: Turn in solutions electronically via Blackboard. Be sure to submit your homework as a single file. Please name your solution file as A1 studentID name
-
Note that late submissions will result in discounted scores: 0-24 hours → 80%, 24-72 hours → 50%, 72 or more hours→ 0%.
-
Answer the questions in English. Otherwise, you’ll lose half of the points.
-
Collaboration policy: You need to solve all questions independently
and collaboration between students is not allowed.
1. Derive the bias-variance decomposition for the squared error loss function. That is, show that for a model fS trained on a dataset S to predict a target y(x) for each x, 10 points
ES [Eout (fS)] = Ex[Bias(x) + Var(x)] given the following definitions:
F(x) = ES [fS(x)]
h 2iEout (fS) = Ex (fS(x) − y(x)) Bias(x) = (F (x) − y(x))2
h 2i Var(x) = ES (fS (x) − F (x))
2. Find the closed-form solutions of the following optimization problems (W ∈ RK×D,N ≫ D > K):
(a) minimizeW,b PNi=1 ∥yi − Wxi − b∥2 4 points (b) minimizeW,b 21 PNi=1 ∥yi − Wxi − b∥2 + λ2 ∥W∥2F 6 points
3. Consider the following problem
minimize 1∥WΦ−Y∥2F +λ∥W∥2F
where∥·∥F denotestheFrobeniusnorm;Y∈RK×N,Φ=[φ(x1),φ(x2),...,φ(xN)], xi ∈ RD, i = 1,2,...,N, and φ is the feature map induced by a kernel function k(·,·). Prove that for any x ∈ RD, we can make prediction as
y = Wφ(x) = Y (K + λI)−1 k(x),
where K = Φ⊤Φ and k(x) = [k(x1,x),k(x2,x),...,k(xN,x)]⊤. 15 points
4. Compute the space and time complexities (in the form of big O, consider only the training stage) of the following algorithms: 9 points
-
(1) Ridge regression (Question 2(b)) with the closed-form solution
-
(2) PCA (N data points of D-dimension, choose d principal components)
-
(3) Neural network with architecture D − H1 − H2 − K on a mini-batch of size B (consider only the forward process and neglect the computational costs of activation functions)
* Hint: the time complexity of A ∈ Rm×n × B ∈ Rn×l is O(mnl); the time complexities of eigenvalue decomposition and inverse of an n × n matrix are both O(n3).
5. Prove the convergence of the generic gradient boosting algorithm (Any- Boost). Specifically, suppose in the algorithm of AnyBoost (page 14 of Lecture 02), the gradient of the objective function L is L-Lipschitz continuous, i.e., there exists L > 0 such that
∥∇L(H) − ∇L(H′)∥ ≤ L∥H − H′∥
holds for any H and H′. Suppose in the algorithm, α is computed as
αt+1 = −⟨∇L(Ht),ht+1⟩. L∥ht+1 ∥2
Then the ensemble model is updated as Ht+1 = Ht + αt+1ht+1. Prove that the algorithm either terminates at round T with ⟨∇L(Ht), ht+1⟩ or L(Ht) converges to a finite value, in which case
limt→∞⟨∇L(Ht), ht+1⟩ = 0. 2
W22
Hint: Using the following result: Suppose L : H → R and ∥∇L(F ) − ∇L(G)∥ ≤
L∥F−G∥holdsforanyF andGinH,thenL(F +wG)−L(F)≤w⟨∇L(F),G⟩+
Lw2 ∥G∥2 holds for any w > 0. 14 points 2
6. True or False? If False, then explain shortly. 10 points
-
(1) The inequality G(F,n) ≤ n2 holds for any model class F.
-
(2) The VC dimension of an axis-aligned rectangle in a 2D space is 4.
-
(3) The VC dimension of a circle in a 2D space is 4.
-
(4) The VC dimension of 1-nearest neighbor classifier in d-dimensional space is d + 1.
-
(5) Let d be the VC dimension of F. Then the inequality G(F,n) ≤ end d
always holds.
7. In LASSO, the model class is defined as F = {x 7→ ⟨w,x⟩ : ∥w∥1 ≤ α}. Suppose x ∈ Rd, y ∈ {−1,+1}, the training data are S = {(xi,yi)}ni=1, and max1≤i≤n ∥xi∥∞ ≤ β, where ∥ · ∥∞ denotes the largest absolute element of a vector.
-
(1) Find an upper bound of the empirical Rademacher complexity RS(F) = Eσ supf∈F n1 Pni=1 σif (xi), where σi are the Rademacher variables. 16 points
-
(2) Suppose the loss function is the absolute loss. Use the inequality (high- lighted in blue) on page 30 and Lemma 5 on page 35 (i.e., R(l ◦ F) ≤ ηR(F)) of Lecture 03 to derive a generalization error bound for LASSO. 6 points
Hint: For question (1), please use the inequality ⟨a,b⟩ ≤ ∥a∥1∥b∥∞ and the
following lemma:
Lemma 1. Let A ⊆ Rn be a finite set of points with r = maxx∈A ∥x∥2 and
denote x = (x1,x2,...,xn). Then
"n#
Eσ max X xiσi ≤ rp2 log |A|,
x∈A
i=1
where |A| denotes the cardinality of set A and σi are the Rademacher variables.
8. Answer the following questions about recommendation systems.
(1) What are the differences between collaborative filtering and content-based
methods?
4 points
3
(2) Suppose we have m items and n users. Let Ω be the set of indices of observed ratings given by the users on the items. Provide the objective function of content-based recommendation for users, where the model class is a neural network with two hidden layers. You need to define the no- tations clearly. In addition, explain how to make recommendations for a
new user using your model.