机器学习辅导 | Theory Assignment 3 Instructions

Theory Assignment 3
Instructions
There are many free integrated LATEX editors that are convenient to use (e.g Overleaf, ShareLaTeX).
Choose the one(s) you like the most. This tutorial
Getting to Grips with LaTeX is a good start if you do not
know how to use L
ATEX yet.
• Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by
small letters in bold font and matrices are denoted by capital letters in bold font.
k.k means L2-norm unless specified otherwise i.e. k.k = k.k2
1
Problem 1 Principle Component Analysis (25 points)
In this problem, we use proof by induction to show that the M-th principle component corresponds to the
M-th eigenvector of XTX sorted by the eigenvalue from largest to smallest. Here X is the centered data
matrix and we denote the sorted eigenvalues as
l1 l2 ld. In the lecture, the results was proven
for
M = 1. Now suppose the result holds for a value M, and you are going to show that it holds for
M + 1. Note that the M + 1 principle component corresponds to the solution of the following optimization
problem:
max
v
vTXTXv (1)

where vi is the i-th principle component. Write down the Lagrangian of the optimization problem above,
and show that the solution
vM+1 is an eigenvector of XTX. Then show that the quantity in (1) is maximized
when the
vM+1 is the eigenvector with eigenvalue lM+1.
2
Lecture 8 PAGE 3/13 SLIDE 9-10
Problem 2 Support Vector Regression (30 points)
In this problem, we derive an extension of support vector machine to regression problem, called Support
Vector Regression (SVR). Define the regressor
f (x) = wTf(x) + b, and given a dataset f(xn, yn)gnN=1, yn 2
R. Intuitively, we want to find a regressor that has small weight w and also ensure small approximation
error to
f(xn, yn)gnN=1. The intuition can be formulated as the following optimization problem:
min
w,b
12
kwk2 2
s.t. jwTf(xn) + b ynj ≤ e
For an arbitrary dataset, the e-close constraint may not be feasible, Therefore, we optimize the “soft” version
of the loss above:
min
w,b
12
kwk2 2 + C
N
n=1
Ee(yn f (xn)) (4)
Ee is the e-insensitive error function which gives zero error if the difference between prediction and ground
truth is smaller than
e and incurs linear penalty otherwise. It is defined as follow:
Ee(x) = (0jxj – e j jx xj ≤ j > e e
Question 1 Reformulate the unconstrained optimization problem in equation 4 as a constraint optimization problem by introducing slack variables for each data points. Hint: For each data point, introduce slack
variables
xn 0, xn0 0 such that e xn0 yn f (xn) e + xn. Then replace Ee with xn, xn0 . (12 points)
Question 2
Write down the Lagrangian of the constrained optimization derived in Question 1, then minimize the Lagrangian by taking derivative w.r.t w, b, xn, xn0 and set the gradient to 0, and simplify expressions.
Hint: there are no
b, xn, xn0 in the final expressions. (18 points)
3
Lecture 7
Lecture 7

Problem 3 Support Vector Machine (25 points)
Consider the dataset consisting of points (x, y), where x is a real value, and y 2 f-1, 1g is the class label.
There are only three points
(x1, y1) = (0, 1), (x2, y2) = ( p2 , 1), (x3, y3) = (p, 1). Let the feature mapping
f(x) = [cos x, sin x]T, corresponding to the kernel function k(x, y) = cos(x y).
Question 1 Write down the primal and dual formulations of SVM for this dataset in the transformed twodimensional feature space based on f(·). Note that we assume the data points are separable and set the
hyperparameter
C to be +¥, which forces all slack variables (x) in the primal formulation to be 0 (and thus
can be removed from the optimization).
(12 points)
Question 2
Next, solve the dual formulation. Based on that, derive the primal solution. (13 points)
4
Lecture 7
Lecture7

Problem 4 Boosting (20 points)
Recall the procedure of AdaBoost algorithm described in class:
Algorithm 1: Adaboost
1 Given: A training set f(xn, yn 2 f+1, 1g)gnN=1, and a set of classifier H, where each h 2 H takes a
feature vector as input and outputs
+1 or 1.
2 Goal: Learn H(x) = sign tT=1 btht(x)
3 Initialization: D1(n) = N1 , 8n 2 [N].
4 Find ht = arg minh2H n:yn6=h(xn) Dt(n).
5 for t = 1, 2, · · · , T do
6 Compute
et =
n:yn6=ht(xn)
Dt(n)

1 et
7 Compute
Dt+1(n) = Dt(n)ebtynht(xn)
nN0=1 Dt(n0)ebtyn0 ht(xn0 )
for each n 2 [N]
Question 1 We discussed in class that AdaBoost minimizes the exponential loss greedily. In particular,
Adaboost seeks the optimal
bt that minimizes
et(ebt ebt) + ebt
where et is the weighted classification error of ht and is fixed. Show that b= 12 ln 1etet  is the minimizer.
(8 points)
Question 2
Recall that at round t of AdaBoost, a classifier ht is obtained and the weighting over the
training set is updated from
Dt to Dt+1. Prove that ht is only as good as random guessing in terms of
classification error weighted by
Dt+1. That is (12 points)

n:ht(xn)6=yn
Dt+1(n) = 1
2.
Hint: you can somehow ignore the denominator of
Dt+1(n) to simplify calculation.
5
Lecture 8 Slide 39-44
Lecture 8 Slide 30-