# 机器学习辅导 | Theory Assignment 3 Instructions

Theory Assignment 3

Instructions

There are many free integrated LATEX editors that are convenient to use (e.g Overleaf, ShareLaTeX).

Choose the one(s) you like the most. This tutorial Getting to Grips with LaTeX is a good start if you do not

know how to use LATEX yet.

• Unless stated otherwise, scalars are denoted by small letter in normal font, vectors are denoted by

small letters in bold font and matrices are denoted by capital letters in bold font.

• k.k means L2-norm unless specified otherwise i.e. k.k = k.k2

1

Problem 1 Principle Component Analysis (25 points)

In this problem, we use proof by induction to show that the M-th principle component corresponds to the

M-th eigenvector of XTX sorted by the eigenvalue from largest to smallest. Here X is the centered data

matrix and we denote the sorted eigenvalues as l1 ≥ l2 ≥ … ≥ ld. In the lecture, the results was proven

for M = 1. Now suppose the result holds for a value M, and you are going to show that it holds for

M + 1. Note that the M + 1 principle component corresponds to the solution of the following optimization

problem:

max

v

vTXTXv (1)

where vi is the i-th principle component. Write down the Lagrangian of the optimization problem above,

and show that the solution vM+1 is an eigenvector of XTX. Then show that the quantity in (1) is maximized

when the vM+1 is the eigenvector with eigenvalue lM+1.

2

Lecture 8 PAGE 3/13 SLIDE 9-10

Problem 2 Support Vector Regression (30 points)

In this problem, we derive an extension of support vector machine to regression problem, called Support

Vector Regression (SVR). Define the regressor f (x) = wTf(x) + b, and given a dataset f(xn, yn)gnN=1, yn 2

R. Intuitively, we want to find a regressor that has small weight w and also ensure small approximation

error to f(xn, yn)gnN=1. The intuition can be formulated as the following optimization problem:

min

w,b

12

kwk2 2

s.t. jwTf(xn) + b – ynj ≤ e

For an arbitrary dataset, the e-close constraint may not be feasible, Therefore, we optimize the “soft” version

of the loss above:

min

w,b

12

kwk2 2 + C

N∑

n=1

Ee(yn – f (xn)) (4)

Ee is the e-insensitive error function which gives zero error if the difference between prediction and ground

truth is smaller than e and incurs linear penalty otherwise. It is defined as follow:

Ee(x) = (0jxj – e j jx xj ≤ j > e e

Question 1 Reformulate the unconstrained optimization problem in equation 4 as a constraint optimization problem by introducing slack variables for each data points. Hint: For each data point, introduce slack

variables xn ≥ 0, xn0 ≥ 0 such that –e – xn0 ≤ yn – f (xn) ≤ e + xn. Then replace Ee with xn, xn0 . (12 points)

Question 2 Write down the Lagrangian of the constrained optimization derived in Question 1, then minimize the Lagrangian by taking derivative w.r.t w, b, xn, xn0 and set the gradient to 0, and simplify expressions.

Hint: there are no b, xn, xn0 in the final expressions. (18 points)

3

Lecture 7

Lecture 7

Problem 3 Support Vector Machine (25 points)

Consider the dataset consisting of points (x, y), where x is a real value, and y 2 f-1, 1g is the class label.

There are only three points (x1, y1) = (0, 1), (x2, y2) = ( p2 , –1), (x3, y3) = (p, 1). Let the feature mapping

f(x) = [cos x, sin x]T, corresponding to the kernel function k(x, y) = cos(x – y).

Question 1 Write down the primal and dual formulations of SVM for this dataset in the transformed twodimensional feature space based on f(·). Note that we assume the data points are separable and set the

hyperparameter C to be +¥, which forces all slack variables (x) in the primal formulation to be 0 (and thus

can be removed from the optimization). (12 points)

Question 2 Next, solve the dual formulation. Based on that, derive the primal solution. (13 points)

4

Lecture 7

Lecture7

Problem 4 Boosting (20 points)

Recall the procedure of AdaBoost algorithm described in class:

Algorithm 1: Adaboost

1 Given: A training set f(xn, yn 2 f+1, –1g)gnN=1, and a set of classifier H, where each h 2 H takes a

feature vector as input and outputs +1 or –1.

2 Goal: Learn H(x) = sign ∑tT=1 btht(x)

3 Initialization: D1(n) = N1 , 8n 2 [N].

4 Find ht = arg minh2H ∑n:yn6=h(xn) Dt(n).

5 for t = 1, 2, · · · , T do

6 Compute

et = ∑

n:yn6=ht(xn)

Dt(n)

1 – et

7 Compute

Dt+1(n) = Dt(n)e–btynht(xn)

∑nN0=1 Dt(n0)e–btyn0 ht(xn0 )

for each n 2 [N]

Question 1 We discussed in class that AdaBoost minimizes the exponential loss greedily. In particular,

Adaboost seeks the optimal bt that minimizes

et(ebt – e–bt) + e–bt

where et is the weighted classification error of ht and is fixed. Show that b∗ = 12 ln 1–etet is the minimizer.

(8 points)

Question 2 Recall that at round t of AdaBoost, a classifier ht is obtained and the weighting over the

training set is updated from Dt to Dt+1. Prove that ht is only as good as random guessing in terms of

classification error weighted by Dt+1. That is (12 points)

∑

n:ht(xn)6=yn

Dt+1(n) = 1

2.

Hint: you can somehow ignore the denominator of Dt+1(n) to simplify calculation.

5

Lecture 8 Slide 39-44

Lecture 8 Slide 30-