# python辅导 | CS 189 Introduction to Machine Learning

1 Getting Started

Read through this page carefully. You may typeset your homework in latex or submit neatly

handwritten/scanned solutions. Please start each question on a new page. Deliverables:

1. Submit a PDF of your writeup, with an appendix for your code, to assignment on Gradescope, “HW4 Write-Up”. If there are graphs, include those graphs in the correct sections.

Do not simply reference your appendix.

2. If there is code, submit all code needed to reproduce your results, “HW4 Code”.

3. If there is a test set, submit your test set evaluation results, “HW4 Test Set”.

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 1

**2 Gaussian Features**

In this question we will be exploring augmenting our data using the Gaussian RBF. We might want

to do something like this so that we can model larger classes of functions by considering linear

combinations of basis functions f(x) =

PK

k=1 wkφk(x)…, and we may want ”handcraft” our features

with functions φ(x) when we have intuition and insight about our data and its source.

(a) First we will attempt to augment our data using polynomial features to see if this is appropriate in the setting that our data comes from Gaussian distributions. Implement the function

polynomial augmentation which takes as input a data matrix with only one feature and

return a data matrix augmented with polynomial features of degree up to k (some values for k

worth trying might be 2, 4, 6). Then, use the function linear regression which performs

linear regression on a data matrix and plots the resulting curve along side the data points. What

do you observe about the fitted curve? What happens as the degree of the polynomial features

increases?

(b) Now, we will use the Gaussian function (https://en.wikipedia.org/wiki/Gaussian function) to

augment our data matrix. The idea is that our data comes from a linear combination of a set

number of Gaussian functions, each with a center. Analyze the plot of data to find 3 centers

that seem like they would be good centers and estimating three variances by looking at the

plotted function against the data (hint: some of the Gaussians may be negatively weighted and

be upside down). Then, implement the function rbf augmentation which takes as input a

data matrix with only one feature and outputs a data matrix augmented with the values of the

Gaussian functions centered at the chosen centers with the data as input. This function will

take as input

where c1, c2, c3 are your centers and σ1, σ2, σ3 are the respective variances of the Gaussians.

3 Probabilistic Principal Components Analysis (PPCA)

In this question, we explore a more general form of PCA, called PPCA. We start with a latent

variable model

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 2

y = W x + µ + z

for x ∼ N(0, I), noise z ∼ N(0, Σ), and constant µ, W. We will constrain ourselves to isotropic

noise i.e., Σ = σ

2

I.

(a) Characterize the distribution of y conditioned on x.

(b) Now consider i.i.d. x ∼ N(0, I). Now characterize the distribution of y.

4 Canonical Correlation Analysis

Assume that you have a database of images of the words typed in two different fonts. X, Y ∈ R

n×d

corresponds to the dataset of font 1 and font 2 respectively. Think of the database X as being

composed on n independent draws (samples) from a random variable X ∈ R

d

, and similarly Y

as n draws from a random variable Y ∈ R

d

. Your goal is to use machine learning to build a text

recognition of word images.

(a) Explain why you would want to consider using CCA in this problem.

(b) Assume that the data matrices X and Y include zero-mean features of the word images. Given

two unit-length vectors u, v ∈ R

d

, compute the correlation coefficient of the random variables x,

y projected onto u, v, i.e., compute the correlation coefficient between u

T x and v

T y. Correlation

coefficient between two scalar random variables P and Q is computed by:

ρ(P, Q) =

cov(P, Q)

σPσQ

(c) Assume that the features of matrix X are rescaled to have values between -1 and 1. How does

this change the correlation coefficient?

5 Total Least Squares

In most of the models we have looked at so far, we’ve accounted for noise in the observed y

measurement and adjusted accordingly. However, in the real world it could easily be that our

feature matrix X of data is also corrupted or noisy. Total least squares is a way to account for this.

Whereas previously we were minimizing the y distance from the data point to our predicted line

because we had assumed the features were definitively accurate, now we are minimizing the entire

distance from the data point to our predicted line. In this problem we will explore the mathematical

intuition for the TLS formula. We will then apply the formula to adjusting the lighting of an image

which contains noise in its feature matrix due to inaccurate assumptions we make about the image,

such as the image being a perfect sphere.

Let X ∈ R

n×d

and y ∈ R

n be the measurements. Recall that in the least squares problem, we want

to solve for w in minw ||Xw − y||. We measure the error as the difference between Xw and y, which

can be viewed as adding an error term y such that the equation Xw = y + y has a solution:

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 3

min

y,w

||y||2, subject to Xw = y + y (1)

Although this optimization formulation allows for errors in the measurements of y, it does not

allow for errors in the feature matrix X that is measured from the data. In this problem, we will

explore a method called total least squares that allows for both error in the matrix X and the vector

y, represented by X and y, respectively. For convenience, we absorb the negative sign into y and

X and define true measurements y and X like so:

y

true = y + y (2)

X

true = X + X (3)

Specifically, the total least squares problem is to find the solution for w in the following minimization problem:

min

y,X,w

[X, y]

2

F

, subject to (X + X)w = y + y (4)

where the matrix [X, y] is the concatenation of the columns of X with the column vector y.

Recall the subscript F from discussion denotes the Frobenius norm of a matrix. Notice that the

minimization is over w because it’s a free parameter, and it does not necessarily have to be in the

objective function. Intuitively, this equation is finding the smallest perturbation to the matrix of

data points X and the outputs y such that the linear model can be solved exactly. The constraint in

the minimization problem can be rewritten as

[X + X, y + y]

w

−1

= 0 (5)

(a) Let the matrix X ∈ R

n×d

and y ∈ R

n

and note that X ∈ R

n×d

and y ∈ R

n

. Assuming that n > d

and rank(X + X) = d, explain why rank([X + X, y + y]) = d.

(b) For the solution w to be unique, the matrix [X + X, y + y] must have exactly d linearly

independent columns. Since this matrix has d+1 columns in total, it must be rank-deficient by

1. Recall that we can rewrite [X, y] as its SVD decomposition UΣV

>

for orthonormal U, V and

diagonal Σ. Eckart-Young-Mirsky Theorem tells us that the closest lower-rank matrix in the

Frobenius norm is obtained by discarding the smallest singular values. Therefore, the matrix

[X + X, y + y] that minimizes

[X, y]

2

F

=

[X

true

, y

true] − [X, y]

2

F

is given by

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 4

[X + X, y + y] = U

Σd

0

…

0

V

>

Suppose we express the SVD of [X, y] in terms of submatrices and vectors:

[X, y] =

Uxx uxy

u

>

yx uyy

Σd

σd+1

Vxx vxy

v

>

yx vyy

>

uxy ∈ R

n−1

is the first (n − 1) elements of the (d + 1)-th column of U, u

>

yx ∈ R

d

is the first d

elements of the n-th row of U, uyy is the n-th element of the (d + 1)-th column of U, Uxx ∈

R

(n−1)×d

is the (n − 1) × d top left submatrix of U.

Similarly, vxy ∈ R

d

is the first d elements of the (d + 1)-th column of V, v

>

yx ∈ R

d

is the first d

elements of the (d + 1)-th row of V, vyy is the (d + 1)-th element of the (d + 1)-th column of V,

Vxx ∈ R

d×d

is the d × d top left submatrix of V. σd+1 is the (d + 1)-th eigenvalue of [X, y]. Σd

is the diagonal matrix of the d largest singular values of [X, y]

Using this information show that

[X, y] = −

uxy

uyy

σd+1

vxy

vyy

>

(c) Using the result from the previous part and the fact that vyy is not 0 (see notes on Total

Least Squares), find a nonzero solution to [X + X, y + y]

w

−1

= 0 and thus solve for w in

Equation (5).

HINT: Looking at the last column of the product [X, y]V may or may not be useful for this

problem, depending on how you solve it.

(d) From the previous part, you can see that

w

−1

is a right-singular vector of [X, y]. Show that

(X

>X − σ

2

d+1

I)w = X

>

y (14)

(e) In this problem, we will use total least squares to approximately learn the lighting in a photograph, which we can then use to paste new objects into the image while still maintaining the

realism of the image. You will be estimating the lighting coefficients for the interior of St. Peter’s Basilica, and you will then use these coefficients to change the lighting of an image of a

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 5

Figure 1: Tennis ball pasted on top of image of St. Peter’s Basilica without lighting adjustment (left) and

with lighting adjustment (right)

Figure 2: Image of a spherical mirror inside of St. Peter’s Basilica

tennis ball so that it can be pasted into the image. In Figure 1, we show the result of pasting

the tennis ball in the image without adjusting the lighting on the ball. The ball looks too bright

for the scene and does not look like it would fit in with other objects in the image.

To convincingly add a tennis ball to an image, we need to need to apply the appropriate lighting

from the environment onto the added ball. To start, we will represent environment lighting as

a spherical function f(n) where n is a 3 dimensional unit vector (||n||2 = 1), and f outputs a

3 dimensional color vector, one component for red, green, and blue light intensities. Because

f(n) is a spherical function, the input n must correspond to a point on a sphere. The function

f(n) represents the total incoming light from the direction n in the scene. The lighting function

of a spherical object f(n) can be approximated by the first 9 spherical harmonic basis functions.

The first 9 un-normalized spherical harmonic basis functions are given by:

L1 = 1

L2 = y

L3 = x

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 6

L4 = z

L5 = xy

L6 = yz

L7 = 3z

2 − 1

L8 = xz

L9 = x

2 − y

2

where n = [x, y,z]

>

. The lighting function can then be approximated as

f(n) ≈

X

9

i=1

γiLi(n)

— f(n1) —

— f(n2) —

.

.

.

— f(nn) —

n×3

=

L1(n1) L2(n1) … L9(n1)

L1(n2) L2(n2) … L9(n2)

.

.

.

L1(nn) L2(nn) … L9(nn)

n×9

— γ1 —

— γ2 —

.

.

.

— γ9 —

9×3

where Li(n) is the ith basis function from the list above.

The function of incoming light f(n) can be measured by photographing a spherical mirror

placed in the scene of interest. In this case, we provide you with an image of the sphere as seen

in Figure 2. In the code provided, there is a function extractNormals(img) that will extract the

training pairs (ni

, f(ni)) from the image. An example using this function is in the code.

Use the spherical harmonic basis functions to create a 9 dimensional feature vector for

each sample. Use this to formulate an ordinary least squares problem and solve for the

unknown coefficients γi

. Report the estimated values for γi and include a visualization of

the approximation using the provided code. The code provided will load the images, extracts

the training data, relights the tennis ball with incorrect coefficients, and saves the results. Your

task is to compute the basis functions and solve the least squares problems to provide the code

with the correct coefficients. To run the starter code, you will need to use Python with numpy

and scipy. Because the resulting data set is large, we reduce it in the code by taking every

50th entry in the data. This is done for you in the starter code, but you can try using the entire

data set or reduce it by a different amount.

(f) When we extract from the data the direction n to compute (ni

, f(ni)), we make some approximations about how the light is captured on the image. We also assume that the spherical

mirror is a perfect sphere, but in reality, there will always be small imperfections. Thus, our

measurement for n contains some error, which makes this an ideal problem to apply total least

squares. Solve this problem with total least squares by allowing perturbations in the matrix of basis functions. Report the estimated values for γi and include a visualization of

the approximation. The output image will be visibly wrong, and we’ll explore how to fix this

problem in the next part. Your implementation may only use the SVD and the matrix inverse

functions from the linear algebra library in numpy as well as np.linalg.solve.

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 7

(g) In the previous part, you should have noticed that the visualization is drastically different than

the one generated using least squares. Recall that in total least squares we are minimizing

||[X, y]||2

F

. Intuitively, to minimize the Frobenius norm of components of both the inputs and

outputs, the inputs and outputs should be on the same scale. However, this is not the case here.

Color values in an image will typically be in [0, 255], but the original image had a much larger

range. We compressed the range to a smaller scale using tone mapping, but the effect of the

compression is that relatively bright areas of the image become less bright. As a compromise,

we scaled the image colors down to a maximum color value of 384 instead of 255. Thus, the

inputs here are all unit vectors, and the outputs are 3 dimensional vectors where each value

is in [0, 384]. Propose a value by which to scale the outputs f(ni) such that the values of

the inputs and outputs are roughly on the same scale. Solve this scaled total least squares

problem, report the computed spherical harmonic coefficients and provide a rendering

of the relit sphere.

HW 04, ©UCB CS 189, Fall 2019. All Rights Reserved. This may not be publicly shared without explicit permission. 8