# python辅导 | CS 189 Introduction to Machine Learning

1 Getting Started

handwritten/scanned solutions. Please start each question on a new page. Deliverables:

1. Submit a PDF of your writeup, with an appendix for your code, to assignment on Gradescope, “HW4 Write-Up”. If there are graphs, include those graphs in the correct sections.

Do not simply reference your appendix.

2. If there is code, submit all code needed to reproduce your results, “HW4 Code”.

3. If there is a test set, submit your test set evaluation results, “HW4 Test Set”.

2 Gaussian Features

In this question we will be exploring augmenting our data using the Gaussian RBF. We might want
to do something like this so that we can model larger classes of functions by considering linear
combinations of basis functions f(x) =
PK
k=1 wkφk(x)…, and we may want ”handcraft” our features
with functions φ(x) when we have intuition and insight about our data and its source.

(a) First we will attempt to augment our data using polynomial features to see if this is appropriate in the setting that our data comes from Gaussian distributions. Implement the function
polynomial augmentation which takes as input a data matrix with only one feature and
return a data matrix augmented with polynomial features of degree up to k (some values for k
worth trying might be 2, 4, 6). Then, use the function linear regression which performs
linear regression on a data matrix and plots the resulting curve along side the data points. What
do you observe about the fitted curve? What happens as the degree of the polynomial features
increases?

(b) Now, we will use the Gaussian function (https://en.wikipedia.org/wiki/Gaussian function) to
augment our data matrix. The idea is that our data comes from a linear combination of a set
number of Gaussian functions, each with a center. Analyze the plot of data to find 3 centers
that seem like they would be good centers and estimating three variances by looking at the
plotted function against the data (hint: some of the Gaussians may be negatively weighted and
be upside down). Then, implement the function rbf augmentation which takes as input a
data matrix with only one feature and outputs a data matrix augmented with the values of the
Gaussian functions centered at the chosen centers with the data as input. This function will
take as input

where c1, c2, c3 are your centers and σ1, σ2, σ3 are the respective variances of the Gaussians.

3 Probabilistic Principal Components Analysis (PPCA)
In this question, we explore a more general form of PCA, called PPCA. We start with a latent
variable model
y = W x + µ + z
for x ∼ N(0, I), noise z ∼ N(0, Σ), and constant µ, W. We will constrain ourselves to isotropic
noise i.e., Σ = σ
2
I.
(a) Characterize the distribution of y conditioned on x.
(b) Now consider i.i.d. x ∼ N(0, I). Now characterize the distribution of y.
4 Canonical Correlation Analysis
Assume that you have a database of images of the words typed in two different fonts. X, Y ∈ R
n×d
corresponds to the dataset of font 1 and font 2 respectively. Think of the database X as being
composed on n independent draws (samples) from a random variable X ∈ R
d
, and similarly Y
as n draws from a random variable Y ∈ R
d
. Your goal is to use machine learning to build a text
recognition of word images.
(a) Explain why you would want to consider using CCA in this problem.
(b) Assume that the data matrices X and Y include zero-mean features of the word images. Given
two unit-length vectors u, v ∈ R
d
, compute the correlation coefficient of the random variables x,
y projected onto u, v, i.e., compute the correlation coefficient between u
T x and v
T y. Correlation
coefficient between two scalar random variables P and Q is computed by:
ρ(P, Q) =
cov(P, Q)
σPσQ
(c) Assume that the features of matrix X are rescaled to have values between -1 and 1. How does
this change the correlation coefficient?
5 Total Least Squares
In most of the models we have looked at so far, we’ve accounted for noise in the observed y
measurement and adjusted accordingly. However, in the real world it could easily be that our
feature matrix X of data is also corrupted or noisy. Total least squares is a way to account for this.
Whereas previously we were minimizing the y distance from the data point to our predicted line
because we had assumed the features were definitively accurate, now we are minimizing the entire
distance from the data point to our predicted line. In this problem we will explore the mathematical
intuition for the TLS formula. We will then apply the formula to adjusting the lighting of an image
which contains noise in its feature matrix due to inaccurate assumptions we make about the image,
such as the image being a perfect sphere.
Let X ∈ R
n×d
and y ∈ R
n be the measurements. Recall that in the least squares problem, we want
to solve for w in minw ||Xw − y||. We measure the error as the difference between Xw and y, which
can be viewed as adding an error term y such that the equation Xw = y + y has a solution:
min
y,w
||y||2, subject to Xw = y + y (1)
Although this optimization formulation allows for errors in the measurements of y, it does not
allow for errors in the feature matrix X that is measured from the data. In this problem, we will
explore a method called total least squares that allows for both error in the matrix X and the vector
y, represented by X and y, respectively. For convenience, we absorb the negative sign into y and
X and define true measurements y and X like so:
y
true = y + y (2)
X
true = X + X (3)
Specifically, the total least squares problem is to find the solution for w in the following minimization problem:
min
y,X,w

[X, y]

2
F
, subject to (X + X)w = y + y (4)
where the matrix [X, y] is the concatenation of the columns of X with the column vector y.
Recall the subscript F from discussion denotes the Frobenius norm of a matrix. Notice that the
minimization is over w because it’s a free parameter, and it does not necessarily have to be in the
objective function. Intuitively, this equation is finding the smallest perturbation to the matrix of
data points X and the outputs y such that the linear model can be solved exactly. The constraint in
the minimization problem can be rewritten as
[X + X, y + y]


w
−1

 = 0 (5)
(a) Let the matrix X ∈ R
n×d
and y ∈ R
n
and note that X ∈ R
n×d
and y ∈ R
n
. Assuming that n > d
and rank(X + X) = d, explain why rank([X + X, y + y]) = d.
(b) For the solution w to be unique, the matrix [X + X, y + y] must have exactly d linearly
independent columns. Since this matrix has d+1 columns in total, it must be rank-deficient by
1. Recall that we can rewrite [X, y] as its SVD decomposition UΣV
>
for orthonormal U, V and
diagonal Σ. Eckart-Young-Mirsky Theorem tells us that the closest lower-rank matrix in the
Frobenius norm is obtained by discarding the smallest singular values. Therefore, the matrix
[X + X, y + y] that minimizes

[X, y]

2
F
=

[X
true
, y
true] − [X, y]

2
F
is given by
[X + X, y + y] = U


Σd
0

0


V
>
Suppose we express the SVD of [X, y] in terms of submatrices and vectors:
[X, y] =


Uxx uxy
u
>
yx uyy




Σd
σd+1




Vxx vxy
v
>
yx vyy


>
uxy ∈ R
n−1
is the first (n − 1) elements of the (d + 1)-th column of U, u
>
yx ∈ R
d
is the first d
elements of the n-th row of U, uyy is the n-th element of the (d + 1)-th column of U, Uxx ∈
R
(n−1)×d
is the (n − 1) × d top left submatrix of U.
Similarly, vxy ∈ R
d
is the first d elements of the (d + 1)-th column of V, v
>
yx ∈ R
d
is the first d
elements of the (d + 1)-th row of V, vyy is the (d + 1)-th element of the (d + 1)-th column of V,
Vxx ∈ R
d×d
is the d × d top left submatrix of V. σd+1 is the (d + 1)-th eigenvalue of [X, y]. Σd
is the diagonal matrix of the d largest singular values of [X, y]
Using this information show that
[X, y] = −


uxy
uyy


σd+1


vxy
vyy


>
(c) Using the result from the previous part and the fact that vyy is not 0 (see notes on Total
Least Squares), find a nonzero solution to [X + X, y + y]


w
−1

 = 0 and thus solve for w in
Equation (5).
HINT: Looking at the last column of the product [X, y]V may or may not be useful for this
problem, depending on how you solve it.
(d) From the previous part, you can see that


w
−1

 is a right-singular vector of [X, y]. Show that
(X
>X − σ
2
d+1
I)w = X
>
y (14)
(e) In this problem, we will use total least squares to approximately learn the lighting in a photograph, which we can then use to paste new objects into the image while still maintaining the
realism of the image. You will be estimating the lighting coefficients for the interior of St. Peter’s Basilica, and you will then use these coefficients to change the lighting of an image of a
Figure 1: Tennis ball pasted on top of image of St. Peter’s Basilica without lighting adjustment (left) and
Figure 2: Image of a spherical mirror inside of St. Peter’s Basilica
tennis ball so that it can be pasted into the image. In Figure 1, we show the result of pasting
the tennis ball in the image without adjusting the lighting on the ball. The ball looks too bright
for the scene and does not look like it would fit in with other objects in the image.
To convincingly add a tennis ball to an image, we need to need to apply the appropriate lighting
from the environment onto the added ball. To start, we will represent environment lighting as
a spherical function f(n) where n is a 3 dimensional unit vector (||n||2 = 1), and f outputs a
3 dimensional color vector, one component for red, green, and blue light intensities. Because
f(n) is a spherical function, the input n must correspond to a point on a sphere. The function
f(n) represents the total incoming light from the direction n in the scene. The lighting function
of a spherical object f(n) can be approximated by the first 9 spherical harmonic basis functions.
The first 9 un-normalized spherical harmonic basis functions are given by:
L1 = 1
L2 = y
L3 = x
L4 = z
L5 = xy
L6 = yz
L7 = 3z
2 − 1
L8 = xz
L9 = x
2 − y
2
where n = [x, y,z]
>
. The lighting function can then be approximated as
f(n) ≈
X
9
i=1
γiLi(n)


— f(n1) —
— f(n2) —
.
.
.
— f(nn) —


n×3
=


L1(n1) L2(n1) … L9(n1)
L1(n2) L2(n2) … L9(n2)
.
.
.
L1(nn) L2(nn) … L9(nn)


n×9


— γ1 —
— γ2 —
.
.
.
— γ9 —


9×3
where Li(n) is the ith basis function from the list above.
The function of incoming light f(n) can be measured by photographing a spherical mirror
placed in the scene of interest. In this case, we provide you with an image of the sphere as seen
in Figure 2. In the code provided, there is a function extractNormals(img) that will extract the
training pairs (ni
, f(ni)) from the image. An example using this function is in the code.
Use the spherical harmonic basis functions to create a 9 dimensional feature vector for
each sample. Use this to formulate an ordinary least squares problem and solve for the
unknown coefficients γi
. Report the estimated values for γi and include a visualization of
the approximation using the provided code. The code provided will load the images, extracts
the training data, relights the tennis ball with incorrect coefficients, and saves the results. Your
task is to compute the basis functions and solve the least squares problems to provide the code
with the correct coefficients. To run the starter code, you will need to use Python with numpy
and scipy. Because the resulting data set is large, we reduce it in the code by taking every
50th entry in the data. This is done for you in the starter code, but you can try using the entire
data set or reduce it by a different amount.
(f) When we extract from the data the direction n to compute (ni
, f(ni)), we make some approximations about how the light is captured on the image. We also assume that the spherical
mirror is a perfect sphere, but in reality, there will always be small imperfections. Thus, our
measurement for n contains some error, which makes this an ideal problem to apply total least
squares. Solve this problem with total least squares by allowing perturbations in the matrix of basis functions. Report the estimated values for γi and include a visualization of
the approximation. The output image will be visibly wrong, and we’ll explore how to fix this
problem in the next part. Your implementation may only use the SVD and the matrix inverse
functions from the linear algebra library in numpy as well as np.linalg.solve.
(g) In the previous part, you should have noticed that the visualization is drastically different than
the one generated using least squares. Recall that in total least squares we are minimizing
||[X, y]||2
F
. Intuitively, to minimize the Frobenius norm of components of both the inputs and
outputs, the inputs and outputs should be on the same scale. However, this is not the case here.
Color values in an image will typically be in [0, 255], but the original image had a much larger
range. We compressed the range to a smaller scale using tone mapping, but the effect of the
compression is that relatively bright areas of the image become less bright. As a compromise,
we scaled the image colors down to a maximum color value of 384 instead of 255. Thus, the
inputs here are all unit vectors, and the outputs are 3 dimensional vectors where each value
is in [0, 384]. Propose a value by which to scale the outputs f(ni) such that the values of
the inputs and outputs are roughly on the same scale. Solve this scaled total least squares
problem, report the computed spherical harmonic coefficients and provide a rendering
of the relit sphere.