机器学习代写|CSCI-GA.2565-001 Machine Learning: Homework 3

1 Variational Inference and Monte Carlo Gradients

In this question, we will review the details of variational inference (VI), in particular, we will implement the gradient estimators that make VI tractable.

We consider the latent variable model p(z, x) = Q N i=1 p(xi |zi)p(zi) where xi , zi R D. Recall that in VI, we fifind an approximation qλ(z) to p(z|x).

(A) Let V1(λ) be the set of variational approximations {qλ : qλ(z) = Q N i=1 q(zi ; λi)} where λi are parameters learned for each datapoint xi . Now consider fλ(x) as a deep neural network with fifixed architecture where λ parametrizes the network. Let V2(λ) = {qλ : qλ(z) = Q N i q(zi ; fλ(xi))}. Which of the two families (V1 or V2) is more expressive, i.e. approximates a larger set of distributions? Prove your answer.

Will your answer change if we let fλ represent variable architecture, e.g. if λ parametrizes the set of multi-layered perceptrons of all sizes? Why or why not?

Solution. Note that fλ(xi) can at most be a universal approximator for λi , so V1 is more expressive than (or at least, equal to) V2. It the architecture is variable, this is also the same.

(B) For variational inference to work, we need to compute unbiased estimates of the gradient of the ELBO.

In class, we learnt two such estimators: score function (REINFORCE) and pathwise (reparametrization) gradients. Let us see this in practice for a simpler inference problem.

Consider the dataset of N = 100 one-dimensional data points {xi} N i=1 in data.csv. Suppose we want to minimize the following expectation with respect to a parameter µ:

minµ Ez∼N (µ,1) “N Xi=1(xi z) 2 # (1)

(i) Write down the score function gradient for this problem. Using a suitable reparametrization, write down the reparameterization gradient for this problem.

Solution. score function gradient

N (µ,1) “N Xi=1(xi z) 2µ log  1 2π exp  (z 2 µ) 2  # =Ez∼N (µ,1) “N Xi=1(xi z) 2 (z µ) #

reparameterization gradient

µ Ez∼N (µ,1) “N Xi=1(xi z) 2 # =E ∼N (0,1) ” µ (N Xi=1(xi (µ +  ))2 )# =E ∼N (0,1) ” 2N Xi=1(xi (µ +  ))#

(ii) Using PyTorch and for each of these two gradient estimators, perform gradient descent using M ={1, 10, 100, 1000} gradient samples for T = 10 trials. Plot the mean and variance of the fifinal estimate for µ for each value of M across the trials.

You should have two graphs, one for each gradient estimator. Each of the graph should contain two plots, one for the means and one for the variances. The x-axis should be M, hence each of these plots will have four points.

Solution. for score function gradient

(C) What conditions do you require on p(z) and f(z) (f(z) = P N i=1(xi z) 2 in this case) for each of the two gradient estimators to be valid? Do these apply to both continuous and discrete distributions p(z)?

Solution. for score function gradient, there is no restriction on f(z) and p(z). For parameterization gradient, p(z) should be continuous, and f(z) should be difffferentiable.

2 Bayesian Parameters versus Latent Variables

(A) Consider the model yi ∼ N (w> xi , σ2 ) where the inverse-variance is distributed λ = 12 Gamma(α, β).

Show that the predictive distribution y ? |w, x ? , α, β for a datapoint x ? follows a generalized T distribution

T(t; ν, µ, θ) =Γ( ν+12 )Γ(ν/2)θ πν  1 + ν 1  t µ θ  2 ν+12

with degree ν = 2α, mean µ = w> x ? and scale θ = p β/α. You may use the property Γ(k) = R 0 x k1 e xdx.

Solution. Combine

p(12 = λ) = β α /Γ(α)λ α1 e βλ

p(y ? |w, x ? ) = 2 1 πσ2 exp  (y ? −  wx? ) 22σ 2

(B) Using your expression in (A), write down the MLE objective for w on N arbitrary labelled datapoints {(xi , yi)} N i=1. Do not optimize this objective.

Solution.

N Yi=1p(yi |w, xi , α, β) =N Yi=1Γ(ν+12)Γ(ν/2)θ πν  1 + ν 1  yi θ wxi  2 ν+12

(C) Now consider the model yi ∼ N  f(xi , zi , w), σ2  where zi ∼ N (0, I), σ 2 is known, and f is a deep neural network parametrized by w.

(i) Write down an expression for the predictive distribution y ? |X, y, x ? , where X, y denote the training datapoints. (You may leave your answer as an integral.)