Probabilistic Programming

Homework # 4

Learning Goal

The goal of this fourth exercise is to understand and implement one kind of generic variational inference algorithm.

Task

Using the graphical model output of the Daphne compiler, write a black-box variational inference (VI) evaluator following section 4.4 of the book. Carefully note the difference between MCMC and VI evaluators. VI evaluators must first solve an optimization problem and only then can posterior samples of the return expression can be produced. Also note that the persistent “state” of the posterior is larger and, in some sense, more tangible: the parameter vector for the variational distribution (as opposed to the “last sample” of a burned in MCMC sampler).

Hints:

You will have to assign, access, and utilize a log-density gradient function for every variational distribution. Autodiff through PyTorch distribution object log_prob methods achieves this directly. The support code for this homework provides implementations of this for you.
You may not wish to use the same distribution type in the variational posterior as in the prior (in fact in general you should not; mixtures or normalizing flows are generally far better options). However, whichever choice you make, you need to ensure it has the same support lest the return expression might potentially be not be evaluable or the state of the program might otherwise not be valid and therefore not correspond to a valid sample.

Rubric

Submit a .pdf file to gradescope.ca with clickable urls to personalized wandb reports for each section. Use wandb code logging to submit your code.

Convergence of posterior expectations within wall-clock time constraints are to be peer assessed for the short FOPPL programs below. Posterior expectations, variances or other quantities where indicated that are all within 5% of ground truth for a single program within 1 hour of wall-clock runtime will be given full marks. Each program is worth 12 points. The maximum number of points achievable on this assignment is 5*12 = 60.

In order for us to evaluate your results log and provide plots of the BBVI loss as a function of optimizer step for each program.

Program 1)

(let [mu (sample (normal 1 (sqrt 5)))
           sigma (sqrt 2)
           lik (normal mu sigma)]
       (observe lik 8)
       (observe lik 9)
       mu)

Report the posterior expected value of mu. Also report the variational distribution (parameters and plot) corresponding to the random variable corresponding to the variable mu.

Program 2)

(defn observe-data [_ data slope bias]
  (let [xn (first data)
        yn (second data)
        zn (+ (* slope xn) bias)]
    (observe (normal zn 1.0) yn)
    (rest (rest data))))

(let [slope (sample (normal 0.0 10.0))
      bias  (sample (normal 0.0 10.0))
      data (vector 1.0 2.1 2.0 3.9 3.0 5.3
                   4.0 7.7 5.0 10.2 6.0 12.9)]
  (loop 6 data observe-data slope bias)
  [(vector slope bias) (+ (* slope 10.0) bias)])

Report the posterior expected values of both slope and bias and the posterior predictive distribution at 0.0

Program 3)

(let [data [1.1 2.1 2.0 1.9 0.0 -0.1 -0.05]
      likes (foreach 3 []
                     (let [mu (sample (normal 0.0 10.0))
                           sigma (sample (gamma 1.0 1.0))]
                       [(normal mu sigma) [mu sigma]]))
      pi (sample (dirichlet [1.0 1.0 1.0]))
      z-prior (discrete pi)
      z (foreach 7 [y data]
          (let [z (sample z-prior)]
            (observe (first (get likes z)) y)))]
  [(= (first z) (second z))
   pi
   z-prior
   (foreach 3 [x likes] (second (get likes x)))])

Report the posterior probability that the first and second datapoint are in the same cluster. Plot the posterior distributions of pi, z-prior, mu, and sigma, the latter for all three components of the mixture. A nice way to plot pi and mu+sigma is to plots the mixture density and its individual components.

Also, in your wandb report, write a paragraph or two about the mode-seeking behavior of VI on models like this that have internal symmetries. Note what the symmetries are and why they might cause problems with optimization. You may wish, in fact, to plot the mixture components as above at multiple points in the optimization path.

Program 4)

This program corresponds to Bayesian neural network learning.

(let [weight-prior (normal 0 1)
      W_0 (foreach 10 []
            (foreach 1 [] (sample weight-prior)))
      W_1 (foreach 10 []
            (foreach 10 [] (sample weight-prior)))
      W_2 (foreach 1 []
            (foreach 10 [] (sample weight-prior)))

      b_0 (foreach 10 []
            (foreach 1 [] (sample weight-prior)))
      b_1 (foreach 10 []
            (foreach 1 [] (sample weight-prior)))
      b_2 (foreach 1 []
            (foreach 1 [] (sample weight-prior)))

      x   (mat-transpose [[1] [2] [3] [4] [5]])
      y   [[1] [4] [9] [16] [25]]
      h_0 (mat-tanh (mat-add (mat-mul W_0 x)
                             (mat-repmat b_0 1 5)))
      h_1 (mat-tanh (mat-add (mat-mul W_1 h_0)
                             (mat-repmat b_1 1 5)))
      mu  (mat-transpose
            (mat-tanh (mat-add (mat-mul W_2 h_1)
                               (mat-repmat b_2 1 5))))]
(foreach 5 [y_r y
            mu_r mu]
   (foreach 1 [y_rc y_r
               mu_rc mu_r]
      (observe (normal mu_rc 1) y_rc)))
[W_0 b_0 W_1 b_1
 (let [h_0 (mat-tanh (mat-add (mat-mul W_0 [[6.0]]) b_0))
       h_1 (mat-tanh (mat-add (mat-mul W_1 h_0) b_1))
       mu (mat-transpose (mat-tanh (mat-add (mat-mul W_2 h_1) b_2)))]
       mu)])

Report the posterior distributions of all of the return tensors as heatmaps. Note that the latter is the posterior predictive.

Write a paragraph or two comparing and contrasting mean-field black-box variational inference to parameter estimation via gradient descent. Commenting on the difference between the STAN result from HW1 and the posterior distribution learned here would be excellent.

Program 5)

(let [m (sample (normal 0 5))
      s (sample (uniform-continuous 0.01 (abs m)))]
  (observe (normal 0 s) 7)
  s)

Report the learned variational distribution for the variable address corresponding to the program variable s.

Note that this problem is a “trick” problem as well in that it intentionally highlights a problem with BBVI-style interpreters relating to tracking the support of random variables. Here, clearly, the support of the variational approximation to the posterior on s should be the union of the supports of s under different values of m and, in this case, this is the positive real line. So, in general, we would like a) a facility for determining the support of RV’s via another type of non-standard interpretation (possible project idea) or b) to be overly conservative and rely upon the computer science “type”, here s being a float, and to use variational distributions whose support covers that type in general.

Infrastructure

Support Code

As indicated above there is support code available for this HW:

CS532-HW4.zip

This source code provides some features that should reduce your implementation effort tremendously. In particular PyTorch distribution objects are provided that externalize parameters as leaf tensors and, automagically, transform the distribution parameterization into one that enables unconstrained optimization.