ICML 2020

So the International Conference on Machine Learning (ICML) just ended and it was overall a great experience for me.

My favorite talks

This blog post will aim at presenting some of the talks I really enjoyed watching. They are more or less in my order of preference. Clicking on the title will take you to the ICML website for which you need a registration. The link to the paper is accessible regardless.

πβ(zx)qϕ(zx)1βpθ(x,z)β\pi_\beta(z|x) \propto q_\phi(z|x)^{1-\beta}p_\theta(x, z)^\beta

By progressively going from β=0\beta=0 to β=1\beta = 1 we go from the variational distribution to the true posterior. This gives us a much more exact estimate of the log-evidence. For those interested in this approach there is this blog post but I will probably try to write my own.

Now one normally integrate over a couple of β\beta but they show that by choosing the right β\beta one can already improve the ELBO value. To do this they rewrite πβ\pi_\beta as an exponential family :

πβ(zx)=qϕexp[βlogpθ(x,z)qϕ(zx)logZβ(x)]\pi_\beta(z|x) = q_\phi\exp\left[\beta \log\frac{p_\theta(x,z)}{q_\phi(z|x)} - \log Z_\beta(x)\right]

To be honest I am not sure how this works for β\beta :slightlysmilingface:

They also show that one can when integrating from β=0\beta=0 to β=1\beta=1, one can automatically find the best step-size.

From the discussions one of the pit-falls of the methods seems to be the dimensionality of the problem. The expectations computed require importance sampling, which is known to be weak in high-dimensions.

xi+1t(xxi)x_{i+1} \sim t(x|x_i)

where the condition is that the kernel is density invariant : t(xx)p(x)dx=p(x)\int t(x'|x)p(x)dx = p(x')

One such kernel is t(xx)=δ(xf(x))t(x'|x) = \delta(x'-f(x)) where f(x)f(x) has to respect the condition

p(x)=p(f(x))fx=p(f1(x))f1x p(x) = p(f(x))\left|\frac{\partial f}{\partial x}\right| = p(f^{-1}(x))\left|\frac{\partial f^{-1}}{\partial x}\right|

When adding the acceptance step one gets the following kernel

t(xx)=(x=f(x))min[1,p(f(x))p(x)fx]+(x=x)(1min[1,p(f(x))p(x)fx])t(x'|x) = (x' = f(x))\min\left[1, \frac{p(f(x))}{p(x)}\left|\frac{\partial f}{\partial x}\right|\right] + (x' = x)\left( 1 - \min\left[1, \frac{p(f(x))}{p(x)}\left|\frac{\partial f}{\partial x}\right|\right]\right)

Now the problem with getting a ff satisfying this condition is that you will end up cycling between two locations. To solve this problem we need an additional auxiliary variable vv.

The involution restriction is now relaxed to f(x,v)=f1(x,v)f(x, v) = f^{-1}(x, v) . If we take the Metropolis Hasting algorithm this would be sample v p(vx)v~p(v|x) our proposal. For example for the random walk algorithm, this would mean sampling from a Gaussian centered in xx . Then f(x,v)=[v,x]f(x, v) = [v,x] (notice the permutation). The acceptance rate gives then P=min{1,p(v,x)p(x,v)}P = \min \left\{ 1, \frac{p(v,x)}{p(x,v)}\right\}

Following this definition, they list a series of tricks including additional auxiliary augmentations, additional involutions and deterministic map.

This talk really fascinated me as it really gives openings to create new and more efficient sampling algorithms!

Other talks

Here are other presentations that really caught my attention and that I will probably explore later

The online format

Due to COVID the conference was naturally online, which was for me a double-edged sword. On one hand it is amazing to be able to parse the presentations one by one and to take time to understand each topic. On the other hand, the lack of physical presence made it quasi-impossible to network with other people.

I went to a few poster sessions, aka local Zoom meetings, and it was definitely a slightly awkward experience. You suddenly end up in front of the multiple authors. It definitely puts a lot more pressure. If there is one point that needs to be improved it is this one!

This blog post has probability 1 - ε to have mistakes, inaccuracies or plain wrong takes! Please let me know in the comments/PM etc!