Generative models are one of the cooler branches of Deep Learning. During last weeks Generative Adversarial Networks (GANs) have been present in a large number of posts (most of them related with Nvidia’s last work). Thanks to this I realized that, although I had studied generative models at University, I had never code even one of them! So I decide to change this panorama and spend a couple hours (re)learning about Variational Autoencoders. In these series of posts, I will try to transmit and also provide useful resources which I have found and feel the need to share! You can read the other posts in this series here:
- Post 2: VAEs! Generating images with Tensorflow
- Post 3: Generating FIFA 19 players with VAEs and Tensorflow
What are VAEs?
Variational Autoencoders are after all a neural network. They consist of two main pieces, an encoder and a decoder. The first of them is a neural network which task is to convert an input datapoint
The decoder is also neural network. It’s input will be in the same dimensional space than the encoder’s output and its function consists on bringing the data back to the original probability distribution. This is, output an image as the ones we have in our dataset.
So during training, the encoder ‘encodes’ the images into the latent space (information is lost due to lower dimensionality), after this the decoder tries to recover the original input. The commited error is for sure backpropagated through the whole network and the this improves its ability to reconstruct the original inputs.
But wait… wasn’t this a generative model? Yes! The encoder is in fact fitting a probability distribution to our data! The lower dimensional space is stochastic (usually modeled with a Gaussian probability density), so once our training has converged to an stable solution, we can sample from this distribution an create new unseen samples!!
If you are not impressed yet, think about this simplification of the problem. Imagine we collect all articles that have been published in New York Times during last year and we force ourselves to summarize them but with the following restriction: we can only use one hundred words from English vocabulary. For this task we will need to select this set of words carefully to minimize the loss of information. When we have succeed at this task, we might be able to reconstruct the original article from the words we see. But also, we can select a random number of words (from the 100 sample set) and create a new ‘fake’ article!
We will then act as encoders, transforming the articles into a reduced 100 words space. The decoder task will be based on recovering as much as possible of the original article!
The math
My idea here is to stick just with those parts that were more difficult to understand for me and that might help another person in the same situation! I will cover the intuition behind the algorithm and the most important parts that one needs to understand before implementing this network on on Tensorflow.
There are two main resources I have used where you can find a whole explanation of VAEs algorithm. These are Doersch article and Jaan Altosaar blog post.
The variational autoencoder can be represented as a graphical model. Where the joint probability can be expressed as
What we in fact want, is to find good values for the latent variables given our dataset, which is known as the posterior
“The key is to notice that any distribution in d dimensions can be generated by taking a set of d variables that are normally distributed and mapping them through a sufficiently complicated function”
Which I would express in another way as: do not worry, it is a simplification but the neural network will take care of it!
The family of distributions can be expressed as
So we use
We need to minimize this divergence but, once again, we find the intractable term
This functions is a lower bound on the evidence, this means that if we maximize it, we will increase the probability of observing the data (more here). If we mix the two previous equations we will get:
And it is known that KL divergence is always greater or equal than zero. So… maximizing the ELBO is all we need to do and we can get rid of the KL divergence term.
In our neural network the encoder takes input data and outputs
This is our lost function! We must highlight two things from here. First of all, we can apply backpropagation to this function (the previous equation is defined for single datapoints). Second, a lost function in Deep Learning is always minimized, so we will have to work with the negative ELBO.
That’s all! Although it might seem a little convoluted at the beginning, I have found the ELBO trick super interesting! You must think that we have found a tractable solution from an intractable problem by reducing our hypothesis set (our family of distributions will be Gaussian) and applying a small amount of math!
Gaussian tricks!
As we said before, the family of distribution that we are going to use are Gaussians. Usually,
Decoder will sample z from
where
In this video you can find a good visual explanation of the whole network!
Conclusion and next steps
In this post I have covered the basic intuition we must have in order to implement a Deep Variational Autoencoder. In next posts we will go through the implementation of this network using tensorflow, we will evaluate some of the obtained results playing with different dimensions of our latent space and observe how our data distributes on it. I must once again thank to the amazing article written by Doersch who has helped me to properly understand the theory behind VAEs.
Any ideas for future posts or is there something you would like to comment? Please feel free to reach out via Twitter or Github