The original paper is
The main purpose of VAE is to propose a method to learn the distribution of the given data $p(x)$ , it assumes the latent distribution model, i.e., $p_\theta(x) = p_\theta(x|z)p_\theta(z)$.
The distribution is parametric by $\theta$ , and our goal is to learn the optimal $\theta^*$ via maximizing marginal log likelihood (ML),
$$ \theta^* = \underset{\theta}{\operatorname{argmax}} \enspace \int p_\theta(x|z)p_\theta(z) dz $$
The analytic solution is usually not existed, due to the intractable integration. For some simple cases, e.g. Gaussian mixture model, $z$ follows a multinomial distribution, and we could effectively estimate $p(z)$ via its posterior $p_{\theta}(z|x)=p_\theta(x|z)p_{\theta}(z)/p_{\theta}(x)$. Then, EM algorithm could be used.
However, for more general cases, where $z$ is not restricted to multinomial, the posterior is usually intractable as well.
Denoising Criterion for Variational Autoencoding Framework
Some useful tricks in training variational autoencoder
https://github.com/loliverhennigh/Variational-autoencoder-tricks-and-tips/blob/master/README.md