Generative Models ================= .. highlight:: python Stochastic Differential Equation -------------------------------- Since almost all generative models can be unified within a SDE (Stochastic Differential Equation), we start with an introduction to SDE. Let :math:`p_0` be the data distribution in :math:`\mathbb R^d` and let :math:`p_T` be a prior distribution (usually but not necessarily a standard Gaussian) also over :math:`\mathbb R^d`. Our goal is to construct a continuous stochastic process :math:`\{x(t)\}` indexed by a continous time variable :math:`t\in[0,T]`, such that :math:`x(0)\sim p_0` and :math:`x(T)\sim p_T`. Additionally, we assume :math:`x(t)` is characterized by the following Itô SDE: .. math:: {\rm d}x=f(x,t){\rm d}t+g(t){\rm d}w. Here, :math:`w` is the standard Wiener process (Brownian motion), i.e., it adds Gaussian noises to :math:`x(t)` when integrated over time :math:`t`. :math:`f(\cdot,t):\mathbb R^d\to\mathbb R^d` maps the current :math:`x(t)` to a velocity vector, usually called the *drift* coefficient of :math:`x(t)`. :math:`g(t)\in\mathbb R` determines the scale of the Brownian motion added at time :math:`t`, called the *diffusion* coefficient of :math:`x(t)`. Intuitively, this gradually adds Gaussian noise to :math:`x(t)`, so it is also called a forward diffusion process. .. note:: A standard Wiener process is defined as a stochastic process :math:`\{w_t\}_{t\geq0}` that satisfies: 1. :math:`w_0=0`. 2. :math:`w_t` is continous in :math:`t` almost everywhere. 3. The increments :math:`w_{t+s}-w_s` observes :math:`\mathcal N(0,tI)` for any :math:`s,t>0`, and the increments are stationary and independent (see e.g. `these lecture notes `__). We can derive from (3) that if we let :math:`z=w_{t+s}-w_s`, then :math:`z\sim\mathcal N(0,tI)`. Thus, :math:`{\rm d}w=\sqrt{{\rm d}t}z` where :math:`z\sim\mathcal N(0,tI)`. .. _UchicagoBrownian: https://galton.uchicago.edu/~lalley/Courses/313/BrownianMotionCurrent.pdf According to [Anderson1982]_, the reverse process is given by .. math:: {\rm d}x=(f(x,t)-g(t)^2\nabla_x\log p_t(x)){\rm d}t+g(t){\rm d}\bar w, where :math:`p_t(x)` is the probability density of :math:`x(t)`, and :math:`\bar w` is a Wiener process with time going backwards. Note that the reverse SDE should be integrated with :math:`t` going back from :math:`T` to :math:`0`, i.e., :math:`{\rm d}t<0`. In both forward and reverse processes, :math:`f(x,t)` and :math:`g(t)` are given by the specific models we choose, and they can be considered as known function. In order to sample from the reverse process, we additionally need to know :math:`\nabla_x\log p_t(x)`, known as the *score function* of the distribution :math:`p_t(x)`. Estimating Score Functions -------------------------- Note that .. math:: \nabla_x\log p_t(x)=\nabla_{x(t)}\log(p_{0,t}(x(t)|x(0))p_0(x(0)))=\nabla_{x(t)}\log p_{0,t}(x(t)|x(0)), where :math:`p_{s,t}(x(t)|x(s))` is called the *transition kernel* from :math:`s` to :math:`t` (:math:`s