Generative Models
Stochastic Differential Equation
Since almost all generative models can be unified within a SDE (Stochastic Differential Equation), we start with an introduction to SDE.
Let \(p_0\) be the data distribution in \(\mathbb R^d\) and let \(p_T\) be a prior distribution (usually but not necessarily a standard Gaussian) also over \(\mathbb R^d\). Our goal is to construct a continuous stochastic process \(\{x(t)\}\) indexed by a continous time variable \(t\in[0,T]\), such that \(x(0)\sim p_0\) and \(x(T)\sim p_T\). Additionally, we assume \(x(t)\) is characterized by the following Itô SDE:
Here, \(w\) is the standard Wiener process (Brownian motion), i.e., it adds Gaussian noises to \(x(t)\) when integrated over time \(t\). \(f(\cdot,t):\mathbb R^d\to\mathbb R^d\) maps the current \(x(t)\) to a velocity vector, usually called the drift coefficient of \(x(t)\). \(g(t)\in\mathbb R\) determines the scale of the Brownian motion added at time \(t\), called the diffusion coefficient of \(x(t)\). Intuitively, this gradually adds Gaussian noise to \(x(t)\), so it is also called a forward diffusion process.
Note
A standard Wiener process is defined as a stochastic process \(\{w_t\}_{t\geq0}\) that satisfies:
\(w_0=0\).
\(w_t\) is continous in \(t\) almost everywhere.
The increments \(w_{t+s}-w_s\) observes \(\mathcal N(0,tI)\) for any \(s,t>0\), and the increments are stationary and independent (see e.g. these lecture notes).
We can derive from (3) that if we let \(z=w_{t+s}-w_s\), then \(z\sim\mathcal N(0,tI)\). Thus, \({\rm d}w=\sqrt{{\rm d}t}z\) where \(z\sim\mathcal N(0,tI)\).
According to [Anderson1982], the reverse process is given by
where \(p_t(x)\) is the probability density of \(x(t)\), and \(\bar w\) is a Wiener process with time going backwards. Note that the reverse SDE should be integrated with \(t\) going back from \(T\) to \(0\), i.e., \({\rm d}t<0\).
In both forward and reverse processes, \(f(x,t)\) and \(g(t)\) are given by the specific models we choose, and they can be considered as known function. In order to sample from the reverse process, we additionally need to know \(\nabla_x\log p_t(x)\), known as the score function of the distribution \(p_t(x)\).
Estimating Score Functions
Note that
where \(p_{s,t}(x(t)|x(s))\) is called the transition kernel from \(s\) to \(t\) (\(s<t\)). To estimate the score function, we can start from sampling \(x(0)\sim p_0\) and some \(t\in[0,T]\), diffuse it to get \(x(t)\sim p_t\), and then compute \(\nabla_{x(t)}\log p_{t}(x(t))=\nabla_{x(t)}\log p_{0,t}(x(t)|x(0))\). Here, \(p_{0,t}(x(t)|x(0))\) is known and depends on \(f(x,t)\) and \(g(t)\) (but computing it is another matter). Finally, a neural network \(s_\theta(x(t),t)\) parameterized by \(\theta\) is trained to estimate the score function by minimizing:
Supposably, \(p_{0,t}(x(t)|x(0))\) is Gaussian when \(f(x,t)\) is affine in \(x\) (haven’t checked this myself).
Note
The sampling of \(t\) is equivalent to reweighting the loss according to different \(t\), and is usually also equivalent to weighting brought by differnet schedules (choices of \(f\) and \(g\)).
Forward Processes
Score Matching with Langevin Dynamics (SMLD)
For SMLD, the forward process is given by
Here, \(\sigma_t\) is monotonically increasing in \(t\). Consider discrete time steps \(0=t_0<t_1<t_2<\cdots<t_N\) (for simplicity we define \(\sigma_{t_0}=0\)). Then the discrete forward process is
The corresponding incremental version is
Todo
I have not idea how to convert between incremental update rules and the one-shot rule.
Then we have
Taking the limit \(t_i\to t_{i-1}\) and noting that \({\rm d}w=\sqrt{{\rm d}t}z\) we get
Denoising Diffusion Probabilistic Models (DDPM)
For DDPM, the forward process is defined as
Again, let \(0=t_0<t_1<t_2<\cdots<t_N\) and \(\alpha_0=1\). Then the discrete form is
The equivalent incremental form is
where \(\beta_{t_i}\in(0,1)\) are noise scales such that \(\alpha_{t_i}=\prod_{j=1}^i(1-\beta_{t_j})\). Similar to SMLD, we have
References
Note that as \(t_i\to t_{i-1}\), we also have \(\beta_{t_i}\to0\). So we need to define \(\beta(t_{i-1})=\lim_{t_i\to t_{i-1}}\beta_{t_i}/(t_i-t_{i-1})\). In this way, the continuous version of DDPM is
Flow Matching (FM)
FM itself is a continous model defined by
which gives
Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313-326, May 1982.