Generative Models

Stochastic Differential Equation

Since almost all generative models can be unified within a SDE (Stochastic Differential Equation), we start with an introduction to SDE.

Let \(p_0\) be the data distribution in \(\mathbb R^d\) and let \(p_T\) be a prior distribution (usually but not necessarily a standard Gaussian) also over \(\mathbb R^d\). Our goal is to construct a continuous stochastic process \(\{x(t)\}\) indexed by a continous time variable \(t\in[0,T]\), such that \(x(0)\sim p_0\) and \(x(T)\sim p_T\). Additionally, we assume \(x(t)\) is characterized by the following Itô SDE:

\[{\rm d}x=f(x,t){\rm d}t+g(t){\rm d}w.\]

Here, \(w\) is the standard Wiener process (Brownian motion), i.e., it adds Gaussian noises to \(x(t)\) when integrated over time \(t\). \(f(\cdot,t):\mathbb R^d\to\mathbb R^d\) maps the current \(x(t)\) to a velocity vector, usually called the drift coefficient of \(x(t)\). \(g(t)\in\mathbb R\) determines the scale of the Brownian motion added at time \(t\), called the diffusion coefficient of \(x(t)\). Intuitively, this gradually adds Gaussian noise to \(x(t)\), so it is also called a forward diffusion process.

Note

A standard Wiener process is defined as a stochastic process \(\{w_t\}_{t\geq0}\) that satisfies:

\(w_0=0\).
\(w_t\) is continous in \(t\) almost everywhere.
The increments \(w_{t+s}-w_s\) observes \(\mathcal N(0,tI)\) for any \(s,t>0\), and the increments are stationary and independent (see e.g. these lecture notes).

We can derive from (3) that if we let \(z=w_{t+s}-w_s\), then \(z\sim\mathcal N(0,tI)\). Thus, \({\rm d}w=\sqrt{{\rm d}t}z\) where \(z\sim\mathcal N(0,tI)\).

According to [Anderson1982], the reverse process is given by

\[{\rm d}x=(f(x,t)-g(t)^2\nabla_x\log p_t(x)){\rm d}t+g(t){\rm d}\bar w,\]

where \(p_t(x)\) is the probability density of \(x(t)\), and \(\bar w\) is a Wiener process with time going backwards. Note that the reverse SDE should be integrated with \(t\) going back from \(T\) to \(0\), i.e., \({\rm d}t<0\).

In both forward and reverse processes, \(f(x,t)\) and \(g(t)\) are given by the specific models we choose, and they can be considered as known function. In order to sample from the reverse process, we additionally need to know \(\nabla_x\log p_t(x)\), known as the score function of the distribution \(p_t(x)\).

Estimating Score Functions

Note that

\[\nabla_x\log p_t(x)=\nabla_{x(t)}\log(p_{0,t}(x(t)|x(0))p_0(x(0)))=\nabla_{x(t)}\log p_{0,t}(x(t)|x(0)),\]

where \(p_{s,t}(x(t)|x(s))\) is called the transition kernel from \(s\) to \(t\) (\(s<t\)). To estimate the score function, we can start from sampling \(x(0)\sim p_0\) and some \(t\in[0,T]\), diffuse it to get \(x(t)\sim p_t\), and then compute \(\nabla_{x(t)}\log p_{t}(x(t))=\nabla_{x(t)}\log p_{0,t}(x(t)|x(0))\). Here, \(p_{0,t}(x(t)|x(0))\) is known and depends on \(f(x,t)\) and \(g(t)\) (but computing it is another matter). Finally, a neural network \(s_\theta(x(t),t)\) parameterized by \(\theta\) is trained to estimate the score function by minimizing:

\[\min_\theta\mathbb E_{t,x(0),x(t)|x(0)}\left[\|s_\theta(x(t),t)-\nabla_{x(t)}\log p_{0,t}(x(t)|x(0))\|^2\right].\]

Supposably, \(p_{0,t}(x(t)|x(0))\) is Gaussian when \(f(x,t)\) is affine in \(x\) (haven’t checked this myself).

Note

The sampling of \(t\) is equivalent to reweighting the loss according to different \(t\), and is usually also equivalent to weighting brought by differnet schedules (choices of \(f\) and \(g\)).

Forward Processes

Score Matching with Langevin Dynamics (SMLD)

For SMLD, the forward process is given by

\[ \begin{align}\begin{aligned}p_{\sigma_t}(x(t)|x(0))=\mathcal N(x(0),\sigma_t^2I)\\\iff x(t)=x(0)+\sigma_tz,\ z\sim\mathcal N(0,I).\end{aligned}\end{align} \]

Here, \(\sigma_t\) is monotonically increasing in \(t\). Consider discrete time steps \(0=t_0<t_1<t_2<\cdots<t_N\) (for simplicity we define \(\sigma_{t_0}=0\)). Then the discrete forward process is

\[x_{t_i}=x_{t_0}+\sigma_{t_i}z,\ z\sim\mathcal N(0,I).\]

The corresponding incremental version is

\[x_{t_i}=x_{t_{i-1}}+\sqrt{\sigma_{t_i}^2-\sigma_{t_{i-1}}^2}z,\ z\sim\mathcal N(0,I).\]

Todo

I have not idea how to convert between incremental update rules and the one-shot rule.

Then we have

\[\frac{x_{t_i}-x_{t_{i-1}}}{t_i-t_{i-1}}=\sqrt{\frac{\sigma_{t_i}^2-\sigma_{t_{i-1}}^2}{t_i-t_{i-1}}}\frac{z}{\sqrt{t_i-t_{i-1}}},\ z\sim\mathcal N(0,I).\]

Taking the limit \(t_i\to t_{i-1}\) and noting that \({\rm d}w=\sqrt{{\rm d}t}z\) we get

\[{\rm d}x=\sqrt{\frac{{\rm d}\sigma^2(t)}{{\rm d}t}}{\rm d}w.\]

Denoising Diffusion Probabilistic Models (DDPM)

For DDPM, the forward process is defined as

\[ \begin{align}\begin{aligned}p_{\alpha_t}(x(t)|x(0))=\mathcal N(\sqrt{\alpha_t}x(0),(1-\alpha_t)I)\\\iff x(t)=\sqrt{\alpha_t}x(0)+\sqrt{1-\alpha_t}z,\ z\sim\mathcal N(0,I).\end{aligned}\end{align} \]

Again, let \(0=t_0<t_1<t_2<\cdots<t_N\) and \(\alpha_0=1\). Then the discrete form is

\[x_{t_i}=\sqrt{\alpha_{t_i}}x_{t_0}+\sqrt{1-\alpha_{t_i}}z,\ z\sim\mathcal N(0,I).\]

The equivalent incremental form is

\[x_{t_i}=\sqrt{1-\beta_{t_i}}x_{t_{i-1}}+\sqrt{\beta_{t_i}}z,\ z\sim\mathcal N(0,I),\]

where \(\beta_{t_i}\in(0,1)\) are noise scales such that \(\alpha_{t_i}=\prod_{j=1}^i(1-\beta_{t_j})\). Similar to SMLD, we have

\[\frac{x_{t_i}-x_{t_{i-1}}}{t_i-t_{i-1}}\approx-\frac{\beta_{t_i}}{2(t_i-t_{i-1})}x_{t_{i-1}}+\sqrt{\frac{\beta_{t_i}}{t_i-t_{i_1}}}\frac{z}{\sqrt{t_i-t_{i_1}}},\ z\sim\mathcal N(0,I),\]

References

Note that as \(t_i\to t_{i-1}\), we also have \(\beta_{t_i}\to0\). So we need to define \(\beta(t_{i-1})=\lim_{t_i\to t_{i-1}}\beta_{t_i}/(t_i-t_{i-1})\). In this way, the continuous version of DDPM is

\[{\rm d}x=-\frac{\beta(t)}{2}x{\rm d}t+\sqrt{\beta(t)}{\rm d}w.\]

Flow Matching (FM)

FM itself is a continous model defined by

\[x(t)=(1-t)x(0)+tz,\ z\sim\mathcal N(0,I),\]

which gives

\[p_t(x(t)|x(0))=\mathcal N((1-t)x(0),t^2I).\]

[Anderson1982]

Brian D. O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313-326, May 1982.