Next Article in Journal
Dephasing Dynamics in a Non-Equilibrium Fluctuating Environment
Next Article in Special Issue
Learning Energy-Based Models in High-Dimensional Spaces with Multiscale Denoising-Score Matching
Previous Article in Journal
Bibliometric Analysis of Granger Causality Studies
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models

1
EURECOM Data Science Department, 06410 Biot, France
2
Huawei Technologies Paris, 92100 Boulogne-Billancourt, France
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(4), 633; https://doi.org/10.3390/e25040633
Submission received: 10 March 2023 / Revised: 28 March 2023 / Accepted: 29 March 2023 / Published: 7 April 2023
(This article belongs to the Special Issue Deep Generative Modeling: Theory and Applications)

Abstract

:
Score-based diffusion models are a class of generative models whose dynamics is described by stochastic differential equations that map noise into data. While recent works have started to lay down a theoretical foundation for these models, a detailed understanding of the role of the diffusion time T is still lacking. Current best practice advocates for a large T to ensure that the forward dynamics brings the diffusion sufficiently close to a known and simple noise distribution; however, a smaller value of T should be preferred for a better approximation of the score-matching objective and higher computational efficiency. Starting from a variational interpretation of diffusion models, in this work we quantify this trade-off and suggest a new method to improve quality and efficiency of both training and sampling, by adopting smaller diffusion times. Indeed, we show how an auxiliary model can be used to bridge the gap between the ideal and the simulated forward dynamics, followed by a standard reverse diffusion process. Empirical results support our analysis; for image data, our method is competitive with regard to the state of the art, according to standard sample quality metrics and log-likelihood.

1. Introduction

Diffusion-based generative models [1,2,3,4,5,6,7] have recently gained popularity due to their ability to synthesize high-quality audio [8,9], image [10,11] and other data modalities [12], outperforming known methods based on Generative Adversarial Networks (GANs) [13], normalizing flows (NFs) [14] or Variational Autoencoders (vaes) and Bayesian autoencoders (BAEs) [15,16].
Diffusion models learn to generate samples from an unknown density p d a t a by reversing a diffusion process which transforms the distribution of interest into noise. The forward dynamics injects noise into the data following a diffusion process that can be described by a Stochastic Differential Equation (SDE) of the form
d x t = f ( x t , t ) d t + g ( t ) d w t with x 0 p d a t a ,
where x t is a random variable at time t, f ( · , t ) is the drift term, g ( · ) is the diffusion term and w t is a Wiener process (or Brownian motion). We also consider a special class of linear SDEs, for which the drift term is decomposed as f ( x t , t ) = α ( t ) x t , where the function α ( t ) 0 for all t, and the diffusion term is independent of x t . This class of parameterizations of SDEs is known as affine and it admits analytic solutions. We denote the time-varying probability density by p ( x , t ) , where, by definition p ( x , 0 ) = p d a t a ( x ) , and the conditional on the initial condition x 0 by p ( x , t | x 0 ) . The forward SDE is usually considered for a “sufficiently long” diffusion time T, leading to the density p ( x , T ) . In principle, when T , p ( x , T ) converges to Gaussian noise, regardless of initial conditions.
For generative modeling purposes, we are interested in the inverse dynamics of such process, i.e., transforming samples of the noisy distribution p ( x , T ) into p d a t a ( x ) . Such dynamics can be obtained by considering the solutions of the inverse diffusion process [17],
d x t = f ( x t , t ) + g 2 ( t ) log p ( x t , t ) d t + g ( t ) d w t ,
where t = def T t , with the inverse dynamics involving a new Wiener process. Given p ( x , T ) as the initial condition, the solution of Equation (2) Equation (2) after a reverse diffusion time T, will be distributed as p d a t a ( x ) . We refer to the density associated to the backward process as q ( x , t ) . The simulation of the backward process is referred to as sampling and, differently from the forward process, this process is not affine and a closed form solution is out of reach.
Practical considerations on diffusion times. In practice, diffusion models are challenging to work with [3]. Indeed, a direct access to the true score function log p ( x t , t ) required in the dynamics of the reverse diffusion is unavailable. This can be solved by approximating it with a parametric function s θ ( x t , t ) , e.g., a neural network, which is trained using the following loss function:
L ( θ ) = 0 T E ( 1 ) λ ( t ) s θ ( x t , t ) log p ( x t , t | x 0 ) 2
where λ ( t ) is a positive weighting factor and the notation E ( 1 ) means that the expectation is taken with respect to the random process x t in Equation (1) for a generic function h, E ( 1 ) h ( x t , x 0 , t ) = h ( x , z , t ) p ( x , t | z ) p d a t a ( z ) d x d z . The loss in Equation (3), usually referred to as score matching loss, is the cost function considered in [18] (Equation (4)). The condition λ ( t ) = g ( t ) 2 , which we use in this work, is referred to a likelihood reweighting. Due to the affine property of the drift, the term p ( x t , t | x 0 ) is analytically known and normally distributed for all t (expression available in Table 1, and in Särkkä and Solin [19]). Intuitively, the estimation of the score is akin to a denoising objective, which operates in a challenging regime. Later, we will quantify the difficulty of learning the score, as a function of T.
While the forward and reverse diffusion processes are valid for all T, the noise distribution p ( x , T ) is analytically known only when the diffusion time is T . Then, the common solution is to replace p ( x , T ) with a simple (i.e., easy to sample) distribution p n o i s e ( x ) , which, for the classes of SDEs that we consider in this work, is a Gaussian distribution.
In the literature, the discrepancy between p ( x , T ) and p n o i s e ( x ) has been neglected, under the informal assumption of a sufficiently large diffusion time. Unfortunately, while this approximation seems a valid approach to simulate and generate samples, the reverse diffusion process starts from an initial condition q ( x , 0 ) which is different from p ( x , T ) and, as a consequence, it will converge to a solution q ( x , T ) that is different from the true p d a t a ( x ) . Later, we will expand on the error introduced by this approximation, but for illustration purposes, Figure 1 shows quantitatively this behavior for a simple 1D toy example where we set the data distribution equal to a mixture of normal ( N ) distributions as p d a t a ( x ) = π N ( 1 , 0 . 1 2 ) + ( 1 π ) N ( 3 , 0 . 5 2 ) , with π = 0.3 . When T is small, the distribution p n o i s e ( x ) is very different from p ( x , T ) and samples from q ( x , T ) exhibit very low likelihood of being generated from p d a t a ( x ) .
Crucially, Figure 1 (zoomed region) illustrates an unknown behavior of diffusion models, which we unveil in our analysis. The right balance between efficient score estimation and sampling quality can be achieved by diffusion times that are smaller than common best practices. Moreover, even excessively large diffusion times can be detrimental. This is a key observation that we explore in our work.
Contributions. An appropriate choice of the diffusion time T is a key factor that impacts training convergence, sampling time and quality. On the one hand, the approximation error introduced by considering initial conditions for the reverse diffusion process drawn from a simple distribution p n o i s e ( x ) p ( x , T ) increases when T is small. This is why the current best practice is to choose a sufficiently long diffusion time. On the other hand, training convergence of the score model s θ ( x t , t ) becomes more challenging to achieve with a large T, which also imposes extremely high computational costs both for training and for sampling. This would suggest to choose a smaller diffusion time. Given the importance of this problem, in this work, we set off to study the existence of suitable operating regimes to strike the right balance between computational efficiency and model quality. The main contributions of this work are the following:
Contribution 1: 
We use an evidence lower bound (ELBO) decomposition which allows us to study the impact of the diffusion time T. This ELBO decomposition emphasizes the roles of (i) the discrepancy between the “ending” distribution of the diffusion and the “starting” distribution of the reverse diffusion processes, and (ii) of the score matching objective. Crucially, our analysis does not rely on assumptions on the quality of the score models. We explicitly study the existence of a trade-off and explore experimentally, for the first time, current approaches for selecting the diffusion time T.
Contribution 2: 
In Section 3, we propose a novel method to improve both the training and sampling efficiency of diffusion-based models, while maintaining high sample quality. Our method introduces an auxiliary distribution, allowing us to transform the simple “starting” distribution of the reverse process used in the literature so as to minimize the discrepancy to the “ending” distribution of the forward process. Then, a standard reverse diffusion can be used to closely match the data distribution. Intuitively, our method allows to build “bridges” across multiple distributions, and to set T toward the advantageous regime of small diffusion times.
In addition to our methodological contributions, in Section 4, we provide experimental evidence of the benefits of our method, in terms of sample quality and log likelihood. Finally, we conclude this work in Section 5.
Related Work. A concurrent work by Zheng et al. [20] presents an empirical study of a truncated diffusion process but lacks a rigorous analysis and a clear justification for the proposed approach. Recent attempts by Lee et al. [9] to optimize p n o i s e , or the proposal to do so [21], have been studied in different contexts. Related work focus primarily on improving sampling efficiency (but not training efficiency), using a wide array of techniques. Sample generation times can be drastically reduced considering adaptive step-size integrators [22]. Such methods are complementary to our approach, and can be used in combination with the techniques we propose in this work. Other popular choices are based on merging multiple steps of a pretrained model through distillation techniques [23] or by taking larger sampling steps with GANs [24]. Approaches closer to ours modify the sde, or the discrete time processes, to obtain inference efficiency gains. In particular, Song et al. [7] considers implicit non-Markovian diffusion processes, while Watson et al. [25] changes the diffusion processes by optimal scheduling selection, and Dockhorn et al. [26] considers overdamped SDEs. Finally, hybrid techniques combining VAEs and diffusion models [4] or simple auto encoders and diffusion models [27] have positive effects on training and sampling times.
Moreover, we remark that a simple modification of the noise schedule to steer the diffusion process toward a small diffusion time [5,28] is not a viable solution. As we discuss in Section 2.4, the optimal value of the ELBO, in the case of affine SDEs, is invariant to the choice of the noise schedule. Naively selecting a faster noise schedule does not provide any practical benefit in terms of computational complexity, as it requires smaller step sizes to keep the same accuracy of the original noise schedule simulation. However, the optimization of the noise schedule can have important practical effects on the stability of training and variance of estimations [5]. Finally, few other works in the literature attempt to study the convergence properties of diffusion models. For instance, De Bortoli et al. [29] obtain a total variation bound between the generated and data distribution under maximum error assumptions between true and approximated score. De Bortoli [30] relaxes this requirement obtaining a bound in terms of Wasserstein distance. Lee et al. [31] show how the total variation bound can be expressed as a function of the maximum score error and find that the bound is optimized for a diffusion time that depends on this error. Our work, on the other hand, does not make any assumption and aims at selecting the smallest possible diffusion time to maximize training and sampling efficiency.

2. A Tradeoff on Diffusion Time

The dynamics of a diffusion model can be studied through the lens of variational inference, which allows us to bound the (log-)likelihood using an evidence lower bound (ELBO) [32]. The interpretation we consider in this work (see also [18], Theorem 1) emphasizes the two main factors affecting the quality of sample generation: an imperfect score and a mismatch, measured in terms of KL log p ( x , T ) p n o i s e ( x ) , the Kullback-Leibler (KL) divergence between the noise distribution p ( x , T ) of the forward process and the distribution p n o i s e used to initialize the backward process.

2.1. Preliminaries: The ELBO Decomposition

Our goal is to study the quality of the generated data distribution as a function of the diffusion time T. Instead of focusing on the log-likelihood bounds for single datapoints log q ( x , T ) , we consider the average over the data distribution, i.e., the cross-entropy E p d a t a ( x ) log q ( x , T ) . By rewriting the L ELBO derived in Huang et al. [32] [Equation (25)] (details of the steps in Appendix B), we have that
E p d a t a ( x ) log q ( x , T ) L ELBO ( s θ , T ) = E ( 1 ) log p n o i s e ( x T ) I ( s θ , T ) + R ( T )
where R ( T ) = 1 2 t = 0 T E ( 1 ) g 2 ( t ) log p ( x t , t | x 0 ) 2 2 f ( x t , t ) log p ( x t , t | x 0 ) d t , and I ( s θ , T ) = 1 2 t = 0 T g 2 ( t ) E ( 1 ) s θ ( x t , t ) log p ( x t , t | x 0 ) 2 d t is equal to the loss term Equation (3) when λ ( t ) = g 2 ( t ) .
Note that R ( T ) depends neither on s θ nor on p n o i s e , while I ( s θ , T ) , or an equivalent reparameterization [18,32] [Equation (1)], is used to learn the approximated score , by optimization of the parameters θ . It is then possible to show that
I ( s θ , T ) I ( log p , T ) = def K ( T ) = 1 2 t = 0 T g 2 ( t ) E ( 1 ) log p ( x t , t ) log p ( x t , t | x 0 ) 2 d t .
Note that the term K ( T ) = I ( log p , T ) does not depend on θ . Consequently, we can define G ( s θ , T ) = I ( s θ , T ) K ( T ) (see Appendix C for details), where G ( s θ , T ) is a positive term that we call the gap term, accounting for the practical case of an imperfect score , i.e., s θ ( x t , t ) log p ( x t , t ) . It also holds that
E ( 1 ) log p n o i s e ( x T ) = log p n o i s e ( x ) log p ( x , T ) + log p ( x , T ) p ( x , T ) d x = = E ( 1 ) log p ( x T , T ) 1.10 k l log p ( x , T ) p n o i s e ( x ) .
Therefore, we can substitute the cross-entropy term E ( 1 ) log p n o i s e ( x T ) of the ELBO in Equaiton (4) to obtain
E p d a t a ( x ) log q ( x , T ) KL p ( x , T ) p n o i s e ( x ) + E ( 1 ) log p ( x T , T ) K ( T ) + R ( T ) G ( s θ , T ) .
Before concluding our derivation, we show how to combine different terms of Equation (7) into the negative entropy term E p d a t a ( x ) log p d a t a ( x ) . Given the stochastic dynamics defined in Equtaion (1), it holds that (see derivation and details in Appendix D)
E ( 1 ) log p ( x T , T ) K ( T ) + R ( T ) = E p d a t a ( x ) log p d a t a ( x ) .
Finally, we can now bound the value of E p d a t a ( x ) log q ( x , T ) as
E p d a t a ( x ) log q ( x , T ) E p d a t a ( x ) log p d a t a ( x ) G ( s θ , T ) KL p ( x , T ) p n o i s e ( x ) L ELBO ( s θ , T ) .
Equation (9) clearly emphasizes the roles of an approximate score function, through the gap term G ( · ) , and the discrepancy between the noise distribution of the forward process and the initial distribution of the reverse process through the KL term. The (negative) entropy term E p d a t a ( x ) log p d a t a ( x ) , which is constant with regard to T and θ , is the best value achievable by the ELBO. Indeed, by rearranging Equation (9), KL p d a t a ( x ) q ( x , T ) G ( s θ , T ) + KL p ( x , T ) p n o i s e ( x ) . Optimality is achieved when (i) we have perfect score matching and (ii) the initial conditions for the reverse process are ideal, i.e., q ( x , 0 ) = p ( x , T ) .
Next, we show the existence of a tradeoff: the KL decreases with T, while the gap increases with T.

2.2. The Tradeoff on Diffusion Time

We begin by showing that the KL term in Equation (9) decreases with the diffusion time T, which induces to select large T to maximize the ELBO.
We consider the two main classes of SDEs for the forward diffusion process defined in Equation (1): SDEs whose steady state distribution is the standard multivariate Gaussian, referred to as Variance Preserving (VP), and SDEs without a stationary distribution, referred to as Variance Exploding (VE), which we summarize in Table 1. The standard approach to generate new samples relies on the backward process defined in Equation (2), and consists in setting p n o i s e in agreement with the form of the forward process SDE. The following result bounds the discrepancy between the noise distribution p ( x , T ) and p n o i s e .
Table 1. Two main families of diffusion processes, where σ 2 ( t ) = σ max 2 σ min 2 t and β ( t ) = β 0 + ( β 1 β 0 ) t .
Table 1. Two main families of diffusion processes, where σ 2 ( t ) = σ max 2 σ min 2 t and β ( t ) = β 0 + ( β 1 β 0 ) t .
Diffusion Process p ( x t , t | x 0 ) = N ( m , s I ) p n o i s e ( x )
Variance Exploding α ( t ) = 0 , g ( t ) = σ 2 ( t ) d t m = x 0 , s = σ 2 ( t ) σ 2 ( 0 ) N ( 0 , ( σ 2 ( T ) σ 2 ( 0 ) ) I )
Variance Preserving α ( t ) = 1 2 β ( t ) , g ( t ) = β ( t ) m = e 1 2 0 t β ( d τ ) x 0 , s = 1 e 0 t β ( d τ ) N ( 0 , I )
Lemma 1. 
For the classes of SDEs considered (Table 1), the discrepancy between p ( x , T ) and the p n o i s e ( x ) can be bounded as follows.
For Variance Preserving SDEs, it holds that: KL p ( x , T ) p n o i s e ( x ) C 1 exp ( 0 T β ( t ) d t ) .
For Variance Exploding SDEs, it holds that: KL p ( x , T ) p n o i s e ( x ) C 2 1 σ 2 ( T ) σ 2 ( 0 ) .
Our proof uses results from Villani [33], the logarithmic Sobolev Inequality and Gronwall inequality (see Appendix E for details). The consequence of Lemma 1 is that to maximize the ELBO, the diffusion time T should be as large as possible (ideally, T ), such that the KL term vanishes. This result is in line with current practices for training score-based diffusion processes, which argue for sufficiently long diffusion times [29]. Our analysis, on the other hand, highlights how this term is only one of the two contributions to the ELBO.
Now, we focus our attention on studying the behavior of the second component, G ( · ) . Before that, we define a few quantities that allow us to write the next important result.
Definition 1. 
We define the optimal score  s ^ θ for any diffusion time T, as the score obtained using parameters that minimize I ( s θ , T ) . Similarly, we define the  optimal score gap  G ( s ^ θ , T ) for any diffusion time T, as the gap attained when using the optimal score.
The optimal score gap term G ( s ^ θ , T ) is a non-decreasing function in T. That is, given T 2 > T 1 , and θ 1 = arg min θ I ( s θ , T 1 ) , θ 2 = arg min θ I ( s θ , T 2 ) , then G ( s θ 2 , T 2 ) G ( s θ 1 , T 1 ) . The proof (see Appendix F) is a direct consequence of the definition of G and the optimality of the score.
Note that Section 2.2 does not imply that G ( s θ a , T 2 ) G ( s θ b , T 1 ) holds for generic parameters θ a , θ b .

2.3. Is There an Optimal Diffusion Time?

While diffusion processes are generally studied for T , diffusion times in score-based models have been arbitrarily set to be “sufficiently large” in the literature. Here we formally argue about the existence of an optimal diffusion time, which strikes the right balance between the gap G ( · ) and the KL terms of the ELBO in Equation (9).
Before proceeding any further, we clarify that our final objective in this work is not to find and use an optimal diffusion time. Instead, our result on the existence of optimal diffusion times (which can be smaller than the ones set by than popular heuristics) serves the purpose of motivating the choice of small diffusion times, which, however, calls for a method to overcome approximation errors. For completeness, in Appendix H, we show that optimizing the ELBO to obtain an optimal diffusion time T is technically feasible, without resorting to exhaustive grid search.
Consider the ELBO decomposition in Equation (9). We study it as a function of time T, and seek its optimal argument T = arg max T L ELBO ( s ^ θ , T ) . Then, the optimal diffusion time T R + , and thus not necessarily T = . Additional assumptions on the gap term G ( · ) can be used to guarantee strict finiteness of T .
It is trivial to verify that, since the optimal gap term G ( s ^ θ , T ) is a non decreasing function in T (Section 2.2), we have G T 0 . Then, we study the sign of the KL derivative, which is always negative as shown in Appendix G. Moreover, we know that that lim T KL T = 0 . Consequently, the function L ELBO T = G T + KL T has at least one zero in its domain R + . To guarantee a stricter bounding of T , we could study asymptotically the growth rates of G and the KL terms for large T. The investigation is technically involved and outside the scope of this paper. Nevertheless, as discussed hereafter, the numerical investigation carried out in this work suggests finiteness of T .
Empirically, we use Figure 2 to illustrate the tradeoff and the optimality arguments through the lens of the same toy example we use in Section 1. On the first and third columns, we show the ELBO decomposition. We can verify that G ( s θ , T ) is an increasing function of T, whereas the KL term is a decreasing function of T. Even in the simple case of a toy example, the tension between small and large values of T is clear. On the second and fourth columns, we show the values of the ELBO and of the likelihood as a function of T. We then verify the validity of our claims: the ELBO is neither maximized by an infinite diffusion time, nor by a “sufficiently large” value. Instead, there exists an optimal diffusion time which, for this example, is smaller than T = 1.0 , which is typically used in practice.
In Section 3, we present a new method that admits much smaller diffusion times and we show that the ELBO of our approach is at least as good as the one of a standard diffusion model, configured to use its optimal diffusion time T .

2.4. Relation with Diffusion Process Noise Schedule

We remark that a simple modification of the noise schedule to steer the the diffusion process toward a small diffusion time [5,28] is not a viable solution. In Appendix J, we discuss how the optimal value of the ELBO, in the case of affine SDEs, is invariant to the choice of the noise schedule. Indeed, its value depends uniquely on the relative level of corruption of the initial data at the considered final diffusion time T, that is, the Signal-to-Noise Ratio . Naively, we could think that, by selecting a twice as fast noise schedule, we would be able to obtain the same ELBO of the original schedule by diffusing only for half the time. While true, this does not provide any practical benefit in terms of computational complexity. If the noise schedule is faster, the drift terms involved in the reverse process changes more rapidly. Consequently, to simulate the reverse SDE with a numerical integration scheme, smaller step sizes are required to keep the same accuracy of the original noise schedule simulation. The effect is that, while the diffusion time for the continuous time dynamics is smaller, the number of integration steps is larger, inducing no computational gains. The optimization of the noise schedule can, however, have important practical effects in terms of stability of the training and variance of the estimations, which we do not tackle in this work [5].

2.5. Relation with Literature on Bounds and Goodness of Score Assumptions

Few other works in the literature attempt to study the convergence properties of Diffusion models. In the work of De Bortoli et al. [29] (Theorem 1), a total variation (TV) bound between the generated and data distribution is obtained in the form C 1 exp ( a 1 T ) + C 2 exp ( a 2 T ) , where the constant C 1 depends on the maximum error over [ 0 , T ] between the true and approximated score, i.e., max t [ 0 , T ] s θ ( x , t ) log p ( x , t ) . In the work of De Bortoli [30], the requirement is relaxed by setting max t [ 0 , T ] σ 2 ( t ) 1 + x s θ ( x , t ) log p ( x , t ) , where the 1-Wasserstein distance between generated and true data is bounded as C 1 + C 2 exp ( a 2 T ) + C 3 (Theorem 1). Other works consider the more realistic average square norm instead of the infinity norm, which is consistent with standard training of diffusion models. Moreover, Lee et al. [31] show how the TV bound can be expressed as a function of max t [ 0 , T ] E s θ ( x t , t ) log p ( x t , t ) 2 (Theorems 2.2, 3.1 and 3.2). Related to our work, Lee et al. [31] find that the TV bound is optimized for a diffusion time that depends, among others, on the maximum score error. Finally, the work by Chen et al. [34] (Theorem 2), which is concurrent to ours, shows that if max t [ 0 , T ] E s θ ( x t , t ) log p ( x t , t ) 2 is bounded, then the TV distance between true and generated data can be bound as C 1 exp ( a 1 T ) + ϵ T , plus a discretization error.
All prior approaches require assumptions on the maximum score error, which implicitly depends on: (i) the maximum diffusion time T and (ii) the class of parametric score networks considered. Hence, such methods allow for the study of convergence properties, but with the following limitations. It is not clear how the score error behaves as the fitting domain ( [ 0 , T ] ) is increased, for generic class of parametric functions and generic p d a t a . Moreover, it is difficult to link the error assumptions with the actual training loss of diffusion models. In this work, instead, we follow a more agnostic path, as we make no assumptions about the error behavior. We notice that the optimal gap term is always a non decreasing function of T. First, we question whether the current best practice for setting diffusion times is adequate: we find that, in realistic implementations, diffusion times are larger than necessary. Second, we introduce a new approach, with provably the same performance of standard diffusion models but lower computational complexity, as highlighted in Section 3.

3. A New, Practical Method for Decreasing Diffusion Times

The ELBO decomposition in Equation (9) and the bounds in Lemma 1 and Section 2.2 highlight a dilemma. We thus propose a simple method that allows us to achieve both a small gap G ( s θ , T ) and a small discrepancy KL p ( x , T ) p n o i s e ( x ) . Before that, let us use Figure 3 to summarize all densities involved and the effects of the various approximations, which will be useful to visualize our proposal.
The data distribution p d a t a ( x ) is transformed into the noise distribution p ( x , T ) through the forward diffusion process. Ideally, starting from p ( x , T ) , we can recover the data distribution by simulating using the exact score log p . Using the approximated score  s θ and the same initial conditions, the backward process ends up in q ( 1 ) ( x , T ) , whose discrepancy ① to p d a t a ( x ) is G ( s θ , T ) . However, the distribution p ( x , T ) is unknown and replaced with an easy distribution p n o i s e ( x ) , accounting for an error ⓐ measured as KL p ( x , T ) p n o i s e ( x ) . With the score and initial distribution approximated, the backward process ends up in q ( 3 ) ( x , T ) , where the discrepancy ③ from p d a t a is the sum of the terms G ( s θ , T ) + KL p ( x , T ) p n o i s e .
Multiple bridges across densities. In a nutshell, our method allows us to reduce the gap term by selecting smaller diffusion times and by using a learned auxiliary model to transform the initial density p n o i s e ( x ) into a density ν ϕ ( x ) , which is as close as possible to p ( x , T ) , thus avoiding the penalty of a large KL term. To implement this, we first transform the simple distribution p n o i s e into the distribution ν ϕ ( x ) , whose discrepancy ⓑ KL p ( x , T ) ν ϕ ( x ) is smaller than ⓐ. Then, starting from from the auxiliary model  ν ϕ ( x ) , we use the approximate score s θ to simulate the backward process reaching q ( 2 ) ( x , T ) . This solution has a discrepancy ② from the data distribution of G ( s θ , T ) + KL p ( x , T ) ν ϕ ( x ) , which we will quantify later in the section. Intuitively, we introduce two bridges. The first bridge connects the noise distribution p n o i s e to an auxiliary distribution ν ϕ ( x ) that is as close as possible to that obtained by the forward diffusion process. The second bridge—a standard reverse diffusion process—connects the smooth distribution ν ϕ ( x ) to the data distribution. Notably, our approach has important guarantees, which we discuss next.

3.1. Auxiliary Model Fitting and Guarantees

We begin by stating the requirements we consider for the density ν ϕ ( x ) . First, as it is the case for p n o i s e , it should be easy to generate samples from ν ϕ ( x ) in order to initialize the reverse diffusion process. Second, the auxiliary model should allow us to compute the likelihood of the samples generated through the overall generative process, which begins in p n o i s e , passes through ν ϕ ( x ) , and arrives in q ( x , T ) .
The fitting procedure of the auxiliary model is straightforward. First, we recognize that minimizing KL p ( x , T ) ν ϕ ( x ) with respect to ϕ also minimizes E p ( x , T ) log ν ϕ ( x ) , which we can use as loss function. To obtain the set of optimal parameters ϕ , we require samples from p ( x , T ) , which can be easily obtained even if the density p ( x , T ) is not available. Indeed, by sampling from p d a t a , and p ( x , T | x 0 ) , we obtain an unbiased Monte Carlo estimate of E p ( x , T ) log ν ϕ ( x ) , and optimization of the loss can be performed. Note that, due to the affine nature of the drift, the conditional distribution p ( x , T | x 0 ) is easy to sample from, as shown in Table 1. From a practical point of view, it is important to notice that the fitting of ν ϕ is independent from the training of the score-matching objective, i.e., the result of I ( s θ ) does not depend on the shape of the auxiliary distribution ν ϕ . This implies that the two training procedures can be run in parallel, thus enabling considerable time savings.
Next, we show that the first bridge in our model reduces the KL term, even for small diffusion times.
Proposition 1. 
Let us assume that p n o i s e ( x ) is in the family spanned by ν ϕ , i.e., there exists ϕ ˜ such that ν ϕ ˜ = p n o i s e . Then we have that
KL p ( x , T ) ν ϕ ( x ) KL p ( x , T ) ν ϕ ˜ ( x ) = KL p ( x , T ) p n o i s e ( x ) .
Since we introduce the auxiliary distribution ν , we shall define a new ELBO for our method:
L KL ϕ ( s θ , T ) = E p d a t a ( x ) log p d a t a ( x ) G ( s θ , T ) KL p ( x , T ) ν ϕ ( x )
Recalling that s ^ θ is the optimal score for a generic time T, Proposition 1 allows us to claim that L ELBO ϕ ( s ^ θ , T ) L ELBO ( s ^ θ , T ) .
Then, we can state the following important result:
Proposition 2. 
Given the existence of T , defined as the diffusion time such that the ELBO is maximized (Section 2.3), there exists at least one diffusion time τ T , such that L ELBO ϕ ( s ^ θ , τ ) L 1.10 e l b o ( s ^ θ , T ) .
Proposition 2, which we prove in Appendix I, has two interpretations. On the one hand, given two score models optimally trained for their respective diffusion times, our approach guarantees an ELBO that is at least as good as that of a standard diffusion model configured with its optimal time T . Our method achieves this with a smaller diffusion time τ , which offers sampling efficiency and generation quality. On the other hand, if we settle for an equivalent ELBO for the standard diffusion model and our approach, with our method we can afford a sub-optimal score model, which requires a smaller computational budget to be trained, while guaranteeing shorter sampling times. We elaborate on this interpretation in Section 4, where our approach obtains substantial savings in terms of training iterations.
A final note is in order. The choice of the auxiliary model depends on the selected diffusion time. The larger the T, the “simpler” the auxiliary model can be. Indeed, the noise distribution p ( x , T ) approaches p n o i s e , so that a simple auxiliary model is sufficient to transform p n o i s e into a distribution ν ϕ . Instead, for a small T, the distribution p ( x , T ) is closer to the data distribution. Then, the auxiliary model requires high flexibility and capacity. In Section 4, we substantiate this discussion empirically on synthetic and real data.

3.2. Comparison with Schrödinger Bridges

In this section, we briefly compare our method with the Schrödinger bridges approach [29,35,36], which allows one to move from an arbitrary p n o i s e to p d a t a in any finite amount of time T. This is achieved by simulating the SDE
d x t = f ( x t , t ) + g 2 ( t ) log ψ ^ ( x t , t ) d t + g ( t ) d w t , x 0 p n o i s e ,
where ψ ^ , ψ solve the Partial Differential Equation (PDE) system
ψ ( x , t ) t = f ( x , t ) ψ ( x , t ) g 2 ( t ) 2 Δ ( ψ ( x , t ) ) , ψ ^ t = ψ ^ ( x , t ) f ( x , t ) + g 2 ( t ) 2 Δ ( ψ ^ ( x , t ) ) ,
with boundary conditions ψ ( x , 0 ) ψ ^ ( x , 0 ) = p d a t a ( x ) , ψ ( x , T ) ψ ^ ( x , T ) = p n o i s e ( x ) . In the above equation, f ( x , t ) = I = 1 N f I ( x , t ) x I , being N the dimension of the vectors x , f and the notation f I , x I indicating their I t h component. This approach presents drawbacks compared to classical Diffusion models. First, the functions ψ , ψ ^ are not known, and their parametric approximation is costly and complex. Second, it is much harder to obtain quantitative bounds between true and generated data as a function of the quality of such approximations.
The ψ ^ , ψ estimation procedure simplifies considerably in the particular case where p n o i s e ( x ) = p ( x , T ) , for arbitrary T. The solution of Equation (13) is indeed ψ ( x , t ) = 1 , ψ ^ ( x , t ) = p ( x , t ) . The first PDE of the system is satisfied when ψ is a constant. The second PDE is the Fokker–Planck equation, satisfied by ψ ^ ( x , t ) = p ( x , t ) . Boundary conditions are also satisfied. In this scenario, a sensible objective is the score-matching, as getting log ψ ^ equal to the true score log p allows perfect generation.
Unfortunately, it is difficult to generate samples from p ( x , T ) , the starting conditions of Equation (12). A trivial solution is to select T in order to have p n o i s e as the simple and analytically known steady state distribution of Equation (1). This corresponds to the classical diffusion models approach, which we discussed in the previous sections. An alternative solution is to keep T finite and cover the first part of the bridge from p n o i s e to p ( x , T ) with an auxiliary model. This provides a different interpretation of our method, which allows for smaller diffusion times while keeping good generative quality.

3.3. An Extension for Density Estimation

Diffusion models can be also used for density estimation by transforming the diffusion SDE into an equivalent Ordinary Differential Equation (ODE) whose marginal distribution p ( x , t ) at each time instant coincide to that of the corresponding SDE [3]. The exact equivalent ODE requires the score log p ( x t , t ) , which in practice is replaced by the score model s θ , leading to the following ODE
d x t = f ( x t , t ) 1 2 g ( t ) 2 s θ ( x t , t ) d t with x 0 p d a t a ,
whose time varying probability density is indicated with p ˜ ( x , t ) . Note that the density p ˜ ( x , t ) , is in general not equal to the density p ( x , t ) associated to Equation (1), with the exception of perfect score matching [18]. The reverse time process is modeled as a Continuous Normalizing Flow (cnf) [37,38] initialized with distribution p n o i s e ( x ) ; then, the likelihood of a given value x 0 is
log p ˜ ( x 0 ) = log p n o i s e ( x T ) + t = 0 T · f ( x t , t ) 1 2 g ( t ) 2 s θ ( x t , t ) d t .
To use our proposed model for density estimation, we also need to take into account the ODE dynamics. We focus again on the term log p n o i s e ( x T ) to improve the expected log likelihood. For consistency, our auxiliary density ν ϕ should now maximize E ( 14 ) log ν ϕ ( x T ) instead of E ( 1 ) log ν ϕ ( x T ) . However, the simulation of Equation (14) requires access to s θ which, in the endeavor of density estimation, is available only once the score model has been trained. Consequently, optimization with respect to ϕ can only be performed sequentially, whereas, for generative purposes, it could be done concurrently. While the sequential version is expected to perform better, experimental evidence indicates that improvements are marginal, justifying the adoption of the more efficient concurrent version.

4. Experiments

We now present numerical results on the MNIST and CIFAR10 datasets, to support our claims in Section 2 and Section 3. We follow a standard experimental setup [5,7,18,32]: we use a standard U-Net architecture with time embeddings [6] and we report the log-likelihood in terms of bit per dimension (BPD) and the Fréchet Inception Distance (FID) scores (uniquely for CIFAR10). Although the FIDscore is a standard metric for ranking generative models, caution should be used against over-interpreting FIDimprovements [39]. Similarly, while the theoretical properties of the models we consider are obtained through the lens of ELBO maximization, the log-likelihood measured in terms of BPD should be considered with care [40]. Finally, we also report the number of neural function evaluations (NFE) for computing the relevant metrics. We compare our method to the standard score-based model [3]. The full description on the experimental setup is presented in Appendix K.
On the existence of T . We look for further empirical evidence of the existence of a T < , as stated in Section 2.3. For the moment, we shall focus on the baseline model [3], where no auxiliary models are introduced. Results are reported in Table 2. For MNIST, we observe how times T = 0.6 and T = 1.0 have comparable performance in terms of BPD, implying that any T 1.0 is at best unnecessary and generally detrimental. Similarly, for CIFAR10, it is possible to notice that the best value of BPD is achieved for T = 0.6 , outperforming all other values.
Our auxiliary models. In Section 3, we introduced an auxiliary model to minimize the mismatch between initial distributions of the backward process. We now specify the family of parametric distributions we have considered. Clearly, the choice of an auxiliary model also depends on the data distribution, in addition to the choice of diffusion time T.
For our experiments, we consider two auxiliary models: (i) a Dirichlet process Gaussian mixture model (DPGMM) [41,42] for MNIST and (ii) Glow [43], a flexible normalizing flow for CIFAR10. Both of them satisfy our requirements: they allow exact likelihood computation and they are equipped with a simple sampling procedure. As discussed in Section 3, auxiliary model complexity should be adjusted as a function of T.
This is confirmed experimentally in Figure 4, where we use the number of mixture components of the DPGMM as a proxy to measure the complexity of the auxiliary model.
Reducing T with auxiliary models. We now show how it is possible to obtain a comparable (or better) performance than the baseline model for a wide range of diffusion times T. For MNIST, setting τ = 0.4 produces good performance both in terms of BPD (Table 3) and visual sample quality (Figure 5). We also consider the sequential extension (S) to compute the likelihood, but remark marginal improvements compared to a concurrent implementation. Similarly for the CIFAR10 dataset, in Table 4 we observe how our method achieves better BPD than the baseline diffusion for T = 1 . Moreover, our approach outperforms the baselines for the corresponding diffusion time in terms of FIDscore (Figure 6 and additional non-curated samples in the Appendix K). In Figure A3 we provide a non curated subset of qualitative results, showing that our method for a diffusion time equal to 0.4 still produces appealing images, while the vanilla approach fails. We finally notice how the proposed method has comparable performance with regard to several other competitors, while stressing that many orthogonal to our solutions (like diffusion in latent space [4], or the selection of higher order schemes [22]) can actually be combined with our methodology.
Training and sampling efficiency. In Figure 7, the horizontal line corresponds to the best performance of a fully trained baseline model for T = 1.0 [3]. To achieve the same performance of the baseline, variants of our method require fewer iterations, which translate in training efficiency. For the sake of fairness, the total training cost of our method should account for the auxiliary model training, which, however, can be done concurrently to the diffusion process. As an illustration for CIFAR10, using four GPUs, the baseline model requires ∼6.4 days of training. With our method we trained the auxiliary and diffusion models for ∼2.3 and 2 days, respectively, leading to a total training time of max { 2.3 , 2 } = 2.3 days. Similar training curves can be obtained for the MNIST dataset, where the training time for dpgmms is negligible.
Sampling speed benefits are evident from Table 3 and Table 4. When considering the SDE version of the methods the number of sampling steps can decrease linearly with T, in accordance with theory [45], while retaining good BPD and FIDscores. Similarly, although not in a linear fashion, the number of steps of the ODE samplers can be reduced by using a smaller diffusion time T.
Finally, we test the proposed methodology on the more challenging CELEBA 64x64 dataset. In this case, we use a variance exploding diffusion and we consider again Glow as the auxiliary model. The results, presented in Table 5, report the log-likelihood performance of different methods (qualitative results are reported in Appendix K). On the two extremes of the complexity we have the original diffusion (VE, T = 1.0 ) with the best BPD and the highest complexity, and Glow which provides a much simpler scheme with worse performance. In the table we report the BPD and the NFE metrics for smaller diffusion times, in three different configurations: naively neglecting the mismatch (ScoreSDE) or using the auxiliary model (Our). Interestingly, we found that the best results are obtained by using a combination of diffusion models pretrained for T = 1.0 . The summary of the content of this table is the following: by accepting a small degradation in terms of BPD, we can reduce the computational cost by almost one order of magnitude. We think it would be interesting to study more performing auxiliary models to improve performance of our method on challenging datasets.

5. Conclusions

Diffusion-based generative models emerged as an extremely competitive approach for a wide range of application domains. In practice, however, these models are resource-hungry, both for their training and for sampling new data points. In this work, we have introduced the key idea of considering diffusion times T as a free variable which should be chosen appropriately. We have shown that the choice of T introduces a trade-off, for which a “sweet spot” exists. In standard diffusion-based models, smaller values of T are preferable for efficiency reasons, but sufficiently large T are required to reduce approximation errors of the forward dynamics. Thus, we devised a novel method that allows for an arbitrary selection of diffusion times, where even small values are allowed. Our method closes the gap between practical and ideal diffusion dynamics, using an auxiliary model. Our empirical validation indicated that the performance of our approach was comparable and often superior to standard diffusion models, while being efficient both in training and in sampling.

Author Contributions

Conceptualization, G.F., S.R., M.F. and P.M.; Methodology, G.F., S.R., M.F. and P.M.; Software, G.F., S.R., L.Y., A.F., D.R., M.F. and P.M.; Validation, G.F., S.R., M.F. and P.M.; Formal analysis, G.F., S.R., M.F. and P.M.; Investigation, G.F., S.R., M.F. and P.M.; Resources, L.Y., A.F. and D.R.; Writing—original draft, G.F., S.R., M.F. and P.M.; Writing—review & editing, G.F., S.R., L.Y., A.F., D.R., M.F. and P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially supported by the Netception project, funded by Huawei Paris.

Data Availability Statement

All used datasets are publicly available.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Generic Definitions and Assumptions

Our work builds upon the work in [18], which should be considered as a basis for the developments hereafter. In this supplementary material, we use the following shortened notation for a generic ω > 0 :
N ω ( x ) = def N ( x ; 0 , ω I ) .
It is useful to notice that log ( N ω ( x ) ) = 1 ω x .
For an arbitrary probability density p ( x ) , we define the convolution (∗ operator) with N ω using notation
p ω ( x ) = p ( x ) N ω ( x ) .
Equivalently, p ω ( x ) = exp ( ω 2 Δ ) p ( x ) , and consequently d p ω ( x ) d ω = 1 2 Δ p ( x ) , where Δ = . Notice that, by considering the Dirac delta function δ ( x ) , we have the equality δ ω ( x ) = N ω ( x ) .
In the following derivations, we make use of the Stam–Gross logarithmic Sobolev inequality result in [33] (p. 562 Example 21.3):
KL p ( x ) N ω ( x ) = p ( x ) log p ( x ) N ω ( x ) d x ω 2 log p ( x ) N ω ( x ) 2 p ( x ) d x .

Appendix B. Deriving Equation (4) from [32]

We start with Equation (25) of [32], which, in our notation, reads
log q ( x , T ) E log p n o i s e ( x T ) | x 0 = x 0 T E 1 2 g 2 ( t ) s θ ( x t ) 2 + g 2 ( t ) s θ ( x t ) f ( x t , t ) | x 0 = x d t .
The first step is to take the expected value with respect to x 0 p d a t a on both sides of the above inequality
E p d a t a log q ( x , T ) E log p n o i s e ( x T ) 0 T E 1 2 g 2 ( t ) s θ ( x t ) 2 + g 2 ( t ) s θ ( x t ) f ( x t , t ) d t .
We focus on rewriting the term
0 T E g 2 ( t ) s θ ( x t ) f ( x t , t ) d t = 0 T p ( x , t | x 0 ) p d a t a ( x 0 ) g 2 ( t ) s θ ( x ) f ( x , t ) d x d x 0 d t = 0 T p ( x , t | x 0 ) p d a t a ( x 0 ) g 2 ( t ) s θ ( x ) f ( x , t ) d x d x 0 d t =
0 T p ( x , t | x 0 ) p d a t a ( x 0 ) 1 p ( x , t | x 0 ) p d a t a ( x 0 ) g 2 ( t ) s θ ( x ) f ( x , t ) p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = 0 T log ( p ( x , t | x 0 ) ) + log ( p d a t a ( x 0 ) ) g 2 ( t ) s θ ( x ) f ( x , t ) p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = 0 T log ( p ( x , t | x 0 ) ) g 2 ( t ) s θ ( x ) f ( x , t ) p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = 0 T E log ( p ( x t , t | x 0 ) ) g 2 ( t ) s θ ( x t ) f ( x t , t ) d t .
Consequently, we can rewrite the r.h.s of Equation (A4) as
E log p n o i s e ( x T ) 0 T E 1 2 g 2 ( t ) s θ ( x t ) 2 g 2 ( t ) log ( p ( x t , t | x 0 ) ) s θ ( x ) + E + g 2 ( t ) log ( p ( x t , t | x 0 ) ) f ( x , t ) d t = E log p n o i s e ( x T ) 0 T E 1 2 g 2 ( t ) s θ ( x t ) log ( p ( x t , t | x 0 ) ) 2 d t 1 2 0 T E g 2 ( t ) log ( p ( x t , t | x 0 ) ) + 2 log ( p ( x t , t | x 0 ) ) f ( x , t ) d t ,
which is exactly Equation (4).

Appendix C. Proof of Equation (5)

We prove the following result
I ( s θ , T ) I ( log p , T ) = def K ( T ) = 1 2 t = 0 T g 2 ( t ) E ( 1 ) log p ( x t , t ) log p ( x t , t | x 0 ) 2 d t .
Proof. 
We prove that for generic positive λ ( · ) , and T 2 > T 1 the following holds:
t = T 1 T 2 λ ( t ) E ( 1 ) | | s ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t t = T 1 T 2 λ ( t ) E ( 1 ) | | log p ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t .
First, we compute the functional derivative (with regard to s)
δ δ s t = T 1 T 2 λ ( t ) E ( 1 ) | | s ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t = 2 t = T 1 T 2 λ ( t ) E ( 1 ) ( s ( x t , t ) log p ( x t , t | x 0 ) ) d t = 2 t = T 1 T 2 λ ( t ) E ( 1 ) ( s ( x t , t ) log p ( x t , t ) ) d t ,
where we used
E ( 1 ) log p ( x t , t | x 0 ) = log p ( x , t | x 0 ) p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 = p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 = p ( x , t ) x = E ( 1 ) log p ( x t , t ) .
Consequently, we can obtain the optimal s through
δ δ s t = T 1 T 2 λ ( t ) E ( 1 ) | | s ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t = 0 s ( x , t ) = log p ( x , t ) .
Substitution of this result into Equation (A5) directly proves the desired inequality.
As a byproduct, we prove the correctness of Equation (5), since it is a particular case of Equation (A5), with λ = g 2 , T 1 = 0 , T 2 = T . Since K ( T ) is a minimum, the decomposition I ( s θ , T ) = K ( T ) + G ( s θ , T ) implies K ( T ) + G ( s θ , T ) K ( T ) G ( s θ , T ) 0 . □

Appendix D. Proof of Equation (8)

Proof. 
We consider the pair of equations
d x t = f ( x t , t ) + g 2 ( t ) log q ( x t , t ) d t + g ( t ) d w ( t ) , d x t = f ( x t , t ) t + g ( t ) d w ( t ) ,
where t = T t , q is the density of the backward process and p is the density of the forward process. These equations can be interpreted as a particular case of the following pair of SDEs (corresponding to [32] Equations (4) and (17) (Notice that our notation for the roles of p , q is swapped with respect to [32])).
d x t = f ( x t , t ) + g 2 ( t ) log q ( x t , t ) μ ( x t , t ) d t + g ( t ) σ ( t ) d w ( t ) , d x t = f ( x t , t ) g 2 ( t ) log q ( x t , t ) μ ( x t , t ) + g ( t ) σ ( t ) a ( x t , t ) d t + g ( t ) d w ( t ) ,
where Equation (A7) is recovered considering a ( x , t ) = σ ( t ) log q ( x , t ) = g ( t ) log q ( x , t ) . Equation (A8) is associated to an ELBO ([32], Theorem 3) that is attained with equality if and only if a ( x , t ) = σ ( t ) log q ( x , t ) . Consequently, we can write the following equality associated to the backward process of Equation (A7)
log q ( x , T ) = E 1 2 0 T | | a ( x t , t ) | | 2 + 2 μ ( x t , t ) d s + log q ( x T , 0 ) | x 0 = x ,
where the expected value is taken with respect to the dynamics of the associated forward process.
By careful inspection of the couple of equations, we notice that, in the process x t , the drift includes the log q ( x t , t ) term, while in our main (1) we have log p ( x t , t ) . In general the two vector fields do not agree. However, if we select as starting distribution of the generating process p ( x , T ) , i.e., q ( x , 0 ) = p ( x , T ) , then t , q ( x , t ) = p ( x , t ) .
Given the initial conditions, the time evolution of the density p is fully described by the Fokker–Planck equation
d d t p ( x , t ) = f ( x , t ) p ( x , t ) + g 2 ( t ) 2 Δ ( p ( x , t ) ) , p ( x , 0 ) = p d a t a ( x ) .
Similarly, for the density q,
d d t q ( x , t ) = f ( x , t ) q ( x , t ) + g 2 ( t ) log q ( x , t ) q ( x , t ) + g 2 ( t ) 2 Δ ( q ( x , t ) ) ,
with q ( x , 0 ) = p ( x , T ) . By Taylor expansion we have
q ( x , δ t ) = q ( x , 0 ) + δ t d d t q ( x , t ) t = 0 + O ( δ t 2 ) = q ( x , 0 ) + δ t f ( x , T ) q ( x , 0 ) + g 2 ( T ) log q ( x , 0 ) q ( x , 0 ) + g 2 ( T ) 2 Δ ( q ( x , 0 ) ) + O ( δ t 2 ) = q ( x , 0 ) + δ t f ( x , T ) q ( x , 0 ) g 2 ( T ) 2 Δ ( q ( x , 0 ) ) + O ( δ t 2 ) ,
and
p ( x , T δ t ) = p ( x , T ) δ t d d t p ( x , t ) t = T + O ( δ t 2 ) = p ( x , T ) δ t f ( x , T ) p ( x , T ) + g 2 ( T ) 2 Δ ( p ( x , T ) ) + O ( δ t 2 ) = p ( x , T ) + δ t f ( x , T ) p ( x , T ) g 2 ( T ) 2 Δ ( p ( x , T ) ) + O ( δ t 2 )
Since q ( x , 0 ) = p ( x , T ) , we finally have q ( x , δ t ) p ( x , T δ t ) = O ( δ t 2 ) . This holds for arbitrarily small δ t . By induction, with similar reasoning, we claim that q ( x , t ) = p ( x , t ) .
This last result allows us to rewrite Equation (A7) as the pair of SDEs
x t = f ( x t , t ) + g 2 ( t ) log p ( x t , t ) t + g ( t ) w ( t ) , x t = f ( x t , t ) t + g ( t ) w ( t ) .
Moreover, since q ( x , T ) = p ( x , 0 ) = p d a t a ( x ) , together with the result Equation (A9), we have the following equality
log p d a t a ( x ) = E 1 2 0 T | | a ( x t , t ) | | 2 + 2 μ ( x t , t ) t + log p ( x T , T ) | x 0 = x .
Consequently,
E x p d a t a log p d a t a ( x ) = E log p ( x T , T ) + E 1 2 0 T | | a ( x t , t ) | | 2 + 2 μ ( x t , t ) d t = E log p ( x T , T ) + E 1 2 0 T g ( t ) 2 | | log p ( x t , t ) | | 2 + 2 f ( x t , t ) + g 2 ( t ) log p ( x t , t ) d t = E log p ( x T , T ) + E 1 2 0 T g ( t ) 2 | | log p ( x t , t ) | | 2 2 g 2 ( t ) x log p ( x t , t ) log p ( x t , t | x 0 ) d t + E 1 2 0 T 2 f ( x t , t ) log p ( x t , t | x 0 ) d t =
E log p ( x T , T ) + E 1 2 0 T g ( t ) 2 | | log p ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t + E 1 2 0 T g ( t ) 2 | | log p ( x t , t | x 0 ) | | 2 + 2 f ( x t , t ) log p ( x t , t | x 0 ) d t .
Remembering the definitions
K ( T ) = 1 2 t = 0 T g 2 ( t ) E ( 1 ) | | log p ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t R ( T ) = 1 2 t = 0 T E ( 1 ) g 2 ( t ) log p ( x , t | x 0 ) 2 2 f ( x , t ) log p ( x , t | x 0 ) d t ,
we finally conclude the proof that
E ( 1 ) [ log p ( x T , T ) ] K ( T ) + R ( T ) = E x p d a t a [ log p d a t a ( x ) ] .
 □

Appendix E. Proof of Lemma 1

In this section, we prove the validity of Lemma 1 for the case of Variance Preserving (VP) and Variance Exploding (VE) sdes. Remember that, as reported also in tab:difftypes, the above mentioned classes correspond to α ( t ) = 1 2 β ( t ) , g ( t ) = β ( t ) , β ( t ) = β 0 + ( β 1 β 0 ) t and α ( t ) = 0 , g ( t ) = d σ 2 ( t ) t , σ 2 ( t ) = σ m a x σ m i n t , respectively.

Appendix E.1. The Variance Preserving (VP) Convergence

We associate this class of sdes to the Fokker Planck operator
L ( t ) = 1 2 β ( t ) x · + ( · ) ,
and consequently d p ( x , t ) t = L ( t ) p ( x , t ) . Simple calculations show that lim T p ( x , T ) = N 1 ( x ) .
We compute bound the time derivative of the kl term as
d t KL p ( x , T ) N 1 ( x ) = d p ( x , t ) d t log ( p ( x , t ) N 1 ( x ) ) d x + p ( x , t ) p ( x , t ) d p ( x , t ) d t d x = 1 2 β ( t ) log ( N 1 ( x ) ) p ( x , t ) ) + p ( x , t ) ) log ( p ( x , t ) N 1 ( x ) ) d x = 1 2 β ( t ) p ( x , t ) log ( N 1 ( x ) ) + log p ( x , t ) ) ( log ( p ( x , t ) N 1 ( x ) ) ) d x = 1 2 β ( t ) p ( x , t ) ( log ( p ( x , t ) N 1 ( x ) ) ) ( log ( p ( x , t ) N 1 ( x ) ) ) d x = 1 2 β ( t ) p ( x , t ) | | ( log ( p ( x , t ) N 1 ( x ) ) ) | | 2 d x β ( t ) KL p ( x , T ) N 1 ( x ) .
We then apply Gronwall’s inequality [33] to d t 1.10 k l p ( x , T ) N 1 ( x ) β ( t ) 1.10 k l p ( x , T ) N 1 ( x ) to claim
KL p ( x , T ) N 1 ( x ) KL p ( x , 0 ) N 1 ( x ) exp ( 0 T β ( s ) d s ) .
To claim validity of the result, we need to assume that p ( x , t ) has finite first and second order derivatives, and that KL p ( x , 0 ) N 1 ( x ) < .

Appendix E.2. The Variance Exploding (VE) Convergence

The first step is to bound the derivative with respect to to ω of the divergence KL p ω ( x ) N ω ( x ) , i.e.,
d d ω KL p ω ( x ) N ω ( x ) = d p ω ( x ) d ω log ( p ω ( x ) N ω ( x ) ) d x + p ω ( x ) p ω ( x ) d p ω ( x ) d ω d x p ω ( x ) N ω ( x ) d N ω ( x ) d ω d x = 1 2 Δ p ω ( x ) log ( p ω ( x ) N ω ( x ) ) Δ N ω ( x ) p ω ( x ) N ω ( x ) d x = 1 2 p ω ( x ) log p ω ( x ) log ( p ω ( x ) N ω ( x ) ) N ω ( x ) log N ω ( x ) p ω ( x ) N ω ( x ) d x = 1 2 p ω ( x ) log p ω ( x ) ( log ( p ω ( x ) N ω ( x ) ) ) N ω ( x ) log N ω ( x ) ( p ω ( x ) N ω ( x ) ) d x = 1 2 p ω ( x ) log p ω ( x ) ( log ( p ω ( x ) N ω ( x ) ) ) p ω ( x ) log N ω ( x ) ( log ( p ω ( x ) N ω ( x ) ) ) d x = 1 2 p ω ( x ) | | ( log ( p ω ( x ) N ω ( x ) ) ) | | 2 d x 1 ω KL p ω ( x ) N ω ( x ) .
Consequently, using again Gronwall inequality, for all ω 1 > ω 0 > 0 we have
KL p ω 1 ( x ) N ω 1 ( x ) KL p ω 0 ( x ) N ω 0 ( x ) exp ( ( log ( ω 1 ) log ( ω 0 ) ) ) = KL p ω 0 ( x ) N ω 0 ( x ) ω 0 1 ω 1 .
This can be directly applied to obtain the bound for VE sde. Consider ω 1 = σ 2 ( T ) σ 2 ( 0 ) and ω 0 = σ 2 ( τ ) σ 2 ( 0 ) for an arbitrarily small τ < T . Then, since for the considered class of variance exploding SDE we have p ( x , T ) = p σ 2 ( T ) σ 2 ( 0 ) ( x )
KL p ( x , T ) N σ 2 ( T ) σ 2 ( 0 ) ( x ) C 1 σ 2 ( T ) σ 2 ( 0 )
where C = KL p ( x , τ ) N σ 2 ( τ ) σ 2 ( 0 ) ( x ) ( σ 2 ( τ ) σ 2 ( 0 ) ) .
Similarly to the previous case, we assume that p ( x , t ) has finite first and second order derivatives, and that C < .

Appendix F. Proof for the Optimal Score Gap Term, Section 2.2

The optimal score gap term G ( s ^ θ , T ) is a non-decreasing function in T. That is, given T 2 > T 1 , and θ 1 = arg min θ I ( s θ , T 1 ) , θ 2 = arg min θ I ( s θ , T 2 ) , then G ( s θ 2 , T 2 ) G ( s θ 1 , T 1 ) .
Proof. 
For θ 1 defined as in the lemma, I ( s θ 1 , T 1 ) = K ( T 1 ) + G ( s θ 1 , T 1 ) . Next, select T 2 > T 1 . Then, for a generic θ , including θ 2 , .
I ( s θ , T 2 ) = t = 0 T 1 g 2 ( t ) E ( 1 ) | | s θ ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t = I ( s θ , T 1 ) K ( T 1 ) + G ( s θ 1 , T 1 ) = I ( s θ 1 , T 1 ) +
t = T 1 T 2 g 2 ( t ) E ( 1 ) | | s θ ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t t = T 1 T 2 g 2 ( t ) E ( 1 ) | | log p ( x t , t ) log p ( x t , t | x 0 ) | | 2 d t = K ( T 2 ) K ( T 1 ) G ( s θ 1 , T 1 ) + K ( T 2 ) ,
from which G ( s θ , T 2 ) = I ( s θ , T 2 ) K ( T 2 ) G ( s θ 1 , T 1 ) . □

Appendix G. Proof of Section 2.3

Consider the ELBO decomposition in Equation (9). We study it as a function of time T, and seek its optimal argument T = arg max T L ELBO ( s ^ θ , T ) . Then, the optimal diffusion time T R + , and thus not necessarily T = . Additional assumptions on the gap term G ( · ) can be used to guarantee strict finiteness of T .
Proof. 
It is trivial to verify that since the optimal gap term G ( s ^ θ , T ) is an increasing function in T Section 2.2, then G T 0 . Then, we study the sign of the KL derivative, which is always negative as shown by Equation (A16) and Equation (A18) (where we also notice d d t = d ω d t d d ω keep the sign). Moreover, we know that that lim T KL T = 0 . Then, the function L ELBO T = G T + KL T has at least one zero in [ 0 , ] .  □

Appendix H. Optimization of T

It is possible to treat the diffusion time T as a hyper-parameter and perform gradient-based optimization jointly with the score model parameters θ . Indeed, simple calculations show that
L KL ( s θ , T ) T = E f ( x T , T ) + g 2 ( T ) Δ log p n o i s e ( x T ) +
1 2 E s θ ( x T , T ) log p ( x T , T | x 0 ) 2 +
1 2 E g 2 ( T ) log p ( x T , T | x 0 ) 2 2 f ( x T , T ) log p ( x T , T | x 0 )

Appendix I. Proof of Proposition 2

Proof. 
Since T we have L ELBO ϕ ( s θ , T ) L ELBO ( s θ , T ) , there exists a countable set of intervals I contained in [ 0 , T ] of variable supports, where L ELBO ϕ is greater than L ELBO ( s θ , T ) . Assuming continuity of L ELBO ϕ , in these intervals it is possible to find at least one τ T where L ELBO ϕ ( s ^ θ , τ ) L ELBO ( s ^ θ , T ) .  □
We notice that the degenerate case I = T is obtained only when T T , KL p ( x , T ) ν ϕ ( x ) = KL p ( x , T ) p n o i s e ( x ) . We expect this condition to never occur in practice.

Appendix J. Invariance to Noise Schedule

We here discuss about the claims made in Section 2.4 about the invariance of the ELBO to the particular choice of noise schedule. First, in Appendix J.1, we explain how different SDEs corresponding to different noise schedules can be translated one into the other. We introduce the concept of the signal-to-noise ratio (SNR). We clarify the unified score parametrization used in practice in the literature [5,46]. Then, in Appendix J.2, we prove how the single elements of the ELBO depend only on the value of the snr at the final diffusion time T, as claimed in the main paper.

Appendix J.1. Preliminaries

We consider as reference SDE a pure Wiener process diffusion,
d x t = d w t with x 0 p d a t a ,
It is easily seen that the solution of the random process admits representation
x t = x 0 + t ϵ , ϵ N ( 0 , I )
In this case, the time varying probability density, which we indicate with ψ , satisfies
ψ ( x , t ) = exp ( t 2 Δ ) p d a t a ( x ) , ψ ( x , t | x 0 ) = exp ( t 2 Δ ) δ ( x x 0 )
Simple calculations show that
log ψ ( x , σ 2 ) = E x 0 | x 0 + σ ϵ = x x σ 2 = def d ( x ; σ 2 ) x σ 2 ,
where again x 0 p d a t a and the function d can be interpreted as a denoiser .
Our goal is to show the relationship between equations like Equation (1), and Equation (A23). In particular, we focus on affine sdes, as classically done with Diffusion models. The class of considered affine SDEs is the following:
d x t = α ( t ) x t d t + g ( t ) d w t with x 0 p d a t a ,
In this simple linear case the process admits representation
x t = k ( t ) x 0 + σ ( t ) ϵ , ϵ N ( 0 , I )
where k ( t ) = exp 0 t α ( s ) d s , σ 2 ( t ) = k 2 ( t ) 0 t g 2 ( s ) k 2 ( s ) d s . We can rewrite Equation (A28) as x t = k ( t ) ( x 0 + σ ˜ ( t ) ϵ ) , and define the snr as σ ˜ ( t ) = σ ( t ) k ( t ) . The density associated to Equation (A27) can be expressed as a function of ψ as follows
p ( x , t ) = k ( t ) D exp ( σ ˜ 2 ( t ) 2 Δ ) p d a t a ( x ) x k ( t ) = k ( t ) D ψ ( x k ( t ) , σ ˜ 2 ( t ) ) .
The score function associated to Equation (A28) has consequently expression
x log p ( x , t ) = x log ψ ( x k ( t ) , σ ˜ 2 ( t ) ) = 1 k ( t ) x k ( t ) log ψ ( x k ( t ) , σ ˜ 2 ( t ) ) = k ( t ) d ( x k ( t ) ; σ ˜ 2 ( t ) ) x σ 2 ( t ) .

Appendix J.2. Different Noise Schedules

Consider a diffusion of the form of Equation (A23) and a score network s ¯ θ that approximate the true score. Inspecting Equation (A30), we parametrize the score network associated to a generic diffusion Equation (A27) as a function of the score of the reference diffusion. The score parametrization considered in [5] can be generalized to arbitrary SDEs [46]. In particular, as suggested by Equation (A26), we select
s ¯ θ ( x , t ) = k ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) x σ 2 ( t )
We proceed by showing that the different components of the ELBO depends on the diffusion time T only through σ ˜ ( T ) , but not on k ( t ) , σ ( t ) singularly for any time t < T .
Theorem A1. 
Consider a generic diffusion Equation (A27) and parametrize the score network as s ¯ θ ( x k ( t ) , σ ˜ ( t ) ) . Then, the gap term G ( s ¯ θ , T ) associated to Equation (A27) for a diffusion time T depends only on σ ˜ ( T ) but not on k ( t ) , σ ( t ) singularly for any time t < T .
Proof. 
We first rearrange the gap term
2 G ( s ¯ θ , T ) = t = 0 T g 2 ( t ) E ( A 27 ) s ¯ θ ( x t , t ) log p ( x t , t | x 0 ) 2 d t t = 0 T g 2 ( t ) E ( A 27 ) log p ( x t , t ) log p ( x t , t | x 0 ) 2 d t = t = 0 T g 2 ( t ) E ( A 27 ) s ¯ θ ( x t , t ) log p ( x t , t ) 2 t
Then
t = 0 T g 2 ( t ) s ¯ θ ( x , t ) log p ( x , t ) 2 p ( x , t | x 0 ) p d a t a ( d x 0 ) d x d x 0 t = t = 0 T g 2 ( t ) k ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) x σ 2 ( t ) k ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) x σ 2 ( t ) 2 p ( x , t | x 0 ) p d a t a ( d x 0 ) d x d x 0 t = t = 0 T g 2 ( t ) k ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) k ( t ) d ( x k ( t ) ; σ ˜ 2 ( t ) ) σ 2 ( t ) 2 p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = t = 0 T g 2 ( t ) k 2 ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) d ( x k ( t ) ; σ ˜ 2 ( t ) ) σ ˜ 2 ( t ) 2 p ( x , t | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = t = 0 T g 2 ( t ) k 2 ( t ) d θ ( x k ( t ) ; σ ˜ 2 ( t ) ) d ( x k ( t ) ; σ ˜ 2 ( t ) ) σ ˜ 2 ( t ) 2 ψ ( x k ( t ) , σ ˜ 2 ( t ) | x 0 ) p d a t a ( x 0 ) k ( t ) D d x d x 0 d t = subst . x ˜ = x k ( t ) , d x ˜ = d x k D ( t ) t = 0 T g 2 ( t ) k 2 ( t ) d θ ( x ˜ ; σ ˜ 2 ( t ) ) d ( x ˜ ; σ ˜ 2 ( t ) ) σ ˜ 2 ( t ) 2 ψ ( x ˜ , σ ˜ 2 ( t ) | x 0 ) p d a t a ( x 0 ) d x d x 0 d t = subst . r = σ ˜ 2 ( t ) , r = g 2 ( t ) k 2 ( t ) t t = 0 σ ˜ 2 ( T ) s ¯ θ ( x ˜ , r ) log ψ ( x ˜ , r | x 0 ) 2 ψ ( x ˜ , r ) p d a t a ( x 0 ) d x ˜ d x 0 d r
For any k ( t ) , σ ( t ) such that σ ˜ ( T ) is the same, the score matching loss is the same.  □
Theorem A2. 
Suppose that for any ϕ of the auxiliary model ν ϕ ( x ) it exists one ϕ such that ν ϕ ( x ) = k D ν ϕ ( x k ) , for any k > 0 . Notice that this condition is trivially satisfied if the considered parametric model has the expressiveness to multiply its output by the scalar k. Then the minimum Kullback–Leibler divergence betweeen p ( x , T ) associated to a generic diffusion eq:generic and the density of an auxiliary model ν ϕ ( x ) depends only on σ ˜ ( T ) and not on σ ( T ) alone.
Proof. 
We start with the equality
KL p ( x , T ) ν ϕ ( x ) = KL k ( T ) D ψ ( x k ( T ) , σ ˜ ( T ) ) ν ϕ ( x ) = KL k ( T ) D ψ ( x k ( T ) , σ ˜ ( T ) ) k ( T ) D ν ϕ ( x k ( T ) ) = k ( T ) D ψ ( x k ( T ) , σ ˜ ( T ) ) log ψ ( x k ( T ) , σ ˜ ( T ) ) ν ϕ ( x k ( T ) ) d x = ψ ( x ˜ , σ ˜ ( T ) ) log ψ ( x ˜ , σ ˜ ( T ) ) ν ϕ ( x ˜ ) d x ˜ = KL ψ ( x , σ ˜ ( T ) ) ν ϕ ( x )
Then the minimimum only depends on σ ˜ ( T ) , as it is always possible to achieve the same value independently on the SDE by rescaling the auxiliary model output. □

Appendix K. Experimental Details

We here give some additional details concerning the experimental (Section 4) settings.

Appendix K.1. Toy Example Details

In the toy example, we use 8192 samples from a simple Gaussian mixture with two components as target p d a t a ( x ) . In detail, we have p d a t a ( x ) = π N ( 1 , 0 . 1 2 ) + ( 1 π ) N ( 3 , 0 . 5 2 ) , with π = 0.3 . The choice of Gaussian mixture allows to write down explicitly the time-varying density
p ( x t , t ) = π N ( 1 , s 2 ( t ) + 0 . 1 2 ) + ( 1 π ) N ( 3 , s 2 ( t ) + 0 . 5 2 ) ,
where s 2 ( t ) is the marginal variance of the process at time t. We consider a variance exploding SDE of the type x t = σ t w t , which corresponds to s 2 ( t ) = σ 2 t 1 2 log σ .
Figure A1. Visualization of few samples at different diffusion times T.
Figure A1. Visualization of few samples at different diffusion times T.
Entropy 25 00633 g0a1

Appendix K.2. Section 4 Details

We considered Variance Preserving SDE with default β 0 , β 1 parameter settings. When experimenting on CIFAR10 we considered the 1.10NCSN++ architecture as implemented in [3]. Training of the score matching network has been carried out with the default set of optimizers and schedulers of [3], independently of the selected T.
For the MNIST dataset we reduced the architecture by considering 64 features, ch _ mult = ( 1 , 2 ) and attention resolutions equal to 8. The optimizer has been selected as the one in the CIFAR10 experiment but the warmup has been reduced to 1000 and the total number of iterations to 65,000.

Appendix K.3. Varying T

We clarify about the T truncation procedure during both training and testing. The SDE parameters are kept unchanged irrespective of T. During training, as evident from Equation (3), it is sufficient to sample randomly the diffusion time from distribution U ( 0 , T ) , where T can take any positive value. For testing (sampling), we simply modified the algorithmic routines to begin the reverse diffusion processes from a generic T instead of the default 1.0 .

Appendix L. Non-Curated Samples

We provide for completeness collection of non-curated samples for the CIFAR10 (Figure A2, Figure A3, Figure A4, Figure A5), MNIST dataset (Figure A6, Figure A7, Figure A8, Figure A9) and CELEBA dataset (Figure A10 and Table 5).
Figure A2. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.2 .
Figure A2. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.2 .
Entropy 25 00633 g0a2
Figure A3. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.4 .
Figure A3. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.4 .
Entropy 25 00633 g0a3
Figure A4. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.6 .
Figure A4. CIFAR10: Our method (left) and the Vanilla method (right) at T = 0.6 .
Entropy 25 00633 g0a4
Figure A5. Vanilla method at T = 1.0 .
Figure A5. Vanilla method at T = 1.0 .
Entropy 25 00633 g0a5
Figure A6. MNIST: Our method (left) and the Vanilla method (right) at T = 0.2 .
Figure A6. MNIST: Our method (left) and the Vanilla method (right) at T = 0.2 .
Entropy 25 00633 g0a6
Figure A7. MNIST: Our method (left) and the Vanilla method (right) at T = 0.4 .
Figure A7. MNIST: Our method (left) and the Vanilla method (right) at T = 0.4 .
Entropy 25 00633 g0a7
Figure A8. MNIST: Our method (left) and the Vanilla method (right) at T = 0.6 .
Figure A8. MNIST: Our method (left) and the Vanilla method (right) at T = 0.6 .
Entropy 25 00633 g0a8
Figure A9. MNIST: Vanilla method at T = 1.0 .
Figure A9. MNIST: Vanilla method at T = 1.0 .
Entropy 25 00633 g0a9
Figure A10. CELEBA images. Top Left: our method with pretrained score model and Glow ( T = 0.2 ), Top Right: our method with pretrained score model and Glow ( T = 0.5 ) and Bottom Left: baseline diffusion ( T = 1.0 ). Bottom Right: FIDscores for our method and baseline ( T = 1.0 ).
Figure A10. CELEBA images. Top Left: our method with pretrained score model and Glow ( T = 0.2 ), Top Right: our method with pretrained score model and Glow ( T = 0.5 ) and Bottom Left: baseline diffusion ( T = 1.0 ). Bottom Right: FIDscores for our method and baseline ( T = 1.0 ).
Entropy 25 00633 g0a10

References

  1. Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015. [Google Scholar]
  2. Song, Y.; Ermon, S. Generative Modeling by Estimating Gradients of the Data Distribution. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  3. Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-Based Generative Modeling through Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Virtual, 30 April–3 May 2021. [Google Scholar]
  4. Vahdat, A.; Kreis, K.; Kautz, J. Score-based Generative Modeling in Latent Space. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  5. Kingma, D.; Salimans, T.; Poole, B.; Ho, J. Variational Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  6. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–12 December 2020. [Google Scholar]
  7. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations, Virtual, 30 April–3 May 2021. [Google Scholar]
  8. Kong, Z.; Ping, W.; Huang, J.; Zhao, K.; Catanzaro, B. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In Proceedings of the International Conference on Learning Representations, Virtual, 30 April–3 May 2021. [Google Scholar]
  9. Lee, S.G.; Kim, H.; Shin, C.; Tan, X.; Liu, C.; Meng, Q.; Qin, T.; Chen, W.; Yoon, S.; Liu, T.Y. PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  10. Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  11. Nichol, A.Q.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
  12. Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  13. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  14. Kingma, D.P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; Welling, M. Improved Variational Inference with Inverse Autoregressive Flow. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
  15. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. In Proceedings of the International Conference on Learning Representations, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  16. Tran, B.H.; Rossi, S.; Milios, D.; Michiardi, P.; Bonilla, E.V.; Filippone, M. Model Selection for Bayesian Autoencoders. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  17. Anderson, B.D. Reverse-Time Diffusion Equation Models. Stoch. Process. Their Appl. 1982, 12, 313–326. [Google Scholar] [CrossRef] [Green Version]
  18. Song, Y.; Durkan, C.; Murray, I.; Ermon, S. Maximum Likelihood Training of Score-Based Diffusion Models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  19. Särkkä, S.; Solin, A. Applied Stochastic Differential Equations; Institute of Mathematical Statistics Textbooks, Cambridge University Press: Cambridge, UK, 2019. [Google Scholar] [CrossRef] [Green Version]
  20. Zheng, H.; He, P.; Chen, W.; Zhou, M. Truncated Diffusion Probabilistic Models. CoRR 2022. abs/2202.09671. Available online: http://xxx.lanl.gov/abs/2202.09671 (accessed on 28 March 2023).
  21. Austin, J.; Johnson, D.D.; Ho, J.; Tarlow, D.; van den Berg, R. Structured Denoising Diffusion Models in Discrete State-Spaces. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  22. Jolicoeur-Martineau, A.; Li, K.; Piché-Taillefer, R.; Kachman, T.; Mitliagkas, I. Gotta Go Fast When Generating Data with Score-Based Models. CoRR 2021. abs/2105.14080. Available online: http://xxx.lanl.gov/abs/2105.14080 (accessed on 28 March 2023).
  23. Salimans, T.; Ho, J. Progressive Distillation for Fast Sampling of Diffusion Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  24. Xiao, Z.; Kreis, K.; Vahdat, A. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  25. Watson, D.; Ho, J.; Norouzi, M.; Chan, W. Learning to Efficiently Sample from Diffusion Probabilistic Models. CoRR 2021. abs/2106.03802. Available online: http://xxx.lanl.gov/abs/2106.03802 (accessed on 28 March 2023).
  26. Dockhorn, T.; Vahdat, A.; Kreis, K. Score-Based Generative Modeling with Critically-Damped Langevin Diffusion. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  27. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022. [Google Scholar]
  28. Bao, F.; Li, C.; Zhu, J.; Zhang, B. Analytic-DPM: An Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  29. De Bortoli, V.; Thornton, J.; Heng, J.; Doucet, A. Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  30. De Bortoli, V. Convergence of denoising diffusion models under the manifold hypothesis. arXiv 2022, arXiv:2208.05314. [Google Scholar]
  31. Lee, H.; Lu, J.; Tan, Y. Convergence for score-based generative modeling with polynomial complexity. arXiv 2022, arXiv:2206.06227. [Google Scholar]
  32. Huang, C.W.; Lim, J.H.; Courville, A.C. A Variational Perspective on Diffusion-Based Generative Models and Score Matching. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021. [Google Scholar]
  33. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
  34. Chen, S.; Chewi, S.; Li, J.; Li, Y.; Salim, A.; Zhang, A.R. Sampling is as easy as learning the score: Theory for diffusion models with minimal data assumptions. arXiv 2022, arXiv:2209.11215. [Google Scholar]
  35. Chen, Y.; Georgiou, T.T.; Pavon, M. Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrodinger bridge. SIAM Rev. 2021, 63, 249–313. [Google Scholar] [CrossRef]
  36. Chen, T.; Liu, G.H.; Theodorou, E. Likelihood Training of Schrödinger Bridge using Forward-Backward SDEs Theory. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  37. Chen, R.T.Q.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural Ordinary Differential Equations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
  38. Grathwohl, W.; Chen, R.T.Q.; Bettencourt, J.; Duvenaud, D. Scalable Reversible Generative Models with Free-form Continuous Dynamics. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  39. Kynkäänniemi, T.; Karras, T.; Aittala, M.; Aila, T.; Lehtinen, J. The Role of ImageNet Classes in Fréchet Inception Distance. CoRR 2022. abs/2203.06026. Available online: http://xxx.lanl.gov/abs/2203.06026 (accessed on 28 March 2023).
  40. Theis, L.; van den Oord, A.; Bethge, M. A Note on the Evaluation of Generative Models. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  41. Rasmussen, C. The Infinite Gaussian Mixture Model. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999. [Google Scholar]
  42. Görür, D.; Edward Rasmussen, C. Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution. J. Comput. Sci. Technol. 2010, 25, 653–664. [Google Scholar] [CrossRef] [Green Version]
  43. Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1 × 1 Convolutions. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018. [Google Scholar]
  44. Hoogeboom, E.; Gritsenko, A.A.; Bastings, J.; Poole, B.; van den Berg, R.; Salimans, T. Autoregressive Diffusion Models. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  45. Kloeden, P.E.; Platen, E. Numerical Solution of Stochastic Differential Equations; Springer: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
  46. Karras, T.; Aittala, M.; Aila, T.; Laine, S. Elucidating the Design Space of Diffusion-Based Generative Models. arXiv 2022, arXiv:2206.00364. [Google Scholar]
Figure 1. Effect of T on a toy model: low diffusion times are detrimental for sample quality (likelihood of 1024 samples as median and 95 quantile, on 8 random seeds).
Figure 1. Effect of T on a toy model: low diffusion times are detrimental for sample quality (likelihood of 1024 samples as median and 95 quantile, on 8 random seeds).
Entropy 25 00633 g001
Figure 2. ELBO decomposition, ELBO and likelihood for a 1D toy model, as a function of diffusion time T. Tradeoff and optimality numerical results confirm our theory.
Figure 2. ELBO decomposition, ELBO and likelihood for a 1D toy model, as a function of diffusion time T. Tradeoff and optimality numerical results confirm our theory.
Entropy 25 00633 g002
Figure 3. Intuitive illustration of the forward and backward diffusion processes. Discrepancies between distributions are illustrated as distances. Color coding is discussed in the text.
Figure 3. Intuitive illustration of the forward and backward diffusion processes. Discrepancies between distributions are illustrated as distances. Color coding is discussed in the text.
Entropy 25 00633 g003
Figure 4. Complexity of the auxiliary model as function of diffusion time (reported median and 95 quantiles on 4 random seeds).
Figure 4. Complexity of the auxiliary model as function of diffusion time (reported median and 95 quantiles on 4 random seeds).
Entropy 25 00633 g004
Figure 5. Visualization of some samples. Top to Bottom: ScoreSDE [3] ( T = 1 , BPD = 1.16 ), ScoreSDE ( T = 0.4 , BPD = 1.25 ), Our ( T = 0.4 , BPD = 1.17 ).
Figure 5. Visualization of some samples. Top to Bottom: ScoreSDE [3] ( T = 1 , BPD = 1.16 ), ScoreSDE ( T = 0.4 , BPD = 1.25 ), Our ( T = 0.4 , BPD = 1.17 ).
Entropy 25 00633 g005
Figure 6. Visualization of some samples on CIFAR10.
Figure 6. Visualization of some samples on CIFAR10.
Entropy 25 00633 g006
Figure 7. Training curves of score models for different diffusion time T, recorded during the span of 1.3 million iterations.
Figure 7. Training curves of score models for different diffusion time T, recorded during the span of 1.3 million iterations.
Entropy 25 00633 g007
Table 2. Optimal T in [3].
Table 2. Optimal T in [3].
DatasetTime TBPD (↓)
MNIST 1.0 1.16
0.6 1 . 16
0.4 1.25
0.2 1.75
CIFAR10 1.0 3.09
0.6 3 . 07
0.4 3.09
0.2 3.38
Table 3. Experiment results on MNIST. For our method, (S) is for the extension in Section 3.3.
Table 3. Experiment results on MNIST. For our method, (S) is for the extension in Section 3.3.
NFE (↓)BPD (↓)
Model(ODE)
ScoreSDE300 1.16
ScoreSDE ( T = 0.6 )258 1.16
Our ( T = 0.6 )258 1.16 1.14 (S)
ScoreSDE ( T = 0.4 )235 1.25
Our ( T = 0.4 )235 1.17 1.16 (S)
ScoreSDE ( T = 0.2 )191 1.75
Our ( T = 0.2 )191 1.33 1.31 (S)
Table 4. Experimental results on CIFAR10, including other relevant baselines and sampling efficiency enhancements from the literature.
Table 4. Experimental results on CIFAR10, including other relevant baselines and sampling efficiency enhancements from the literature.
FID ( ) BPD ( ) NFE ( ) NFE ( )
Model   (SDE)(ODE)
ScoreSDE [3] 3.64 3.09 1000221
ScoreSDE ( T = 0.6 ) 5.74 3.07 600200
ScoreSDE ( T = 0.4 ) 24.91 3.09 400187
ScoreSDE ( T = 0.2 ) 339.72 3.38 200176
Our ( T = 0.6 ) 3.72 3.07 600200
Our ( T = 0.4 ) 5.44 3.06 400187
Our ( T = 0.2 ) 14.38 3.06 200176
ARDM [44] 2.69 3072
VDM [5] 4.0 2.49 1000
D3PMs [21] 7.34 3.43 1000
DDPM [6] 3.21 3.75 1000
Gotta Go Fast [22] 2.44 180
LSGM [4] 2.10 2.87 120 / 138
ARDM-P [44] 2.68 / 2.74 200 / 50
Table 5. Experimental results on CELEBA 64.
Table 5. Experimental results on CELEBA 64.
BPD ( ) NFE ( )
Model  (ODE)
ScoreSDE [3] 2.13 68
ScoreSDE ( T = 0.5 ) 8.06 15
ScoreSDE ( T = 0.2 ) 12.1 9
Our ( T = 0.5 ) 2.48 16
Our ( T = 0.2 ) 2.58 9
Our with pretrain diffusion ( T = 0.5 ) 2.36 16
Our with pretrain diffusion ( T = 0.2 ) 2.32 9
Glow [43] 3.74 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Franzese, G.; Rossi, S.; Yang, L.; Finamore, A.; Rossi, D.; Filippone, M.; Michiardi, P. How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models. Entropy 2023, 25, 633. https://doi.org/10.3390/e25040633

AMA Style

Franzese G, Rossi S, Yang L, Finamore A, Rossi D, Filippone M, Michiardi P. How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models. Entropy. 2023; 25(4):633. https://doi.org/10.3390/e25040633

Chicago/Turabian Style

Franzese, Giulio, Simone Rossi, Lixuan Yang, Alessandro Finamore, Dario Rossi, Maurizio Filippone, and Pietro Michiardi. 2023. "How Much Is Enough? A Study on Diffusion Times in Score-Based Generative Models" Entropy 25, no. 4: 633. https://doi.org/10.3390/e25040633

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop