# F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits

^{1}

^{2}

^{3}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Background

#### 2.1. Generative Modelling

#### 2.2. Born Machines as Implicit Generative Models

#### 2.3. Adversarial Generative Modelling with f-Divergences

## 3. Training Heuristics

#### 3.1. Switching f-Divergences

#### 3.2. Local Cost Functions

## 4. Numerical Results

`pytket`[89] and execute the simulations with Qiskit [90]. The parameters of the QCBM are updated using stochastic gradient descent with a constant learning rate, which is tuned to each of the simulations.

`scikit-learn`[92]. The particular hyper-parameters used in each simulation are specified below.

#### 4.1. Switching f-Divergences

#### 4.2. Local Cost Functions

## 5. Estimation of f-Divergences on Fault-Tolerant Quantum Computers

**Theorem**

**1**

**.**Assume $p,q$ are two distributions on $\left[n\right]$. Then there is a quantum algorithm that approximates $TV(p,q)$ up to an additive error $\epsilon >0$, with probability of success $1-\delta $, using $O(\sqrt{n}{\epsilon}^{-3/2}/log\left(1/\delta \right))$ quantum queries.

**Theorem**

**2**

**.**Assume $p,q$ are two distributions on $\left[n\right]$ satisfying ${p}_{i}/{q}_{i}\le g\left(n\right),\phantom{\rule{3.33333pt}{0ex}}\forall i\in \left[n\right]$ for some $a:\mathbb{N}\to {\mathbb{R}}^{+}$. Then there is a quantum algorithm that approximates $KL(p\parallel q)$ within an additive error $\epsilon >0$ with probability of success at least $2/3$ using $\tilde{O}(\sqrt{n}/{\epsilon}^{2})$ quantum queries to p and $\tilde{O}(\sqrt{n}\phantom{\rule{1.42262pt}{0ex}}g\left(n\right)/{\epsilon}^{2})$ quantum queries to q. (The notation $\tilde{O}(\xb7)$ ignores factors that are polynomial in $logn$ and $log1/\epsilon $.)

`EstAmp`and

`EstAmp’`. The only difference between these two algorithms is the behavior when one of the probabilities, ${q}_{i}$, is sufficiently close to zero. This is problematic in the case of the KL estimation (and indeed entropy estimation) in [40] since the relevant quantities diverge as ${q}_{i}\to 0$. The same is true in our case, as ${q}_{i}^{-1}$ appears in many f-divergences.

**Theorem**

**3**

**.**For any $k,M\in \mathbb{N}$, there is a quantum algorithm (named

`EstAmp`) with M queries to a boolean function, $\chi :\left[S\right]\to \{0,1\}$ that outputs $\tilde{a}={sin}^{2}\left(\frac{l\pi}{M}\right)$ for some $l\in \{0,\cdots ,M-1\}$ such that

`EstAmp’`) outputs ${sin}^{2}\left(\frac{\pi}{2M}\right)$ when

`EstAmp`outputs 0, and outputs the same as

`EstAmp`otherwise. Now that we have a mechanism for estimating the probabilities, we need a final ingredient, which is the generic speedup of Monte Carlo methods from [39]

**Theorem**

**4**

**.**Let $\mathcal{A}$ be a quantum algorithm with output X such that $Var\left[X\right]\le {\sigma}^{2}$. Then for ε where $0<\epsilon <4\sigma $, by using $O((\sigma /\epsilon ){log}^{3/2}(\sigma /\epsilon )loglog(\sigma /\epsilon ))$ executions of $\mathcal{A}$ and ${\mathcal{A}}^{-1}$, Algorithm 3 in [39] outputs an estimate $\tilde{\mathbb{E}}\left[X\right]$ of $\mathbb{E}\left[X\right]$ such that

Algorithm 1: Estimate the forward Pearson divergence of $p={\left({p}_{i}\right)}_{i=1}^{n}$ and ${\left({q}_{i}\right)}_{i=1}^{n}$ on $\left[n\right]$. |

**Theorem**

**5.**

## 6. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A. Proof of Theorem 5

**Theorem**

**5.**

**Proof.**

`EstAmp’`outputs ${\tilde{q}}_{i}$ such that ${\tilde{q}}_{i}\ge {sin}^{2}(\pi /{2}^{\lceil {log}_{2}(\sqrt{n}g\left(n\right)/\epsilon )\rceil +1})\ge {\epsilon}^{2}/\left(4ng{\left(n\right)}^{2}\right)$ for any i. It follows that ${\tilde{q}}_{i}/{\tilde{p}}_{i}\ge {\tilde{q}}_{i}\ge {\epsilon}^{2}/\left(4ng{\left(n\right)}^{2}\right)$, and thus $exp(-2{\tilde{q}}_{i}/{\tilde{p}}_{i})\le exp(-{\epsilon}^{2}/\left(2ng{\left(n\right)}^{2}\right))$. We thus have, using also the fact that ${(x-1)}^{2}<exp(-2x)$ for $x>-1$, that

## References

- McClean, J.R.; Romero, J.; Babbush, R.; Aspuru-Guzik, A. The theory of variational hybrid quantum-classical algorithms. New J. Phys.
**2016**, 18, 23023. [Google Scholar] [CrossRef] - Benedetti, M.; Lloyd, E.; Sack, S.; Fiorentini, M. Parameterized quantum circuits as machine learning models. Quantum Sci. Technol.
**2019**, 4, 043001. [Google Scholar] [CrossRef] [Green Version] - Cerezo, M.; Arrasmith, A.; Babbush, R.; Benjamin, S.C.; Endo, S.; Fujii, K.; McClean, J.R.; Mitarai, K.; Yuan, X.; Cincio, L.; et al. Variational quantum algorithms. Nat. Rev. Phys.
**2021**, 3, 1–20. [Google Scholar] [CrossRef] - Bharti, K.; Cervera-Lierta, A.; Kyaw, T.H.; Haug, T.; Alperin-Lea, S.; Anand, A.; Degroote, M.; Heimonen, H.; Kottmann, J.S.; Menke, T.; et al. Noisy intermediate-scale quantum (NISQ) algorithms. arXiv
**2021**, arXiv:2101.08448. [Google Scholar] - Li, W.; Deng, D.L. Recent advances for quantum classifiers. arXiv
**2021**, arXiv:2108.13421. [Google Scholar] - Grant, E.; Benedetti, M.; Cao, S.; Hallam, A.; Lockhart, J.; Stojevic, V.; Green, A.G.; Severini, S. Hierarchical quantum classifiers. NPJ Quantum Inf.
**2018**, 4, 65. [Google Scholar] [CrossRef] [Green Version] - Cong, I.; Choi, S.; Lukin, M.D. Quantum convolutional neural networks. Nat. Phys.
**2019**, 15, 1273–1278. [Google Scholar] [CrossRef] [Green Version] - Schuld, M.; Killoran, N. Quantum Machine Learning in Feature Hilbert Spaces. Phys. Rev. Lett.
**2019**, 122, 40504. [Google Scholar] [CrossRef] [Green Version] - Havlíček, V.; Córcoles, A.D.; Temme, K.; Harrow, A.W.; Kandala, A.; Chow, J.M.; Gambetta, J.M. Supervised learning with quantum-enhanced feature spaces. Nature
**2019**, 567, 209–212. [Google Scholar] [CrossRef] [Green Version] - LaRose, R.; Coyle, B. Robust data encodings for quantum classifiers. Phys. Rev. A
**2020**, 102, 032420. [Google Scholar] [CrossRef] - Romero, J.; Olson, J.P.; Aspuru-Guzik, A. Quantum autoencoders for efficient compression of quantum data. Quantum Sci. Technol.
**2017**, 2, 45001. [Google Scholar] [CrossRef] [Green Version] - Pepper, A.; Tischler, N.; Pryde, G.J. Experimental Realization of a Quantum Autoencoder: The Compression of Qutrits via Machine Learning. Phys. Rev. Lett.
**2019**, 122, 60501. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ding, Y.; Lamata, L.; Sanz, M.; Chen, X.; Solano, E. Experimental Implementation of a Quantum Autoencoder via Quantum Adders. Adv. Quantum Technol.
**2019**, 2, 1800065. [Google Scholar] [CrossRef] [Green Version] - Otterbach, J.S.; Manenti, R.; Alidoust, N.; Bestwick, A.; Block, M.; Bloom, B.; Caldwell, S.; Didier, N.; Fried, E.S.; Hong, S.; et al. Unsupervised Machine Learning on a Hybrid Quantum Computer. arXiv
**2017**, arXiv:1712.05771. [Google Scholar] - Liu, J.G.; Wang, L. Differentiable learning of quantum circuit Born machines. Phys. Rev. A
**2018**, 98, 62324. [Google Scholar] [CrossRef] [Green Version] - Benedetti, M.; Garcia-Pintos, D.; Perdomo, O.; Leyton-Ortega, V.; Nam, Y.; Perdomo-Ortiz, A. A generative modeling approach for benchmarking and training shallow quantum circuits. NPJ Quantum Inf.
**2019**, 5, 45. [Google Scholar] [CrossRef] - Hamilton, K.E.; Dumitrescu, E.F.; Pooser, R.C. Generative model benchmarks for superconducting qubits. Phys. Rev. A
**2019**, 99, 62323. [Google Scholar] [CrossRef] [Green Version] - Zhu, D.; Linke, N.M.; Benedetti, M.; Landsman, K.A.; Nguyen, N.H.; Alderete, C.H.; Perdomo-Ortiz, A.; Korda, N.; Garfoot, A.; Brecque, C.; et al. Training of quantum circuits on a hybrid quantum computer. Sci. Adv.
**2019**, 5, eaaw9918. [Google Scholar] [CrossRef] [Green Version] - Coyle, B.; Mills, D.; Danos, V.; Kashefi, E. The Born supremacy: Quantum advantage and training of an Ising Born machine. NPJ Quantum Inf.
**2020**, 6, 60. [Google Scholar] [CrossRef] - Du, Y.; Hsieh, M.H.; Liu, T.; Tao, D. Expressive power of parametrized quantum circuits. Phys. Rev. Res.
**2020**, 2, 33125. [Google Scholar] [CrossRef] - Anand, A.; Romero, J.; Degroote, M.; Aspuru-Guzik, A. Noise Robustness and Experimental Demonstration of a Quantum Generative Adversarial Network for Continuous Distributions. Adv. Quantum Technol.
**2021**, 4, 2000069. [Google Scholar] [CrossRef] - Leyton-Ortega, V.; Perdomo-Ortiz, A.; Perdomo, O. Robust implementation of generative modeling with parametrized quantum circuits. Quantum Mach. Intell.
**2021**, 3, 17. [Google Scholar] [CrossRef] - Dallaire-Demers, P.L.; Killoran, N. Quantum generative adversarial networks. Phys. Rev. A
**2018**, 98, 12324. [Google Scholar] [CrossRef] [Green Version] - Hu, L.; Wu, S.H.; Cai, W.; Ma, Y.; Mu, X.; Xu, Y.; Wang, H.; Song, Y.; Deng, D.L.; Zou, C.L.; et al. Quantum generative adversarial learning in a superconducting quantum circuit. Sci. Adv.
**2019**, 5, eaav2761. [Google Scholar] [CrossRef] [Green Version] - Zeng, J.; Wu, Y.; Liu, J.G.; Wang, L.; Hu, J. Learning and inference on generative adversarial quantum circuits. Phys. Rev. A
**2019**, 99, 52306. [Google Scholar] [CrossRef] [Green Version] - Zoufal, C.; Lucchi, A.; Woerner, S. Quantum Generative Adversarial Networks for learning and loading random distributions. NPJ Quantum Inf.
**2019**, 5, 103. [Google Scholar] [CrossRef] [Green Version] - Verdon, G.; Marks, J.; Nanda, S.; Leichenauer, S.; Hidary, J. Quantum Hamiltonian-Based Models and the Variational Quantum Thermalizer Algorithm. arXiv
**2019**, arXiv:1910.02071. [Google Scholar] - Huang, H.L.; Du, Y.; Gong, M.; Zhao, Y.; Wu, Y.; Wang, C.; Li, S.; Liang, F.; Lin, J.; Xu, Y.; et al. Experimental Quantum Generative Adversarial Networks for Image Generation. arXiv
**2020**, arXiv:2010.06201. [Google Scholar] - Situ, H.; He, Z.; Wang, Y.; Li, L.; Zheng, S. Quantum generative adversarial network for generating discrete distribution. Inf. Sci.
**2020**, 538, 193–208. [Google Scholar] [CrossRef] - Coyle, B.; Henderson, M.; Le, J.C.J.; Kumar, N.; Paini, M.; Kashefi, E. Quantum versus classical generative modelling in finance. Quantum Sci. Technol.
**2021**, 6, 024013. [Google Scholar] [CrossRef] - Liu, W.; Zhang, Y.; Deng, Z.; Zhao, J.; Tong, L. A hybrid quantum-classical conditional generative adversarial network algorithm for human-centered paradigm in cloud. EURASIP J. Wirel. Commun. Netw.
**2021**, 2021, 37. [Google Scholar] [CrossRef] - Rudolph, M.S.; Toussaint, N.B.; Katabarwa, A.; Johri, S.; Peropadre, B.; Perdomo-Ortiz, A. Generation of High-Resolution Handwritten Digits with an Ion-Trap Quantum Computer. arXiv
**2020**, arXiv:2012.03924. [Google Scholar] - Benedetti, M.; Coyle, B.; Fiorentini, M.; Lubasch, M.; Rosenkranz, M. Variational inference with a quantum computer. arXiv
**2021**, arXiv:2103.06720. [Google Scholar] - Cheng, S.; Chen, J.; Wang, L. Information Perspective to Probabilistic Modeling: Boltzmann Machines versus Born Machines. Entropy
**2018**, 20, 583. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
- Cerezo, M.; Sone, A.; Volkoff, T.; Cincio, L.; Coles, P.J. Cost function dependent barren plateaus in shallow parametrized quantum circuits. Nat. Commun.
**2021**, 12, 1791. [Google Scholar] [CrossRef] [PubMed] - Uvarov, A.V.; Biamonte, J.D. On barren plateaus and cost function locality in variational quantum algorithms. J. Phys. A Math. Theor.
**2021**, 54, 245301. [Google Scholar] [CrossRef] - Bravyi, S.; Harrow, A.W.; Hassidim, A. Quantum Algorithms for Testing Properties of Distributions. IEEE Trans. Inf. Theory
**2011**, 57, 3971–3981. [Google Scholar] [CrossRef] [Green Version] - Montanaro, A. Quantum speedup of Monte Carlo methods. Proc. R. Soc. A Math. Phys. Eng. Sci.
**2015**, 471, 20150301. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Li, T.; Wu, X. Quantum Query Complexity of Entropy Estimation. IEEE Trans. Inf. Theory
**2019**, 65, 2899–2921. [Google Scholar] [CrossRef] [Green Version] - Bowman, S.R.; Vilnis, L.; Vinyals, O.; Dai, A.M.; Jozefowicz, R.; Bengio, S. Generating Sentences from a Continuous Space. arXiv
**2016**, arXiv:1511.06349. [Google Scholar] - Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef] [Green Version]
- Simonovsky, M.; Komodakis, N. GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders. In Proceedings of the 27th Int. Conf. Artif. Neural Networks, Rhodes, Greece, 4–7 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 412–422. [Google Scholar] [CrossRef] [Green Version]
- Sinha, S.; Ebrahimi, S.; Darrell, T. Variational Adversarial Active Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 5971–5980. [Google Scholar] [CrossRef] [Green Version]
- Ha, D.; Schmidhuber, J. World Models. arXiv
**2018**, arXiv:1803.10122. [Google Scholar] - Ilse, M.; Tomczak, J.M.; Louizos, C.; Welling, M. DIVA: Domain Invariant Variational Autoencoders. arXiv
**2019**, arXiv:1905.10427. [Google Scholar] - Brehmer, J.; Kling, F.; Espejo, I.; Cranmer, K. MadMiner: Machine Learning-Based Inference for Particle Physics. Comput. Softw. Big Sci.
**2020**, 4, 3. [Google Scholar] [CrossRef] [Green Version] - Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv
**2016**, arXiv:1609.03499. [Google Scholar] - Diggle, P.J.; Gratton, R.J. Monte Carlo Methods of Inference for Implicit Statistical Models. J. R. Stat. Soc. Ser. B
**1984**, 46, 193–212. [Google Scholar] [CrossRef] - Mohamed, S.; Lakshminarayanan, B. Learning in Implicit Generative Models. arXiv
**2017**, arXiv:1610.03483. [Google Scholar] - Frey, B.J. Graphical Models for Machine Learning and Digital Communication; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
- Uria, B.; Côté, M.A.; Gregor, K.; Murray, I.; Larochelle, H. Neural Autoregressive Distribution Estimation. arXiv
**2016**, arXiv:1605.02226. [Google Scholar] - Rippel, O.; Adams, R.P. High-Dimensional Probability Estimation with Deep Density Models. arXiv
**2013**, arXiv:1302.5125. [Google Scholar] - Rezende, D.J.; Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1530–1538. [Google Scholar]
- Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density Estimation Using Real NVP; ICLR, 2017; Available online: https://arxiv.org/abs/1605.08803 (accessed on 27 September 2021).
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv
**2014**, arXiv:1312.6114. [Google Scholar] - Rezende, D.J.; Mohamed, S.; Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 1278–1286. [Google Scholar]
- Ackley, D.H.; Hinton, G.E.; Sejnowski, T.J. A learning algorithm for boltzmann machines. Cogn. Sci.
**1985**, 9, 147–169. [Google Scholar] [CrossRef] - Hinton, G.E.; Osindero, S.; Teh, Y. A fast learning algorithm for deep belief nets. Neural Comput.
**2006**, 18, 1527–1554. [Google Scholar] [CrossRef] - Salakhutdinov, R.; Hinton, G. Deep Boltzmann Machines. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Clearwater Beach, FL, USA, 16–18 April 2009; pp. 448–455. [Google Scholar]
- Bengio, Y.; Thibodeau-Laufer, E.; Alain, G.; Yosinski, J. Deep Generative Stochastic Networks Trainable by Backprop. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014. [Google Scholar]
- Dziugaite, G.K.; Roy, D.M.; Ghahramani, Z. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence, Amsterdam, The Netherlands, 12–16 July 2015; pp. 258–267. [Google Scholar]
- Li, Y.; Swersky, K.; Zemel, R. Generative Moment Matching Networks. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1718–1727. [Google Scholar]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems—Volume 2, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Born, M. Zur Quantenmechanik der Stoßvorgänge. Z. Phys.
**1926**, 37, 863–867. [Google Scholar] [CrossRef] - Glasser, I.; Sweke, R.; Pancotti, N.; Eisert, J.; Cirac, J.I. Expressive power of tensor-network factorizations for probabilistic modeling. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Bremner, M.J.; Montanaro, A.; Shepherd, D.J. Average-Case Complexity Versus Approximate Simulation of Commuting Quantum Computations. Phys. Rev. Lett.
**2016**, 117, 80501. [Google Scholar] [CrossRef] - Boixo, S.; Isakov, S.V.; Smelyanskiy, V.N.; Babbush, R.; Ding, N.; Jiang, Z.; Bremner, M.J.; Martinis, J.M.; Neven, H. Characterizing quantum supremacy in near-term devices. Nat. Phys.
**2018**, 14, 595–600. [Google Scholar] [CrossRef] - Bouland, A.; Fefferman, B.; Nirkhe, C.; Vazirani, U. On the complexity and verification of quantum random circuit sampling. Nat. Phys.
**2019**, 15, 159–163. [Google Scholar] [CrossRef] - Arute, F.; Arya, K.; Babbush, R.; Bacon, D.; Bardin, J.C.; Barends, R.; Biswas, R.; Boixo, S.; Brandao, F.G.S.L.; Buell, D.A.; et al. Quantum supremacy using a programmable superconducting processor. Nature
**2019**, 574, 505–510. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mitarai, K.; Negoro, M.; Kitagawa, M.; Fujii, K. Quantum circuit learning. Phys. Rev. A
**2018**, 98, 032309. [Google Scholar] [CrossRef] [Green Version] - Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung.
**1967**, 2, 229–318. [Google Scholar] - Ali, S.M.; Silvey, S. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser.-Methodol.
**1966**, 28, 131–142. [Google Scholar] [CrossRef] - Amari, S. α-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory
**2009**, 55, 4925–4931. [Google Scholar] [CrossRef] - Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv
**2018**, arXiv:1706.08500. [Google Scholar] - Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial; Foundations and Trends
^{®}in Communications and Information Theory; Now Publishers Inc.: Boston, MA, USA, 2004; Volume 1, pp. 417–528. [Google Scholar] [CrossRef] [Green Version] - Uehara, M.; Sato, I.; Suzuki, M.; Nakayama, K.; Matsuo, Y. Generative Adversarial Nets from a Density Ratio Estimation Perspective. arXiv
**2016**, arXiv:1610.02920. [Google Scholar] - Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein Generative Adversarial Networks. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 214–223. [Google Scholar]
- Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th Conference on Neural Information Processing Systems, Spain, Barcelona, 5–10 December 2016. [Google Scholar]
- McClean, J.R.; Boixo, S.; Smelyanskiy, V.N.; Babbush, R.; Neven, H. Barren plateaus in quantum neural network training landscapes. Nat. Commun.
**2018**, 9, 4812. [Google Scholar] [CrossRef] [Green Version] - Grant, E.; Wossnig, L.; Ostaszewski, M.; Benedetti, M. An initialization strategy for addressing barren plateaus in parametrized quantum circuits. Quantum
**2019**, 3, 214. [Google Scholar] [CrossRef] - Arrasmith, A.; Cerezo, M.; Czarnik, P.; Cincio, L.; Coles, P.J. Effect of barren plateaus on gradient-free optimization. arXiv
**2020**, arXiv:2011.12245. [Google Scholar] - Marrero, C.O.; Kieferová, M.; Wiebe, N. Entanglement Induced Barren Plateaus. arXiv
**2021**, arXiv:2010.15968. [Google Scholar] - Patti, T.L.; Najafi, K.; Gao, X.; Yelin, S.F. Entanglement devised barren plateau mitigation. Phys. Rev. Res.
**2021**, 3, 033090. [Google Scholar] [CrossRef] - Arrasmith, A.; Holmes, Z.; Cerezo, M.; Coles, P.J. Equivalence of quantum barren plateaus to cost concentration and narrow gorges. arXiv
**2021**, arXiv:2104.05868. [Google Scholar] - Holmes, Z.; Sharma, K.; Cerezo, M.; Coles, P.J. Connecting ansatz expressibility to gradient magnitudes and barren plateaus. arXiv
**2021**, arXiv:2101.02138. [Google Scholar] - Larocca, M.; Czarnik, P.; Sharma, K.; Muraleedharan, G.; Coles, P.J.; Cerezo, M. Diagnosing barren plateaus with tools from quantum optimal control. arXiv
**2021**, arXiv:2105.14377. [Google Scholar] - Wang, S.; Fontana, E.; Cerezo, M.; Sharma, K.; Sone, A.; Cincio, L.; Coles, P.J. Noise-Induced Barren Plateaus in Variational Quantum Algorithms. arXiv
**2021**, arXiv:2007.14384. [Google Scholar] - Sivarajah, S.; Dilkes, S.; Cowtan, A.; Simmons, W.; Edgington, A.; Duncan, R. t|ket>: A retargetable compiler for NISQ devices. Quantum Sci. Technol.
**2020**, 6, 14003. [Google Scholar] [CrossRef] - Aleksandrowicz, G.; Alexander, T.; Barkoutsos, P.; Bello, L.; Ben-Haim, Y.; Bucher, D.; Cabrera-Hernández, F.J.; Carballo-Franquis, J.; Chen, A.; Chen, C.-F.; et al. Qiskit: An Open-source Framework for Quantum Computing. 2021. Available online: https://zenodo.org/record/2562111#.YVUWKzURXIU (accessed on 27 September 2021).
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv
**2019**, arXiv:1912.01703 [cs.LG]. [Google Scholar] - Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res.
**2011**, 12, 2825–2830. [Google Scholar] - Han, Y.; Jiao, J.; Weissman, T. Minimax Estimation of Divergences Between Discrete Distributions. IEEE J. Sel. Areas Inf. Theory
**2020**, 1, 814–823. [Google Scholar] [CrossRef] - Chan, S.O.; Diakonikolas, I.; Valiant, G.; Valiant, P. Optimal Algorithms for Testing Closeness of Discrete Distributions. arXiv
**2013**, arXiv:1308.3946. [Google Scholar] - Brassard, G.; Høyer, P.; Mosca, M.; Tapp, A. Quantum amplitude amplification and estimation. Quantum Comput. Inf.
**2002**, 305, 53–74. [Google Scholar] [CrossRef] [Green Version] - Sriperumbudur, B.K.; Fukumizu, K.; Gretton, A.; Schölkopf, B.; Lanckriet, G.R.G. On integral probability metrics, ϕ-divergences and binary classification. arXiv
**2009**, arXiv:0901.2698. [Google Scholar] - Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett.
**2014**, 21, 10–13. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Convex conjugate ${f}^{*}$ (left panel) and derivative ${f}^{*\prime}$ (right panel) of the generator f for several f-divergences. All generators have been standardised with ${f}^{\prime}\left(1\right)=0$ and normalised with ${f}^{\u2033}\left(1\right)=1$, except for the TV.

**Figure 2.**The ansatz employed in numerical simulations (shown for three qubits). The ansatz consists of D alternating layers of single qubit gates and entangling gates. The single qubit layers consists of two single qubit rotations, one around the z axis and one around the x axis. The entangling layer is composed of a ladder of CZ gates. There is an additional layer of Hadamard gates prior to the first layer, and an additional layer of single qubit rotations after the final layer. The total number of parameters in a circuit of depth D is given by ${n}_{p}=n(2D+2)$, where n is the number of qubits.

**Figure 3.**Training performance of the QCBM in illustrative 3 qubit and 4 qubit experiments using 4 different classifiers. The classifiers are trained using 500 samples. We plot the bootstrapped median (solid line), as well as 90% confidence intervals (shaded).

**Figure 4.**Performance of the QCBM trained using the TV (green) and the f-divergence heuristic (red) for 3 qubits in the severely over-parameterised case OO(12,30), using an exact classifier. We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of the TV (

**left**) and the KL (

**right**).

**Figure 5.**Performance of the QCBM training using the TV (green) and the f-divergence heuristic (red) for 3 qubits in the severely over-parameterised case OO(12,30), using a trained SVM classifier. We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of both the TV (

**left**) and the KL (

**right**).

**Figure 6.**Performance of the QCBM training using the TV (green) and the f-divergence heuristic (red) for 3 qubits in the under-parameterised case U(30,18). We show the bootstrapped median (solid line) and 90% confidence intervals (shaded) of both the TV (

**left**) and the KL (

**right**).

**Figure 7.**Performance of the QCBM trained using several f-divergences for 3 qubits in the under-parameterised case U(30, 18). The parameters are initialised using the parameters which gave the lowest cost during training in Figure 6. We show the exact TV (

**left**) and the exact KL (

**right**).

**Figure 8.**f-divergences chosen throughout the training of the heuristic in Figure 7 in each of the 18 directions in parameter space.

**Figure 9.**Training performance of the QCBM using the global and local reverse KL for 4 qubits, 5 qubits, and 6 qubits, for a discretised Gaussian target distribution. For 4 qubits and 5 qubits, we show the bootstrapped median (solid line), as well as 90% confidence intervals (shaded). For 6 qubits, we plot an illustrative training example.

**Table 1.**A summary of well-known f-divergences, including the definition, the convex conjugate of the generator ${f}^{*}$, and the corresponding parameter-shift rule in terms of the ratio $r\left(\mathit{x}\right)=\frac{{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}{p\left(\mathit{x}\right)}$. The $\parallel $ symbol indicates that the divergence is asymmetric, while a comma indicates that it is symmetric. Interestingly, one can construct symmetric f-divergences for every asymmetric one (see Table 2).

f-Divergence | Definition | ${\mathit{f}}^{*}$ | Parameter-Shift |
---|---|---|---|

total variation | $\mathrm{TV}(p,{q}_{\mathbf{\theta}})={\textstyle \frac{1}{2}}\sum |p\left(\mathit{x}\right)-{q}_{\mathbf{\theta}}\left(\mathit{x}\right)|$ | $\frac{1}{2}}|r-1|$ | $\frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[\mathrm{sgn}\left(r\right(\mathit{x})-1)\right]-{\textstyle \frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[\mathrm{sgn}\left(r\right(\mathit{x})-1)\right]$ |

squared Hellinger | ${\mathrm{H}}^{2}(p,{q}_{\mathbf{\theta}})=\sum {(\sqrt{p\left(\mathit{x}\right)}-\sqrt{{q}_{\mathbf{\theta}}\left(\mathit{x}\right)})}^{2}$ | $2{(\sqrt{r}-1)}^{2}$ | $-2{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[{\textstyle \frac{1}{\sqrt{r\left(\mathit{x}\right)}}}\right]+2{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[{\textstyle \frac{1}{\sqrt{r\left(\mathit{x}\right)}}}\right]$ |

Kullback–Leibler (type I, forward) | $\mathrm{KL}(p\parallel {q}_{\mathbf{\theta}})={\mathbb{E}}_{p}\left[log\frac{p\left(\mathit{x}\right)}{{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}\right]$ | $-logr+r-1$ | $-{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[{\textstyle \frac{1}{r\left(\mathit{x}\right)}}\right]+{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[{\textstyle \frac{1}{r\left(\mathit{x}\right)}}\right]$ |

Kullback–Leibler (type I, reverse) | $\mathrm{KL}({q}_{\mathbf{\theta}}\parallel p)={\mathbb{E}}_{{q}_{\mathbf{\theta}}}\left[log\frac{{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}{p\left(\mathit{x}\right)}\right]$ | $rlogr-r+1$ | ${\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[logr\left(\mathit{x}\right)\right]-{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[logr\left(\mathit{x}\right)\right]$ |

Kullback–Leibler (type II, forward) | $\mathrm{KL}\left(p\right|\frac{p+{q}_{\mathbf{\theta}}}{2})={\mathbb{E}}_{p}\left[log\frac{2p\left(\mathit{x}\right)}{p\left(\mathit{x}\right)+{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}\right]$ | $4log\frac{2}{r+1}+2(r-1)$ | $-4{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[{\textstyle \frac{1}{r\left(\mathit{x}\right)+1}}\right]+4{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[{\textstyle \frac{1}{r\left(\mathit{x}\right)+1}}\right]$ |

Kullback–Leibler (type II, reverse) | $\mathrm{KL}({q}_{\mathbf{\theta}}\parallel \frac{p+{q}_{\mathbf{\theta}}}{2})={\mathbb{E}}_{{q}_{\mathbf{\theta}}}\left[log\frac{2{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}{p\left(\mathit{x}\right)+{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}\right]$ | $4rlog\frac{2r}{r+1}+2(1-r)$ | $4{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[log\frac{r\left(\mathit{x}\right)}{r\left(\mathit{x}\right)+1}+\frac{1}{r\left(\mathit{x}\right)+1}\right]$ $-4{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[log\frac{r\left(\mathit{x}\right)}{r\left(\mathit{x}\right)+1}+\frac{1}{r\left(\mathit{x}\right)+1}\right]$ |

Pearson (forward) | ${\chi}^{2}(p\parallel {q}_{\mathbf{\theta}})=\sum \frac{{(p\left(\mathit{x}\right)-{q}_{\mathbf{\theta}}\left(\mathit{x}\right))}^{2}}{p\left(\mathit{x}\right)}$ | $\frac{{(r-1)}^{2}}{2}$ | ${\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[r\left(\mathit{x}\right)\right]-{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[r\left(\mathit{x}\right)\right]$ |

Pearson (reverse) | ${\chi}^{2}({q}_{\mathbf{\theta}}\parallel p)=\sum \frac{{(p\left(\mathit{x}\right)-{q}_{\mathbf{\theta}}\left(\mathit{x}\right))}^{2}}{{q}_{\mathbf{\theta}}\left(\mathit{x}\right)}$ | $\frac{{(r-1)}^{2}}{2r}$ | $-{\textstyle \frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[{\textstyle \frac{1}{r{\left(\mathit{x}\right)}^{2}}}\right]+{\textstyle \frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[{\textstyle \frac{1}{r{\left(\mathit{x}\right)}^{2}}}\right]$ |

**Table 2.**A summary of the symmetric f-divergences corresponding to some well-known asymmetric f-divergences, including the definition, and the parameter-shift rule.

f-Divergence | Definition | Parameter-Shift |
---|---|---|

symmetric Kullback–Leibler (type I, Jeffrey) | $\mathrm{J}(p,{q}_{\mathbf{\theta}})=\mathrm{KL}(p\parallel {q}_{\mathbf{\theta}})+\mathrm{KL}({q}_{\mathbf{\theta}}\parallel p)$ | $\frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[logr\left(\mathit{x}\right)-{\textstyle \frac{1}{r\left(\mathit{x}\right)}}\right]-{\textstyle \frac{1}{2}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[logr\left(\mathit{x}\right)-{\textstyle \frac{1}{r\left(\mathit{x}\right)}}\right]$ |

symmetric Kullback–Leibler (type II, Jensen–Shannon) | $\mathrm{JS}(p,{q}_{\mathbf{\theta}})=\mathrm{KL}(p\parallel \frac{p+{q}_{\mathbf{\theta}}}{2})+\mathrm{KL}({q}_{\mathbf{\theta}}\parallel \frac{p+{q}_{\mathbf{\theta}}}{2})$ | $2{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[log\frac{r\left(\mathit{x}\right)}{1+r\left(\mathit{x}\right)}\right]-2{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[log\frac{r\left(\mathit{x}\right)}{1+r\left(\mathit{x}\right)}\right]$ |

symmetric Pearson | ${\overline{\chi}}^{2}(p,{q}_{\mathbf{\theta}})={\chi}^{2}(p\parallel {q}_{\mathbf{\theta}})+{\chi}^{2}({q}_{\mathbf{\theta}}\parallel p)$ | $\frac{1}{4}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{+}}}\left[2r\left(\mathit{x}\right)-\frac{1}{r{\left(\mathit{x}\right)}^{2}}\right]-{\textstyle \frac{1}{4}}{\mathbb{E}}_{{q}_{{\mathbf{\theta}}^{-}}}\left[2r\left(\mathit{x}\right)-\frac{1}{r{\left(\mathit{x}\right)}^{2}}\right]$ |

Severely over Parameterised (OO) | over Parameterised (O) | Exactly Parameterised (E) | under Parameterised (U) | Severely under Parameterised (UU) | |
---|---|---|---|---|---|

Number of parameters (layers) used to generate the target p | 12 parameters (1 layer) | 12 parameters (1 layer) | 12 parameters (1 layer) | 30 parameters (4 layers) | 30 parameters (4 layers) |

Number of parameters (layers) used for the model ${q}_{\mathbf{\theta}}$ | 30 parameters (4 layers) | 24 parameters (3 layers) | 12 parameters (1 layer) | 18 parameters (2 layers) | 12 parameters (1 layer) |

**Table 4.**Performance of the QCBM trained using the TV and the f-divergence heuristic for 3 qubits in over-, under-, and exactly parameterised regimes. We show the bootstrapped median of the TV (top two rows) and the KL (bottom two rows) after 500 epochs. The asterisk (*) on some of the experiments indicates that the cost is still converging. The bold indicates the regimes where f-switch significantly outperforms the other methods.

${\mathit{D}}_{\mathit{f}}$ Evaluated | ${\mathit{D}}_{\mathit{f}}$ Used in Training | OO (12, 30) | O (12, 24) | E (12, 12) | U (30, 18) | UU (30, 12) |
---|---|---|---|---|---|---|

TV | TV | $\left(1.12\begin{array}{c}+0.45\\ -0.28\end{array}\right)\times {10}^{-2}$ | $\left(8.4\begin{array}{c}+1.2\\ -1.0\end{array}\right)\times {10}^{-3}$ | $\left(1.00\begin{array}{c}+1.51\\ -0.12\end{array}\right)\times {10}^{-2}$ | $\left(1.06\begin{array}{c}+0.26\\ -0.23\end{array}\right)\times {10}^{-2}$ | $\left(1.4\begin{array}{c}+2.4\\ -0.7\end{array}\right)\times {10}^{-2}$ |

TV | f-switch | $\left(\mathbf{0.6}\begin{array}{c}+\mathbf{3.8}\\ -\mathbf{0.5}\end{array}\right)\times {\mathbf{10}}^{-\mathbf{5}}$ * | $\left(\mathbf{2.5}\begin{array}{c}+\mathbf{2.5}\\ -\mathbf{2.1}\end{array}\right)\times {\mathbf{10}}^{-\mathbf{3}}$ * | $\left(3.1\begin{array}{c}+1.8\\ -1.9\end{array}\right)\times {10}^{-2}$ | $\left(0.65\begin{array}{c}+0.27\\ -0.51\end{array}\right)\times {10}^{-2}$ | $\left(1.8\begin{array}{c}+2.9\\ -0.9\end{array}\right)\times {10}^{-2}$ |

KL | TV | $\left(3.5\begin{array}{c}+2.1\\ -1.3\end{array}\right)\times {10}^{-4}$ | $\left(2.0\begin{array}{c}+0.6\\ -0.4\end{array}\right)\times {10}^{-4}$ | $\left(2.6\begin{array}{c}+14.8\\ -2.3\end{array}\right)\times {10}^{-3}$ | $\left(3.7\begin{array}{c}+1.7\\ -92.6\end{array}\right)\times {10}^{-4}$ | $\left(0.6\begin{array}{c}+24.3\\ -0.4\end{array}\right)\times {10}^{-3}$ |

KL | f-switch | $\left(\mathbf{0.0182}\begin{array}{c}+\mathbf{1.383}\\ -\mathbf{0.012}\end{array}\right)\times {\mathbf{10}}^{-\mathbf{8}}$ | $\left(1.8\begin{array}{c}+20.9\\ -1.7\end{array}\right)\times {10}^{-5}$ * | $\left(3.5\begin{array}{c}+9.1\\ -2.0\end{array}\right)\times {10}^{-3}$ | $\left(2.4\begin{array}{c}+1.6\\ -2.4\end{array}\right)\times {10}^{-4}$ | $\left(1.8\begin{array}{c}+4.3\\ -1.5\end{array}\right)\times {10}^{-3}$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Leadbeater, C.; Sharrock, L.; Coyle, B.; Benedetti, M.
F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits. *Entropy* **2021**, *23*, 1281.
https://doi.org/10.3390/e23101281

**AMA Style**

Leadbeater C, Sharrock L, Coyle B, Benedetti M.
F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits. *Entropy*. 2021; 23(10):1281.
https://doi.org/10.3390/e23101281

**Chicago/Turabian Style**

Leadbeater, Chiara, Louis Sharrock, Brian Coyle, and Marcello Benedetti.
2021. "F-Divergences and Cost Function Locality in Generative Modelling with Quantum Circuits" *Entropy* 23, no. 10: 1281.
https://doi.org/10.3390/e23101281