Is Quantum Tomography a Difficult Problem for Machine Learning?

Jacquet, Philippe

doi:10.3390/psf2022005047

Open AccessProceeding Paper

Is Quantum Tomography a Difficult Problem for Machine Learning?^†

by

Philippe Jacquet

Inria Saclay Ile-de-France, 91120 Palaiseau, France

^†

Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.

Phys. Sci. Forum 2022, 5(1), 47; https://doi.org/10.3390/psf2022005047

Published: 7 February 2023

(This article belongs to the Proceedings of The 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

One of the key issues in machine learning is the characterization of the learnability of a problem. Regret is a way to quantify learnability. Quantum tomography is a special case of machine learning where the training set is a set of quantum measurements and the ground truth is the result of these measurements, but nothing is known about the hidden quantum system. We will show that in some case quantum tomography is a hard problem to learn. We consider a problem related to optical fiber communication where information is encoded in photon polarizations. We will show that the learning regret cannot decay faster than

1 / \sqrt{T}

where T is the size of the training dataset and that incremental gradient descent may converge worse.

Keywords:

machine learning; artificial intelligence; photon polarization; quantum tomography

1. Introduction: Supervised Learning in General

With the invention of deep neural learning, the general public thinks there is a glimpse of a universal machine learning technology capable of solving arbitrary problems without any specific preparation on training data and learning strategy. Everything “is” be solvable as long as there are enough layers, enough processing power and enough training data. We arrived at the point that many people (among them the late Stephen Hawking) start thinking that machines may supersede human intelligence thanks to the greater performance of silicon neurons over biological neurons, and may be capable of cracking the last enigmas around the physical nature of the universe.

However, we should not forget that actual Artificial Intelligence (AI) has many limitations. However, due to the youth of technology, many of the present limits might be of teething nature. To learn a language the present algorithms need to be trained over millions of texts which is equivalent to a training period of 80 years if it were done at the learning pace of a child! Presently, deep neural training is very demanding in processing and it is the third major source of energy consumption among information technologies after Bitcoin and data centers. Deep learning is not yet such a good self-organizing learning process as some researchers would have thought [1]. There is also the obstacle of data sparsity to learning (the machine only recognizes the data on which it has been trained over and over as if a reader could only understand the texts on which (s)he has been trained).

To make it short the main limitations of machine learning technologies are: (i) the data sparsity; (ii) the absence of a computable solution to learn (e.g., the program halting problem); (iii) the presence of hard-to-learn algorithms in the solution. My present paper will address the third limitation.

A supervised learning problem can be viewed as a set of training data and ground truths. The machine acts as an automaton whose aim is to predict the ground truth from data. The loss measures the difference between the prediction and the ground truth and can be established under an arbitrary metric. The general objective of supervised machine learning is to minimize the average loss, but since the ground truth might contain some inherent stochastic variations (e.g., when predicting the result of a quantum measurement) it may be impossible to make the loss as small as we would like. Given an automaton architecture, there exists a setting that gives the optimal average loss. However, the optimal setting might be difficult to reach. However, there is still the question of the size of the training set needed to converge to the optimal settings.

All problems are not equal in front of learnability [2]. Some seem to be a perfect match with AI, some others are more difficult to adapt. In [3], the author shows that the random parity functions are just unlearnable. In fact, in a broader perspective, the “learnability” may not be a learnable problem [4].

The first contribution of this paper is a new definition of learning regret with respect to a given single problem submitted to a given learning strategy. Most regret expressions are infimum of regret over a large class (if not universal) of problems [5] and therefore lose the specificity of individual problems.

The second contribution is the application of this new regret definition to a quantum tomography problem. The specificity of the problem is that the hidden source probability distribution is indeed contained in the learning distribution class. The surprising result is that the regret is at least in the square root of the number of runs, hinting at a poor convergence rate of the learned distribution toward the hidden distribution. We conclude with numerical experiments with gradient descents.

2. Expressing the Convergence Regret

Let T be an integer and let

x^{T} = (x_{1}, \dots, x_{T})

be a sequence of features which are vectors of a certain dimension which define the problem (the notation with T is not for “transpose”, which should be noted

^{T} x

, but for a sequence with T atoms). Each feature

x

generates a discrete random label y. Let denote

P_{S} (y | x)

(S for “source”) the probability to have label y given the feature

x

. If

y^{T}

is the sequence of random labels given the sequence of feature

x^{T}

:

P_{S} (y^{T} | x^{T}) = \prod_{t} P_{S} (y_{t} | x_{t})

. The sequence of features and labels defines the problem for supervised learning.

The learning process will give as output an index

L (y^{T})

which will be taken from a set of

L

, such that each

L \in L

define a distribution

P_{L} (y^{T} | x^{T})

(L for “learning”) over the label sequence given the feature sequence. In absence of side information the learning process leads to

L (y^{T}) = arg {max}_{L \in L} {P_{L} (y^{T} | x^{T})}

. Our aim is find how close

P_{L (y^{T})} (y^{T} | x^{T})

is to

P_{S} (y^{T} | x^{T})

when

y^{T}

varies.

The distance between the two distributions can be expressed by the Kullback–Leibler divergence [6]

D (P_{S} ∥ P_{L}) = \sum_{y^{T}} P_{S} (y^{T} | x^{T}) log \frac{P_{S} (y^{T} | x^{T})}{P_{L (y^{T})} (y^{T} | x^{T})}

(1)

However, it should be stressed that the quantity

P_{L (y^{T})} (y^{T} | x^{T})

does not necessarily define a probability distribution since

L (y^{T})

may vary when

y^{T}

varies, making

\sum_{y^{T}} P_{L (y^{T})} (y^{T} | x^{T})

equal to 1 unlikely. Thus,

D (P_{S} ∥ P_{L})

is not a distance, because it can be non-positive. One way to get through is to introduce

P_{L}^{*} (y^{T} | x^{T}) = \frac{P_{L (y^{T})} (y^{T} | x^{T})}{S_{L} (x^{T})}

with

S (x^{T}) = \sum_{y^{T}} P_{L (y^{T})} (y^{T} | x^{T})

which makes

P_{L}^{*} ()

a probability distribution. Thus, we will use

D (P_{S} ∥ P_{L}^{*})

which satisfies:

D (P_{S} ∥ P_{L}^{*}) = \sum_{y^{T}} P_{S} (y^{T} | x^{T}) log \frac{P_{S} (y^{T} | x^{T})}{P_{L}^{*} (y^{T} | x^{T})} = D (P_{S} ∥ P_{L}) + log S (x^{T}),

(2)

and is now a well-defined semi distance which we will define as the learning regret

R (x^{T}) = D (P_{S} ∥ P_{L}^{*})

[5].

3. The Quantum Learning on Polarized Photons

We now include pure physical measurements in the learning process. There are several applications that involve physic, ref. [7] describes a process of deep learning over the physical layer of a wireless network. The issue with quantum physical effects is the fact that they are not reproducible and not deterministic. We consider a problem related to optical fiber communication where information is encoded in photon polarizations. The photon polarization is given by a quantum wave function of dimension 2. In the binary case, the bit 0 is given by polarisation angles

θ_{Q}

and the bit 1 is given by angle

θ_{Q} + π / 2

. The quantity

θ_{Q}

is supposed to be unknown by the receiver and its estimate

θ_{T}

is obtained after a training sequence via machine learning.

For this purpose, the sender sends a sequence of T equally polarized photons, along angle

θ_{Q}

, the receiver measures these photons over a collection of T measurement angles

x_{1}, x_{2}, \dots, x_{T}

, called the featured angles. They are pure scalar and are not vector (

d = 1

), therefore we will not depict them in bold font as in the previous section which is therefore of dimension 1. The labels, or ground truths,

y_{1}, \dots, y_{T}

are the sequence of binary measurement obtained,

y_{t} \in {0, 1}

, there are

2^{T}

possible label sequences.

This problem is the most simplified version of tomography on quantum telecommunication since it relies on a single parameter. More realistic and more complicated situations will occur when noisy circular polarization is introduced within a more complex combination of polarizations within groups of photons. This will considerably increase the dimension of the feature vectors and certainly will make our results on the training process more critical. However, in the situation analyzed in our paper, we show that this simple system is difficult to learn.

If we assume that the experiment results are delivered in batches to the training process, that is the estimate

θ_{t} = θ

does not vary for

0 < t < T

, the learning class of probability distribution is a function of

θ

with

P_{L} (y^{T} | x^{T}, θ) = \prod_{y_{t} = 0} cos {(θ - x_{t})}^{2} \prod_{y_{t} = 1} sin {(θ - x_{t})}^{2}

. The source distribution is indeed

P_{S} (y^{T} | x^{T}) = P_{L} (y^{T} | x^{T}, θ_{Q})

, thus the source distribution belongs to the class

L

of learning distribution. For a given pair of sequence

(y^{T}, x^{T})

, let

θ^{*}

be the value of

θ

which maximizes

P_{L} (y^{T} | x^{T}, θ)

. Since we will never touch the sequence

x^{T}

which are the foundation of the experiments, we will sometimes drop the parameter

x^{T}

and denote

ℓ_{y^{T}} (θ) = - log P_{L} (y^{T} | x^{T}, θ)

. The quantity

θ^{*}

which maximizes

P_{L} (y^{T} | x^{T}, θ)

will satisfy

ℓ_{y^{T}}^{'} (θ^{*}) = 0

. We have

\{\begin{matrix} ℓ_{y^{T}} (θ) & = & - 2 \sum_{t} log | cos (θ - x_{t} + y_{t} π / 2) | \\ ℓ_{y^{T}}^{'} (θ) & = & 2 \sum_{t} tan (θ - x_{t} + y_{t} π / 2) \\ {ℓ^{″}}_{y^{T}} (θ) & = & 2 \sum_{t} \frac{1}{cos {(θ - x_{t} + y_{t} π / 2)}^{2}} \end{matrix}

We notice that for all

θ

{ℓ^{″}}_{y^{T}}

is always strictly positive (but

ℓ^{″}

and

ℓ^{'}

are not continuous so ℓ is not convex). We now turn to displaying and proving our main results (two theorems), whose proof would need the following two next lemmas.

Lemma 1.

We have the expression

ℓ_{y^{T}} (θ^{*}) = \frac{1}{2 π} \int_{0}^{2 π} ℓ_{y^{T}} (w) {ℓ^{″}}_{y^{T}} (w) d w \int_{R} exp (- i ℓ_{y^{T}}^{'} (w) z) d z .

(3)

Proof.

Let

g_{y^{T}} (θ) = ℓ_{y^{T}}^{'} (θ)

which is homomorphic and is locally invertible (since

{ℓ^{″}}_{y^{T}} (θ)

is never zero). Let

a \in R

we denote

l_{y^{T}}

the function

a \to ℓ_{y^{T}} (g_{y^{T}}^{- 1} (a))

. We have

ℓ_{y^{T}} (θ^{*}) = l_{y^{T}} (0)

. For

z \in R

, let

{\tilde{l}}_{y^{T}} (z)

be the Fourier transform of function

l_{y^{T}} (a)

. Formally we have

\begin{matrix} (4) & {\tilde{l}}_{y^{T}} (z) & = & \int_{R} l_{y^{T}} (a) e^{- i a z} d a \\ (5) & = & \int_{0}^{2 π} ℓ_{y^{T}} (w) {ℓ_{y^{T}}}^{″} (w) e^{- i ℓ_{y^{T}}^{'} (w) z} d w \end{matrix}

and inversely

l_{y^{T}} (a) = \frac{1}{2 π} \int_{R} {\tilde{l}}_{y^{T}} (z) e^{i a z} d z

(6)

Thus

\begin{matrix} (7) & ℓ_{y^{T}} (θ^{*}) & = & \frac{1}{2 π} \int_{R} {\tilde{l}}_{y^{T}} (z) d z \\ (8) & = & \frac{1}{2 π} \int_{0}^{2 π} ℓ_{y^{T}} (w) {ℓ_{y^{T}}}^{″} (w) d w \\ (9) & \times \int_{R} e^{- i ℓ_{y^{T}}^{'} (w) z} d z . \end{matrix}

□

In fact, the function

ℓ_{y^{T}} (θ)

may have several extrema as we will see in the next section, thus

ℓ_{y^{T}}^{'} (θ)

may have several roots, thus

g_{y^{T}}^{- 1} (a)

is polymorphic. In order to avoid the secondary roots which contribute to the non-optimal extrema, we will concentrate on the main root in the vicinity of

θ_{Q}

.

Let

p^{T} = (p_{1}, \dots, p_{T})

and

q^{T} = (q_{1}, \dots, q_{T})

be two sequence of real numbers. We denote

p (y^{T}) = \prod_{t} p_{t}^{1 - y_{t}} q_{t}^{y_{t}}

.

Lemma 2.

For any

1 \leq t_{0} \leq T

we have the identity

\sum_{y^{T}} y_{t_{0}} p (y^{T}) = q_{t_{0}} \prod_{t \neq t_{0}} (p_{t} + q_{t}) .

(10)

For

t_{1} \neq t_{2}

, we have

\sum_{y^{T}} y_{t_{1}} y_{t_{2}} p (y^{T}) = q_{t_{1}} q_{t_{2}} \prod_{t \neq t_{1}, t_{2}} (p_{t} + q_{t}) .

(11)

Proof.

This is just the consequence of the finite sums via algebraic manipulations. □

Theorem 1.

Under mild conditions, we have the estimate

\sum_{y^{T}} P (y^{T} | x^{T}) log \frac{P_{S} (y^{T} | x^{T})}{P_{L (y^{T})} (y^{T} | x^{T})} = O (\sqrt{T})

(12)

Proof.

Let

C (x^{T}) = \sum_{y^{T}} P_{S} (y^{T} | x^{T}) ℓ_{y^{T}} (θ^{*})

. Applying both lemma with

p_{t} = cos {(θ_{Q} - x_{t})}^{2} e^{- 2 i tan (θ - x_{t}) z}

and

q_{t} = sin {(θ_{Q} - x_{t})}^{2} e^{- 2 i tan (θ - x_{t} + π / 2) z}

, thus

p (y^{T}) = P_{S} (y^{T} | x^{T}) e^{- i ℓ_{y^{T}} (θ)}

we get

\begin{matrix} C (x^{T}) & = & \sum_{y^{T}} \frac{1}{2 π} \int_{0}^{2 π} ℓ_{y^{T}} (θ) {ℓ^{″}}_{y^{T}} (θ) d θ \int_{R} exp (- i ℓ_{y^{T}}^{'} (w) z) d z \\ = & \frac{1}{2 π} \int_{0}^{2 π} d θ \int_{R} (\bar{ℓ} (θ, z) {\bar{ℓ}}^{″} (θ, z) + \bar{Δ} (θ, z)) \prod_{t} (p_{t} + q_{t}) d z \end{matrix}

with

\begin{matrix} \bar{ℓ} (θ, z) & = & - 2 \sum_{t} \frac{p_{t}}{p_{t} + q_{t}} log cos (θ - x_{t}) + \frac{q_{t}}{p_{t} + q_{t}} log sin (θ - x_{t}) \\ {\bar{ℓ}}^{″} (θ, z) & = & 2 \sum_{t} \frac{p_{t}}{p_{t} + q_{t}} \frac{1}{cos {(θ - x_{t})}^{2}} + \frac{q_{t}}{p_{t} + q_{t}} \frac{1}{sin {(θ - x_{t})}^{2}} \\ \bar{Δ} (θ, z) & = & - 2 \sum_{t} \frac{p_{t} q_{t}}{{(p_{t} + q_{t})}^{2}} (\frac{log cos (θ - x_{t})}{cos {(θ - x_{t})}^{2}} + \frac{log sin (θ - x_{t})}{sin {(θ - x_{t})}^{2}}) \end{matrix}

We notice that

\prod_{t} (p_{t} + q_{t}) = exp (2 i m (θ) z + v (θ) z^{2} + O (z^{3} T))

with

\begin{matrix} m (θ) & = & \sum_{t} tan (θ - x_{t}) cos {(θ_{Q} - x_{t})}^{2} + tan (θ - x_{t} + π / 2) sin {(θ - x_{t})}^{2} \\ v (θ) & = & \sum_{t} tan {(θ - x_{t})}^{2} cos {(θ_{Q} - x_{t})}^{2} + tan {(θ - x_{t} + π / 2)}^{2} sin {(θ_{Q} - x_{t})}^{2} \\ - \sum_{t} {(tan (θ - x_{t}) cos {(θ_{Q} - x_{t})}^{2} + tan (θ - x_{t} + π / 2) sin {(θ_{Q} - x_{t})}^{2})}^{2} \end{matrix}

We notice that

m (θ) \sim 2 (θ - θ_{Q}) T

and

v (θ) = T + O (θ - θ_{Q})

when

θ \to θ_{Q}

. The expression is obtained via saddle point method approximation, under the mild conditions being that it can be applied as in the maximum likelihood problem [8] (the error term would be the smallest possible)

\begin{matrix} \int_{R} (\bar{ℓ} (θ, z) {\bar{ℓ}}^{″} (θ, z) + \bar{Δ} (θ, z)) \prod_{t} (p_{t} + q_{t}) d z & = & \int_{R} (\bar{ℓ} (θ, z) {\bar{ℓ}}^{″} (θ, z) + \bar{Δ} (θ, z)) \\ (13) & exp (- i m (θ) z - v (θ) z^{2} / 2 + {O (T | z |}^{3})) d z \\ = & (\bar{ℓ} (θ) {\bar{ℓ}}^{″} (θ) + \bar{Δ} (θ)) \frac{\sqrt{π}}{\sqrt{v (θ)}} exp (- \frac{m {(θ)}^{2}}{v (θ)}) \\ (14) & (1 + O (1 / \sqrt{T})) \end{matrix}

with

\bar{ℓ} (θ) = \bar{ℓ} (θ, 0)

,

{\bar{ℓ}}^{″} (θ) = {\bar{ℓ}}^{″} (θ, 0)

and

\bar{Δ} (θ) = \bar{Δ} (θ, 0)

Since

\frac{m {(θ)}^{2}}{v (θ)} = 4 {(θ - θ_{Q})}^{2} T + O (| θ - θ_{Q} |^{3} T)

, the factor

\prod_{t} (p_{t} + q_{t})

behaves like a gaussian function centered on

θ_{Q}

with standard deviation of order

1 / \sqrt{T}

. Thus, via saddle point approximation again, it comes:

\begin{matrix} C (x^{T}) & = & \frac{1}{2 \sqrt{π}} \int_{0}^{2 π} (\bar{ℓ} (θ) {\bar{ℓ}}^{″} (θ) + \bar{Δ} (θ)) \frac{\sqrt{π}}{\sqrt{v (θ)}} exp (- \frac{m (θ)}{v (θ)}) (1 + O (1 / \sqrt{T})) \\ = & \frac{1}{2 \sqrt{π}} \int_{0}^{2 π} \frac{\bar{ℓ} (θ) {\bar{ℓ}}^{″} (θ) + \bar{Δ} (θ)}{\sqrt{v (θ)}} exp (- 4 {(θ - θ_{Q})}^{2} T + O (| θ - θ_{Q} |^{3} T)) (1 + O (1 / \sqrt{T})) \\ = & \frac{\bar{ℓ} (θ_{Q}) {\bar{ℓ}}^{″} (θ_{Q}) + \bar{Δ} (θ_{Q})}{2 \sqrt{v (0)}} (1 + O (1 / \sqrt{T})) \\ = & h (θ_{Q}) (1 + O (1 / \sqrt{T})) \end{matrix}

with

h (θ_{Q}) = (\bar{ℓ} (θ_{Q}) {\bar{ℓ}}^{″} (θ_{Q}) - \bar{Δ} (θ_{Q})) / 2 T

with

h (θ) - \sum_{t} cos {(θ - x_{t})}^{2} log cos {(θ - x_{t})}^{2} + sin {(θ - x_{t})}^{2} log sin {(θ - x_{t})}^{2}

is clearly

O (T)

.

Furthermore,

h (θ_{Q}) = - \sum_{y^{T}} P_{S} (y^{T} | x^{T}) log P_{S} (y^{T} | x^{T})

, thus we have

\sum_{y^{T}} P (y^{T} | x^{T}) log \frac{P_{S} (y^{T} | x^{T})}{P_{L (y^{T})} (y^{T} | x^{T})} = O (\frac{h (θ_{Q})}{\sqrt{T}}) = O (\sqrt{T}) .

□

Theorem 2.

We have

log S (x^{T}) = log (\sum_{y^{T}} P_{L (y^{T})} (y^{T} | x^{T})) = \frac{1}{2} log T + O (1) .

(15)

Remark 1.

This order of magnitude is much smaller than the main order of magnitude provided in Theorem 1, confirming that the overall regret is indeed

\sqrt{T}

. The regret per measurement is

O (1 / \sqrt{T})

therefore the individual regrets nevertheless tend to zero when

T \to \infty

.

Proof.

It is formally a Shtarkov sum [5,9]. Using Lemma 1 and Lemma 2 gives

\begin{matrix} (16) & S (x^{T}) = \sum_{y^{T}} P_{L (y^{T})} (y^{T} | x^{T}) & = & \sum_{y^{T}} \frac{1}{2 π} \int_{0}^{2 π} P (y^{T} | x^{T}, w) {ℓ_{y^{T}}}^{″} (w) d w \int_{R} exp (- i ℓ_{y^{T}}^{'} (w) z) d z . \\ (17) & = & \frac{1}{2 π} \int_{0}^{2 π} d θ \int_{R} {\tilde{ℓ}}^{″} (θ, z) \prod_{t} (p_{t} + q_{t}) d z \end{matrix}

with

p_{t} = cos {(θ - x_{t})}^{2} e^{- 2 i tan (θ - x_{t}) z}

and

q_{t} = sin {(θ - x_{t})}^{2} e^{- 2 i tan (θ - x_{t} + π / 2) z}

, thus

p (y^{T}) = P (y^{T} | x^{T}, θ) e^{- i ℓ_{y^{T}} (θ)}

;

{\tilde{ℓ}}^{″} (θ, z)

has same expression as

{\bar{ℓ}}^{″} (θ, z)

but with the new expression of

p_{t}

and

q_{t}

:

{\tilde{ℓ}}^{″} (θ, z) = 2 \sum_{t} \frac{p_{t}}{p_{t} + q_{t}} \frac{1}{cos {(θ - x_{t})}^{2}} + \frac{q_{t}}{p_{t} + q_{t}} \frac{1}{sin {(θ - x_{t})}^{2}}

Developing further:

S (x^{T}) = \frac{1}{2 π} \int_{0}^{2 π} d θ \int_{R} \tilde{ℓ} (θ, z) exp (- 2 T z^{2} + O (T | z^{3} |)),

(18)

via the saddle point estimate (which consists to do a change of variable

z \to \frac{1}{\sqrt{T}} z^{'}

under the same conditions of Theorem 1 we get

S (x^{T}) = \frac{1}{2 π} \int_{0}^{2 π} d θ \int_{R} \tilde{ℓ} (θ, 0) \frac{\sqrt{π}}{\sqrt{2 T}} (1 + O (1 / \sqrt{T})) .

(19)

We terminate with the evaluation

{\tilde{ℓ}}^{″} (θ, 0) = 4 T

, thus

S (x^{T}) = \frac{\sqrt{T}}{\sqrt{π / 2}} (1 + O (1 / \sqrt{T}))

.

□

4. Incremental Learning and Gradient Descent

We investigate gradient descent methods to reach the value

θ^{*}

. There are many gradient strategies. The classic strategy, which we call, the slow gradient descent, where we define the loss by

loss (y_{t}, θ_{t} | x_{t}) = {(y_{t} - sin {(θ_{t} - x_{t})}^{2})}^{2}

, since the average value of

y_{t}

is

sin {(θ_{Q})}^{2}

, thus the average loss is

{(sin {(θ_{Q} - x_{t})}^{2} - sin {(θ_{t} - x_{t})}^{2})}^{2} + \frac{sin {(2 θ_{Q} - 2 x_{t})}^{2}}{4}

(minimized at

θ_{t} = θ_{Q}

) and the gradient

θ_{t}

updates is

θ_{t + 1} = θ_{t} - r \frac{\partial}{θ_{t}} loss (y_{t}, θ_{t} | x_{t}) .

(20)

In Figure 1 we display our simulations as a sequence

θ_{t}

starting with a random initial

θ_{1}

. We assume that for all t the transmitted bit is always 0 i.e., the polarization angle is always

θ_{Q}

. The learning rate is

r = 0.0002

. We simulate nine parallel gradient descents randomly initialized sharing the same random feature sequence

x^{T}

, with T = 3,000,000. On Figure 1 we plot the parallel evolutions of quantity

θ_{t}

. The initial points are green diamonds and the final points are red diamonds. Although we start with nine different positions, the trajectories converge toward

θ_{Q} \pm π

. However, the convergence is slow, confirming the

1 / \sqrt{T}

and worse rate. In fact, some initial positions converge even more slowly, and even after 3,000,000 trials, are still very far. The reason is that the target function

log P (y^{T} | x^{T}, θ)

has several local maxima as it is shown in Figure 2 where the

x_{t}

belongs to the set of values

2 π k / 10

for

k = 1, \dots, 10

. It is very unlikely that a communication operator would tolerate so many runs (3,000,000) in order to have a proper convergence. However, it would be possible to run the gradient descents in parallel and act like with particle systems in order to select the fastest in convergence.

A supposedly faster gradient descent would be defined by the inverse derivative

θ_{t + 1} = θ_{t} + r \frac{y_{t} - sin {(θ_{t} - x_{t})}^{2}}{\frac{\partial}{\partial θ_{t}} sin {(θ_{t} - x_{t})}^{2}}

(21)

We notice that in stationary situation (where we suppose that

θ_{t}

very little varies) we have

E (θ_{t + 1}) = θ_{t} + r \frac{sin {(θ_{Q} - x_{t})}^{2} - sin {(θ_{t} - x_{t})}^{2}}{\frac{\partial}{\partial θ_{t}} sin {(θ_{t} - x_{t})}^{2}}

which is equal to

θ_{t}

when

θ_{t} = θ_{Q}

. In Figure 3, we display our simulations as a sequence

θ_{t}

starting with a random initial

θ_{1}

. The learning rate is

r = 0.0002

. We simulate nine parallel fast gradient descents randomly initialized sharing the same random feature sequence

x^{T}

, with

T =

3,000,000. The gradient descent converges fast but does not converge on the good value

θ_{Q} \pm π

. Again it is due to the fact that the target function

log P (y^{T} | x^{T}, θ)

has several local maxima which act like a trap for the gradient descent.

5. Conclusions

We have presented a simple quantum tomography problem, the photon unknown polarization problem and have analyzed its learnability via AI over T runs. We have shown that the learning regret cannot decay faster than

1 / \sqrt{T}

(i.e., a cumulative regret of

\sqrt{T}

). Furthermore, the classic gradient descent is hampered by local extrema which may significantly impact the theoretical convergence rate.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
Bouillard, A.; Jacquet, P. Quasi Black Hole Effect of Gradient Descent in Large Dimension: Consequence on Neural Network Learning. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
Abbe, E.; Sandon, C. Provable limitations of deep learning. arXiv 2018, arXiv:1812.06369. [Google Scholar]
Ben-David, S.; Hrubeš, P.; Moran, S.; Shpilka, A.; Yehudayoff, A. Learnability can be undecidable. Nat. Mach. Intell. 2019, 1, 44–48. [Google Scholar] [CrossRef]
Jacquet, P.; Shamir, G.; Szpankowski, W. Precise Minimax Regret for Logistic Regression with Categorical Feature Values. In Algorithmic Learning Theory; PMLR: New York City, NY, USA, 2021. [Google Scholar]
Van Erven, T.; Harremos, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
O’shea, T.; Hoydis, J. An introduction to deep learning for the physical layer. IEEE Trans. Cogn. Commun. Netw. 2017, 3, 563–575. [Google Scholar] [CrossRef] [Green Version]
Newey, W.K.; McFadden, D. Chapter 36: Large sample estimation and hypothesis testing. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands, 1994. [Google Scholar]
Shtarkov, Y.M. Universal sequential coding of single messages. Probl. Inf. Transm. 1987, 23, 3–17. [Google Scholar]

Figure 1. Angle estimate

θ_{t}

versus time of nine slow gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.

Figure 1. Angle estimate

θ_{t}

versus time of nine slow gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.

Figure 2. Target function

\sum_{t} cos {(θ_{Q} - x_{t})}^{2} log cos {(θ - x_{t})}^{2} + sin {(θ_{Q} - x_{t})}^{2} log sin {(θ - x_{t})}^{2}

as function of

θ

.

Figure 2. Target function

\sum_{t} cos {(θ_{Q} - x_{t})}^{2} log cos {(θ - x_{t})}^{2} + sin {(θ_{Q} - x_{t})}^{2} log sin {(θ - x_{t})}^{2}

as function of

θ

.

Figure 3. Angle estimate

θ_{t}

versus time of nine fast gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.

Figure 3. Angle estimate

θ_{t}

versus time of nine fast gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jacquet, P. Is Quantum Tomography a Difficult Problem for Machine Learning? Phys. Sci. Forum 2022, 5, 47. https://doi.org/10.3390/psf2022005047

AMA Style

Jacquet P. Is Quantum Tomography a Difficult Problem for Machine Learning? Physical Sciences Forum. 2022; 5(1):47. https://doi.org/10.3390/psf2022005047

Chicago/Turabian Style

Jacquet, Philippe. 2022. "Is Quantum Tomography a Difficult Problem for Machine Learning?" Physical Sciences Forum 5, no. 1: 47. https://doi.org/10.3390/psf2022005047

Article Menu

Is Quantum Tomography a Difficult Problem for Machine Learning?^†

Abstract

1. Introduction: Supervised Learning in General

2. Expressing the Convergence Regret

3. The Quantum Learning on Polarized Photons

4. Incremental Learning and Gradient Descent

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Is Quantum Tomography a Difficult Problem for Machine Learning? †

Abstract

1. Introduction: Supervised Learning in General

2. Expressing the Convergence Regret

3. The Quantum Learning on Polarized Photons

4. Incremental Learning and Gradient Descent

5. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Is Quantum Tomography a Difficult Problem for Machine Learning?^†