Next Article in Journal
Variational Bayesian Approximation (VBA): A Comparison between Three Optimization Algorithms
Previous Article in Journal
Comparison of Step Samplers for Nested Sampling
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Is Quantum Tomography a Difficult Problem for Machine Learning? †

Inria Saclay Ile-de-France, 91120 Palaiseau, France
Presented at the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Paris, France, 18–22 July 2022.
Phys. Sci. Forum 2022, 5(1), 47; https://doi.org/10.3390/psf2022005047
Published: 7 February 2023

Abstract

:
One of the key issues in machine learning is the characterization of the learnability of a problem. Regret is a way to quantify learnability. Quantum tomography is a special case of machine learning where the training set is a set of quantum measurements and the ground truth is the result of these measurements, but nothing is known about the hidden quantum system. We will show that in some case quantum tomography is a hard problem to learn. We consider a problem related to optical fiber communication where information is encoded in photon polarizations. We will show that the learning regret cannot decay faster than 1 / T where T is the size of the training dataset and that incremental gradient descent may converge worse.

1. Introduction: Supervised Learning in General

With the invention of deep neural learning, the general public thinks there is a glimpse of a universal machine learning technology capable of solving arbitrary problems without any specific preparation on training data and learning strategy. Everything “is” be solvable as long as there are enough layers, enough processing power and enough training data. We arrived at the point that many people (among them the late Stephen Hawking) start thinking that machines may supersede human intelligence thanks to the greater performance of silicon neurons over biological neurons, and may be capable of cracking the last enigmas around the physical nature of the universe.
However, we should not forget that actual Artificial Intelligence (AI) has many limitations. However, due to the youth of technology, many of the present limits might be of teething nature. To learn a language the present algorithms need to be trained over millions of texts which is equivalent to a training period of 80 years if it were done at the learning pace of a child! Presently, deep neural training is very demanding in processing and it is the third major source of energy consumption among information technologies after Bitcoin and data centers. Deep learning is not yet such a good self-organizing learning process as some researchers would have thought [1]. There is also the obstacle of data sparsity to learning (the machine only recognizes the data on which it has been trained over and over as if a reader could only understand the texts on which (s)he has been trained).
To make it short the main limitations of machine learning technologies are: (i) the data sparsity; (ii) the absence of a computable solution to learn (e.g., the program halting problem); (iii) the presence of hard-to-learn algorithms in the solution. My present paper will address the third limitation.
A supervised learning problem can be viewed as a set of training data and ground truths. The machine acts as an automaton whose aim is to predict the ground truth from data. The loss measures the difference between the prediction and the ground truth and can be established under an arbitrary metric. The general objective of supervised machine learning is to minimize the average loss, but since the ground truth might contain some inherent stochastic variations (e.g., when predicting the result of a quantum measurement) it may be impossible to make the loss as small as we would like. Given an automaton architecture, there exists a setting that gives the optimal average loss. However, the optimal setting might be difficult to reach. However, there is still the question of the size of the training set needed to converge to the optimal settings.
All problems are not equal in front of learnability [2]. Some seem to be a perfect match with AI, some others are more difficult to adapt. In [3], the author shows that the random parity functions are just unlearnable. In fact, in a broader perspective, the “learnability” may not be a learnable problem [4].
The first contribution of this paper is a new definition of learning regret with respect to a given single problem submitted to a given learning strategy. Most regret expressions are infimum of regret over a large class (if not universal) of problems [5] and therefore lose the specificity of individual problems.
The second contribution is the application of this new regret definition to a quantum tomography problem. The specificity of the problem is that the hidden source probability distribution is indeed contained in the learning distribution class. The surprising result is that the regret is at least in the square root of the number of runs, hinting at a poor convergence rate of the learned distribution toward the hidden distribution. We conclude with numerical experiments with gradient descents.

2. Expressing the Convergence Regret

Let T be an integer and let x T = ( x 1 , , x T ) be a sequence of features which are vectors of a certain dimension which define the problem (the notation with T is not for “transpose”, which should be noted T x , but for a sequence with T atoms). Each feature x generates a discrete random label y. Let denote P S ( y | x ) (S for “source”) the probability to have label y given the feature x . If y T is the sequence of random labels given the sequence of feature x T : P S ( y T | x T ) = t P S ( y t | x t ) . The sequence of features and labels defines the problem for supervised learning.
The learning process will give as output an index L ( y T ) which will be taken from a set of L , such that each L L define a distribution P L ( y T | x T ) (L for “learning”) over the label sequence given the feature sequence. In absence of side information the learning process leads to L ( y T ) = arg max L L { P L ( y T | x T ) } . Our aim is find how close P L ( y T ) ( y T | x T ) is to P S ( y T | x T ) when y T varies.
The distance between the two distributions can be expressed by the Kullback–Leibler divergence [6]
D ( P S P L ) = y T P S ( y T | x T ) log P S ( y T | x T ) P L ( y T ) ( y T | x T )
However, it should be stressed that the quantity P L ( y T ) ( y T | x T ) does not necessarily define a probability distribution since L ( y T ) may vary when y T varies, making y T P L ( y T ) ( y T | x T ) equal to 1 unlikely. Thus, D ( P S P L ) is not a distance, because it can be non-positive. One way to get through is to introduce P L * ( y T | x T ) = P L ( y T ) ( y T | x T ) S L ( x T ) with S ( x T ) = y T P L ( y T ) ( y T | x T ) which makes P L * ( ) a probability distribution. Thus, we will use D ( P S P L * ) which satisfies:
D ( P S P L * ) = y T P S ( y T | x T ) log P S ( y T | x T ) P L * ( y T | x T ) = D ( P S P L ) + log S ( x T ) ,
and is now a well-defined semi distance which we will define as the learning regret R ( x T ) = D ( P S P L * ) [5].

3. The Quantum Learning on Polarized Photons

We now include pure physical measurements in the learning process. There are several applications that involve physic, ref. [7] describes a process of deep learning over the physical layer of a wireless network. The issue with quantum physical effects is the fact that they are not reproducible and not deterministic. We consider a problem related to optical fiber communication where information is encoded in photon polarizations. The photon polarization is given by a quantum wave function of dimension 2. In the binary case, the bit 0 is given by polarisation angles θ Q and the bit 1 is given by angle θ Q + π / 2 . The quantity θ Q is supposed to be unknown by the receiver and its estimate θ T is obtained after a training sequence via machine learning.
For this purpose, the sender sends a sequence of T equally polarized photons, along angle θ Q , the receiver measures these photons over a collection of T measurement angles x 1 , x 2 , , x T , called the featured angles. They are pure scalar and are not vector ( d = 1 ), therefore we will not depict them in bold font as in the previous section which is therefore of dimension 1. The labels, or ground truths, y 1 , , y T are the sequence of binary measurement obtained, y t { 0 , 1 } , there are 2 T possible label sequences.
This problem is the most simplified version of tomography on quantum telecommunication since it relies on a single parameter. More realistic and more complicated situations will occur when noisy circular polarization is introduced within a more complex combination of polarizations within groups of photons. This will considerably increase the dimension of the feature vectors and certainly will make our results on the training process more critical. However, in the situation analyzed in our paper, we show that this simple system is difficult to learn.
If we assume that the experiment results are delivered in batches to the training process, that is the estimate θ t = θ does not vary for 0 < t < T , the learning class of probability distribution is a function of θ with P L ( y T | x T , θ ) = y t = 0 cos ( θ x t ) 2 y t = 1 sin ( θ x t ) 2 . The source distribution is indeed P S ( y T | x T ) = P L ( y T | x T , θ Q ) , thus the source distribution belongs to the class L of learning distribution. For a given pair of sequence ( y T , x T ) , let θ * be the value of θ which maximizes P L ( y T | x T , θ ) . Since we will never touch the sequence x T which are the foundation of the experiments, we will sometimes drop the parameter x T and denote y T ( θ ) = log P L ( y T | x T , θ ) . The quantity θ * which maximizes P L ( y T | x T , θ ) will satisfy y T ( θ * ) = 0 . We have
y T ( θ ) = 2 t log | cos ( θ x t + y t π / 2 ) | y T ( θ ) = 2 t tan ( θ x t + y t π / 2 ) y T ( θ ) = 2 t 1 cos ( θ x t + y t π / 2 ) 2
We notice that for all θ y T is always strictly positive (but and are not continuous so is not convex). We now turn to displaying and proving our main results (two theorems), whose proof would need the following two next lemmas.
Lemma 1. 
We have the expression
y T ( θ * ) = 1 2 π 0 2 π y T ( w ) y T ( w ) d w R exp ( i y T ( w ) z ) d z .
Proof. 
Let g y T ( θ ) = y T ( θ ) which is homomorphic and is locally invertible (since y T ( θ ) is never zero). Let a R we denote l y T the function a y T ( g y T 1 ( a ) ) . We have y T ( θ * ) = l y T ( 0 ) . For z R , let l ˜ y T ( z ) be the Fourier transform of function l y T ( a ) . Formally we have
(4) l ˜ y T ( z ) = R l y T ( a ) e i a z d a (5) = 0 2 π y T ( w ) y T ( w ) e i y T ( w ) z d w
and inversely
l y T ( a ) = 1 2 π R l ˜ y T ( z ) e i a z d z
Thus
(7) y T ( θ * ) = 1 2 π R l ˜ y T ( z ) d z (8) = 1 2 π 0 2 π y T ( w ) y T ( w ) d w (9) × R e i y T ( w ) z d z .
In fact, the function y T ( θ ) may have several extrema as we will see in the next section, thus y T ( θ ) may have several roots, thus g y T 1 ( a ) is polymorphic. In order to avoid the secondary roots which contribute to the non-optimal extrema, we will concentrate on the main root in the vicinity of θ Q .
Let p T = ( p 1 , , p T ) and q T = ( q 1 , , q T ) be two sequence of real numbers. We denote p ( y T ) = t p t 1 y t q t y t .
Lemma 2. 
For any 1 t 0 T we have the identity
y T y t 0 p ( y T ) = q t 0 t t 0 ( p t + q t ) .
For t 1 t 2 , we have
y T y t 1 y t 2 p ( y T ) = q t 1 q t 2 t t 1 , t 2 ( p t + q t ) .
Proof. 
This is just the consequence of the finite sums via algebraic manipulations. □
Theorem 1. 
Under mild conditions, we have the estimate
y T P ( y T | x T ) log P S ( y T | x T ) P L ( y T ) ( y T | x T ) = O ( T )
Proof. 
Let C ( x T ) = y T P S ( y T | x T ) y T ( θ * ) . Applying both lemma with p t = cos ( θ Q x t ) 2 e 2 i   tan ( θ x t ) z and q t = sin ( θ Q x t ) 2 e 2 i   tan ( θ x t + π / 2 ) z , thus p ( y T ) = P S ( y T | x T ) e i y T ( θ ) we get
C ( x T ) = y T 1 2 π 0 2 π y T ( θ ) y T ( θ ) d θ R exp ( i y T ( w ) z ) d z = 1 2 π 0 2 π d θ R ( ¯ ( θ , z ) ¯ ( θ , z ) + Δ ¯ ( θ , z ) ) t p t + q t d z
with
¯ ( θ , z ) = 2 t p t p t + q t log cos ( θ x t ) + q t p t + q t log sin ( θ x t ) ¯ ( θ , z ) = 2 t p t p t + q t 1 cos ( θ x t ) 2 + q t p t + q t 1 sin ( θ x t ) 2 Δ ¯ ( θ , z ) = 2 t p t q t ( p t + q t ) 2 log cos ( θ x t ) cos ( θ x t ) 2 + log sin ( θ x t ) sin ( θ x t ) 2
We notice that t ( p t + q t ) = exp ( 2 i m ( θ ) z + v ( θ ) z 2 + O ( z 3 T ) ) with
m ( θ ) = t tan ( θ x t ) cos ( θ Q x t ) 2 + tan ( θ x t + π / 2 ) sin ( θ x t ) 2 v ( θ ) = t tan ( θ x t ) 2 cos ( θ Q x t ) 2 + tan ( θ x t + π / 2 ) 2 sin ( θ Q x t ) 2 t tan ( θ x t ) cos ( θ Q x t ) 2 + tan ( θ x t + π / 2 ) sin ( θ Q x t ) 2 2
We notice that m ( θ ) 2 ( θ θ Q ) T and v ( θ ) = T + O ( θ θ Q ) when θ θ Q . The expression is obtained via saddle point method approximation, under the mild conditions being that it can be applied as in the maximum likelihood problem [8] (the error term would be the smallest possible)
R ( ¯ ( θ , z ) ¯ ( θ , z ) + Δ ¯ ( θ , z ) ) t p t + q t d z = R ( ¯ ( θ , z ) ¯ ( θ , z ) + Δ ¯ ( θ , z ) ) (13) exp i m ( θ ) z v ( θ ) z 2 / 2 + O ( T | z | 3 ) d z = ( ¯ ( θ ) ¯ ( θ ) + Δ ¯ ( θ ) ) π v ( θ ) exp ( m ( θ ) 2 v ( θ ) ) (14) ( 1 + O ( 1 / T ) )
with ¯ ( θ ) = ¯ ( θ , 0 ) , ¯ ( θ ) = ¯ ( θ , 0 ) and Δ ¯ ( θ ) = Δ ¯ ( θ , 0 ) Since m ( θ ) 2 v ( θ ) = 4 ( θ θ Q ) 2 T + O ( | θ θ Q | 3 T ) , the factor t ( p t + q t ) behaves like a gaussian function centered on θ Q with standard deviation of order 1 / T . Thus, via saddle point approximation again, it comes:
C ( x T ) = 1 2 π 0 2 π ( ¯ ( θ ) ¯ ( θ ) + Δ ¯ ( θ ) ) π v ( θ ) exp ( m ( θ ) v ( θ ) ) ( 1 + O ( 1 / T ) ) = 1 2 π 0 2 π ¯ ( θ ) ¯ ( θ ) + Δ ¯ ( θ ) v ( θ ) exp 4 ( θ θ Q ) 2 T + O ( | θ θ Q | 3 T ) ( 1 + O ( 1 / T ) ) = ¯ ( θ Q ) ¯ ( θ Q ) + Δ ¯ ( θ Q ) 2 v ( 0 ) ( 1 + O ( 1 / T ) ) = h ( θ Q ) ( 1 + O ( 1 / T ) )
with h ( θ Q ) = ( ¯ ( θ Q ) ¯ ( θ Q ) Δ ¯ ( θ Q ) ) / 2 T with h ( θ ) t cos ( θ x t ) 2 log cos ( θ x t ) 2 + sin ( θ x t ) 2 log sin ( θ x t ) 2 is clearly O ( T ) .
Furthermore, h ( θ Q ) = y T P S ( y T | x T ) log P S ( y T | x T ) , thus we have
y T P ( y T | x T ) log P S ( y T | x T ) P L ( y T ) ( y T | x T ) = O h ( θ Q ) T = O ( T ) .
Theorem 2. 
We have
log S ( x T ) = log y T P L ( y T ) ( y T | x T ) = 1 2 log T + O ( 1 ) .
Remark 1. 
This order of magnitude is much smaller than the main order of magnitude provided in Theorem 1, confirming that the overall regret is indeed T . The regret per measurement is O ( 1 / T ) therefore the individual regrets nevertheless tend to zero when T .
Proof. 
It is formally a Shtarkov sum [5,9]. Using Lemma 1 and Lemma 2 gives
(16) S ( x T ) = y T P L ( y T ) ( y T | x T ) = y T 1 2 π 0 2 π P ( y T | x T , w ) y T ( w ) d w R exp ( i y T ( w ) z ) d z . (17) = 1 2 π 0 2 π d θ R ˜ ( θ , z ) t ( p t + q t ) d z
with p t = cos ( θ x t ) 2 e 2 i   tan ( θ x t ) z and q t = sin ( θ x t ) 2 e 2 i   tan ( θ x t + π / 2 ) z , thus p ( y T ) = P ( y T | x T , θ ) e i y T ( θ ) ; ˜ ( θ , z ) has same expression as ¯ ( θ , z ) but with the new expression of p t and q t :
˜ ( θ , z ) = 2 t p t p t + q t 1 cos ( θ x t ) 2 + q t p t + q t 1 sin ( θ x t ) 2
Developing further:
S ( x T ) = 1 2 π 0 2 π d θ R ˜ ( θ , z ) exp 2 T z 2 + O ( T | z 3 | ) ,
via the saddle point estimate (which consists to do a change of variable z 1 T z under the same conditions of Theorem 1 we get
S ( x T ) = 1 2 π 0 2 π d θ R ˜ ( θ , 0 ) π 2 T ( 1 + O ( 1 / T ) ) .
We terminate with the evaluation ˜ ( θ , 0 ) = 4 T , thus S ( x T ) = T π / 2 ( 1 + O ( 1 / T ) ) .

4. Incremental Learning and Gradient Descent

We investigate gradient descent methods to reach the value θ * . There are many gradient strategies. The classic strategy, which we call, the slow gradient descent, where we define the loss by loss ( y t , θ t | x t ) = ( y t sin ( θ t x t ) 2 ) 2 , since the average value of y t is sin ( θ Q ) 2 , thus the average loss is ( sin ( θ Q x t ) 2 sin ( θ t x t ) 2 ) 2 + sin ( 2 θ Q 2 x t ) 2 4 (minimized at θ t = θ Q ) and the gradient θ t updates is
θ t + 1 = θ t r θ t loss ( y t , θ t | x t ) .
In Figure 1 we display our simulations as a sequence θ t starting with a random initial θ 1 . We assume that for all t the transmitted bit is always 0 i.e., the polarization angle is always θ Q . The learning rate is r = 0.0002 . We simulate nine parallel gradient descents randomly initialized sharing the same random feature sequence x T , with T = 3,000,000. On Figure 1 we plot the parallel evolutions of quantity θ t . The initial points are green diamonds and the final points are red diamonds. Although we start with nine different positions, the trajectories converge toward θ Q ± π . However, the convergence is slow, confirming the 1 / T and worse rate. In fact, some initial positions converge even more slowly, and even after 3,000,000 trials, are still very far. The reason is that the target function log P ( y T | x T , θ ) has several local maxima as it is shown in Figure 2 where the x t belongs to the set of values 2 π k / 10 for k = 1 , , 10 . It is very unlikely that a communication operator would tolerate so many runs (3,000,000) in order to have a proper convergence. However, it would be possible to run the gradient descents in parallel and act like with particle systems in order to select the fastest in convergence.
A supposedly faster gradient descent would be defined by the inverse derivative
θ t + 1 = θ t + r y t sin ( θ t x t ) 2 θ t sin ( θ t x t ) 2
We notice that in stationary situation (where we suppose that θ t very little varies) we have E ( θ t + 1 ) = θ t + r sin ( θ Q x t ) 2 sin ( θ t x t ) 2 θ t sin ( θ t x t ) 2 which is equal to θ t when θ t = θ Q . In Figure 3, we display our simulations as a sequence θ t starting with a random initial θ 1 . The learning rate is r = 0.0002 . We simulate nine parallel fast gradient descents randomly initialized sharing the same random feature sequence x T , with T = 3,000,000. The gradient descent converges fast but does not converge on the good value θ Q ± π . Again it is due to the fact that the target function log P ( y T | x T , θ ) has several local maxima which act like a trap for the gradient descent.

5. Conclusions

We have presented a simple quantum tomography problem, the photon unknown polarization problem and have analyzed its learnability via AI over T runs. We have shown that the learning regret cannot decay faster than 1 / T (i.e., a cumulative regret of T ). Furthermore, the classic gradient descent is hampered by local extrema which may significantly impact the theoretical convergence rate.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jerusalem, Israel, 26 April–1 May 2015. [Google Scholar]
  2. Bouillard, A.; Jacquet, P. Quasi Black Hole Effect of Gradient Descent in Large Dimension: Consequence on Neural Network Learning. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
  3. Abbe, E.; Sandon, C. Provable limitations of deep learning. arXiv 2018, arXiv:1812.06369. [Google Scholar]
  4. Ben-David, S.; Hrubeš, P.; Moran, S.; Shpilka, A.; Yehudayoff, A. Learnability can be undecidable. Nat. Mach. Intell. 2019, 1, 44–48. [Google Scholar] [CrossRef]
  5. Jacquet, P.; Shamir, G.; Szpankowski, W. Precise Minimax Regret for Logistic Regression with Categorical Feature Values. In Algorithmic Learning Theory; PMLR: New York City, NY, USA, 2021. [Google Scholar]
  6. Van Erven, T.; Harremos, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  7. O’shea, T.; Hoydis, J. An introduction to deep learning for the physical layer. IEEE Trans. Cogn. Commun. Netw. 2017, 3, 563–575. [Google Scholar] [CrossRef] [Green Version]
  8. Newey, W.K.; McFadden, D. Chapter 36: Large sample estimation and hypothesis testing. In Handbook of Econometrics; Elsevier: Amsterdam, The Netherlands, 1994. [Google Scholar]
  9. Shtarkov, Y.M. Universal sequential coding of single messages. Probl. Inf. Transm. 1987, 23, 3–17. [Google Scholar]
Figure 1. Angle estimate θ t versus time of nine slow gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.
Figure 1. Angle estimate θ t versus time of nine slow gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.
Psf 05 00047 g001
Figure 2. Target function t cos ( θ Q x t ) 2 log cos ( θ x t ) 2 + sin ( θ Q x t ) 2 log sin ( θ x t ) 2 as function of θ .
Figure 2. Target function t cos ( θ Q x t ) 2 log cos ( θ x t ) 2 + sin ( θ Q x t ) 2 log sin ( θ x t ) 2 as function of θ .
Psf 05 00047 g002
Figure 3. Angle estimate θ t versus time of nine fast gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.
Figure 3. Angle estimate θ t versus time of nine fast gradient descents randomly initialized. Green diamonds are starting points, red diamonds are stopping points.
Psf 05 00047 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jacquet, P. Is Quantum Tomography a Difficult Problem for Machine Learning? Phys. Sci. Forum 2022, 5, 47. https://doi.org/10.3390/psf2022005047

AMA Style

Jacquet P. Is Quantum Tomography a Difficult Problem for Machine Learning? Physical Sciences Forum. 2022; 5(1):47. https://doi.org/10.3390/psf2022005047

Chicago/Turabian Style

Jacquet, Philippe. 2022. "Is Quantum Tomography a Difficult Problem for Machine Learning?" Physical Sciences Forum 5, no. 1: 47. https://doi.org/10.3390/psf2022005047

Article Metrics

Back to TopTop