# Fast Quantum State Reconstruction via Accelerated Non-Convex Programming

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

`MiFGD`), extends the applicability of quantum tomography for larger systems. Despite being a non-convex method,

`MiFGD`converges provably close to the true density matrix at an accelerated linear rate asymptotically in the absence of experimental and statistical noise, under common assumptions. With this manuscript, we present the method, prove its convergence property and provide the Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real (noisy) experiments, performed on the IBM’s quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than the state-of-the-art approaches, with similar or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite the presence of experimental and statistical noise in the tomographic data. Finally, we provide a ready-to-use code for state tomography of multi-qubit systems.

## 1. Introduction

`MiFGD`). Our approach combines the ideas from compressed sensing, non-convex optimization, and acceleration/momentum techniques to scale QST beyond the current capabilities.

`MiFGD`includes acceleration motions per iteration, meaning that it uses two previous iterates to update the next estimate; see Section 2 for details. The intuition is that if the k-th and $(k-1)$-th estimates were pointing to the correct direction, then both information should be useful to determine the $(k+1)$-th estimate. Of course such approach requires an additional estimate to be stored—yet, we show both theoretically and experimentally that momentum results in faster estimation. We emphasize that the analysis becomes non-trivially challenging due to the inclusion of two previous iterates.

- (i)
- We prove that the non-convex
`MiFGD`algorithm asymptotically enjoys an accelerated linear convergence rate in terms of the iterate distance, in the noiseless measurement data case and under common assumptions. - (ii)
- We provide QST results using the real measurement data from IBM’s quantum computers up to 8-qubits, contributing to recent efforts on testing QST algorithms in real quantum data [22]. Our synthetic examples scale up to 12-qubits effortlessly, leaving the space for an efficient and hardware-aware implementation open for future work.
- (iii)
- (iv)
- We further increase the efficiency of
`MiFGD`by extending its implementation to utilize parallel execution over the shared and distributed memory systems. We experimentally showcase the scalability of our approach, which is particularly critical for the estimation of larger quantum system. - (v)
- We provide the implementation of our approach at https://github.com/gidiko/MiFGD (accessed on 18 January 2023), which is compatible with the open-source software Qiskit [28].

`MiFGD`. Then, we detail the experimental set up in Section 3, followed by the results in Section 4. Finally, we discuss related and future works with concluding remarks in Section 5.

## 2. Methods

#### 2.1. Problem Setup

**Definition 1**

**.**A linear operator $\mathcal{A}:\phantom{\rule{3.33333pt}{0ex}}{\mathbb{C}}^{d\times d}\to {\mathbb{R}}^{m}$ satisfies the RIP on rank-r matrices with the RIP constant ${\delta}_{r}\in (0,1)$, if the following holds with high probability for any rank-r matrix $X\in {\mathbb{C}}^{d\times d}$:

`MiFGD`algorithm: momentum-inspired factored gradient descent.

#### 2.2. The `MiFGD` Algorithm

`MiFGD`algorithm is a two-step variant of FGD, which iterates as follows:

`MiFGD`asymptotically converges at an accelerated linear rate around a neighborhood of the optimal value, akin to convex optimization results [40].

`MiFGD`. As Problem (3) is non-convex, the initialization plays an important role in achieving global convergence. The initial point ${U}_{0}$ is either randomly initialized [36,41,42], or set according to Lemma 4 in [26]:

Algorithm 1 Momentum-Inspired Factored Gradient Descent (MiFGD). |

Input:$\mathcal{A}$ (sensing map), y (measurement data), r (rank), and $\mu $ (momentum parameter).$\u2022\phantom{\rule{3.33333pt}{0ex}}$ Set ${Z}_{0}={U}_{0}$. for
$k=0,1,2,\dots $do ${U}_{k+1}={Z}_{k}-\eta {\mathcal{A}}^{\u2020}\left(\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)-y\right)\xb7{Z}_{k}$ ${Z}_{k+1}={U}_{k+1}+\mu \left({U}_{k+1}-{U}_{k}\right)$ end for Output:
$\rho ={U}_{k+1}{U}_{k+1}^{\u2020}$ |

#### 2.3. Theoretical Guarantees of the `MiFGD` Algorithm

`MiFGD`asymptotically achieves an accelerated linear rate.

**Theorem 1**

**.**Assume that $\mathcal{A}(\xb7)$ satisfies the RIP in Definition 1 with the constant ${\delta}_{2r}\le \raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$10$}\right.$. Initialize ${U}_{0}={U}_{-1}$ such that

**rank**$\left({\rho}^{\star}\right)=r,$ the output of the

**in Algorithm 1 satisfies the following: for any $\u03f5>0$, there exist constants ${C}_{\u03f5}$ and ${\tilde{C}}_{\u03f5}$ such that, for all k,**

`MiFGD``MiFGD`asymptotically enjoys an accelerated linear convergence rate in iterate distances up to a constant proportional to the momentum parameter μ.

`MiFGD`has better dependency on the (inverse) condition number of f compared to FGD. Such improvement of the dependency on the condition number is referred to as “acceleration” in the convex optimization literature [44,45]. Thus, assuming that the initial points ${U}_{0}$ and ${U}_{-1}$ are close enough to the optimum as stated in the theorem,

`MiFGD`decreases its distance to ${U}^{\star}$ at an accelerated linear rate, up to an “error” level that depends on the momentum parameter $\mu $, which is bounded by $\frac{1}{2\xb7{10}^{3}r\tau \left({\rho}^{\star}\right)\sqrt{\kappa}}$.

## 3. Experimental Setup

#### 3.1. ${\rho}^{\star}$ Density Matrices and Quantum Circuits

`states.py`component of our complementary software package: https://github.com/gidiko/MiFGD (accessed on 18 January 2023)):

- The (generalized) GHZ state:$$\begin{array}{c}\hfill |\mathtt{GHZ}\left(n\right)\rangle =\frac{{|0\rangle}^{\otimes n}+{|1\rangle}^{\otimes n}}{\sqrt{2}},\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}n>2.\end{array}$$
- The (generalized) GHZ-minus state:$$\begin{array}{c}\hfill |{\mathtt{GHZ}}_{-}\left(n\right)\rangle =\frac{{|0\rangle}^{\otimes n}-{|1\rangle}^{\otimes n}}{\sqrt{2}},\phantom{\rule{3.33333pt}{0ex}}\phantom{\rule{3.33333pt}{0ex}}n>2.\end{array}$$
- The Hadamard state:$$\begin{array}{c}\hfill |\mathtt{Hadamard}\left(n\right)\rangle ={\left(\frac{|0\rangle +|1\rangle}{\sqrt{2}}\right)}^{\otimes n}.\end{array}$$
- A random state $\left|\mathtt{Random}\right(n)\rangle $.

#### 3.2. Measuring Quantum States

- (i)
- We sample $m=O\left(r\xb7d\xb7\mathrm{poly}\left(\mathrm{log}d\right)\right)$ or $m=\mathtt{measpc}\xb7{d}^{2}$ Pauli monomials uniformly over ${\left\{{\sigma}_{i}\right\}}^{\otimes n}$ with $i\in \{0,\dots ,3\}$, where $\mathtt{measpc}\in [0,1]$ represents the percentage of measurements out of full tomography.
- (ii)
- For every monomial, ${P}_{i}$, in the generated set, we identify an experimental setting $\alpha \left(i\right)$ that corresponds to the monomial. There, qubits, for which their Pauli operator in ${P}_{i}$ is the identity operator, are measured, without loss of generality, in the ${\sigma}_{3}$ basis. For example, for $n=3$ and ${P}_{i}={\sigma}_{0}\otimes {\sigma}_{1}\otimes {\sigma}_{1}$, we identify the measurement setting $\alpha \left(i\right)=(z,x,x)$.
- (iii)
- We measure the quantum state in the Pauli basis that corresponds to $\alpha \left(i\right)$, and record the outcomes.

#### 3.3. Algorithmic Setup

`maxiters`, the step size $\eta $, the relative error from successive state iterates

`reltol`, the momentum parameter $\mu $, the percentage of the complete set of measurements (i.e., over all possible Pauli monomials)

`measpc`, and the seed. In the sequel experiments we set $\mathtt{maxiters}=1000$, $\eta ={10}^{-3}$, and $\mathtt{reltol}=5\times {10}^{-4}$, unless stated differently. Regarding acceleration, $\mu =0$ when acceleration is muted; we experiment over the range of values $\mu \in \{{\textstyle \frac{1}{8}},{\textstyle \frac{1}{4}},{\textstyle \frac{1}{3}},{\textstyle \frac{3}{4}}\}$ when investigating the acceleration effect, beyond the theoretically suggested ${\mu}^{\star}$. In order to explore the dependence of our approach on the number of measurements available,

`measpc`varies over the set of $\{5\%,10\%,15\%,20\%,40\%,60\%\}$;

`seed`is used for differentiating repeating runs with all other parameters kept fixed (

`maxiters is num_iterations`in the code; also

`reltol`is

`relative_error_tolerance, measpc is complete_measurements_percentage`).

`MiFGD`, we report on outputs including:

- The evolution with respect to the distance between $\widehat{\rho}$ and ${\rho}^{\star}$: $\parallel \widehat{\rho}-{\rho}^{\star}{\parallel}_{F}$, for various $\mu $’s.
- The number of iterations to reach $\mathtt{reltol}$ to ${\rho}^{\star}$ for various $\mu $’s.
- The fidelity of $\widehat{\rho}$, defined as $\mathrm{Tr}\left({\rho}^{\star}\widehat{\rho}\right)$ (for rank-1 ${\rho}^{\star}$), as a function of the acceleration parameter $\mu $ in the default set.

`measpc`values, repeat 5 times for each individual setup, varying supplied seed, and depict their 25-, 50- and 75-percentiles.

#### 3.4. Experimental Setup on Quantum Processing Unit (QPU)

`ibmq_boeblingen`. The layout/connectivity of the device is shown in Figure 1. The 6-qubit data was from qubits $[0,1,2,3,8,9]$, and the 8-qubit data was from $[0,1,2,3,8,9,6,4]$. The ${T}_{1}$ coherence times are $[39.1,75.7,66.7,100.0,120.3,39.2,70.7,132.3]$ $\mu s$, and ${T}_{2}$ coherence times are $[86.8,94.8,106.8,63.6,156.5,66.7,104.5,134.8]$ $\mu s$. The circuit for generating 6-qubit and 8-qubit GHZ states are shown in Figure 1. The typical two qubit gate errors measured from randomized benchmarking (RB) for relevant qubits are summarized in Table 1.

`qiskit-ignis`(https://github.com/Qiskit/qiskit-ignis (accessed on 18 January 2023)). For complete QST of a n-qubits state ${3}^{n}$ circuits are needed. The result of each circuit is averaged over 8192, 4096 or 2048, for different n-qubit scenarios. To mitigate for readout errors, we prepare and measure all of the ${2}^{n}$ computational basis states in the computation basis to construct a calibration matrix C. C has dimension ${2}^{n}$ by ${2}^{n}$, where each column vector corresponds to the measured outcome of a prepared basis state. In the ideal case of no readout error, C is an identity matrix. We use C to correct for the measured outcomes of the experiment by minimizing the function:

`cvxopt`[49].

## 4. Results

#### 4.1. `MiFGD` on 6- and 8-Qubit Real Quantum Data

`shots`for each setting. The (circuit, number of

`shots`) measurement configurations from IBM Quantum devices are summarized in Table 2.

`qiskit-aer`. This is a parallel, high performance quantum circuit simulator written in C++ that can support a variety of realistic circuit level noise models.

`MiFGD`, we further plot its performance on the same settings but using measurements coming from an idealized quantum simulator. Figure 3 considers the exact same settings as in Figure 2. It is obvious that

`MiFGD`can achieve better reconstruction performance when data are less erroneous. This also highlights that, in real noisy scenarios, the radius of the convergence region of

`MiFGD`around ${\rho}^{\star}$ is controlled mostly by the the noise level, rather than by the inclusion of momentum acceleration.

`MiFGD`, defined as $\mathrm{Tr}\left({\rho}^{\star}\widehat{\rho}\right)$, versus various $\mu $ values and for different circuits $\left({\rho}^{\star}\right)$. Shaded area denotes standard deviation around the mean over repeated runs in all cases. The plots show the significant gap in performance when using real quantum data versus using synthetic simulated data within a controlled environment.

#### 4.2. Performance Comparison with Full Tomography Methods in Qiskit

`MiFGD`with publicly available implementations for QST reconstruction. Two common techniques for QST, included in the

`qiskit-ignis`distribution [28], are: (i) the

`CVXPY`fitter method, that uses the

`CVXPY`convex optimization package [50,51]; and (ii) the

`lstsq`method, that uses least-squares fitting [52]. Both methods solve the full tomography problem (In [8], it was sown that the minimization program (13) yields a robust estimation of low-rank states in the compressed sensing. Thus, one can use

`CVXPY`fitter method to solve Equation (13) with $m\ll {d}^{2}$ Pauli expectation value to obtain a robust reconstruction of ${\rho}^{\star}$) according to the following expression:

`MiFGD`is not restricted to “tall” U scenarios to encode PSD and rank constraints: even without rank constraints, one could still exploit the matrix decomposition $\rho =U{U}^{\u2020}$ to avoid the PSD projection, $\rho \u2ab00$, where $U\in {\mathbb{C}}^{d\times d}$. For the

`lstsq`fitter method, the putative estimate $\widehat{\rho}$ is rescaled using the method proposed in [52]. For

`CVXPY`, the convex constraint makes the optimization problem a semidefinite programming (SDP) instance. By default,

`CVXPY`calls the

`SCS`solver that can handle all problems (including SDPs) [53,54]. Further comparison results with matrix factorization techniques from the machine learning community is provided in the Appendix for $n=12$.

`MiFGD`, we set $\eta =0.001$, $\mu ={\textstyle \frac{3}{4}}$, and stopping criterion/tolerance $\mathtt{reltol}={10}^{-5}$. All experiments are run on a Macbook Pro with 2.3 GHz Quad-Core Intel Core i7CPU and 32GB RAM.

`CVXPY`and

`lstsq`attain almost perfect fidelity, while being comparable or faster than

`MiFGD`. (ii) The difference in performance becomes apparent from $n=6$ and on: while

`MiFGD`attains 98% fidelity in <5 s,

`CVXPY`and

`lstsq`require up to hundreds of seconds to find a good solution. (iii) Finally, while

`MiFGD`gets to high-fidelity solutions in seconds for $n=7,8$,

`CVXPY`and

`lstsq`methods could not finish tomography as their memory usage exceeded the system’s available memory.

`MiFGD`are the fidelities at the last iteration, before the stopping criterion is activated, or the maximum number of iterations is exceeded. However, the reported fidelity is not necessarily the best one during the whole execution: for all cases, we observe that

`MiFGD`finds intermediate solutions with fidelity >99%. Though, it is not realistic to assume that the iteration with the best fidelity is known a priori, and this is the reason we report only the final iteration fidelity.

#### 4.3. Performance Comparison of `MiFGD` with Neural-Network Quantum State Tomography

`MiFGD`with neural network approaches. Per [9,10,11,27], we model a quantum state with a two-layer Restricted Boltzmann Machine (RBM). RBMs are stochastic neural networks, where each layer contains a number of binary stochastic variables: the size of the visible layer corresponds to the number of input qubits, while the size of the hidden layer is a hyperparameter controlling the representation error. We experiment with three types of RBMs for reconstructing either the positive-real wave function, the complex wave function, or the density matrix of the quantum state. In the first two cases the state is assumed pure while in the last, general mixed quantum states can be represented. We leverage the implementation in QuCumber [10],

`PositiveRealWaveFunction (PRWF)`,

`ComplexWaveFunction (CWF)`, and

`DensityMatrix (DM)`, respectively.

`PRWF`,

`CWF`, and

`DM`neural networks (We utilize GPU (NVidia GeForce GTX 1080 TI,11GB RAM) for faster training of the neural networks) with measurements collected by the QASM Simulator.

`measpc`= 50% and

`shots`= 2048. The set of measurements is presented to the RBM implementation, along with the target positive-real wave function (for

`PRWF`), complex wavefunction (for

`CWF`) or the target density matrix (for

`DM`) in a suitable format for training. We train

`Hadamard`and

`Random`states with 20 epochs, and

`GHZ`state with 100 epochs (We experimented higher number of epochs (up to 500) for all cases, but after the reported number of epochs, Qucumber methods did not improve, if not worsened). We set the number of hidden variables (and also of additional auxilliary variables for

`DM`) to be equal to the number of input variables n and we use 100 data points for both the positive and the negative phase of the gradient (as per the recommendation for the defaults). We choose $k=10$ contrastive divergence steps and fixed the learning rate to 10 (per hyperparameter tuning). Lastly, we limit the fitting time of Qucumber methods (excluding data preparation time) to be three hours. To compare to the RBM results, we run

`MiFGD`with $\eta =0.001$, $\mu ={\textstyle \frac{3}{4}}$, $\mathtt{reltol}={10}^{-5}$ and using

`measpc`= 50%, keeping previously chosen values for all other hyperparameters.

`PRWF`,

`CWF`, and

`DM`. We observe that for all cases, Qucumber methods are orders of magnitude slower than

`MiFGD`. E.g., for $n=8$, for all three states,

`CWF`and

`DM`did not finish a single epoch in 3 h, while

`MiFG`achieves high fidelity in less than 30 s. For the $\mathtt{Hadamard}\left(n\right)$ and $\mathtt{Random}\left(n\right)$, reaching reasonable fidelities is significantly slower for both

`CWF`and

`DM`, while

`PRWF`hardly improves its performance throughout the training. For the

`GHZ`case,

`CWF`and

`DM`also shows non-monotonic behaviors: even after a few thousands of seconds, fidelities have not “stabilized”, while

`PRWF`stabilizes in very low fidelities. In comparison

`MiFGD`is several orders of magnitude faster than both

`CWF`and

`DM`and fidelity smoothly increases to comparable or higher values. Further, in Table 4, we report final fidelities (within the 3 h time window), and reported times.

#### 4.4. The Effect of Parallelization

`MiFGD`. We parallelize the iteration step across a number of processes, that can be either distributed and network connected, or sharing memory in a multicore environment. Our approach is based on Message Passing Interface (MPI) specification [55], which is the lingua franca for interprocess communication in high performance parallel and supercomputing applications. A MPI implementation provides facilities for launching processes organized in a virtual topology and highly tuned primitives for point-to-point and collective communication between them.

`MPI_Allreduce`collective communication primitive with

`MPI_SUM`as its reduction operator: the underlying implementation will ensure minimum communication complexity for the operation (e.g., $\mathrm{log}p$ steps for p processes organized in a communication ring) and thus maximum performance (This communication pattern can alternatively be realized in two stages, as naturally suggested in its structure: (i) first invoke MPI’s

`MPI_Reduce`primitive, with

`MPI_SUM`as its reduction operator, which results in the element-wise accumulation of local corrections (vector sum) at a single, designated root process, and (ii) finally, send a “copy” of this sum from root process to each process participating in the parallel computation (broadcasting);

`MPI_Bcast`primitive can be utilized for this latter stage. However,

`MPI_Allreduce`is typically faster, since its actual implementation is not constrained by the requirement to have the sum available at a specific, root process, at an intermediate time point - as the two-stage approach implies). We leverage

`mpi4py`[56] bindings to issue MPI calls in our parallel Python code.

`shots`; parallel

`MiFGD`runs with default parameters and using all measurements (

`measpc`= 100%). Reported times are wall-clock computation time. These exclude initialization time for all processes to load Pauli monomials and measurements: we here target parallelizing computation proper in

`MiFGD`.

`MiFGD`in Figure 7 Left. We observe that the benefits of parallelization are pronounced for bigger problems (here: $n=8$ qubits) and maximum scalability results when we use all physical cores (48 in our platform).

`MiFGD`for ($p=8,16,32,48,64$): we observe the smooth path to convergence in all p counts which again minimizes compute time for $p=48$. Note that in this case we use

`measpc`= 10% and $\mu =\frac{1}{4}$.

## 5. Conclusions and Discussions

`MiFGD`algorithm for the factorized form of the low-rank QST problems. We proved that, under certain assumptions on the problem parameters,

`MiFGD`converges linearly to a neighborhood of the optimal solution, whose size depends on the momentum parameter $\mu $, while using acceleration motions in a non-convex setting. We demonstrate empirically, using both simulated and real data, that

`MiFGD`outperforms non-accelerated methods on both the original problem domain and the factorized space, contributing to recent efforts on testing QST algorithms in real quantum data [22]. These results expand on existing work in the literature illustrating the promise of factorized methods for certain low-rank matrix problems. Finally, we provide a publicly available implementation of our approach, compatible to the open-source software Qiskit [28], where we further exploit parallel computations in

`MiFGD`by extending its implementation to enable efficient, parallel execution over shared and distributed memory systems.

`MiFGD`. Preliminary results suggest that only $O(r\xb7\mathrm{log}d)$ random Pauli bases should be taken for a reconstruction, with the same level of accuracy as with $O(r\xb7d\xb7\mathrm{log}d)$ expectation values of random Pauli matrices. We leave the analysis of our algorithm in this case for future work, along with detailed experiments.

#### Related Work

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Additional Experiments

#### Appendix A.1. IBM Quantum System Experiments: `GHZ`—(6) Circuit, 2048 `Shots`

**Figure A1.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(6\right)$ circuit using real measurements from IBM Quantum system experiments.

**Figure A2.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(6\right)$ circuit using synthetic measurements from IBM’s quantum simulator.

**Figure A3.**Convergence iteration plots for reconstructing ${\mathtt{GHZ}}_{-}\left(6\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

**Figure A4.**Fidelity list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(6\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

#### Appendix A.2. IBM Quantum System Experiments: `GHZ`—(8) Circuit, 2048 `Shots`

**Figure A5.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using real measurements from IBM Quantum system experiments.

**Figure A6.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using synthetic measurements from IBM’s quantum simulator.

**Figure A7.**Convergence iteration plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

**Figure A8.**Fidelity list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

#### Appendix A.3. IBM Quantum System Experiments: `GHZ`—(8) Circuit, 4096 `Shots`

**Figure A9.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using real measurements from IBM Quantum system experiments.

**Figure A10.**Target error list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using synthetic measurements from IBM’s quantum simulator.

**Figure A11.**Convergence iteration plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

**Figure A12.**Fidelity list plots for reconstructing ${\mathtt{GHZ}}_{-}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

#### Appendix A.4. IBM Quantum System Experiments: `Hadamard`(6) Circuit, 8192 `Shots`

**Figure A13.**Target error list plots for reconstructing $\mathtt{Hadamard}\left(6\right)$ circuit using real measurements from IBM Quantum system experiments.

**Figure A14.**Target error list plots for reconstructing $\mathtt{Hadamard}\left(6\right)$ circuit using synthetic measurements from IBM’s quantum simulator.

**Figure A15.**Convergence iteration plots for reconstructing $\mathtt{Hadamard}\left(6\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.

**Figure A16.**Fidelity list plots for reconstructing $\mathtt{Hadamard}\left(6\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

#### Appendix A.5. IBM Quantum System Experiments: `Hadamard`(8) Circuit, 4096 `Shots`

**Figure A17.**Target error list plots for reconstructing $\mathtt{Hadamard}\left(8\right)$ circuit using real measurements from IBM Quantum system experiments.

**Figure A18.**Target error list plots for reconstructing $\mathtt{Hadamard}\left(8\right)$ circuit using synthetic measurements from IBM’s quantum simulator.

**Figure A19.**Convergence iteration plots for reconstructing $\mathtt{Hadamard}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation.

**Figure A20.**Fidelity list plots for reconstructing $\mathtt{Hadamard}\left(8\right)$ circuit using using real measurements from IBM Quantum system experiments and synthetic measurements from Qiskit simulation experiments.

#### Appendix A.6. Synthetic Experiments for n = 12

`MiFGD`with (i) the

`Matrix ALPS`framework [61], a state of the art projected gradient descent algorithm, and an optimized version of matrix iterative hard thresholding, operating on the full matrix variable $\rho $, with adaptive step size $\eta $ (we note that this algorithm has outperformed most of the schemes that work on the original space $\rho $; see [61]); (ii) the plain Procrustes Flow/

`FGD`algorithm [25,26,32], where we use the step size as reported in [25], since the later has reported better performance than vanilla Procrustes Flow. We note that the Procrustes Flow/

`FGD`algorithm is similar to our algorithm without acceleration. Further, the original Procrustes Flow/

`FGD`algorithm relies on performing many iterations in the original space $\rho $ as an initialization scheme, which is often prohibitive as the problem dimensions grow. Both for our algorithm and the plain Procrustes Flow/

`FGD`scheme, we use random initialization.

**Figure A21.**Synthetic example results on low-rank matrix sensing in higher dimensions (equivalent to $n=12$ qubits).

**Top row**: Convergence behavior vs. time elapsed.

**Bottom row**: Convergence behavior vs. number of iterations.

**Left panel**: $c=5$, noiseless case;

**Center panel**: $c=3$, noiseless case;

**Right panel**: $c=5$, noisy case, ${\parallel w\parallel}_{2}=0.01$.

#### Appendix A.7. Asymptotic Complexity Comparison of `lstsq`, `CVXPY`, and `MiFGD`

`lstsq`can be only applied to the case we have a full tomographic set of measurements; this makes

`lstsq`algorithm inapplicable in the compressed sensing scenario, where the number of measurements can be significantly reduced. Yet, we make the comparison by providing information-theoretically complete set of measurements to

`lstsq`and

`CVXPY`, as well as to

`MiFGD`, to highlight the efficiency of our proposed method, even in the scenario that is not exactly intended in our work. Given this, we compare in detail the asymptotic scailing of

`MiFGD`with

`lstsq`and

`CVXPY`below:

`lstsq`is based on the computation of eigenvalues/eigenvector pairs (among other steps) of a matrix of size equal to the density matrix we want to reconstruct. Based on our notation, the density matrices are denoted as $\rho $ with dimensions ${2}^{n}\times {2}^{n}$. Here, n is the number of qubits in the quantum system. Standard libraries for eigenvalue/eigenvector calculations, like LAPACK, reduce a Hermitian matrix to tridiagonal form using the Householder method, which takes overall a $O\left({\left({2}^{n}\right)}^{3}\right)$ computational complexity. The other steps in the`lstsq`procedure either take constant time, or $O\left({2}^{n}\right)$ complexity. Thus, the actual run-time of an implementation depends on the eigensystem solver that is being used.`CVXPY`is distributed with the open source solvers; for the case of SDP instances,`CVXPY`utilizes the Splitting Conic Solver (SCS) (https://github.com/cvxgrp/scs (accessed on 18 January 2023)), a general numerical optimization package for solving large-scale convex cone problems. SCS applies Douglas-Rachford splitting to a homogeneous embedding of the quadratic cone program. Based on the PSD constraint, this again involves the computation of eigenvalues/eigenvector pairs (among other steps) of a matrix of size equal to the density matrix we want to reconstruct. This takes overall a $O\left({\left({2}^{n}\right)}^{3}\right)$ computational complexity, not including the other steps performed within the SCS solver. This is an iterative algorithm that requires such complexity per iteration. Douglas-Rachford splitting methods enjoy $O\left({\textstyle \frac{1}{\epsilon}}\right)$ convergence rate in general [53,102,103]. This leads to a rough $O({\left({2}^{n}\right)}^{3}\xb7{\textstyle \frac{1}{\epsilon}})$ overall iteration complexity (This is an optimistic complexity bound since we have skipped several details within the Douglas-Rachford implementation of`CVXPY`).- For
`MiFGD`, and for sufficiently small momentum value, we require $O(\sqrt{\kappa}\xb7\mathrm{log}\left({\textstyle \frac{1}{\epsilon}}\right))$ iterations to get close to the optimal value. Per iteration,`MiFGD`does not involve any expensive eigensystem solvers, but relies only on matrix-matrix and matrix-vector multiplications. In particular, the main computational complexity per iteration origins from the iteration:$$\begin{array}{cc}\hfill {U}_{k+1}& ={Z}_{k}-\eta {\mathcal{A}}^{\u2020}\left(\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)-y\right)\xb7{Z}_{k},\hfill \\ \hfill {Z}_{k+1}& ={U}_{k+1}+\mu \left({U}_{k+1}-{U}_{k}\right).\hfill \end{array}$$Here, ${U}_{k},{Z}_{k}\in {\mathbb{R}}^{{2}^{n}\times r}$ for all k. Observe that $\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)\in {\mathbb{R}}^{m}$ where each element is computed independently. For an index $j\in \left[m\right]$, ${\left(\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)\right)}_{j}=\mathtt{Tr}\left({A}_{j}{Z}_{k}{Z}_{k}^{\u2020}\right)$ requires $O({\left({2}^{n}\right)}^{2}\xb7r)$ complexity, and thus computing $\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)-y$ requires $O({\left({2}^{n}\right)}^{2}\xb7r)$ complexity, overall. By definition the adjoing operation ${\mathcal{A}}^{\u2020}:{\mathbb{R}}^{m}\to {\mathbb{C}}^{{2}^{n}\times {2}^{n}}$ satisfies: ${\mathcal{A}}^{\u2020}\left(x\right)={\sum}_{i=1}^{m}{x}_{i}{A}_{i}$; thus, the operation ${\mathcal{A}}^{\u2020}\left(\mathcal{A}\left({Z}_{k}{Z}_{k}^{\u2020}\right)-y\right)$ is still dominated by $O({\left({2}^{n}\right)}^{2}\xb7r)$ complexity. Finally, we perform one more matrix-matrix multiplication with ${Z}_{i}$, which results into an additional $O({\left({2}^{n}\right)}^{2}\xb7r)$ complexity. The rest of the operations involve adding ${2}^{n}\times r$ matrices, which does not dominate the overall complexity. Combining the iteration complexity with the per-iteration computational complexity,`MiFGD`has a $O({\left({2}^{n}\right)}^{2}\xb7r\xb7\sqrt{\kappa}\xb7\mathrm{log}\left({\textstyle \frac{1}{\epsilon}}\right))$ complexity.

`MiFGD`has the best dependence on the number of qubits and the ambient dimension of the problem, ${2}^{n}$; (ii)

`MiFGD`applies to cases that

`lstsq`is inapplicable; (iii)

`MiFGD`has a better iteration complexity than other iterative algorithms, while has a better polynomial dependency on ${2}^{n}$.

## Appendix B. Detailed Proof of Theorem 1

**Lemma A1**

**.**For any $W,V\in {\mathbb{C}}^{d\times r}$, the following holds:

**Lemma A2**

**.**Given a matrix M and $\u03f5>0$, there exists a matrix norm $\parallel \xb7\parallel $ such that

**Lemma A3**

**.**Given any matrix norm $\parallel \xb7\parallel $, the following holds:

#### Supporting Lemmata

**Lemma A4.**

**Proof.**

**Lemma A5.**

**Proof.**

**Corollary A1.**

**Proof.**

**Corollary A2.**

**Proof.**

**Corollary A3.**

**Proof.**

**Lemma A6.**

**Proof.**

**Lemma A7.**

**Proof.**

**Lemma A8.**

**Proof.**