Next Article in Journal
Consistency of Decision in Finite and Numerable Multinomial Models
Next Article in Special Issue
Some New Properties of Convex Fuzzy-Number-Valued Mappings on Coordinates Using Up and Down Fuzzy Relations and Related Inequalities
Previous Article in Journal
Hermite–Hadamard-Type Inequalities for Coordinated Convex Functions Using Fuzzy Integrals
Previous Article in Special Issue
Application of Fuzzy PID Based on Stray Lion Swarm Optimization Algorithm in Overhead Crane System Control
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A More Efficient and Practical Modified Nyström Method

1
Fair Friend Institute of Intelligent Manufacturing, Hangzhou Vocational & Technical College, Hangzhou 310018, China
2
Post Industry Technology Research and Development Center of the State Posts Bureau (Internet of Things Technology), Nanjing University of Posts and Telecommunications, Nanjing 210023, China
3
Post Big Data Technology and Application Engineering Research Center of Jiangsu Province, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
4
College of Information Engineering, Nanjing University of Finance and Economics, Nanjing 210023, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(11), 2433; https://doi.org/10.3390/math11112433
Submission received: 29 March 2023 / Revised: 17 May 2023 / Accepted: 22 May 2023 / Published: 24 May 2023
(This article belongs to the Special Issue Fuzzy Modeling and Fuzzy Control Systems)

Abstract

:
In this paper, we propose an efficient Nyström method with theoretical and empirical guarantees. In parallel computing environments and for sparse input kernel matrices, our algorithm can have computation efficiency comparable to the conventional Nyström method, theoretically. Additionally, we derive an important theoretical result with a compacter sketching matrix and faster speed, at the cost of some accuracy loss compared to the existing state-of-the-art results. Faster randomized SVD and more efficient adaptive sampling methods are also proposed, which have wide application in many machine-learning and data-mining tasks.

1. Introduction

The Nyström method is a widely used technique to speed up kernel machines. Its efficiency in computation has attracted much attention in the past few years [1,2,3,4,5,6,7,8]. Given a kernel matrix K R n × n , the Nyström method tries to approximate the kernel by random sampling to save computation cost. At the cost of computational efficiency, it suffers from a relatively large matrix approximation error in real applications [9,10]. Given the target rank k and target precision parameter 0 < ϵ 1 , Wang and Zhang [4] gave a theoretical analysis that, with the Nyström method, it is impossible to obtain a 1 + ϵ bound relative to K K k F 2 unless the number of sampled columns c > Ω ( n k / ϵ ) . Here, K k denotes the best rank-k approximation to the kernel matrix K . Several modified Nyström methods were proposed in recent years [3,4,11,12]. In the work of [11], a modified Nyström method just needs k / ϵ columns of the kernel matrix to obtain a 1 + ϵ bound relative to K K k F 2 . To the best of our knowledge, it is the fastest algorithm, costing O ( n k 2 ) + T M u l t i p l y ( nnz ( K ) log n ) to achieve a 1 + ϵ relative error of K K k F 2 , where nnz ( K ) means the number of non-zero entries of K . Although these modified Nyström methods are superior in approximation accuracy, it needs a much higher computational burden compared to the conventional Nyström method.
In this paper, we propose a much faster modified Nyström method which runs in O ( n 1 2 k 3 / ϵ 5 2 ) + T M u l t i p l y ( O ( nnz ( K ) log n ) time to achieve a 1 + ϵ bound relative to K K k F 2 . When ϵ > 2 1 , our algorithm will be accelerated to
O ( k 3 ) + T M u l t i p l y ( O ( nnz ( K ) log n ) ,
which is guaranteed by Lemma 3. Our algorithm is given in Algorithm 3. It needs T M u l t i p l y ( O ( nnz ( A ) log n ) ) times to conduct matrix multiplication which is easily implemented in parallel. The computation complexity of matrix multiplication in Algorithm 3 is near linear in input sparsity. In addition, for the arithmetic operations which are hard to implement in parallel, such as SVD, pseudoinverse and QR decomposition, Algorithm 3 needs O ( n 1 2 k 3 / ϵ 5 2 ) time which is sublinear in the input size n. At the cost of sacrificing a certain accuracy, O ( k 3 ) can be reached with the same computational complexity as the conventional Nyström method, needing O ( k 3 ) arithmetic operations when sampling O ( k ) columns. Our empirical studies further validate the efficiency of our algorithm.
In this paper, we improve several key algorithms which constitute a faster modified Nyström method. We summarized our contributions as follow.
  • First and most importantly, we propose an efficient modified Nyström method with theoretical guarantees.
  • Second, a more computationally efficient adaptive sampling method is proposed in Lemma 2. Adaptive sampling is a cornerstone of column selection, CUR decomposition and the Nyström method [4,5,11,13], and it is also very popular in other matrix problems [14].
  • Finally, our proposed practical Nyström method can achieve computation efficiency in real applications, as shown by our experiments.
The rest of this paper is structured as follows. In Section 2, we provide the notations used in this study. Section 3, several key algorithms that constitute the modified Nyström are improved. Section 4 gives our modified Nyström method. We conduct empirical analysis and comparison in Section 5, and conclude our work in Section 6. All detailed proofs are omitted except computation complexity analysis.

2. Notation and Preliminaries [15]

Firstly, we introduce the notation and concepts that will be utilized here and hereafter. I m is used to represent the identity m × m matrix. Sometimes we just use I for simplicity. We also use 0 to signify a zero vector or a zero matrix with an appropriate size. The number of non-zero entries in A is indicated by the notation nnz ( A ) .
Let k ρ and ρ = rank ( A ) min { m , n } . The singular value decomposition (SVD) of A may be expressed as
A = i = 1 ρ σ i u i v i T = U k U k Σ k 0 0 Σ k V k T V k T ,
where the top k singular values are represented by U k ( m × k ), V k ( n × k ) and Σ k ( k × k ). The best (or closest) rank-k approximation to A is denoted by A k = U k Σ k V k T . The i-th greatest singular value of A is denoted by σ i = σ i ( A ) . The SVD is the same as the eigenvalue decomposition when A is symmetric positive semi-definite (SPSD), in which case we obtain U A = V A .
Furthermore, let A be the Moore–Penrose inverse of A , defined as A = V ρ Σ ρ 1 U ρ T . When A is non-singular, the matrix inverse is the same as the Moore–Penrose inverse.
The matrix norms are defined in the manner as follows. Assume that the spectral norm is A 2 = max x R n , x 2 = 1 , A x 2 = σ 1 and the Frobenius norm is A F = ( i , j a i j 2 ) 1 / 2 = ( i σ i 2 ) 1 / 2 .
When given the matrices, A R m × n and C R m × r with r > k , we explicitly define matrix Π C , k ζ ( A ) as the closest representation of A in the column space of C with the rank of the most k. The function Π C , k ζ ( A ) minimizes the residual A A ^ ζ across all A ^ in the column space of C. Here, “ ζ ” denotes either the spectral norm or the Frobenius norm.
When given three matrices, A R m × n , X R m × p , and Y R q × n , the projection of A onto X ’s column space is represented as X X A = U X U X T A R m × n , and the one onto Y ’s row space is denoted by A Y Y = A V Y V Y T R m × n .
We now give the definition of leverage score sampling and subspace embedding, which are key tools to construct our Nyström algorithm.
Definition 1
(Leverage score sampling, [13,15]). Allow V R n × k to be column orthonormal with n > k , and v i , * to signify the i-th row of V . Allow i = v i , * F 2 / k . Given that the i are leverage scores, let r be an integer in the range 1 r n . Create the sampling matrix Ω R n × r and the rescaling matrix D R r × r as follows. Pick an index i from the set of { 1 , 2 , n } with probability i , for each column j = 1 , , r of Ω and D , separately and with replacements. Let Ω i j = 1 and D j j = 1 / i r . The number of operations required by this procedure is O ( n k + n ) . This procedure is designated as
[ Ω , D ] = L e v e r a g e S c o r e S a m p l i n g ( V , r ) .
Definition 2
([16]). Assuming ε > 0 and δ > 0 , define a distribution on × n matrix S as Π , where ℓ depends on n, d, ε and δ. Assume that, any given n × d matrix A , with a probability of at least 1 δ , a matrix S chosen from distribution Π is a ( 1 + ε )   2 -subspace embedding for A . Meaning that, for every x R d , S A x 2 2 = ( 1 ± ε ) A x 2 2 with probability 1 δ . After that, we designate Π as an ( ε , δ ) -oblivious 2 -subspace embedding.
The sparse subspace embedding matrix S and subsampled Hadamard matrix H are the two most popular subspace embedding matrices. For an n × k matrix A with k dimension subspace, we can construct a sparse subspace embedding matrix S for A with m = O ( k 2 / ϵ 2 ) rows, and the subsampled Hadamard matrix H with m = O ( k log k ) / ϵ 2  [16]. Combining S with H still has the property.
Let’s discussed the computational costs about the matrix operations mentioned above. Matrix multiplication is an intrinsic parallel operation; hence, it can be easily implemented in parallel efficiently just as many mathematical software do. However, SVD decomposition and QR decomposition are much harder to implement in parallel. Hence, we denote the time complexity of such a matrix multiplication by T M u l t i p l y . For a general m × n matrix A with m n , computing the full SVD requires O ( m n 2 ) flops, whereas computing the truncated SVD of rank k ( k < n ), requires O ( m n k ) flops. Additionally, computing A requires O ( m n 2 ) flops, too. Given a m × m Hadamard–Walsh transform matrix H , T M u l t i p l y ( O ˜ ( m n ) ) is the cost for the Hadamard–Walsh transform H A , which is substantially quicker than T M u l t i p l y ( O ( m 2 n ) ) for the typical matrix multiplication. A sparse subspace embedding matrix S for an n × d matrix A , S A needs T M u l t i p l y ( O ( nnz ( A ) ) ) arithmetic operations.

3. Main Lemmas and Theorems

In this part, we will outline our principal theorems and lemmas, which are the key tools to implement Algorithm 3. In addition, these lemmas and theorems are of independent interest and have wide application.
First, we give a fast randomized SVD method which is depicted in Algorithm 1 which is the fastest randomized SVD method as far as we know.
Lemma 1.
Given matrix A R m × n , target rank k and error parameter 0 < ϵ 1 , Z is returned from Algorithm 1; then, the following formula holds with high probability.
A Z Z T A F 2 ( 1 + ϵ ) A A k F 2
In addition, Z can be computed in O ˜ ( k 3 / ϵ 5 ) + T M u l t i p l y ( O ( nnz ( A ) ) + O ˜ ( m k 2 / ϵ 4 + k 3 / ϵ 3 ) ) . We denote Algorithm 1 as
Z = S p a r s e S V D ( A , k , ϵ ) .
Algorithm 1 Sparse SVD
1:
Input: a real matrix A R m × n , error parameter ϵ and target rank k;
2:
Compute A R T , where R = Π S R c × n with c = O ( k log k / ϵ ) . S R s × n is a sparse subspace embedding matrix with s = O ( k 2 + k / ϵ ) and Π R c × s is a subsampled randomized Hadamard matrix with c = O ( k log k / ϵ ) ;
3:
Compute an orthonormal basis U for A R T by U = A R T C 1 , where C is the Cholesky decomposition of R A T A R T ;
4:
Compute Γ = U T A W T R c × d , where W = H F R d × n with d = O ( k log k / ϵ 3 ) . F R n × t is a sparse subspace embedding matrix with t = O ( k 2 log 2 k / ϵ 3 ) and H R d × t is a subsampled randomized Hadamard matrix with d = O ( k log k / ϵ 3 ) .
5:
Compute the SVD of Γ and let Δ R c × k contain the top k left singular vectors of Γ ;
6:
Output:  Z = U Δ .
Proof. 
Lemma A2 shows that A U U T A F 2 ( 1 + ϵ ) A A k F 2 , where U is of O ( k log k / ϵ ) columns. Applying Lemma A1 and replacing V with U , we can obtain the result that
A Z Z T A F 2 ( 1 + ϵ ) A A k F 2
For computation time analysis, computing A R T takes T M u l t i p l y ( O ( nnz ( A ) + O ˜ ( m k ( k + ϵ 1 ) ) ) , and then T M u l t i p l y ( O ˜ ( m k 2 / ϵ 2 + k 3 / ϵ 3 ) ) computes the U = A C 1 , where C is the Cholesky decomposition of A T A . Computing U T ( A W T ) requires T M u l t i p l y ( O ( nnz ( A ) + m k 2 / ϵ 3 + m k 2 / ϵ 4 ) ) . Computing the SVD of Γ requires O ˜ ( k 3 / ϵ 5 ) . In addition, computing Z = U Δ requires T M u l t i p l y ( O ˜ ( m k 2 / ϵ 3 ) ) . Hence, Algorithm 1 takes
O ˜ ( k 3 / ϵ 5 ) + T M u l t i p l y ( O ( nnz ( A ) ) + O ˜ ( m k 2 / ϵ 4 + k 3 / ϵ 3 ) )
computation complexity.    □
A faster adaptive sampling, Algorithm 2, is developed based on the work of [13]. Boutsidis and Woodruff [13] tried to compute norms of each column of G B = G A G C 1 C 1 A . To further reduce the computation cost, we introduce the sketched G B ^ = G A G C 1 ( R C 1 ) ( R A ) to approximate G B . By such sketching, G C 1 ( R C 1 ) ( R A ) can be computed more efficiently than G C 1 C 1 A .
Algorithm 2 Adaptive Sampling
1:
Input: a real matrix A R m × n , C 1 R m × c 1 and the number of selected columns c;
2:
Construct B ^ = A C 1 ( R C 1 ) ( R A ) , where R = Π S R t × m with t = 2 c 1 log c 1 . S R s × m is a sparse subspace embedding matrix with s = c 1 2 + 2 c 1 and Π R t × s is a subsampled randomized Hadamard matrix;
3:
Construct B ˜ = G B ^ where G R g × m is a normalized Gaussian matrix with g = 9 log n ;
4:
Compute sampling probabilities p j = b ˜ j F 2 / B ˜ F 2 for j = 1 , , n , where b ˜ j is the j th column of B ˜ ;
5:
Output: Obtain C 2 by selecting c columns from A in c i.i.d. trials; in each trial the index j is chosen with probability p j .
Lemma 2.
Given A R m × n , C 1 R m × c 1 and V R r × n such that r a n k ( V ) = r a n k ( A V V ) = ρ , with ρ c n , let C 2 R m × c 2 be returned from Algorithm 2 containing c 2 columns of A . Then, the matrix C = [ C 1 , C 2 ] R m × ( c 1 + c 2 ) satisfies that for any integer k > 0 , and with a high probability which is at least 0.9 .
A C C A V V F 2 A A V V F 2 + 40 ρ c 2 A C 1 C 1 A F 2 .
In addition, this randomized algorithm can be implemented in
O ˜ ( c 1 3 ) + T M u l t i p l y ( O ( nnz ( A ) log n + O ˜ ( n c 1 2 + n c 1 log n + c 1 3 ) )
computation time. We denote this randomized algorithm as
C 2 = A d a p t i v e S a m p l i n g ( A , V , C 1 , c 2 ) .
Proof. 
Let B = A C 1 C 1 A be the residual matrix and b i is the i-th column of B . By Theorem A4, with high probability, it holds that
B F 2 B ^ F 2 ( 1 + 2 ϵ ) B F 2 = 2 B F 2 b i F 2 b ^ i F 2 ( 1 + 2 ϵ ) b i F 2 = 2 b i F 2
Besides, by the JL property of G , we have 1 3 b ^ i F b ˜ i F 4 3 b ^ i F . Hence, after utilizing the below distribution for sampling,
p i = b ˜ i F B ˜ F 2 3 · 3 4 · b ^ i F 2 B ^ F 2 1 2 · 1 2 · b i F 2 B F 2 = 1 4 b i F 2 B F 2
Using Lemma A3, we obtain
E A C C A V V F 2 A A V V F 2 + 4 k c 2 A C 1 C 1 A F 2 .
Using the Markov inequality, we have that
A C C A V V F 2 A A V V F 2 + 40 ρ c 2 A C 1 C 1 A F 2 .
holds with a probability of at least 0.9 .
As to the running time, it needs T M u l t i p l y ( O ( nnz ( A ) ) + O ˜ ( n c 1 2 ) ) arithmetic operations to compute R A . To compute R C 1 costs T M u l t i p l y ( O ( nnz ( C 1 ) ) + O ˜ ( c 1 3 ) ) . To compute ( R C 1 ) , it requires O ˜ ( c 1 3 ) . In addition, computing G A and G C 1 require T M u l t i p l y   ( O ( nnz ( A ) log n ) ) and T M u l t i p l y ( m c 1 log n ) , respectively. In addition, to compute ( G C 1 ) ( R C 1 ) ( R A ) needs
T M u l t i p l y ( O ˜ ( n c 1 log n + c 2 log n ) )
computation. In addition, G A G C 1 ( R C 1 ) R A needs another T M u l t i p l y ( O ( n log n ) ) arithmetic operations. Thus, all these need O ˜ ( c 1 3 ) + T M u l t i p l y ( O ( nnz ( A ) log n + O ˜   ( n c 1 2 + n c 1 log n + c 1 3 ) )    □
Lemma 3
([15,17]). Given the matrices C R m × c , A R m × n and R R n × r , let’s suppose that S is the leverage-score sketching matrix of C with s = O ( c / ϵ + c log c ) rows, and T is the leverage-score sketching matrix of R with t = O ( r / ϵ + r log r ) columns. Let
U = C A R = argmin U A C U R F
and
U ^ = ( S C ) S A T ( R T ) ,
then we can obtain
A C U ^ R F ( 1 + ϵ ) A C U R F .
The number of sampled rows in Lemma 3 is independent on the input dimension of A and is linear to c. By losing some accuracy, a much faster algorithm can be implemented.

4. Practical Modified Nyström Method

We use our new lemmas and theorems developed in Section 3 to implement an efficient modified Nyström algorithm.

4.1. Description of The Algorithm

A n × n real symmetric matrix A , an error parameter 0 < ϵ < 1 and a target rank k are the inputs of Algorithm 3. Meanwhile, a matrix C R n × c with c = O ( k / ϵ + k log k ) columns of A , and a matrix U R c × c are the results. There are primarily 3 steps in Algorithm 3: (i) using the definition of the leverage score sampling, it samples a number of columns of A to obtain C 1 ; and using the adaptive sampling method to obtain C 2 and R 2 ; (ii) it calculates the leverage scores of C using the method in [18]; and (iii) it constructs the intersection matrix U . Note that U ^ in Lemma 3 is asymmetric even when A is positive semi-definite. Thus, when applied to kernel approximation, we need to construct a positive semi-definite U shown in Algorithm 3.
Algorithm 3 Practical Nyström
1:
Input: a real symmetric matrix A R n × n , error parameter ϵ and target rank k;
2:
Z = S p a r s e S V D ( A , k , 1 ) ;
3:
[ Ω , Γ ] = L e v e r a g e S c o r e S a m p l i n g ( Z , O ( k log k ) ) and construct C 1 = A Ω ;
4:
C 2 = A d a p t i v e S a m p l i n g ( A , V k T , C 1 , O ( k / ϵ ) ) and C 3 = A d a p t i v e S a m p l i n g ( A , V k T , C 1 , O ( k / ϵ ) ) , constructing C = [ C 1 , C 2 , C 3 ] R n × O ( k / ϵ + k log k ) ;
5:
Compute approximate leverage scores of C using the method of [18] and construct the leverage sketch matrix S 1 and S 2 of n × s size, where s = O ( c ϵ + c log c ) ;
6:
Compute U ^ = ( S 1 C ) S 1 A S 2 T ( C T S 2 T ) .
7:
Compute U = Π H + s ( U ^ ) by conducting eigenvalue decomposition of U ˜ = U ^ + U ^ T 2 and setting the negative eigenvalues of U ^ to zero.
8:
Output:  C and U .

4.2. Analysis of Running-Time

Here, we provide a detailed analysis of the Algorithm 3’s arithmetic operations.
1
The computation complexity of Algorithm 3 is O ˜ ( k 3 ) + T M u l t i p l y ( O ( nnz ( A ) log n + O ˜ ( n k 2 + n k log n + k 3 ) ) to find O ( k / ϵ + k log k ) columns of A to construct C .
(a)
To obtain Z R n × k from Theorem 1, it takes O ˜ ( k 3 ) + T M u l t i p l y ( O ( nnz ( A ) ) + O ˜ ( n k 2 + k 3 ) ) .
(b)
To obtain the leverage score and sample C 1 and C 2 , it takes T M u l t i p l y ( O ( n k ) ) .
(c)
To construct C 3 and R 2 Lemma 2, it takes O ˜ ( k 3 ) + T M u l t i p l y ( O ( nnz ( A ) log n + O ˜ ( n k 2 + n k log n + k 3 ) ) .
2
The computation complexity of Algorithm 3 is O ( k 3 / ϵ 4 ) + T M u l t i p l y ( O ( n k 2 / ϵ 2 ) + O ˜ ( n k 2 + k 3 / ϵ 5 ) ) to construct U when s = O ( c ϵ + c log c ) is the row dimension of S 1 and S 2 in Algorithm 3.
(a)
To obtain the leverage scores of C , it takes O ( k 3 / ϵ 3 ) + T m u l t i p l y ( O ( n ( k / ϵ ) 2 + O ˜ ( n k 2 ) ) .
(b)
To compute ( S 1 T C ) and ( S 2 T C ) , it takes O ˜ ( k 3 / ϵ 4 ) .
(c)
To compute matrix multiplication, it takes T M u l t i p l y ( O ( k 3 / ϵ 5 ) ) .
(d)
To compute the eigenvalue decomposition of e r r o r t y p e T i t l e U , it takes O ˜ ( k 3 / ϵ 3 ) .
The algorithm’s overall asymptotic arithmetic operation is
T M u l t i p l y ( O ( nnz ( A ) log n + n k 2 / ϵ 2 + k 3 / ϵ 5 ) + O ˜ ( n k 2 + n k log n + k 3 / ϵ 4 ) ) .

4.3. Error Bound

Primary approximate result regarding Algorithm 3 is shown as the following theorem.
Theorem 1.
Given an error parameter ϵ and a target rank k, run Algorithm 3, then the below inequality holds with high probability.
A C U C T F ( 1 + ϵ ) A A k F

5. Empirical Study

In this section, we compare our Practical Nyström algorithm with the uniform+adaptive algorithm [11,19], near-optimal+adaptive algorithm [4,11,13] and conventional Nyström using just uniform sampling. All algorithms were implemented in Matlab and experiments were conducted on a workstation with 32 cores of 2G Hz and 24G RAM.
On each data set, we give the approximation error and the execution duration of each algorithm. The approximation error is
Approximation Error = A C U C T F A F ,
where U is the intersection matrix defined in the Nyström method.
On three data sets we test all three algorithms, and the results are listed in Table 1. We create a RBF kernel matrix A for each dataset, with a i j = exp ( x i x j 2 2 2 γ 2 ) , where x i and x j are data instances and γ is the parameter of the RBF kernel function. By the definition of A , the size n of A is the number of instances of the dataset. Thus, the kernel matrices in our experiments are of large sizes. We set γ different values for each data set as Table 1 describes. However, the effectiveness of our algorithm does not depend on the setting of γ . For each data set, we set k = 10 ,   30 and 50. We sampled c = a k columns from A and a ranges from 8 to 26. We ran each algorithm 5 times and report the average value of approximation error and running time. All results are illustrated in Figure 1, Figure 2 and Figure 3.
As evidenced by the empirical results in the figures, it is clear that our approach is efficient. In terms of accuracy, Our approach is comparable to the state-of-the-art algorithm—the near-optimal+adaptive algorithm [4,11,13]. As to the running time, our approach is much faster than near-optimal+adaptive algorithm and uniform+adaptive algorithm. Our algorithm’s running time grows slower than the near-optimal+adaptive algorithm and uniform+adaptive algorithm. The advantage of the running time of our algorithm grows as the dimension of kernel matrix A increases. Calculating kernel matrix A of size 7494 × 7494 from the ‘PenDigits’ data set, our alogrithm is twice as fast as the near-optimal+adaptive algorithm. As to the ‘a9a’ data set of 32,561 instances, our algorithm is four times faster than near-optimal+adaptive. In addition, as c increases, the running-time superiority of our algorithm also increases. Our algorithm also has similar a advantage over the uniform+adaptive algorithm. Hence, our algorithm is suitable to scale to kernel matrices of high dimensions.

6. Conclusions

In this paper, we proposed an efficient modified Nyström method with a theoretical and emperical guarantee. In a high-level parallel-computation environment with sparse input matrices, our Nyström method can achieve comparable computation efficiency compared to the conventional Nyström method, theoretically. Hence, our Nyström method is suitable for machine-learning algorithms in big-data setting. In addition, we give a sketching generalized matrix approximation which extends the previous work [12]. Faster randomized SVD and more efficient adaptive sampling methods are proposed which have wide application in lots of areas. In addition, our modified Nyström algorithm can be easily extended to CUR decomposition which leads to more efficient CUR decomposition.

Author Contributions

Conceptualization, W.Z. and Z.S.; methodology, W.Z. and J.L.; software, W.Z.; validation, S.C. and Z.S.; writing—original draft preparation, W.Z.; writing—review and editing, S.C.; visualization, J.L.; funding acquisition, Z.S. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Hangzhou Key Scientific Research Program of China (No. 20212013B06), Zhejiang Provincial Natural Science Foundation of China (No. LGG21E050005), National Natural Science Foundation of China (No. 61972208, No. 62272239 and No. 62022044) and National Natural Science Foundation of Jiangsu Province (No. BK20201043).

Acknowledgments

The authors would like to thank Professor An Yang, from Hangzhou Vocational & Technical College, for her valuable suggestions and guidance throughout this research.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Key Theorems Used in Our Proofs

Theorem A1
([15,20]). There is t = Θ ( ϵ 2 ) for matrix A R m × n and orthonormal U R m × k , thus, for a t × m leverage-score sketching matrix S for orthonormal U ,
P A T S T S U A T U F 2 < ϵ 2 A F 2 U F 2 1 δ ,
for any fixed δ > 0 .
Theorem A2
([15,20]). There is t = O ( k ϵ 2 log k ) , for any rank k matrix A R m × n with row leverage scores, such that leverage-score sketching matrix S R t × m is an ϵ-embedding matrix for matrix A , i.e.,
S A x 2 2 = ( 1 ± ϵ ) A x 2 2
Theorem A3
([15,20]). Given that A is a matrix with m rows and C is a matrix with m rows as well as rank k. S is a subspace embedding for C with error parameter ϵ 0 1 / 2 , and it is also the t × m leverage-score sketching matrix of C with O ( k / ϵ ) rows. Then if Y ^ and Y are respectively the solutions to
min Y = S ( C Y A ) F 2
and
min Y = S ( C Y A ) F 2
then, the below two formulas hold with a probability of at least 0.99 .
C Y ^ A F ( 1 + ϵ ) C Y A F
C ( Y ^ Y * ) F 2 ϵ C Y * A F
Theorem A4
([15,20]). Given that A is a matrix with m rows, and C is a matrix with m rows as well as rank k, where R = Π S R t × n with t = 2 k log k / ϵ . Π R t × s is a subsampled randomized Hadamard matrix and S R s × m is a sparse subspace embedding matrix with s = k 2 + 2 k / ϵ . Then if Y ^ and Y are respectively the solutions to
min Y = R ( C Y A ) F 2
and
min Y = R ( C Y A ) F 2
then, the below two formulas hold with a probability of at least 0.99 .
C Y ^ A F ( 1 + ϵ ) C Y A F
C ( Y ^ Y * ) F 2 ϵ C Y * A F
Lemma A1
([13,15]). Let A R m × n and V R m × c . Assume that given a particular rank parameter k and an accuracy parameter 0 < ϵ < 1 ,
A Π V , k F ( A ) F 2 A A k F 2 .
V is a QR-decomposition, and let V = Q Y where Q R m × c and Y R c × c . Let Γ = Q T A W T R c × , where W T R n × is a sparse subspace embedding matrix, and = O ( c 2 / ϵ 2 ) . Let Δ R c × k contain the top k left singular vectors of Γ. Then, it holds that
A Q Δ Δ T Q T A F 2 ( 1 + ϵ ) A A k F 2 .
with high probability.
Lemma A2
([16]). Given matrix R m × n , R = Π S R c × n is a subspace embedding matrix with c = O ( k log k / ϵ ) . S R n × s is a sparse subspace embedding matrix with s = O ( k 2 + k / ϵ ) and Π R c × s is a subsampled randomized Hadamard matrix with c = O ( k log k / ϵ ) . Let U be the orthonormal basis of A R T . Then, it holds that
A U U T A F 2 ( 1 + ϵ ) A A k F 2 .
with high probability.
Lemma A3
([4,15,16]). Given A R m × n , R 1 R r 1 × n and C R m × c such that
r a n k ( C ) = r a n k ( C C A ) = ρ ,
with ρ c n , given R 1 R r 1 × n and the defined residual
B = A A R 1 R 1 R m × n .
For i = 1 , , m , let p i be the probability distribution such that for each i:
p i α b i F 2 / B F 2 ,
where b i is the i-th row of B . Sample r 2 rows from A in c 2 i.i.d. trials, where in each trial the i-th column is chosen with probability p i . Let R 2 R r 2 × n contain the r 2 sampled rows and let R = [ R 1 T , R 2 T ] T . Then
E A C C A R R F 2 A C C A F 2 + ρ α r 2 A A R R F 2 .
Theorem A5
([13,15]). Given three matrices C R m × c , A R m × n and R R r × n , we have
C A R = argmin U A C U R F
Theorem A6
([15,21]). Given a matrix A = A Z Z T + E R m × n , where Z T Z = I k and Z R n × k , let S R n × t be any matrix such that rank k = ( Z T S ) . Let C = A S R m × r . Then
A C C A ζ 2 A Π C , k ζ ( A ) ζ 2 A C ( Z T S ) Z T ζ 2 E ζ 2 + E S ( Z T S ) ζ 2 .

Appendix B. Theorem 1 Proof

We first provide an essential lemma before proving the theorem.
Lemma A4
([15]). Given any Z R m × p , C R m × q and A R m × n , assume R ( Z ) R ( C ) R ( A ) . Let X R n × n be a projection matrix. Then
A C C A X F A Z Z A X F .
Now we start to prove Theorem 1.
Proof. 
According to Theorem A6, we have
A C 1 C 1 A F 2 A Π C 1 , k F ( A ) F 2 E F 2 + E S ( Z T S ) F 2 .
Let S = Ω Γ and E = A A Z Z T , then we have
E F 2 2 A A k F 2
because of Lemma A1 with error parameter ϵ = 1 . In addition, S T is a row leverage score sketching matrix of Z , where Γ , Ω and Z are calculated in Algorithm 3. Additionally, S T is also a subspace embedding matrix of Z with error parameter ϵ 0 = 1 / 2 . Inferring from the fact that ( Z T S ) = ( Z T S ) T ( Z T S S T Z ) 1 , we obtain
E S ( Z T S ) F 2 = E S S T Z ( Z T S S T Z ) 1 F 2
E S S T Z F 2 ( Z T S S T Z ) 1 2 2
1 4 k log k E F 2 Z F 2 ( Z T S S T Z ) 1 2 2
1 log k E F 2 ,
where Equation (A2) follows from the fact that A B F A 2 B F , and Equation (A3) follows from Theorem A1 with error parameter ε = 4 k log k and E Z = A ( I Z Z T ) Z = 0 . Due to Theorem A2 with error parameter ϵ 0 = 1 / 2 , Equation (A4) can be obtained. Because we have
S T Z 2 2 = ( 1 ± ϵ 0 ) Z 2 2 = ( 1 ± ϵ 0 ) .
therefore,
( Z T S S T Z ) 1 2 2 ( 1 ϵ 0 ) 2 = 4 .
Due to Theorem A2, S needs t = 4 k log k columns as a subspace embedding matrix of Z with error parameter ϵ 0 = 1 / 2 . Theorem A2 also leads to ε = 4 k log k in the proof of Equation (A3). Now we have
A C 1 C 1 A F 2 E F 2 + 1 log k E F 2 4 A A k F 2 ,
where the last inequality follows from Equation (A1) and 1 / log k 1 .
Using Lemma 2, we need to sample O ( k ϵ ) columns from A such that C ^ = [ C 1 , C 2 ] has the property
A C ^ C ^ A F 2 ( 1 + ϵ ) A A k F 2 .
Lemma A1 shows that there exists an othonormal matrix Q k with rank k in the range of C ^ such that
A Q k Q k T A F 2 ( 1 + ϵ ) A C ^ C ^ A ^ F 2 .
[ C 3 ] = A d p t i v e S a m p l i n g ( A , Q k , C 1 , k / ϵ ) , and we define C ˜ = [ C 1 , C 3 ] , then by Lemma A3, it holds that
A C ˜ C ˜ A Q k T Q k F 2 A Q k Q k T A F 2 + ϵ | A C 1 C 1 A F 2 ( 1 + ϵ ) A C ^ C ^ A F 2 + 4 ϵ A A k F 2 ( 1 + ϵ ) 2 A A k F 2 + 4 ϵ A A k F 2 = ( 1 + 6 ϵ ) A A k F 2 .
By rescaling the ϵ , we can obtain a ( 1 + ϵ ) relative error bound. Since R ( Q k ) R ( C ^ ) R ( A ) , Lemma A4 leads to
A C ˜ C ˜ A ( C ^ ) T C ^ T F 2 A C ˜ C ˜ A Q k T Q k F 2 ( 1 + ϵ ) A A k F 2 .
Inferring from the fact that R ( C ^ ) R ( C ) R ( A ) and R ( C ˜ ) R ( C ) R ( A ) , utilizing Lemma A4 twice, we reach the result that
A C C A ( C ) T C T F 2 A C ˜ C ˜ A ( C ^ ) T C ^ T F 2 ( 1 + ϵ ) A A k F 2 .
S is a leverage-score sketching matrix of C , when s = O ( c ϵ + c log c ) is the row dimension of S ; by Theorem 3 of [12], we have,
A C U C T F 2 A C U ^ C F 2 ( 1 + ϵ ) A C C A ( C ) T C T F 2 .
By rescaling ϵ , we achieve the final result that
A C U C T F ( 1 + ϵ ) A A k F .

References

  1. Kumar, S.; Mohri, M.; Talwalkar, A. Sampling methods for the Nyström method. J. Mach. Learn. Res. 2012, 13, 981–1006. [Google Scholar]
  2. Williams, C.; Seeger, M. Using the Nyström method to speed up kernel machines. In Proceedings of the 14th Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 3–8 December 2001; Number EPFL-CONF-161322. pp. 682–688. [Google Scholar]
  3. Gittens, A.; Mahoney, M.W. Revisiting the Nyström Method for Improved Large-Scale Machine Learning. arXiv 2013, arXiv:1303.1849. [Google Scholar]
  4. Wang, S.; Zhang, Z. Improving CUR Matrix Decomposition and the Nyström Approximation via Adaptive Sampling. J. Mach. Learn. Res. 2013, 14, 2729–2769. [Google Scholar]
  5. Anderson, D.G.; Du, S.S.; Mahoney, M.W.; Melgaard, C.; Wu, K.; Gu, M. Spectral Gap Error Bounds for Improving CUR Matrix Decomposition and the Nyström Method. In Proceedings of the AISTATS, San Diego, CA, USA, 9–12 May 2015. [Google Scholar]
  6. Wang, S.; Gittens, A.; Mahoney, M.W. Scalable kernel K-means clustering with Nyström approximation: Relative-error bounds. J. Mach. Learn. Res. 2019, 20, 431–479. [Google Scholar]
  7. Gao, S.; Dou, S.; Zhang, Q.; Huang, X. Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding. arXiv 2022, arXiv:2210.07547. [Google Scholar]
  8. Hamm, K.; Lu, Z.; Ouyang, W.; Zhang, H.H. Boosting Nyström Method. arXiv 2023, arXiv:2302.11032. [Google Scholar]
  9. Hsieh, C.J.; Si, S.; Dhillon, I.S. Fast Prediction for Large-Scale Kernel Machines. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 3689–3697. [Google Scholar]
  10. Si, S.; Hsieh, C.J.; Dhillon, I. Memory efficient kernel approximation. In Proceedings of the 31st International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 701–709. [Google Scholar]
  11. Wang, S.; Luo, L.; Zhang, Z. SPSD Matrix Approximation vis Column Selection: Theories, Algorithms, and Extensions. J. Mach. Learn. Res. 2016, 17, 1697–1745. [Google Scholar]
  12. Wang, S.; Zhang, Z.; Zhang, T. Towards More Efficient SPSD Matrix Approximation and CUR Matrix Decomposition. J. Mach. Learn. Res. 2016, 17, 1–49. [Google Scholar]
  13. Boutsidis, C.; Woodruff, D.P. Optimal CUR Matrix Decompositions. SIAM J. Comput. 2017, 46, 543–589. [Google Scholar] [CrossRef]
  14. Deshpande, A.; Vempala, S. Adaptive sampling and fast low-rank matrix approximation. In Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques; Springer: Berlin/Heidelberg, Germany, 2006; pp. 292–303. [Google Scholar]
  15. Ye, H.; Li, Y.; Zhang, Z. A simple approach to optimal CUR decomposition. arXiv 2015, arXiv:1511.01598. [Google Scholar]
  16. Woodruff, D.P. Sketching as a tool for numerical linear algebra. arXiv 2014, arXiv:1411.4357. [Google Scholar]
  17. Ye, H.; Wang, S.; Zhang, Z.; Zhang, T. Fast Generalized Matrix Regression with Applications in Machine Learning. arXiv 2019, arXiv:1912.12008. [Google Scholar]
  18. Drineas, P.; Magdon-Ismail, M.; Mahoney, M.W.; Woodruff, D.P. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res. 2012, 13, 3475–3506. [Google Scholar]
  19. Wang, S.; Zhang, Z. Efficient Algorithms and Error Analysis for the Modified Nystrom Method. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; pp. 996–1004. [Google Scholar]
  20. Clarkson, K.L.; Woodruff, D.P. Low rank approximation and regression in input sparsity time. In Proceedings of the Forty-Fifth Annual ACM Symposium on Theory of Computing, Palo Alto, CA, USA, 2–4 June 2013; pp. 81–90. [Google Scholar]
  21. Boutsidis, C.; Drineas, P.; Magdon-Ismail, M. Near-optimal column-based matrix reconstruction. SIAM J. Comput. 2014, 43, 687–717. [Google Scholar] [CrossRef]
Figure 1. Results of the Nyström algorithms on the a9a dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Figure 1. Results of the Nyström algorithms on the a9a dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Mathematics 11 02433 g001
Figure 2. Results of the Nyström algorithms on the pendigit dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Figure 2. Results of the Nyström algorithms on the pendigit dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Mathematics 11 02433 g002
Figure 3. Results of the Nyström algorithms on the usps dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Figure 3. Results of the Nyström algorithms on the usps dataset. In the first column, we set k = 10 , and c = a k with a = 8 , , 26 . In the middle column, we set k = 30 , and c = a k . In the right column, we set k = 50 , and c = a k .
Mathematics 11 02433 g003
Table 1. A summary of the datasets for kernel approximation.
Table 1. A summary of the datasets for kernel approximation.
Data Seta9aUSPSPenDigits
#instance32,56111,3057494
γ 5430
SourceUCITKH96aUCI
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Sun, Z.; Liu, J.; Chen, S. A More Efficient and Practical Modified Nyström Method. Mathematics 2023, 11, 2433. https://doi.org/10.3390/math11112433

AMA Style

Zhang W, Sun Z, Liu J, Chen S. A More Efficient and Practical Modified Nyström Method. Mathematics. 2023; 11(11):2433. https://doi.org/10.3390/math11112433

Chicago/Turabian Style

Zhang, Wei, Zhe Sun, Jian Liu, and Suisheng Chen. 2023. "A More Efficient and Practical Modified Nyström Method" Mathematics 11, no. 11: 2433. https://doi.org/10.3390/math11112433

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop