Next Article in Journal
New Result for the Analysis of Katugampola Fractional-Order Systems—Application to Identification Problems
Next Article in Special Issue
Knowledge Trajectories on Public Crisis Management Research from Massive Literature Text Using Topic-Clustered Evolution Extraction
Previous Article in Journal
Interval-Valued Linear Diophantine Fuzzy Frank Aggregation Operators with Multi-Criteria Decision-Making
Previous Article in Special Issue
A Novel Hybrid Approach: Instance Weighted Hidden Naive Bayes
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets

by
Maria D. Gonzalez-Lima
1,*,† and
Carenne C. Ludeña
2,†
1
Departamento de Matemáticas y Estadística, Universidad del Norte, Barranquilla 081007, Colombia
2
Matrix CPM Solutions, Crr 15 93A 84, Bogotá 110221, Colombia
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2022, 10(11), 1812; https://doi.org/10.3390/math10111812
Submission received: 30 April 2022 / Revised: 16 May 2022 / Accepted: 17 May 2022 / Published: 25 May 2022
(This article belongs to the Special Issue Machine Learning and Data Mining: Techniques and Tasks)

Abstract

:
We propose a novel method using Locality-Sensitive Hashing (LSH) for solving the optimization problem that arises in the training stage of support vector machines for large data sets, possibly in high dimensions. LSH was introduced as an efficient way to look for neighbors in high dimensional spaces. Random projections-based LSH functions create bins so that when great probability points belonging to the same bin are close, the points that are far will not be in the same bin. Based on these bins, it is not necessary to consider the whole original set but representatives in each one of them, thus reducing the effective size of the data set. A key of our proposal is that we work with the feature space and use only the projections to search for closeness in this space. Moreover, instead of choosing the projection directions at random, we sample a small subset and solve the associated SVM problem. Projections in this direction allows for a more precise sample in many cases and an approximation of the solution of the large problem is found in a fraction of the running time with small degradation of the classification error. We present two algorithms, theoretical support, and numerical experiments showing their performances on real life problems taken from the LIBSVM data base.

1. Introduction

In this work, we deal with the problem of binary classification of large volume data sets of possible high dimension (for instance, graphs and texts). There are many techniques for solving this problem, for example, logistic regression and other linear methods, random forests and ensemble methods in general, K-nearest neighbors or neural network-deep learning models, see for example [1] for a general survey; however, one of the difficulties of the classification problem is to be able to represent high dimensional points via an appropriate embedding in a feature space. Methods based on kernel functions such as Support Vector Machines (SVM) are an interesting alternative for the classification problem because they aim exactly at this, supported by the statistical theory developed by [2]. They are based on finding a hyperplane of maximal margin that separates the data. A very appealing feature of SVMs is that the separation can be achieved in the original data space or in a higher dimensional one, via kernel functions, without explicitly forming the space transformation map [3], unlike what most algorithms do. This feature also makes it possible to apply the method in non-vectorial domains [4]. To construct the separating hyperplane, SVM only needs to identify some significant vectors, called support vectors, from the whole data set. Because of this, SVM may suffer less the effect of outliers than other techniques. With well-tuned hyperparameters, SVMs have shown to be a robust and well-behaved technique for classification in many real-world problems. We refer to the web page http://www.clopinet.com/isabelle/Projects/SVM/applist.html (accessed on 28 April 2022) for a comprehensive list of applications.
A drawback of SVM for the classification of very large data sets is the computational cost of the optimization problem to be solved in the training stage. There are a large amount of contributions in the literature that deal with this problem and several courses of action have been developed. Following the seminal work by Osuna, Girosi, and Freund [5,6], decomposition techniques based on the optimality conditions are used to find the solution of the optimization problem by solving a sequence of smaller subproblems of the same structure of the original one. Efficient methods, as well as different heuristics, have been proposed for stating and solving the optimization subproblems, as SMO [7] which gave rise to the broadly used LIBSVM [8] (that decomposes the large problem into small problems of size two). Algorithms that decompose the large optimization problem into a sequence of medium size problems include SVMlight [9,10], GPDT [11,12,13], and ASL [14]. There are other approaches that do not use decomposition procedures but an efficient use of the linear algebra for some kernels [15], or identification procedures of the positive components at solution as the ones developed in [16]. All these methods strive at finding the exact solution of the SVM optimization problem by considering, in different ways, subsets of the training data set.
Other methods, although also using subsets of the training data set, are of a different nature since they consider approximations of the optimization problem in an attempt to reduce the computational cost while still obtaining similar generalization errors. In this group, we can cite [17,18,19] that use low-rank approximations of the kernel matrix, and [20,21,22] that use random samples of the data set. The use of subsamples may be combined with techniques for reducing the feature space dimension as in [23] or combined with search ideas as the one proposed in [24], followed by the extension presented in [25], where approximated solutions of the large size SVM problems are found by looking at the smaller optimization subproblems resulting from random samples of the data, and enriching the subsamples by finding the k-nearest neighbors, in the complete data set, of the support vectors associated to them. For a revision of subsampling methodologies we refer to Nalepa and Kawulok [26] and Pardis et al. [27].
The results in [24] show that great computational advantages can be obtained by using randomness and proximity ideas in the context of SVM. This has to do with the nature of SVM where the main objective is to find the separating hyperplane. It seems then natural to consider approaches that include randomness along with the idea that only points near the actual support vectors are necessary in order to obtain a good fit over a sample, thus reducing the required sample size. One way of addressing this problem is to use projections over random directions, instead of random sampling, and then choosing a sample over the projections. This approach has the advantage that it is not necessary to consider all points for a given random direction, but rather only selected representatives. Motivated by this, in this paper we look for an adequate subset of representative points of the data via projections using Locality Sensitive Hashing (LSH) [28]. LSH has the advantage of transforming the data points into a lower dimensional representation space, see for example [29,30]. One of its main applications is the efficient search of (approximated) nearest neighbors. Since only the support vectors are needed to obtain the optimal hyperplane classifier, to use LSH to select a subset of points that may be support vectors, or close to them, is especially of great benefit for reducing the training computational time for SVM; however, LSH could be used jointly with other approaches as well.
Random projections create bins so that when great probability points belonging to the same bin are close, points that are far will not be in the same bin. Based on these bins, it is not necessary to consider the whole original set but representatives in each one of them, thus reducing the effective size of the data set. A key of our proposal is that we work with the feature space and use the projections to search for closeness in this space. This is also another reason for using LSH in the context of SVM. Moreover, instead of only choosing the projection directions at random, we also choose them by solving SVM very small problems. We call these projecting directions as “directed” because they already contain useful information of the large problem to be solved.
Results relating to kernel-based classifiers and LSH are rather recent. Research efforts are most importantly related to provably approximating the similarity structure given by the kernel in the SVM in order to reduce the time and memory space required to train the SVM as in [31], where authors show accuracy improvement over a series of benchmark data sets. Another approach [32,33] is related to enhancing the prediction stage by using hash functions. Instead of using the actual solution, the authors consider hashing both the sample points and the normal to the obtained separating hyperplane in order to optimize time and memory space.
Our main objective with this paper is to propose subsampling algorithms, based on LSH, that produce a good selection of the data set for using in SVM. The goal is to improve the computational cost without degrading the prediction error significantly. Our approaches exploit the underneath idea of proximity but without looking explicitly for neighborhoods. The numerical experimentation seeks to show the effectiveness and efficiency of our methodologies comparing with the whole data set. We show improvement in time and we support our numerical findings with theoretical results. Our next work will be focused on improving the implementations (by using the embarrassing parallel nature of our algorithms) and the computational environment so that much larger problems in very high dimensions can be solved.
The article is organized as follows. In Section 3 we introduce LSH and the algorithmic general frameworks that we propose for solving the complete SVM problem. The following section includes numerical experimentation and details on the implementations. Section 5 contains theoretical results giving bounds on the obtained errors of the algorithms.
We end in Section 6 with some concluding remarks and future work.

2. Preliminaries: The SVM Problem

SVM for binary classification (the one considered in this paper) is based on the following. Given points { X i R d , i = 1 , , n } belonging to two classes (identified with the corresponding tags y i = 1 or y i = 1 ), they are linearly separable if there exists an hyperplane that divides them into the two different classes. The dimension d denotes the attributes of the data, and the input (or observation) space the set formed by the data. Among the separating hyperplanes, SVM seeks to find the one that maximizes the separation margin between classes, constrained to respecting the classification of each point of the data. This problem can be modeled, after a normalization, as the optimization problem
minimize w , b 1 2 w 2 2 subject to y i ( w t X i + b ) 1 i = 1 , , n .
Here, . 2 denotes the Euclidean norm.
Because the data set is usually linearly nonseparable (that is, there does not exist a solution of problem (1)) two variants are introduced in the previous problem. On one hand, a perturbation variable ξ is included in order to relax the constraints, so that a margin of error in the classification is accepted. Additionally, since the data might be separable by a nonlinear decision surface, such a surface is computed by mapping the input variables into a higher dimensional feature space, and by working with linear classification in that space. In other words, let us denote H as the feature space, which is a reproducing kernel Hilbert set (RKHS). We denote the inner product in H with < . , . > . Then, x R d is mapped into ϕ ( x ) H , where ϕ ( . ) is the transformation induced by the use of the kernel function K. This is, K ( z , a ) = < ϕ ( z ) , ϕ ( a ) > for every z , a R d . Thus, the input vectors X i are substituted with the new “feature vectors” ϕ ( X i ) , belonging to the “feature space” H . In this way, for the linearly nonseparable case, the optimization problem is written as
minimize w , b , ξ 1 2 w H 2 + C i = 1 n ξ i subject to y i ( < w , ϕ ( X i ) > + b ) 1 ξ i i = 1 , , n , ξ i 0 i = 1 , , n
where C is a positive constant that penalizes the errors at the constraints, and w H 2 = < w , w > .
Let us denote ( w , b , ξ ) the solution to this problem. A new point X R d is classified according to the side of the “hyperplane” where ϕ ( X ) falls. Hence, the function used to classify a new point X can be written as
g ( X ) = sign ( < w , ϕ ( X ) > + b )
and is usually called the classification or generalization function.
In order to solve (2), standard duality theory may be used since the problem is convex and quadratic. For more details, we refer to the book by Cristianini and J. Shawe-Taylor [3]. Using this theory, (2) can be solved by solving its dual. Some advantages of using the dual problem is that an explicit description of ϕ above is not required but a function that preserves in the higher dimensional space H the properties of the inner product, and this is satisfied by the kernel function K. Then, the dimension of the feature space for the classification can be increased without increasing the dimension of the optimization problem to solve, and this is particularly relevant when dealing with an infinite dimensional space H .
Following [3], the dual problem corresponding to (2) is given, in terms of the kernel function, by
minimize λ i = 1 n λ i + 1 2 λ t Q λ subject to y t λ = 0 , 0 λ i C f o r   i = 1 , , n
where Q R n × n is a symmetric positive semidefinite matrix with positive diagonal, defined as Q i j = y i y j K ( X i , X j ) . The matrix K with i j -component equal to K ( X i , X j ) is called the kernel matrix. To avoid complicating notation, typically K will stand for both the generic kernel and the matrix defined by the kernel restricted to the original data of size n.
Let λ be a solution of (4). Using the Karush–Kuhn–Tucker (KKT) optimality conditions (see [3]), we can obtain an expression of w as
w = i = 1 n λ i y i ϕ ( X i ) .
Observe that in this sum, only the λ i > 0 are relevant. The corresponding X i are the so-called support vectors and their importance falls in the fact that the remaining objects are irrelevant for classification purposes.
Complementarity slackness conditions [3] also imply that, for any i with 0 < λ i < C , one has y i ( < w , ϕ ( X i ) > + b ) = 1 , and therefore an expression for b can also be obtained as
b = 1 max { y j = 1 , 0 < λ j < C } < w , ϕ ( X j ) > .
Let us recall that our interest is to classify a new point by means of the generalization function g using λ . This can be achieved by substituting w in (3) and (6), so we obtain
g ( X ) = sign ( i = 1 n λ i y i K ( X i , X ) + b )
with b = 1 max { y j = 1 , 0 < λ j < C } i = 1 n λ i y i K ( X i , X j ) .
Therefore, the classification of a new point can be made just by selecting a (hopefully small) group of points from the large original data: the support vectors. The procedure of finding these vectors (which we denote by SV) from the given data set is usually referred to as the training process or training the machine. After training, it is customary to qualify the result by using it in order to classify points that are known to be in one class or another. This set of points is called the testing set. The estimated classification or prediction error for a given data set is the percentage of points from the test set that are incorrectly predicted. This last part of the SVM procedure is known as the fitting process. Note that the number of SV vectors impact the fitting time—the fewer, the faster.
In the following section, we introduce the algorithms proposed to solve an approximation of problem (4) by the use of locality-sensitive hashing and subsamples.

3. Using LSH for SVM

Locality-sensitive Hashing (LSH) was introduced as an efficient way to look for nearest neighbors in high dimensional spaces [28]. The idea is to hash the vectors in the space using several hash functions so that, for each one, the probability of collision is much higher for points that are close to each other than for those that are far apart. Then, LSH can be used to search approximated nearest neighbors of a given query point by retrieving elements stored in the same bin containing this point. Formally, the definition follows.
Definition 1.
(LSH functions) For a given R > 0 and probabilities p 1 > p 2 , a family of functions belonging to the set H = { h : D N } where D is a metric space with metric d ˜ , and N the set of integers, are LSH if for each q ˜ , q D and each h H the followings are satisfied
  • if d ˜ ( q ˜ , q ) R then P r H [ h ( q ) = h ( q ˜ ) ] p 1 ,
  • if d ˜ ( q ˜ , q ) > R then P r H [ h ( q ) = h ( q ˜ ) ] p 2 .
In this paper we are interested in the projection-based hash functions as presented in [34]. For any p dimensional vector v, define the maps h a , θ ( v ) : R p N indexed by a choice of a α -stable random vector a (see [34] for a definition) and a real number θ chosen uniformly from the range [ 0 , r ] in the following way. For a fixed a , θ the hash function h a , θ is given by
h a , θ ( v ) = a t v + θ r .
Here, . denotes the floor function.
In [34] is shown that the projection-based functions h, as previously defined, are LSH.
We use these hash functions in the feature space in order to find representatives of clusters for the data set used to train the SVM problem. We do this by projecting the data several times and choosing, as a representative, a random data point at each bin, after each projection. Because the projections are computationally expensive, we project only subsamples of the whole data set. In order to choose the direction a we follow two alternative procedures, random and directed projections, as described below.
One interesting feature of this procedure, although we do not include this in our numerical experiments, is that repetitions are independent, so the problem is embarrassingly parallel. Each parallel iteration constructs an independent subsample, which can then be joined to form the final subsample that is used to train the SVM.

3.1. Random Projections

Following [34], we choose the entries of a independently from a Gaussian distribution in the input space.
This leads to the following Algorithm 1.
Algorithm 1 LSH-SVM (random projections)
Given an initial kernel K, B bins, η 1 ( 0 , 1 ) the percentage of subsample data points, N the number of projections repetitions, η 3 the cutoff percentage, and S = { X 1 , , X n } the data set of problem (4):
  • Take a random subsample S 1 of S with size n 1 = η 1 n .
  • Generate a vector a with independent entries from a α -stable distribution in the input space.
  • Find K ( a , v ) for all v S 1 . Calculate R max = max v S 1 K ( a , v ) min v S 1 K ( a , v ) .
  • For given B the number of bins calculate r = R max / B .
  • Generate θ U n i f [ 0 , r ] , and find h ˜ a , θ ( v ) : = h ϕ ( a ) , θ ( ϕ ( v ) ) as (8) (in the feature space), this is, h ˜ a , θ ( v ) = K ( a , v ) + θ r , for all v S 1 .
  • Eliminate bins with η 3 percent highest and lowest values.
  • For each one of the remaining bins, randomly select a representative.
  • Repeat N times steps 2 to 5. Call S ^ the set formed by all the representatives found.
  • Solve (4) using S ^ and y ^ their corresponding classes, instead of S and y. Call λ ^ ( 1 ) the solution, and w ^ ( 1 ) , b ^ ( 1 ) the corresponding hyperplane values as defined in ( 5 ) and ( 6 ) .

3.2. Directed Projections

In this case, the direction a is found by solving small random SVM subproblems. To motivate this choice, assume the solution w of the complete SVM problem was known beforehand. By construction, projecting in the direction of w immediately identifies the support vectors. By sampling a small subset and solving the associated SVM problem we can approximate w , and projections in this direction allow for a much more precise sample. In summary, we obtain Algorithm 2.
Algorithm 2 LSH-SVM (directed projections)
Given an initial kernel K, B number of bins, η 1 , η 2 ( 0 , 1 ) the percentages of subsample data points, N the number of projections repetitions, η 3 the cutoff percentage, and S = { X 1 , , X n } the data set of problem (4):
  • Take a random subsample S 1 of S with size n 1 = η 1 n .
  • Take a random subsample S 2 of S with size n 2 = η 2 n .
  • Find λ n 2 the solution to problem (4) corresponding to S 2 and normalize λ ˜ n 2 i = λ n 2 i i = 1 n 2 ( λ n 2 i 2 ) 1 / 2 .
  • Denote the corresponding normal hyperplane direction w ˜ n 2 as defined in ( 5 ) and ( 6 ) .
  • Find < w ˜ n 2 , ϕ ( v ) > = i λ ˜ n 2 i y i K ( X i , v ) for all v S 1 . Calculate R max = max v S 1 < w ˜ n 2 , ϕ ( v ) > min v S 1 < w ˜ n 2 , ϕ ( v ) > .
  • For given B the number of bins calculate r = R max / B .
  • Generate θ U n i f [ 0 , r ] and find
    h w ˜ n 2 , θ ( ϕ ( v ) ) = < w ˜ n 2 , ϕ ( v ) > + θ r = i λ ˜ n 2 i y i K ( X i , v ) + θ r
    for all v S 1 .
  • Eliminate bins with η 3 percent highest and lowest values.
  • For each one of the selected bins, randomly select a representative.
  • Repeat N times steps 2 to 7. Call S ¯ the set formed by all the representatives found.
  • Solve (4) using S ¯ and y ¯ their corresponding classes, instead of S and y. Call λ ¯ ( 2 ) the solution, and w ¯ ( 2 ) , b ¯ ( 2 ) the corresponding hyperplane values as defined in ( 5 ) and ( 6 ) .

4. Numerical Experiments

In this section, we show the results of applying the LSH-SVM method to a set of real life SVM problems taken from the LIBSVM webpage www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ (accessed on 27 April 2022). The method was implemented under R environment and the SVM problems were all solved using the Kernlab package. Kernlab [19] uses a version of the algorithm implemented in LIBSVM, based on the method SMO proposed by Platt [7].
The problems tested and their sizes can be found in Table 1. We selected 80 % for the training stage of the algorithms and the remaining 20 % of the data is used for finding the classification error.
The objectives of the experiments were to study the performance of our approach and the effect of the parameters, as well as to analyze the impact of the directed directions in contrast with the use of random ones. All the experiments performed considered the RBF Gaussian kernel defined as K ( X i , X j ) = exp ( | | X i X j | | 2 2 ( 0.05 ) 2 )   i , j , and the value of the parameter C from problem (4) was set equal to 5. We would like to highlight that the algorithms here presented are specially proposed as efficient techniques for solving SVM problems with nonlinear kernels.
In order to be more effective in the choice of representatives from each bin and to reduce running times, we considered in practice a slight change of Algorithms 1 and 2.
Step 5 in Algorithm 1 was changed by
  • Generate θ U n i f [ 0 , r ] and calculate the B intervals (bins) with equal (or almost equal) number of values of the collection K ( a , v ) + θ , v S 1 .
  • For each v S 1 define h ˜ a , θ ( v ) to be the corresponding bin of the value K ( a , v ) + θ .
Step 7 in Algorithm 2 was changed by
  • Generate θ U n i f [ 0 , r ] and calculate the B intervals (bins) with equal (or almost equal) number of values of the collection < w ˜ n 2 , ϕ ( v ) > + θ , v S 1 .
  • For each v S 1 define h w ˜ n 2 θ ( ϕ ( v ) ) to be the corresponding bin of the value < w ˜ n 2 , ϕ ( v ) > + θ .
In Table 2 we include the CPU training time and classification errors obtained for the tested problems using the LSH-SVM Algorithm 2 (directed projections) and Kernlab for the complete data set. After an extensive number of trials, we selected the parameters η 1 = 0.001 , η 2 = 0.005 , N = 100 , B = 40 for the LSH-SVM algorithm. In each case, N 1 = 20 repetitions were considered and the minimum, maximum, mean, and standard deviation were calculated. For data set w8a, we set η 2 = 0.008 because the data are not balanced (approx. 3–97%) so that taking η 2 = 0.001 led to one class samples. Steps 4 and 6 in Algorithms 1 and 2, respectively, were performed by discarding the bins corresponding to the η 3 = 10 % largest and smallest values. We also include in Table 2 the number of support vectors found by each approach at the training stage. It can be observed that the running time obtained by our proposed algorithm is much lower than when using Kernlab for the complete data set. In addition, there is not a significant degradation of the classification error in many cases, even though the number of support vectors is much smaller. This seems to point out that much more support vectors than the ones really needed for a good classification were obtained by using the complete data set. An additional benefit of the reduced number of support vectors is reduced fitting time as shown in Table 3.
Finally, for completeness sake we have included a baseline comparison to simple random sampling with N 1 = 20 replicas over set a9a in order to highlight the performance of Algorithms 1 and 2. In Table 4 we show results for different sample sizes ranging from l = 0.1 to l = 0.001 . The latter comparable to the size of random samples used in Algorithms 1 and 2. Errors for l = 0.1 are essentially comparable to results of Algorithm 2 and increase as sample size decreases, as predicted. It is interesting to note however, that the number of support vectors for a random sample l = 0.1 is almost double than for Algorithm 2. Support vectors decrease with sample size and are typically close to this value.
In order to study the benefit of our directed directions, we also solved the problems with the LSH-SVM Algorithm 1 using random directions for the projections. The results can be found in Table 5. Notice that, as should be expected, the running time is better when using random projections since no SVM problems need to be solved. Although in some cases the difference is not very noticeable.
In addition, as expected, the directed directions separate classes more accurately than random ones. This is illustrated in Figure 1, where histograms for the two kind of projections are shown for problems covtype.binary and a9a from LIBSVM. Each figure represents the histogram of both type of projections over the same sample. Coloring is set by the most frequent class (−1 or 1) over each bin in the histogram.
However, although errors tend to be slightly smaller with Algorithm 2, the differences seem to be related to complexity of the problem (dimension, nonlinearity).
The effect of changing parameters is shown in the next series of figures for data set covtype.binary. Figure 2 shows changes for number of bins from 10 to 250, for both Algorithms 1 and 2. Error decreases nonlinearly, as follows from Theorems 3 and 4 combining the increase in number of bins with more separated points. Increasing the number of bins in Algorithm 2 has a greater effect than on Algorithm 1, probably because of the greater effect over separating points for the former. Time increases very slowly at first for Algorithm 2 and almost linearly for Algorithm 1. As the number of bins increases, time is almost equivalent for both algorithms. The number of support vectors increases almost linearly for both algorithms; however, Algorithm 1 has a smaller slope, indicating a greater efficiency in finding significant support vectors.
Figure 3 shows the effect of varying the number of projections. Error-wise, the effect over both algorithms is similar (curves are parallel) with bigger errors for Algorithm 1. Time increases piecewise linearly with a slope change around 350 iterations. Before this change, Algorithm 1’s slope is smaller than that of Algorithm 2. Both curves appear to be parallel however after 350 projections. As for the number of bins, the number of SV increases almost linearly. Algorithm 2 is again more efficient in selecting significant support vectors.
Finally, Figure 4 shows the effect of varying the size of the second sample for Algorithm 2, given by η 2 . Error decreases quite quickly, reaching a plateau. Time increases nonlinearly however, probably owing to the changing effect of the proportion of time employed in finding the first approximate solution. The number of support vectors appears to find an optimal (lower) level for sample sizes around η 2 = 0.005 .

5. Theoretical Results

The procedures described in Algorithm 1 (random projections) and Algorithm 2 (directed projections) produce subsamples of size N × B by appropriately selecting points from a series of N samples each distributed in B bins. In each case a SVM model is adjusted based on the representatives selected. In this section, we give approximation results for both proposed methods and give some intuition as to when directed sampling is better than random sampling. Approximations are based on two main results: general risk minimization theory [35,36] used to bound, with high probability, the supremum of the differences between the original loss function (to be introduced in the preliminaries) and its approximation using a subsample of the data. The bounds are shown to depend on the trace, the spectral and the Frobenious norm of kernel K over the subsample. In addition, deterministic approximation lemmas allow us to bound the difference between the solution w and that obtained by subsampling, as in Theorem 3. This then allows us to argue in the case of directed projections, that chosen points are more correlated to the normal vector of the optimal separating hyperplane with high probability. Finally, our results can be used to improve bounds over the unobserved theoretical misclassification error.
We begin by introducing some preliminaries.

5.1. Preliminaries

We assume that the n observations ( X i , y i ) i = 1 , , n , X i R d and y i { 1 , 1 } , are independent with identical joint distribution P. In what follows, E ( h ( X , y ) ) stands for the expectation of any function h of the random vector ( X , y ) with respect to probability P. Our aim is to construct the data-dependent function g = g ( w , b ) with values in { 1 , 1 } , as defined in (3), over an appropriate function space such that P ( g ( X ) y ) is small. For this, we choose a vector ( w , b ) minimizing an optimization problem equivalent to (2) defined over a class of data-defined functions over a subsample of the original observations ( X i , y i ) i = 1 , , n .
Notice that (2) can be written as the unconstrained minimization problem
minimize w , b 1 2 w H 2 + C i = 1 n ψ ( y i ( < w , ϕ ( X i ) > + b ) )
with ψ ( x ) = max ( 1 + x , 0 ) . Recall we denote by ( w , b ) the solution.
Define A n ( w , b ) : = 1 n i = 1 n ψ ( y i ( f w ( X i ) + b ) ) , with f w ( x ) = < w , ϕ ( x ) > . This function is known as the hinge loss function. Divide the objective function of problem (9) by n C , and let M = 1 2 n C . Then, we can rewrite problem (9) as
minimize w , b M w H 2 + A n ( w , b ) .
Clearly, using ( 5 ) and ( 6 ) , the minimization problem (10) can in turn be restated as
minimize w , b A n ( w , b ) over the set F n
with
F n = { ( w , b ) : w H w H , | b | 1 + w H max j K ( X j , X j ) } .
The next proposition gives bounds for the feasible points of problem (11) in terms of C,n and kernel K. For any w = i = 1 n w i ϕ ( X i ) we use the following notation: w = max 1 i n ( | w i | ) , K 2 = sup w 2 = 1 K w 2 , K F 2 = i , j = 1 K i , j 2 , and K 1 / 2 : = [ | K i , j | 1 / 2 ] 1 i , j n the entry-wise square root matrix. We also introduce the following assumptions on Kernel K.
K1
There exists a positive constant C K such that for all x, K ( x , x ) C K .
K2
Given d 1 > 0 , for all x , y with x y > d 1 there exists a positive constant ε ( d 1 ) such that K ( x , y ) < ε ( d 1 ) .
Proposition 1.
Let ( w , b ) F n . Then w H R : = C min ( n K 2 1 / 2 , K 1 / 2 F ) and | b | 1 + R C K .
Proof. 
The proof follows easily by bounding w using (5). We have that w = i = 1 n w i ϕ ( X i ) with w i = λ i y i and 0 λ i C . Then, w H 2 = i , j = 1 n w i w j K ( X i , X j ) satisfying w H 2 min ( K 2 w 2 2 , w 2 K 1 / 2 F 2 ) R . On the other hand, | b | 1 + | | w | | H max j K ( X j , X j ) . The result follows bounding K ( X j , X j ) C K . □
Observe that if K S denotes the submatrix of K formed by the values corresponding to the position of the support vectors, zeroing the other components, that is, K i j S = K i j if i S V or j S V and K i j S = 0 if not, then the solution of (11) satisfies being in the set F S = { ( w , b ) : w H R S , | b | R S C K + 1 } where R S = C min ( | S V | K S 2 1 / 2 , ( K S ) 1 / 2 F ) .
The next lemma from [35] gives a bound, in probability, between the hinge loss function and its expectation, i.e., between A ( w , b ) : = E ψ ( y ( f w ( X ) + b ) ) and A n ( w , b ) . Clearly E ( A n ( w , b ) ) = A ( w , b ) . This result will be at the heart of our theory.
Lemma 1.
Let us define Δ : = s u p ( w , b ) F n [ A ( w , b ) A n ( w , b ) ] . Then, with probability greater than 1 δ we have that
Δ < 4 n T r ( K ) + ( sup ( w , b ) F n f w + b + 1 ) 2 log ( 1 / δ ) n
with T r ( K ) = i = 1 n K ( X i , X i ) .
Proof. 
The proof follows [35], page 8, bounding sup x , y ψ ( y f ( x ) ) f + 1 and E Δ by 2 n T r ( K ) using Radamacher averages [35]. □
In the following subsections, we use the previous lemma to compare the values of the hinge loss function for the given data set and for subsamples of the data set generated randomly or by the LSH-SVM method. For some results, we require the following assumptions on the original sample:
S1
There exists a unique ( w , b ) which minimizes A n ( w , b ) in problem (12) over F ˜ n .
S1’
With probability one, the collection of classes F = n F n defined by an infinite sample { ( X i , y i ) } i 1 is such that ( w 0 , n , b 0 , n ) = arg min ( w , b ) F n A ( w , b ) converges to an overall solution ( w 0 , b 0 ) F .
A sufficient condition for [S1] is that there are a different number of support vectors for each class (see [37]). This is usually the case for practical SVM problems. In addition, the overall minimum of A ( w , b ) is the Bayes classifier (see [35]). If, with probability one, the class is rich enough, condition [S1’] then implies ( w 0 , n , b 0 , n ) converges to the Bayes classifier.
In Section 5.2, we consider the case when the subsamples are randomly chosen. Section 5.3 and Section 5.4 cover the cases when the subsamples are generated by the LSH-SVM Algorithms 1 and 2, respectively. In these latter subsections, we show that the approximation bounds improve respect to the random case, and we also show how these bounds relate for the two algorithms.

5.2. Random Samples

Consider an index set M { 1 , , n } corresponding to a random sample of the original data set (without replacement) of size l. Recall, as shall be required below, that any function of the random sample may be thought of as a function of the sampled variables ( Z 1 , t 1 ) , , ( Z l , t l ) , where P ( ( Z k , t k ) = ( X i , y i ) ) = 1 / n . Since the sample is without replacement, if Z i = 1 if ( X i , y i ) is selected among the l trials, then P ( Z i 1 Z i r = 1 ) = m r | S V | r . Set S ˜ to be the sample based on the index set M, and let K ˜ be the associated kernel matrix, that is K ˜ s t = K ( Z s , Z t ) .
Following the presentation above, define A l ( w , b ) : = 1 l k = 1 l ψ ( t k ( f w ( Z k ) + b ) ) .
As for problem (10), the unconstrained minimization problem with solution ( w l , b l ) can be stated as minimizing A l over the set F n , l = { ( w , b ) : w = w w l H , | b | 1 + w l H max j K ˜ ( Z j , Z j ) } .
For our theory, we use a closely related minimization problem, where the feasible set F n , l is substituted by the set F n , l ˜ defined as
F n , l ˜ = F n , l if w l H w H F n if w l H > w H
This is, the minimization problem
minimize w , b A l ( w , b ) over the set F n , l ˜
Observe that Problem (12) is equivalent to an unconstrained problem of the type of (9) for some parameter C ˜ . The followings are satisfied
(1)
F n , l ˜ F n , l , F n , l ˜ F n .
(2)
l F n , l ˜ = F n .
(3)
Following Proposition 1, for any ( w , b ) F n , l ˜ we have that w H R l and | b | C l sup M K ˜ 2 + 1 1 + R l C K with
R l : = C min ( l sup M K ˜ 2 1 / 2 , sup M K ˜ 1 / 2 F ) .
(4)
A l ( w l , b l ) is a lower bound for A l ( w ˜ , b ˜ ) with ( w ˜ , b ˜ ) the solution of problem (12).
Next, our objective is to bound the difference of the hinge loss functions corresponding to the original dataset and the random sample.
Set Δ M : = sup ( w , b ) F ˜ n , l [ A n ( w , b ) A l ( w , b ) ] . In order to bound Δ M , we use the following Mc Diardmid type inequality for symmetric functions of samples due to Cortes et al., 2008 (cited from Kumar et al. [36] (Th. 2: pages 998 to 1000)).
Theorem 1.
Let Z 1 , , Z be a sequence of random variables sampled uniformly without replacement from a fixed set of + u elements. Let Γ : Z R be a symmetric function such that for all j { 1 , , } , | Γ ( Z 1 , , Z j , , Z ) Γ ( Z 1 , , Z j , , Z ) | c . Then,
P ( Γ E ( Γ ) > ε ) e 2 ε 2 α ( , u ) c 2
where α ( , u ) = u + u 1 1 1 / ( 2 max ( , u ) )
Based on Lemma 1 and Theorem 1, we obtain the following theorem
Theorem 2.
Assume kernel K satisfies assumption [K1]. Consider a sample of size l taken without replacement from an original sample of size n. Define Δ M : = s u p ( w , b ) F ˜ n , l [ A n ( w , b ) A l ( w , b ) ] a n d   T l ( K ) : = 2 R l C K + 1 . Then, with probability greater than 1 δ
Δ M 4 l T r ( K ˜ ) + T l ( K ) 1 2 log ( 1 / δ ) n l n l ( 1 1 2 max ( l , n l ) ) .
Proof. 
The proof follows directly from Lemma 1 and Theorem 1, bounding | 1 l ( Δ M ( Z ) Δ M ( Z ) ) | 2 ( sup w , b sup j | f w ( X j ) + b | + 1 ) 2 l 2 using that E M A l ( w , b ) = A n ( w , b ) , where E M stands for the expectation with respect to the subsampling procedure, that is, conditional to the original sample S . Next we bound sup w , b sup j | f w ( X j ) + b | . By definition f w ( x ) = < w , ϕ ( x ) > so sup w , b sup j | f w ( X j ) + b | w H sup j K ( X j , X j ) + | b | 2 R l C K + 1 = T l ( K ) by our assumptions over w (see (3)) and Kernel K. □
Using Theorem 2, we obtain the following bound
A n ( w , b ) A l ( w ˜ , b ˜ ) min ( w , b ) F n , l ˜ A n ( w , b ) A l ( w ˜ , b ˜ ) Δ M .
Theorem 2 also allows us to give a bound between the minimizers ( w , b ) and ( w ˜ , b ˜ ) . For this purpose, we assume [S1] applies to the minimizer ( w ˜ , b ˜ ) of problem (11), and we present the next two lemmas.
Lemma 2.
Let V : R n be a convex function, F R n a not empty convex, closed, and bounded set, and x a unique global minimizer of V over F . Then, there exists ε o such that for all 0 < ϵ < ε o there exists δ = δ ( ϵ ) such that if | | x x | | > δ we have that V ( x ) V ( x ) > ϵ for any x F .
Proof. 
Let ε o : = arg max x F V ( x ) + V ( x ) , which is well defined because V is continuous and F compact. Moreover, since V cannot be the constant function equal to zero, ε o is positive. Given any ε o > ϵ > 0 consider the level set E = { x F : V ( x ) = ϵ + V ( x ) } . This set is not empty because the function V is coercive, this is V ( x ) if | | x | | . Let us define δ = δ ( ϵ ) the distance from x to the set E . This distance exists because E is a not empty closed set inside a compact, therefore is compact.
Let x F such that | | x x | | > δ . Then, there exists x ϵ E such that | | x x ϵ | | = δ and λ ( 0 , 1 ) with x ϵ = λ x + ( 1 λ ) x ; therefore, by convexity, V ( x ϵ ) λ ( V ( x ) V ( x ) ) + V ( x ) < V ( x ) and we obtain that V ( x ) V ( x ) > ϵ . □
Lemma 3.
Consider a function V satisfying the assumptions of previous lemma and V ˜ another function with unique minimizer x ˜ over F and such that s u p x F | V ( x ) V ˜ ( x ) | ϵ , for some ε o 2 > ϵ > 0 . Then x x ˜ δ ( 2 ϵ ) with δ given as in the previous lemma.
Proof. 
Assume that x x ˜ > δ ( 2 ϵ ) . Then, by Lemma 2, we have that 2 ϵ < V ( x ˜ ) V ( x ) ; therefore, 2 ϵ + V ( x ) + V ˜ ( x ) V ˜ ( x ) < V ( x ˜ ) V ˜ ( x ˜ ) + V ˜ ( x ˜ ) . Using the assumption at each side of the inequality, we obtain 2 ϵ ϵ + V ˜ ( x ) < ϵ + V ˜ ( x ˜ ) . This contradicts the fact that x ˜ is the minimizer of V ˜ over F . □
Observe that previous lemmas apply to the Hilbert space H being finite dimensional or being infinite dimensional (by considering the restriction to the subspace of finite linear combinations of the data set, based on the dual representation [3]). Now the theorem can be stated.
Theorem 3.
Assume kernel K satisfies [K1]. Assume [S1] and [S1’] are satisfied. Then, there exist n , l such that with probability greater than 1 2 δ , ( w , b ) ( w ˜ , b ˜ ) δ ( ε / 2 ) + δ ( ε ) + ε with
ε = 4 l T r ( K ˜ ) + T l ( K ) 1 2 log ( 1 / δ ) n l n l ( 1 1 2 max ( l , n l ) ) ,
T l ( K ) : = 2 R l C K + 1 as in the previous theorem, and | | . | | denoting the induced norm in the Cartesian product space H × R .
Proof. 
By [S1’] and [S1], A ( w , b ) satisfies the assumptions of Lemma 2 over F , so there exists ε 0 and δ ( ϵ ) satisfying Lemma 1 for all ϵ < ε 0 over F n for large enough n. In addition, [S1] is satisfied for A n ( w , b ) over F n and for A l ( w , b ) over F n , l ˜ . Then sup F n | A ( w , b ) A n ( w , b ) | Δ < ε 0 / 2 , sup F ˜ n , l | A n ( w , b ) A l ( w , b ) | Δ M < ε 0 / 2 and finally sup F ˜ n , l | A ( w , b ) A l ( w , b ) | Δ M + Δ < ε 0 , for n , l large enough with probability greater than ( 1 δ ) 2 > 1 2 δ . On the other hand, since F n = l F n , l (4.2), setting ( w 0 , l , b 0 , l ) = arg min ( w , b ) F n , l A ( w , b ) we have that | | ( w 0 , l , b 0 , l ) ( w 0 , n , b 0 , n ) | | converges to zero when n , l goes to infinity.
Then, with probability greater than 1 2 δ , | | ( w 0 , n , b 0 , n ) ( w , b ) | | δ ( ε / 2 ) and | | ( w 0 , l , b 0 , l ) ( w ˜ , b ˜ ) | | δ ( ε 0 ) , for ε < ε 0 , by applying the lemmas. Moreover, for any given ϵ , | | ( w 0 , n , b 0 , n ) ( w 0 , l , b 0 , l ) | | < ϵ for n , l large enough; therefore, using the triangle inequality we obtain | | ( w , b ) ( w ˜ , b ˜ ) | | δ ( ϵ / 2 ) + δ ( ϵ ) + ϵ . □
Finally, Lemma 1 and Theorem 2 can be used to improve bounds over the misclassification error given by L ( w , b ) = P ( sign ( < w , ϕ ( X ) > + b ) y ) .
Indeed, following [35], we have that L ( w ˜ , b ˜ ) A l ( w ˜ , b ˜ ) + sup ( w , b ) F n , l ˜ [ A n ( w , b ) A l ( w , b ) ] + sup ( w , b ) F n , l ˜ [ A n ( w , b ) A ( w , b ) ] .
Then,
L ( w ˜ , b ˜ ) A l ( w ˜ , b ˜ ) + Δ + Δ M .
Therefore, the unobserved theoretical misclassification error is bounded by the observed hinge loss function plus two approximation errors. The bound denoted by Δ M can be further improved if the subsample is generated by the LSH-SVM algorithm. The improved bound is established in the following subsections by Theorems 4 and 5, for the LSH-SVM algorithm with random projections and directed projections, respectively.

5.3. Random Projections

In Theorem 2, the quantity T l ( K ) can be improved if we are able to control the characteristics of the random sample. The main idea behind random projections (Algorithm 1) is being able to improve this bound by selecting with high probability a sample over a set of data points belonging to B blocks satisfying d ( X j , X k ) > d for j , k in different blocks and some d > 0 . In order to show this, we have the following lemma. We denote S ^ the sample created by Algorithm 1.
Lemma 4.
Assume kernel K satisfies [K1]. Let 0 < c , δ be fixed constants. Set p ( c , δ ) : = P ( max X , X S ^ ( K ( a , X ) K ( a , X ) ) > B c / δ ) , and for each X S ^ define V d ( X ) = { X S ^ : ϕ ( X ) ϕ ( X ) H d } . Then
P ( max X | V c / 2 C K ( X ) | N 1 ) 1 B N ( 1 δ ) N p ( c , δ ) .
Proof. 
Recall LSH-SVM projecting Algorithm 1 creates the sample S ^ of B N points, with N projection directions and B bins for each direction, where r = max X , X S ^ ( K ( a , X ) K ( a , X ) / B . By construction, for any given direction, r > B c / ( δ B ) and then c / r < δ with probability at least p ( c , δ ) . For a N ( 0 , I ) consider the random variables K ( a , X ) for any given X. Let f a ( X , X ) be the density of the random variable K ( a , X ) K ( a , X ) . The probability P ( h ˜ a , θ ( X ) = h ˜ a , θ ( X ) ) = 0 r f a ( X , X ) ( x ) ( 1 x / r ) d x .
In particular, if c a = | K ( a , X ) K ( a , X ) | , then conditional to a , P ( h ˜ a , θ ( X ) = h ˜ a , θ ( X ) | c a ) = c a r ( 1 θ / r ) d θ . If c a r then X , X are in contiguous bins with probability P = 1 P ( h ˜ a , θ ( X ) = h ˜ a , θ ( X ) ) = 0 c a ( 1 θ / r ) d θ = c a / r . Then, conditional on a , the event { | h ˜ a , θ ( X ) ( X ) h ˜ a , θ ( X ) | = 1 } = { θ < c a } and for any 0 < c < r , with probability greater than c / r , c < c a ϕ ( X ) ϕ ( X ) H ϕ ( a ) H C K ϕ ( X ) ϕ ( X ) H . The last inequality by [K1].
In addition, if | h ˜ h ˜ a , θ ( X ) , θ ( X ) h ˜ a , θ ( X ) | 2 then r < c a C K ϕ ( X ) ϕ ( X ) H using the same bounds as before on kernel K.
It follows that with probability at least 1 B N ( 1 δ ) N p ( c , δ ) selected points from contiguous bins in any given projection will be apart at least c / C K . Given any X S ^ from projection j, for each projection j j , if there exists X from projection j with ϕ ( X ) ϕ ( X ) H < c / 2 C K then ϕ ( X ) ϕ ( X ) > c / 2 C K for all other X from projection j . Thus ϕ ( X ) ϕ ( X ) H c / 2 C K for at most N 1 points from S ^ . □
Set K ^ to be the kernel matrix defined over the sample S ^ and consider the corresponding minimization problem (12) with the feasible set denoted as F ^ n , l . Define Δ ^ : = sup ( w , b ) F ^ n , l [ A n ( w , b ) A l ( w , b ) ] . We have the following result
Theorem 4.
Let S ^ be the sample obtained by Algorithm 1. Set p 1 : = 1 m a x ( 1 B N ( 1 δ ) N p ( c , δ ) , 0 ) . Assume [K1] and [K2] are satisfied. Set
C 1 = min ( C ( N C K + N ( B 1 ) ε ( c ( 1 δ ) / 2 C K ) ) , R S ^ C k ) ,
where R S ^ = C min ( N B K ^ 2 1 / 2 C K , K ^ 1 / 2 F ) , T 1 ( K ) = 2 C 1 + 1 , and ε ( d ) is defined in [K2]. Then, with probability greater than 1 p 1 δ ,
Δ ^ 4 N B T r ( K ^ ) + T 1 ( K ) 1 2 log ( 1 / δ ) n N B n N B ( 1 1 2 max ( N B , n N B ) ) .
Proof. 
The sample S ^ is a subset of S satisfying the property that for each v S ^ there exist at most N 1 points at distance smaller than r ( 1 δ ) over a set S 1 with P ( S 1 ) > 1 δ . We now bound P ( Δ ^ ε ) P ( { Δ ^ ε } S 1 ) + P ( S 1 c ) . The proof follows from bounding P ( { Δ ^ ε } S 1 ) . We use Theorem 1, and the bound sup w , b ( f w + b + 1 ) 2 C 1 + 1 . For this, use that over the set S 1 , max j i | K ^ ( X i , X j ) | N C K + N ( B 1 ) ε ( c ( 1 δ ) / 2 C K ) by Lemma 4.
Then, we bound E ( Δ ^ 1 S 1 ) E ( Δ ^ ) and use the bounds on the Radamacher averages following [35]. □
Remark 1.
By Lemma 4, matrix K ^ will satisfy with high probability a block-like property. This improves the bounds over K ^ 2 from the case of a completely random sample.

5.4. Directed Projections

Random projections (Algorithm 1) are effective because they are able to assure a minimum distance among most part of elements of the sample; however, it may happen that certain randomly chosen directions are non-informative: i.e., directions that are close to the normal of the optimal solution w . Algorithm 2, selects directions that are informative with high probability, i.e., close to the optimal solution w . This has a two-wise benefit. On the one hand, maintain a minimal distance among selected points, but on the other concentrate the sample over a set of points that are close to optimal separating hyperplane. Thus, by choosing a representative from each bin we are able to reduce the quantity of effective support vectors required to estimate the solution. Although our theoretical bounds are not as strong as we would like, they do offer a key to understanding why this method works better in certain cases.
Lemma 4 also holds for Algorithm 2 thus assuring the minimum distance property; however, Theorem 3 allows for a different approach. First we introduce some notation.
Let Z i , j ( w ) : = < w , ϕ ( X i ) ϕ ( X j ) > . Then, | Z i , j ( w ) | w H ϕ ( X i ) ϕ ( X j ) H . In addition, for any given direction, w we have the distances are ordered: that is | Z i , j ( w ) | > r for 2 apart bins, | Z i , j ( w ) | > 2 r for 3 apart bins and so on. Theorem 3 then yields that for the direction w ˜ n 2 obtained by Algorithm 2, with probability greater than 1 δ ,
ϕ ( X i ) ϕ ( X j ) H ( p 1 ) ( r η ( ε ) ) / w H
for points that are p bins apart. Bins that are contiguous are harder to bound so we just consider non-contiguous bins.
The rest of the approximation result is just as in the random projection case. This yields the following result whose proof is omitted as it is exactly as the proof of Theorem 4. Here the subsample is denoted as S ¯ , K ¯ is the kernel matrix defined over S ¯ and we consider the corresponding minimization problem (12) with feasible set F ¯ n , l . In addition, Δ ¯ stands for the supremal among the original and approximated hinge loss functions. This is, Δ ¯ : = sup ( w , b ) F ¯ n , l [ A n ( w , b ) A l ( w , b ) ] .
Theorem 5.
Let S ¯ be the sample obtained by Algorithm 2. Assume [K1], [K2], [S1], and [S1’] are satisfied. Set
C 2 = min ( C ( 3 N C K + p = 2 B 1 ε ( ( p 1 ) ( r η ( ε ) ) / w H ) ) , R S ¯ C K ) ,
with ε ( d ) from [K2], T 2 ( K ) = 2 C 2 + 1 , and R S ¯ = C min ( N B K ¯ 2 1 / 2 , K ¯ 1 / 2 F ) . Then, with probability greater than 1 min ( ( N + 1 ) δ , 1 ) ,
Δ ¯ 4 N B T r ( K ¯ ) + T 2 ( K ) 1 2 log ( 1 / δ ) n N B n N B ( 1 1 2 max ( N B , n N B ) ) .
Moreover, directed projections satisfy an additional property stemming from the fact that with high probability, the projections in the optimal direction w are zero for the actual support vectors. We have the following result.
Lemma 5.
Assume r > 2 . Let w be the solution of problem (1). Then, P ( h w , θ ( X j ) = h w , θ ( X i ) ) r 2 r for any X j , X i support vectors. In any case, P ( | h w , θ ( X j ) h w , θ ( X i ) | > 1 ) = 0 . This result applies to the feature space H. This is, P ( | h ˜ w , θ ( ϕ ( X j ) ) h ˜ w , θ ( ϕ ( X i ) ) | > 1 ) = 0 .
Proof. 
If X j is a support vector, ( w ) t X j = y j b ^ with y j = 1   or   1 and h w , θ ( X j ) = y j b + θ r . Thus for two support vectors X i , X j we have that | ( X j X i ) t w | 2 . Then, as in the proof of Lemma 4, use that for a given direction P ( h w , θ ( X j ) = P ( h w , θ ( X i ) ) = r | Z i , j ( w ) | r . □
Even though b is not known, Lemma 5, suggests looking at the central bins in order to select support vectors or points such that ϕ ( X j ) is close to the separating hyperplane. Thus a reasonable guess based on projections is to eliminate the highest and lowest value bins. As we have seen, Lemma 3 assures points in these center bins will concentrate support vectors and points that are close to the margins of the separating hyperplane.
Thus, directed projections improve over random projections. The extent of this improvement is hard to assess though, and depends on the geometry of S . Experimental results show a greater improvement for high dimensional problems with many support vectors for the optimal solution.

6. Concluding Remarks

In this article, we propose a novel methodology for dealing with classification problems in very large data sets using support vector machines. Our findings are that using projections based on Locality-Sensitive Hashing (LSH) can lead to improved estimation and fitting times by selecting a smaller yet significant data set for model training. Moreover, we show that previous knowledge based on the problem at hand, such as solving the problem over a small sample and then projecting in the normal direction to the obtained hyperplane, can further improve error rates in some cases. One byproduct of the proposed algorithms is that reducing the number of support vectors of the solution yields an important reduction in fitting times without affecting the overall error.
Although we restricted our attention to SVM, the approach here presented can be readily applied to other classification methods. Theoretical results, albeit not as conclusive as experimental results indicate, show improvement is related to better bounds over the target function class by eliminating less informative points, such as very close points or far from actual support vectors. This suggests that our method acts as a dimension reduction technique as measured by the decrease in the supreme of the infinity norm over the target class. An important related problem is considering sampling schemes not only for rows but also for columns in order to address problems with a huge number of features as well. Further research is necessary to fully understand how efficient column sampling can be achieved with similar heuristics, for example by sub-setting based on correlations with chosen support vectors over a smaller sample. Finally, since our method is embarrassingly parallel, future research will consider parallel implementation for further training time reductions for huge problems.

Author Contributions

These authors contributed equally to the theoretical, methodological, and computational aspects of this work. M.D.G.-L. work focuses on numerical optimization and support vector machines. C.C.L. work focuses on mathematical statistics and data mining. This paper is a result of complementing their areas of expertise. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ (accessed on 20 April 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  2. Vapnik, V. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 1995. [Google Scholar]
  3. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar]
  4. Bunke, H.; Neuhaus, M. Bridging the Gap between Graph Edit Distance and Kernel Machines (Machine Perception and Artificial Intelligence); World Scientific Publishing Company: Hackensack, NJ, USA, 2007. [Google Scholar]
  5. Osuna, E.; Freund, R.; Girosi, F. Support Vector Machines: Training and Applications; Technical Report A.I. Memo No. 1602, C.B.C.L. Paper No. 144; Massachusetts Institute of Technology: Cambridge, MA, USA, 1997. [Google Scholar]
  6. Osuna, E.; Freund, R.; Girosi, F. Training support vector vector machines: An application to face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 1997; pp. 130–136. [Google Scholar]
  7. Platt, J. Fast training of support vector machines using sequential minimal optimization. In Advances in Kernel Methods. Support Vector Learning; Scholkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1998; pp. 41–65. [Google Scholar]
  8. Fan, R.E.; Chen, P.H.; Lin, C.J. Working set selection using second order information for training support vector machines. J. Mach. Learn. Res. 2005, 6, 1889–1918. [Google Scholar]
  9. Joachims, T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In Proceedings of the 10th European Conference on Machine Learning (ECML-98), Chemnitz, Germany, 21–23 April 1998; Nédellec, C., Rouveirol, C., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 137–142. [Google Scholar]
  10. Lin, C.J. On the Convergence of the Decomposition Method for Support Vector Machines. IEEE Trans. Neural Netw. 2001, 12, 1288–1298. [Google Scholar] [PubMed] [Green Version]
  11. Serafini, T.; Zanghirati, G.; Zanni, L. Gradient projection methods for quadratic programs and applications in training support vector machines. Optim. Methods Softw. 2003, 20, 353–378. [Google Scholar] [CrossRef] [Green Version]
  12. Serafini, T.; Zanni, L. On the Working set selection in gradient-based descomposition techniques for support vector machines. Optim. Methods Softw. 2005, 20, 586–593. [Google Scholar] [CrossRef]
  13. Zanni, L. An Improved Gradient Projection-based Decomposition Techniques for Support Vector Machines. Comput. Manag. Sci. 2006, 3, 131–145. [Google Scholar] [CrossRef]
  14. Gonzalez-Lima, M.D.; Hager, W.W.; Zhang, H. An Affine-Scaling Interior-Point Method for Continuous Knapsack Constraints with Application to Support Vector Machines. SIAM J. Optim. 2011, 21, 361–390. [Google Scholar] [CrossRef]
  15. Woodsend, K.; Gondzio, J. Exploiting Separability in Large Scale Linear Support Vector Machine Training. Comput. Optim. Appl. 2011, 49, 241–269. [Google Scholar] [CrossRef]
  16. Jung, J.; Leary, D.O.; Tits, A. Adaptive constraint reduction for training support vector machines. Electron. Trans. Numer. Anal. 2008, 31, 156–177. [Google Scholar]
  17. Ferris, M.C.; Munson, T. Interior point methods for massive support vector machines. SIAM J. Optim. 2002, 13, 783–804. [Google Scholar] [CrossRef]
  18. Fine, S.; Scheinberg, K. Efficient svm training using low-rank kernel representations. J. Mach. Learn. Res. 2002, 2, 243–264. [Google Scholar]
  19. Karatzoglou, A.; Smola, A.; Hornik, K.; Zeileis, A. kernlab—An S4 Package for Kernel Methods in R. J. Stat. Softw. 2004, 11, 1–20. [Google Scholar] [CrossRef] [Green Version]
  20. Krishnan, S.; Bhattacharyya, C.; Hariharan, R. A randomized algorithm for large scale support vector learning. In Proceedings of the Advances in 20th, Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2008; pp. 793–800. [Google Scholar]
  21. Balcazar, J.L.; Dai, Y.; Tanaka, J.; Watanabe, O. Fast training algorithms for Support Vector Machines. Theory Comput. Syst. 2008, 42, 568–595. [Google Scholar] [CrossRef]
  22. Jethava, V.; Suresh, K.; Bhattacharyya, C.; Hariharan, R. Randomized algorithms for large scales SVMs. arXiv 2009, arXiv:0909.3609. [Google Scholar]
  23. Paul, S.; Boutsidis, C.; Mwagdon-Ismail, M.; Drineas, P. Random projections for Support Vector Machines. In Proceedings of the 16th International Conference on Artificial Intelligence and Statistics, Scottsdale, AZ, USA, 29 April–1 May 2013; pp. 498–506. [Google Scholar]
  24. Camelo, S.; Gonzalez-Lima, M.D.; Quiroz, A. Nearest neighbors methods for support vector machines. Ann. Oper. Res. 2015, 235, 85–101. [Google Scholar] [CrossRef]
  25. Montañés, D.M.; Quiroz, A.; Dulce, M.; Riascos, A. Efficient nearest neighbors methods for support vector machines in high dimensional feature spaces. Optim. Lett. 2021, 15, 391–404. [Google Scholar] [CrossRef]
  26. Nalepa, J.; Kawulok, M. Selecting training sets for support vector machines: A review. Artif. Intell. Rev. 2019, 52, 857–900. [Google Scholar] [CrossRef] [Green Version]
  27. Birzandi, P.; Kim, K.T.; Youn, H.Y. Reduction of training data for support vector machine: A survey. Soft Comput. 2022, 26, 1–14. [Google Scholar]
  28. Indyk, P.; Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Annual ACM Symposium on Theory of Computing, Dallas, TX, USA, 24–26 May 1998; pp. 604–613. [Google Scholar]
  29. Yong, L.; Wenliang, H.; Yunliang, J.; Zhiyong, Z. Quick attribute reduct algorithm for neighborhood rough set model. Inf. Sci. 2014, 271, 65–81. [Google Scholar] [CrossRef]
  30. Chen, Y.; Wang, P.; Yang, X.; Mi, J.; Liu, D. Granular ball guided selector for attribute reduction. Knowl.-Based Syst. 2021, 229, 107326. [Google Scholar] [CrossRef]
  31. Mu, Y.; Hua, G.; Fan, W.; Chang, S.F. Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 23–28 June 2014. [Google Scholar] [CrossRef] [Green Version]
  32. Litayem, S.; Joly, A.; Boujemaa, N. Hash-Based Support Vector Machines Approximation for Large Scale Prediction. In British Machine Vision Conference (BMVC); BMVA Press: Durham, UK, 2012; Volume 86, pp. 1–11. [Google Scholar]
  33. Ju, X.; Wang, T. A Hash Based Method for Large Scale Nonparallel Support Vector Machines Prediction. Procedia Comput. Sci. 2017, 108, 1281–1291. [Google Scholar] [CrossRef]
  34. Datar, M.; Indyk, P.; Immorlica, N.; Mirrokni, V. Localitity Sensitivity Hashing scheme based on p-stable distributions. In Proceedings of the 20th Annual Symposium on Computational Geometry, Brookling, NY, USA, 8–11 June 2004; pp. 253–262. [Google Scholar]
  35. Boucheron, S.; Bousquet, O.; Lugosi, G. Theory of Classification: A Survey of Some Recent Advances. ESAIM Probab. Stat. 2005, 9, 323–375. [Google Scholar] [CrossRef]
  36. Kumar, S.; Mohri, M.; Talwalkar, A. Sampling Methods for the Nyström Method. J. Mach. Learn. Res. 2012, 13, 981–1006. [Google Scholar]
  37. Burges, C.; Crisp, D.J. Uniqueness of the SVM Solution. In Advances in Neural Information Processing Systems 12; Solla, S.A., Leen, T.K., Müller, K., Eds.; MIT Press: Cambridge, MA, USA, 2000; pp. 223–229. [Google Scholar]
Figure 1. Histograms of Directed projections and Random projections algorithms (Algorithms 1 and 2). Data sets covtype.binary (a) and a9a (b).
Figure 1. Histograms of Directed projections and Random projections algorithms (Algorithms 1 and 2). Data sets covtype.binary (a) and a9a (b).
Mathematics 10 01812 g001
Figure 2. Plots of the effect of changing the number of bins over error (a), time (b) and number of SV (c) for data set covtype.binary.
Figure 2. Plots of the effect of changing the number of bins over error (a), time (b) and number of SV (c) for data set covtype.binary.
Mathematics 10 01812 g002
Figure 3. Plots of the effect of changing the number of projections over error (a), time (b), and number of SV (c) for data set covtype.binary.
Figure 3. Plots of the effect of changing the number of projections over error (a), time (b), and number of SV (c) for data set covtype.binary.
Mathematics 10 01812 g003
Figure 4. Plots of the effect of parameter η 2 (size of the first small SVM problem in Algorithm 2 in order to obtain projection directions) over: error (a), time (b), and number of SV (c) for data set covtype.binary. Larger sample sizes were not considered because of the increase in training time.
Figure 4. Plots of the effect of parameter η 2 (size of the first small SVM problem in Algorithm 2 in order to obtain projection directions) over: error (a), time (b), and number of SV (c) for data set covtype.binary. Larger sample sizes were not considered because of the increase in training time.
Mathematics 10 01812 g004
Table 1. Problems tested.
Table 1. Problems tested.
ProblemNumber of Data Points (n)Number of Attributes (d)
a9a48,844123
w8a49,749300
covtype.binary581,01254
cod-rna59,5358
ijcnn149,99022
skin245,0573
phishing11,05568
Table 2. LSH-SVM Algorithm 2 vs. Kernlab.
Table 2. LSH-SVM Algorithm 2 vs. Kernlab.
ProblemLSH-SVM ( N 1 = 20 )KernlabVariable
MinMaxMeanStd
a9aDirected: η 1 = 0.001 , η 2 = 0.005
12.2815.513.531.051972.58CPU time (s)
0.19930.21200.20470.00350.1798Classification error
111512231168.312923,004# Support vectors
w8aDirected: η 1 = 0.001 , η 2 = 0.008
21.9431.2026.172.882313.6CPU time (s)
0.02880.03150.03010.00060.0203Classification error
24486273.4151.4723,658# Support vectors
covtypeDirected: η 1 = 0.001 , η 2 = 0.005
112.55125.79118.723.44CPU time (s)
0.20620.21370.20970.002Classification error
109911991148.6526.54# Support vectors
cod-rnaDirected: η 1 = 0.001 , η 2 = 0.005
3.654.544.040.2984.55CPU time (s)
0.05130.05610.05360.00130.0460Classification error
461516484.115.766669# Support vectors
ijcnn1Directed: η 1 = 0.001 , η 2 = 0.005
3.424.813.730.3049.64CPU time (s)
0.03770.05180.04230.00340.0132Classification error
297349317.6513.662900# Support vectors
skinDirected: η 1 = 0.001 , η 2 = 0.005
5.836.596.040.21224.45CPU time (s)
0.01030.01180.01110.00040.0053Classification error
487522502.39.234156# Support vectors
phishingDirected: η 1 = 0.001 , η 2 = 0.005
2.983.163.040.06327.31CPU time (s)
0.06600.12390.10070.01660.0275Classification error
443484464.210.843221# Support vectors
Table 3. CPU estimation times for proposed algorithms and Kernlab over the test sample. For all cases, the test sample is randomly chosen and corresponds to 20% of the complete data set.
Table 3. CPU estimation times for proposed algorithms and Kernlab over the test sample. For all cases, the test sample is randomly chosen and corresponds to 20% of the complete data set.
ProblemFit Time KernlabFit Time Algorithm 1Fit Time Algorithm 2
a9a27.541.531.60
w8a63.391.10.26
covtype.binary19.0811.83
cod-rna7.120.400.54
ijcnn12.050.310.35
skin18.121.782.67
phishing0.830.160.12
Table 4. Random samples baselines for set a9a. As sample size increases, error is smaller but number of SV increase. Total error for l = 0.1 is essentially equal to results for LSH-SVM Algorithm 2, but number of SV is consistently larger.
Table 4. Random samples baselines for set a9a. As sample size increases, error is smaller but number of SV increase. Total error for l = 0.1 is essentially equal to results for LSH-SVM Algorithm 2, but number of SV is consistently larger.
Sample SizeRandom Samples for Set a9a ( N 1 = 20 )Variable
MinMaxMeanStd
l = 0.1 6.759.547.370.6CPU time (s)
0.180.200.200.003Classification error
29833096303126.9# Support vectors
l = 0.01 0.080.130.090.013CPU time (s)
0.20.230.220.005Classification error
3613843734.86# Support vectors
l = 0.005 0.030.040.040.003CPU time (s)
0.210.230.230.004Classification error
1891951931.56# Support vectors
l = 0.001 0.0160.0280.0190.004CPU time (s)
0.220.240.240.004Classification error
3939390# Support vectors
Table 5. LSH-SVM Algorithm 1: random vs. Algorithm 2: directed (mean).
Table 5. LSH-SVM Algorithm 1: random vs. Algorithm 2: directed (mean).
ProblemLSH-SVM ( N 1 = 20 )MeanVariable
MinMaxMeanStd
a9aRandom: η 1 = 0.001 Dir.
12.1415.4912.781.0413.53CPU time (s)
0.20920.21750.21280.00260.2047Classification error
118612621235.922.081168.31# Support vectors
w8aRandom: η 1 = 0.001 Dir.
13.2544.4519.038.7326.17CPU time (s)
0.02930.03090.02990.00040.0301Classification error
34405189.8125.02273.4# Support vectors
covtypeRandom: η 1 = 0.001 Dir.
35.1841.9737.62.025118.72CPU time (s)
0.23290.25110.23950.00430.2097Classification error
168518161749.8530.981148.65# Support vectors
cod-rnaRandom: η 1 = 0.001 Dir.
2.813.623.150.274.04CPU time (s)
0.05250.05670.05400.00110.0536Classification error
373443399.518.25484.1# Support vectors
ijcnn1Random: η 1 = 0.001 Dir.
3.294.503.830.333.73CPU time (s)
0.03510.04880.04320.00370.0423Classification error
323407350.7519.59317.65# Support vectors
skinRandom: η 1 = 0.001 Dir.
5.159.565.910.996.04CPU time (s)
0.014050.01980.01660.00120.0111Classification error
362420393.217.30502.3# Support vectors
phishingRandom: η 1 = 0.001 Dir.
2.163.072.440.283.04CPU time (s)
0.08090.15420.11980.02220.1007Classification error
412462440.312.18464.2# Support vectors
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Gonzalez-Lima, M.D.; Ludeña, C.C. Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets. Mathematics 2022, 10, 1812. https://doi.org/10.3390/math10111812

AMA Style

Gonzalez-Lima MD, Ludeña CC. Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets. Mathematics. 2022; 10(11):1812. https://doi.org/10.3390/math10111812

Chicago/Turabian Style

Gonzalez-Lima, Maria D., and Carenne C. Ludeña. 2022. "Using Locality-Sensitive Hashing for SVM Classification of Large Data Sets" Mathematics 10, no. 11: 1812. https://doi.org/10.3390/math10111812

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop