Efficiency and Performance Analysis of a Sparse and Powerful Second Order SVM Based on LP and QP

Productivity analysis is done on the new algorithm “Second Order Support Vector Machine (SOSVM)”, which could be thought as an offshoot of the popular SVM and based on its conventional QP version as well as the LP one. Our main goal is to produce a machine which is: 1) sparse & efficient; 2) powerful (kernel based) but not overfitted; 3) easily realizable. Experiments on benchmark data shows that to classify a new pattern, the proposed machine, SOSVM requires samples up to as little as 2.7% of original data set or 4.8% of conventional QP SVM or 48.3% of Vapnik’s LP SVM, which is already sparse. Despite this heavy test cost reduction, its classification accuracy is very similar to the most powerful QP SVM while being very simple to be produced. Moreover, two new terms called “Generalization Failure Rate (GFR)” and “Machine-Accuracy-Cost (MAC)” are defined to measure generalization-deficiency and accuracy-cost of a detector, respectively and used to compare such among different machines. Results show that our machine possesses GFR up to as little as 1.4% of the QP SVM or 1.5% of Vapnik’s LP SVM and MAC up to as little as 2.6% of the QP SVM or 35.9% of the Vapnik’s sparse LP SVM. Finally, having only two types of parameters to tune, this machine is straight forward and cheaper to be produced compared to the most popular & state-of-the-art machines in this direction. These collectively fulfill the three key goals that the machine is built for. Keywords—Generalization failure rate; Kernel machine; LP; QP; machine accuracy cost; Second Order Support Vector Machine; sparse


I. INTRODUCTION
Run time optimization of classifiers is a crucial issue for fast data classification. A prominent example is from Viola and Jones [1] on face detection based on a cascade of boosted weak classifiers. This framework is not efficiently applicable to kernel based classifiers like support vector machines (SVMs) [2], for instance, because boosting based on such strong classifiers as components is less effective. In many applications, the flexibility of such kernel machines is a real advantage. While SVM based classifiers play the leading role in pattern classification with highest accuracy, one of its key properties is that the learned classifier can be expressed in terms of only a subset of the training patterns, known as support vectors (SVs). But as the computational load of using such a classifier to classify a pattern is proportional to the number of SVs, SV sparsity is extremely important for large datasets. This is especially the case when the training is done once on powerful computers that can handle large data but the prediction is needed to be done multiple times possibly on a small lowpowered devices in real time. This motivates to design kernel based classifiers maintaining the trade-off between accuracy and sparsity. Consequently, this problem has come to the center of main attention in research recently.
In this paper, we have proposed a new sparse algorithm "Second Order SVM (SOSVM)" and carried out experimental studies on it as well as standard QP SVM [2] and Vapnik's LP SVM [2] to analyze their performance & efficiency on the basis of computational cost and generalization ability. For simplicity in discussion, only Gaussian kernel is applied throughout the whole work. Standard machine learning benchmark data is used for experiment.
In Section II related works are discussed, in Section III we re-describe SVMs, in Section IV we explain our approach whereas Section V is for experiments and we use Section VI for conclusion and discussion including future work.

II. RELATED WORKS
Related work can be approximately, but not disconnectedly, classified • into approaches [3]- [11] to the design of Reduced SVMs (RSVMs) that demand less computational loads than standard SVM for classifying a pattern; • into approaches [7], [12]- [16] that exploit SVMs as components of a detector with structured architecture for classification • into approaches [17]- [20] that develop SVM related cheaper classifiers, which are different from usual RSVMs; • into approaches [21]- [31] that investigate ensembledetector by boosting weak classifiers; • into approaches [32]- [40] that improve one or more of the three variables cost, efficiency, & accuracy of a detector by applying different techniques on different hypothetical single classifiers using one of them or combining more of them considering un/balanced data.
Regarding the first class of approaches, RSVMs demand only a fraction of kernel evaluations to classify a pattern. Wavelet approximations of these latter vectors have also been investigated in [6] for an efficient evaluation of the arguments to which the kernel function is applied. However, while [4], [9], [11] proposed some smart iterative algorithms for reduced SVMs with impressive results, [9] reported a memory run out from [4], [11] in case of their implementations on large dataset whereas [9] has a considerable practical variation with heavy parameter selection from its defined approach. The second class of approaches, in contrast to the first one, is focusing on structured SVM-based classification for pattern detection. Heisele et al. [13] studied a hierarchy of linear SVMs with a single nonlinear SVM at the end. Thresholds were tuned for optimizing classification performance and speed, followed by feature selection. Romdhani et al. [7] proposed a single chain of SVMs that is optimized also by threshold tuning, and by approximating a fully nonlinear SVM that has to be computed beforehand, whereas a decision tree with linear SVMs is suggested in [12]. Sahbi and Geman [14] presented a tree-structured hierarchy of SVMs that is optimized by reduced set technique in [7] and threshold selection, and operates on application specific partitioning of the space of patterns following different poses. Huo Chen [15] talked about numerical strategies for optimal cascade and checked three heuristics on synthetic data using binary SVM on each stage of a cascade. However, the third class of approaches, being a bit correlated to the second one as originated from the SVM principle, has reasonable discrepancy from that as well from the structural point of view. Maji et al. [17], [20], [36] showed that SVMs using histogram as well as additive kernels are faster and outperform linear SVM. Ladicky -Torr [18] proposed a novel locally linear SVM classifier with smooth decision boundary and bounded curvature while suggesting a trade-off the number of anchor points against the expressivity of the classifier in order to avoid overfitting and speed problem. Xu et al. [19] introduced a post-processing algorithm that compresses the learned SVM by further training on the SVs with adding few extra training parameters. Enthusiastically, the fourth class of approaches has a bit similarity to the second one from the construction principle as they both use a cascade like approach. Xiao et al. [21] used an idea named "Dynamic Cascade" as Face detector that is trained on large data set by dividing them into subsets and hence working on them while using "Bayesian Stump" as weak learners for boosting. Luo H. [22] designed optimization for cascaded classifier that finds the optimum thresholds of different stages for a fixed set up. Saberian et al. [23] introduced a mathematical model for a cascaded detector relating classification and complexity. Chen et al. [24] proposed an algorithm for a cascaded detector considering operational cost, accuracy, and feature extraction cost. Chen et al. [25] presented a general cascade framework that unifies detection learning and alignment for face detection. Li Zhang [26] offered a fast cascaded object detector having fewer stages and using logistic regression as weak learner, which emphasize on training efficiency. Raykar et al. [27] proposed a soft cascade where classifiers accept/reject patterns following probability distributions induced by the earlier stages' classifiers. Considering a fixed order of different stages in the cascade, they tried to find a trade off between accuracy and feature acquisition cost. Visentini et al. [28] devised an algorithm that dynamically builds a cascade of classifiers to speed-up the Online Boosting technique. The cascade explicitly considers the computational cost of the involved features to maintain real-time performance while its classifiers are automatically in tune balancing speed and accuracy. Saberian et al. [29] suggested a cascade boosting algorithm, fast cascade boosting (FCBoost) that minimizes Lagrangian risk while considering speed and accuracy. They introduced the concept of "neutral predictors" that robotically determines the cascade configuration such as number of cascade stage and number of weak learners in each stage. Xu et al. [30] offered a tree of classifiers to balance the test cost and accuracy while Xu et al. [31] analyzed the trade-off problem considering one more variable, feature orientation cost. At last, interestingly, the fifth class of approaches is quite diverge. Fu et al. [32] discussed a problem of combining linear SVMs to classify non-linear data set and claimed experimental results showing that their method can achieve the efficiency of LSVMs in the prediction phase while providing a classification performance comparable to nonlinear SVMs. Cheng-Jhan [33] proposed a pedestrian detector by cascading AdaBoost and SVM classifiers in different stages. A classifier for digit recognition was proposed by Maji et al. [34] that poses reduced operational cost with improved features. It also claimed to have the best result in all three aspects like accuracy, train-cost, and test-cost while using histogramgradient features and intersection kernel SVM. Gu -Han [35] designed a Clustered Support Vector Machine (CSVM), by weighted combination of linear SVMs (LSVM) trained on the clustered subsets of the training data to separate the data locally. These combined LSVMs are regularized globally to leverage the inter cluster information and avoid over-fitting in each cluster. They derive a data-dependent generalization error bound for CSVM, which explains the advantage of CSVM over linear SVM. Sharma et al. [37] offered an approach for learning non-linear SVM at reduced computational cost in the test phase and empirically analyzed the tradeoff between encoder and classifier complexity and strength. Osadchy et al. [38] proposed a so called hybrid classifier to tackle the problem with data set having high asymmetry as the large portion of the pattern space belongs to the negative class; their kernel hybrid classifier is for further efficiency than SVM while having similar accuracy [39]. Vedaldi et al. [40] offered a three-stage classifier combining linear, quasi-linear, and non-linear kernel SVMs. They showed that increasing the non-linearity of the kernels increases their discriminative power at the cost of an increased computational complexity. Nevertheless, their three stage cascade to overcome the complexity cost has resulted in quite a 'heavy' algorithm in both training and testing.

III. SUPPORT VECTOR MACHINE (SVM)
Support Vector Machines (SVMs) is a state-of-the-art and popular machine learning technique that has been confirmed as a very powerful tool for Supervised Classification. In this part, we re-describe SVM with its two main variants; one is the standard & most common method using the quadratic programming (QP), we call it QPSVM, while the other one is the Vapnik's linear programming SVM, we call it VLPSVM. We also make a mild comparison between these two.

A. Quadratic Programming SVM (QPSVM)
Here, we briefly review the basic learning algorithm of the QP based Support Vector Machine (SVM) using margin maximization between two classes, which consists in finding the separating hyperplane that is furthest from the closest object; a detailed introduction could be found in [2].
Consider a binary classification problem of dataset where a set of training patterns 1} is given. As the objective of the SVM algorithm is to find the optimal separating hyperplane that skillfully separates these patterns into two classes, it offers a classifier using a decision function (for the input pattern x) of the form is a kernel function and the parameters w and b are found from a series of calculations starting from the following QP problem: Where the set of constraints (2) implies that the decision function should classify correctly all patterns from the given training set up to some tolerable errors, the slack variables ζ i > 0 hold for margin-outward-deviated patterns (that is, patterns staying outwards from their class-margins) and C > 0 is a parameter of the classifier that controls the trade off between two main goals of the objective function in (1): one is to optimally maximize the margin between the two classes and another is to minimize the number of misclassifcations on the training patterns.
Common practice to realize a solution for this problem is to solve its dual problem, developed by introducing a Lagrangian and the Lagrangian of the problem form (1)- (3) is where α i are Lagrange multipliers and we get the corresponding dual problem as This is also a QP problem and optimum values of its variable α are used to find the primal variables w, b as w = α i y i φ(x i ) and b = y s −w·φ(x s ) where s is an index of any pattern for which 0 < α s < C. One of the KKT conditions for the problem (1) These patterns having α i > 0 are support vectors (SVs), which are usually far less (depending on the data set) in number compared to the total training set size that proves QPSVM to be sparse. From these SVs, α i < C (are those having ζ i = 0 ) patterns lie on the margin of own class whereas α i = C (are those having ζ i > 0) patterns stay outwards from their corresponding margins. Interestingly, the constraint (7) that is N i=1 α i y i = 0 ensures that in this QPSVM, SVs set must have members from both classes. In QPSVM model, SVs are the only training patterns that contribute in designing an optimal classifier.

B. Vapnik's LP SVM (VLPSVM)
Here, we concisely go through the linear programming approach proposed by Vapnik to find a separating hyperplane that is very similar to that of the QPSVM one but demands comparatively less computation to classify a pattern. More elaborate could be found in [2].
Inferring that the classifier has the same form of kernel expansion using the SVs in the QPSVM, Vapnik used a linear objective function to minimize the sum of all the coefficients used in the kernel expansion. Each coefficient is associated with its corresponding KCV (kernel computing vector) in the expansion. For better clarification, we name these vectors of this machine as "Expansion Vector (EV)", which is similar to SVs in QPSVM.
Considering that the decision function preserves exactly the same form of kernel expansion as the QPSVM and the error constraints of the QPSVM also remain almost the same, Vapnik proposed this VLPSVM focusing at minimizing the number of kernel computation by reducing EVs of the separating hyperplane that has the weight vector . For this purpose, he formed a linear objective function using the coefficients of the EVs directly and coupling the error penalty on top of the error constraints as below: Alike the QPSVM, the set of constraints (10) implies that the decision function should classify correctly all patterns from the given training set up to some tolerable errors, the slack variables ξ i > 0 hold for absolute-unity-outward-deviated patterns (that is, training patterns having ClassLabel( and C V > 0 is a parameter of the classifier that controls the trade off between data learning and overfitting. For this machine, the bias term, b V of the decision function is an optimization variable of the main problem ((9)-(12)) and found along with other optimization variables λ and ξ. The optimum values of λ are used to find the weight vector w V as w V = λ j y j φ(x j ). Training patterns, x j with coefficients λ j > 0 are the EVs, which are usually very much smaller (depending on the data set) in number compared to the total training set size showing VLPSVM to be sparser. Unlike QPSVM, here we have no constraints that forces the machine to have KCV from both classes. Unlike QPSVM, we directly implement the primal to be optimized in case of VLPSVM and get the optimal value of the bias along with the optimal co-efficient set of the expansion vector. Interestingly, in case of this machine, values of the absolute-unity-outward-deviation parameters ξ i are also found as a subset of the optimum variable by solving the LP.
Although QPSVM & VLPSVM do not have the same constraint set fully, part of the constraints they use are almost the same. For example, they use nearly the same error constraint & the non-negative lower bounds of the variables (KCVexpansion co-efficients α, λ, margin/absolute-unity outward deviation ζ, ξ).
Both machines are heavily influenced by the error penalty parameter (on top of kernel parameter) where the QPSVM introduces this parameter, C through the constraint but the VLPSVM uses this parameter, C V by directly optimizing the primal cost function that includes it, which gives it a chance to have further significance.
For the QPSVM, maximizing the margin and minimizing the error are the two basic modules in its mathematical modelling while finding the best trade off between minimizing the error and minimizing the sum of the KCV coefficients is the main theme of VLPSVM. To do so, in its mathematical formulation, VLPSVM directly puts the sum of the non-negative coefficients of KCVs in the objective function to be minimized that gives a sparser solution. Amazingly, with this direct involvement of non negative coefficients of KCVs (by a non-negative summation) in the cost function, VLPSVM very closely replicates the QPSVM. However, by so far, compared to the QPSVM, VLPSVM misses many of the interesting properties that make QPSVM academically richer such as a concrete and validated theoretical base with the efficient dual transformation that couples the kernel functions in the simplest and productive fashion. Still, our experiments on benchmark data as well as other reports [41] show that considering classification performance with generalization, VLPSVM is quite competent like QPSVM while being more efficient proving LPSVM to be empirically richer and more productive. Apparently, it appears to be paradoxical to the basic principle of SVM that a machine with less number of KCVs poses similar generalization performance to that of a machine with more KVCs, but the key point here is that the KCVs in VLPSVM do not have exactly the same topoloical and geometric interpretation as that in QPSVM despite the fact that they are being extensively called by the same name (SVs) in many literature. Further in the same path, unlike QPSVM (due to its constraint N i=1 α i y i = 0) there is no condition in VLPSVM that the KCV set contains training patterns from both classes, which may help it to reduce the number of KCVs in the decision function. Furthermore, as L 1 norm is usually more intending to sparser solution compared to L 2 norm [42], by formulating an indirect L 1 norm in VLPSVM cost function, it leads to further sparsity compared to QPSVM that uses the L 2 norm for such. Another concern related to the larger number of KCV of the QPSVM (compared to VLPSVM) is its KKT condition α i y i w · φ(x i ) + b − 1 + ζ i = 0 that forces all the training patterns staying on the class-margin of own class or outwards to be KCV. That means, some sorts of training patterns must be included in the KCVs set in the QPSVM and this becomes specially serious in case of large and noisy datasets as they contain such overlapping and nonseparable examples with a big portion. On contrary, VLPSVM has no such condition, which gives it flexibility to chose KCVs from more scattered pattern space following the demand of the stochastic and topological property of data-patterns leading to pick up few but crucial patterns that are perfect to be KCVs for a very sparse but powerful classifier with strong generalization capability.

IV. PROPOSED METHOD: SOSVM
While a powerful classifier is essential to handle with the difficulties from large and noisy data, controlling the classifiercomplexity is also important to achieve better generalization. Additionally, considering both cost and accuracy, the best classifier is the most sparse one, having the highest generalization capability, posing least test error. To serve this purpose, we try a novel algorithm by applying both QP and LP in a structured sequence.
As we discussed earlier that although both QPSVM and LPSVM are sparse, VLPSVM produces sparser solution than QPSVM while posing very similar accuracy. Still, this sparsity form VLPSVM is not sufficient for large data set. So, we look for a machine that is even further sparse and faster but more generalized and powerful aiming at real-time classification on very large and complex data.
We know that the sparseness of SVMs heavily depends on the noise and complexity of the data. When the data set is very noisy, a good generalized QPSVM may get more outliers, which will be included into the SVs set in addition to the patterns that are just on the margins. So, number of SVs will soar while generalization capability of the machine will also rise and this SVs set is one of the best representative sets of the whole data. Moreover, as the SVs set from QPSVM are sufficient to represent the discrimination between the classes, we consider only this SVs patterns (who also mostly stay around the discrimination boundary using margin maximization concept) for next manipulation in order to produce our efficient classifier by further sparsification without losing generalization ability. We then run LP (in VLPSVM fashion) on this SVs set as this LP will impose a co-efficient vector carrying weights of these SVs patterns to minimize the objective function while maintaining classification accuracy and generalization potential. Hence, this weight vector will have updated coefficient values (from the QPSVM SVs-coefficients) being further (2nd time) sparse and will promote to an extensive computational reduction by enabling much smaller number of KCVs (after throwing a large part of the SVs) to be involved in the final decision function. This gives us twofold benefits: one, training set is condensed by a pure filtration picking only the significant patterns that are already bases of a theoretically solid and powerful classifier and are sole representer of the data. By this, we also abstain from the computation of novel representatives of SVs as this relies upon complex optimization problems that are susceptible to initialization, step sizes, etc. Second, we take advantage from the sparser, and flexible pattern selecting capability of VLPSVM for KCV from a scattered and random Output: A discriminator f SOSV M (·) 3: Select two of the best pairs of (P enalty parameter, Kernel parameter) ≡ (C, σ), (C V , σ V ) for two stages 4: Run QPSVM on the training set solving the following following problem: αiyi = 0 0 ≤ αi ≤ C; i = 1, 2, ..., N 5: Extract SVs with their labels x l , y l M l=1 from the QPSVM using αi > 0 6: Run LP in VLPSVM fashion on these SVs patterns solving the following following problem: ξi ≥ 0; i = 1, 2, ..., M 7: Extract EVs with their labels xm, ym P m=1 using λm > 0 and the bias b V from the LP 8: pattern space. Additionally, the sequential training of two inductive submachines where the second one is denser being truncated and a function of the first one leads to a resultant final one higher ordered function. So, our discriminator virtually plays the functional role of a second ordered decision function, which echos the name "Second Order SVM (SOSVM)" of our algorithm while this higher ordered nature better handles the random and nonlinear behavior of the data.
Patterns having the non-zero co-efficient values from final output are the KCVs of our SOSVM and we call them "Machine Vectors(MVs)" whereas the bias of this SOSVM comes from the solution of the final optimization problem; these components construct the SOSVM using corresponding patterns and labels. It is worth mentioning here that while the optimizations in the two stages are done in sequence, their supporting parameters are found simultaneously by a joint search using a modified cross validation technique, which is in accordance with the overall system as a single machine. Fig. 1 shows the decision boundary with number of Kernel computing vectors (KCVs), and training error rates from QPSVM, VLPSVM, and SOSVM on Banana data set from machine learning benchmark. Fig. 1(a) shows the decision boundary for QPSVM with C = 4096 and σ = 2, Fig. 1(b) shows the decision boundary for VLPSVM with C V = 4 and σ V = 1 and the decision boundary for SOSVM with C = 8, σ = 0.125, C V = 4 and σ V = 1 is shown in Fig.  1(c). Banana is a two dimensional data with 400 training patterns from which QPSVM uses 94 KCVs, VLPSVM uses 14 whereas our SOSVM uses only 11 while posing training error rates 6.75%, 8.75%, and 8.75%, respectively. Therefore, to classify a single pattern, SOSVM demands only 11.7% kernel execution of QPSVM (which is sparse) and 78.6% that of the VLPSVM (which is sparser) while offering very similar accuracy!

A. Key Terms to Analyze Machine's Perfection
So far, there is a convention to find the test error rate of a classifier to realize its generalization-quality. In fact, it tells about the classifier's performance on test data, which is a must to know but may not give complete info about machine's bridging capability between the training and test data. So, we define a novel term called "Generalization Failure Rate (GFR)" that includes machine's performances on the training set, test set and their difference. To evaluate the classifier's deficiency, GFR is based on the two coupled info: 1) How much the classifier intends to overfit; and 2) How bad it performs on the test set.
Further, to terminate the confusion between the usefulness of an expensive machine with highest accuracy and a cheaper machine with acceptable accuracy, another new term "Machine-Accuracy-Cost(MAC)", expressing cost per accuracy is defined.

1) Generalization Failure Rate (GFR):
The main goal of a classification algorithm is to discover a discriminating function basing on the training set (input patterns and the corresponding labels) that will generalize well by classifying the novel patterns with the least possible errors. However, to make it real time applicable, the secondary objective is that the classifier should be as sparse as possible, that is, in case of basis-vector based machine, it should have as few basis as possible. But this basis-set (hence its size) radically influences the machine's generalization performance.
If this basis-set lead towards a very simple model, it fails to learn the data-complexity and thus poses poor performance on both the training and test set by underfitting.
On contrary, if this basis-set lead towards a very complex model, it learns the irrelevant detail and noise in the training dataset (and weakening the general model) leading to the decrement of the training error by overfitting and increment of the test error with generalization-failure. Thus measuring the generalization quality of a classifier is really indispensable. However, although both overfitting and underfitting can lead to model's performance failure, the most frequent problem in machine learning is overfitting. So, we start by focusing on it and define a term called "Overfitting Tendencey(OT)" as the difference between Test Error Rate and Train Error Rate per Train Error Rate; mathematically, OT = T estErrorRate−T rainErrorRate T rainErrorRate . Hence, OT gets higher for a higher value of T estErrorRate T rainErrorRate , which increases for lower Train Error and higher Test Error. Further, to include the loss done by underfitting, we divide this OT by Test Accuracy and define the term as the GF R that measures the overall Generalization deficiency of the model, that is GF R = OT T estAccuracy . It is quite clear that GF R gets higher either by increasing in OT or by decreasing the Test Accuracy or by both. It is to note that the term "GFR" is defined here on the assumption that T estErrorRate, T rainErrorRate ∈ (0, 100)% and T estErrorRate > T rainErrorRate, which is the usual case. 2) Machine-Accuracy-Cost (MAC): In almost all cases, it is desired to have an efficient machine, that is a machine with less computational cost but having high performance. Therefore, we need to build a machine demanding less kernel execution (to classify a test pattern), hence with less number of Kernel Computing Vectors (Support Vectors or Basis Vectors or Expansion Vectors or Machine Vectors, however it is called by different authors for different machines, we, here, call those "Kernel Computing Vectors(KCVs)" that involve kernel evaluation) and high test accuracy. To measure the achievement of such property by a machine, we define a term "Machine-Accuracy-Cost(MAC)" as M AC = N umber of KCV s(#KCV s) T est Accuracy (T eA) . So, a machine with the highest number of KCVs which increases for lower Train Error and hig lowest test accuracy will have the maximum M AC(which is never desired), whereas a machine with the lowest number of KCVs giving the highest test accuracy will have the minimum M AC (which is always desired).

B. Experimental Set-up and Results
In this section, the experimental results obtained from the proposed method are presented to show the efficiency of the SOSVM and to compare with QPSVM and VLPSVM. The experiments were performed on six benchmark machine learning datasets [43] namely Banana, Diabetics, Heart, Thyroid, Titanic and Twonorm as listed in Table I. For all three machines Gaussian kernel is used throughout the whole experiment. In case of QPSVM and VLPSVM, the penalty parameters C, C V and the kernel parameters σ, σ V are chosen based on the lowest crossvalidation error rate for each dataset using fivefold crossvalidation scheme. This two cases are experimented with C ∈ {2 −2 , 2 0 , 2 2 , ..., 2 12 } and σ ∈ {2 −2 , 2 0 , ..., 2 6 }. For SOSVM, there are C, σ in the first stage and C V , σ V in the second stage. To figure out the best C, σ, C V and σ V , a modified scheme of five-fold crossvalidation is implemented. In this scheme, 4 folds of randomly chosen training data are used to feed in the first stage for a particular value of C and σ. The KCVs from the first stage are used to feed as training set with a particular value of C V and σ V in the second stage. The returned KCVs of the second stage VLPSVM are treated as the overall KCVs of SOSVM and this classifier is used to test the remaining one fold of training data. The parameters C, σ, C V , σ V with the lowest crossvalidation error rate are chosen as the best ones. This approach is experimented with C ∈ {2 −2 , 2 0 , 2 2 , ..., 2 12 }, σ ∈ {2 −2 , 2 0 , ..., To evaluate the quality of the result obtained by the proposed SOSVM, it is compared with the results obtained by QPSVM and VLPSVM in Table I. Table I  But in case of our proposed SOSVM the average number of KCVs is 18.29 (SD 8.34) which is 1/9 th of QPSVM and 2/3 rd of VLPSVM. Therefore, it is clear that in most cases SOSVM results in a reduction in the number of KCVs and in some cases substantial reduction.
Moreover, in order to show how well our machine is capable of generalizing than the traditional machines, the performance of the proposed method along with the traditional QPSVM and VLPSVM is compared in terms of Generalization Failure Rate (GFR). In Table II, the GFRs of traditional QPSVM, VLPSVM and the proposed SOSVM are listed for various dataset. Also, the ratios of GFRs of the proposed SOSVM with respect to traditional machines are listed to show how well SOSVM performs to generalize the training data. It can be seen from the table that the average GFR value of our machine as small as 30% of the most powerful classifier like the QP SVM and 2% of the LP SVM.
Finally, in order to judge the proposed in terms of Machine Accuracy Cost (MAC), the MAC values of the proposed method along with the traditional QPSVM and VLPSVM is listed Table III. Also, the ratios of MACs of the proposed SOSVM with respect to traditional machines are listed to show how well our machine minimizes the cost. It can be seen from the table that the average MAC value of our machine as small as 16% of the most the most powerful classifier like the QPSVM and 80% of the sparse LP SVM.

VI. CONCLUSION AND FUTURE WORK
We have developed a fast but powerful classifier by using sequential optimization that is supported with simultaneous parameter search. We have also defined two new terms "GFR" and "MAC" that can be directly and easily measured to verify a detector's perfection. Our classifier is very much straight forward using least effort to train. Compared to the state-ofthe-art sparse classifiers, it is more efficient, hence, posing average MAC value as small as 16% of the standard QP  Due to this exceptionally good performance, one question pops up about how it manages to perform better than Vapnik's LP SVM (VLPSVM) or the standard QP SVM (QPSVM). We do not have any theoretical explanation for it now (and left for future work) but one plausible explanation for it could be that by the second layered training from the sequential combination of the two sub-machines generated by QP and LP (using corresponding parameters by a joint and simultaneous search) respectively, we get a machine vectors set being second ordered filtered and scaled (hence learned) having stochastic and topological properties complex and sophisticated than Support Vectors (in case of QPSVM) or Expansion Vectors (in case of VLPSVM) while working in the similar method for the discriminator. Thus, our algorithm produces a hybrid and unconventional hyperplane, based on a compact second ordered representer set coupled with corresponding co-efficent vector and bias that collectively adopts the statistical and geometric properties of training data very skillfully and generalization is boosted.
Anyway, while our classifier has consistently over performed state of the art complex and sparse classifiers with respect to computational cost and accuracy, we are considering some further manipulation with it where parts are given below: 1) An extension from two-stages including further stages.
2) A deep theoretical analysis relating the data characteristics, components of the sub-machines as well as their sequential behavior and pattern-space sharing including low GFR and MAC would be interesting.
At last, an efficient and accurate classifier like our SOSVM is very much essential. For example, our classifier is indispensable in real life, where one may have more time and resources to train but very less to test.