Parameter Optimization for Nadaraya-Watson Kernel Regression Method with Small Samples

Many current regression algorithms have unsatisfactory prediction accuracy with small samples. To solve this problem, a regression algorithm based on Nadaraya-Watson kernel regression (NWKR) is proposed. The proposed method advocates parameter selection directly from the standard deviation of training data, optimized with leave-one-out crossvalidation (LOO-CV). Good generalization performance of the proposed parameter selection is demonstrated empirically using small sample regression problems with Gaussian noise. The results show that proposed parameter optimization method is more robust and accurate than other methods for different noise levels and different sample sizes, and indicate the importance of Vapnik’s ε-insensitive loss for regression problems with small samples. Keywords—small samples regression; Nadaraya-Watson kernel regression; parameter optimization; loss function; cross validation


INTRODUCTION
This template, modified in MS Word 2007 and saved as a "Word 97-2003 Document" for the PC, provides authors with most of the formatting specifications needed for preparing electronic versions of their papers.All standard paper components have been specified for three reasons: (1) ease of use when formatting individual papers, (2) automatic compliance to electronic requirements that facilitate the concurrent or later production of electronic products, and (3) conformity of style throughout a conference proceedings.Margins, column widths, line spacing, and type styles are builtin; examples of the type styles are provided throughout this document and are identified in italic type, within parentheses, following the example.Some components, such as multileveled equations, graphics, and tables are not prescribed, although the various table text styles are provided.The formatter will need to create these components, incorporating the applicable criteria that follow.
Regression is one of the most fundamental and useful statistical techniques and is widely used to model practical problems arising from such fields as economics, psychology, management, signal processing, product design and medicine.It helps to relate explanatory variable(s) with a response variable and build predictive models.Given a set of independent observations D ={(x 1 ,y 1 ),…,(x n ,y n )} from a population (X,Y), where X and Y are called the explanatory variable(s) and the response variable respectively, we want to find a function f(x), assumed to be smooth, such that , where i  are independent, identically distributed random noises, so that 0 At present, there are many available regression analysis models.In general, these regression models can be divided into the classes of parametric regression models and nonparametric regression models (Wand & Jones, 1995).Parametric regression models can be specified by a finite number of parameters, which implies that the regression function f(x) is known except for the values of the parameters.Linear regression models and polynomial regression models are typical of the parametric models usually applied.Parametric regression models have a distinct interpretation of the relationship between X and Y, but the choice of parametric model depends on the situation.Restricting f(x) to belong to a parametric family means that f(x) can sometimes be too rigid (Zhang, Huang, et al, 2007).Once a parametric family is chosen, the mathematical form is fixed regardless of whether it is appropriate in reality, which could result in incorrect conclusions in the regression analysis.Non-parametric regression is proposed to overcome the rigidity of parametric regression.It only assumes that the regression function belongs to a smooth family of functions, and offers a way of estimating the regression function without specifying a parametric model.When the regression function between X and Y is complex, it is hard to deal with the observations using a parametric model, while a nonparametric model can analyze such situations effectively.
In nonparametric regression, ANNs (artificial neural networks) and k-nearest neighbor are widely used, and have good performance in many applications (Maxwell & Stinchcombe, 1995;Su, Jing, et al, 2008;Cho, Ishida, et al, 2011;La, Guo, et al, 2012).However, these methods need sufficiently large samples.When the size of samples is insufficient the quality of the results can decrease.In real world applications, obtaining sufficient training samples is often too expensive when dangerous measurements or complex technical experiments have to be performed, such as fault diagnosis for expensive equipment (Huang & Moraga, 2004), semi-conductor manufacturing (Li, Wu, et al, 2006), engine control simulation (Andonie, 2009), and biological studies (Lee & Ong, 2010).Therefore, designing a regression approach that performs well with small samples is a significant problem.Support vector www.ijarai.thesai.orgregression (SVR) is motivated by the growing popularity of support vector machines (SVM) for regression with small samples (Smola & Scholkopf, 2004;Chu & Keerthi, 2007;Bloch, 2008;Huang, Zheng, et al, 2009).However, the quality of SVR models depends on proper settings of the SVR hyperparameters, and the main issue for practitioners trying to apply SVR is determining these parameter values for a given data set.Cherkassky and Ma have proved an effective approach to selecting SVR parameters, based on noise variance estimation in the observed data (Cherkassky & Ma, 2004a).In practice, with small samples, the noise variance cannot be precisely estimated by any well-known approach (such as polynomial or k-nearest-neighbor regression).Nadaraya-Watson kernel regression (NWKR) is a nonparametric technique in statistics for estimating the conditional expectation of a random variable, and allows interpolation and approximation a little beyond the samples (Shapiai, Ibrahim, et al, 2010).However, there is no appropriate approach for the selection of its parameter.This paper describes a practical analytical approach to selecting the parameter for NWKR directly from training data.The practical validity of the proposed approach is demonstrated using synthetic data sets.This paper is organized as follows.Section 2 gives a brief introduction to NWKR regression.Section 3 describes the proposed approach to selecting the NWKR parameter using cross-validation (CV).Section 4 describes experimental tests for regression problems with Gaussian noise; these tests indicate that the proposed approach provides better generalization performance than other approaches.Finally, a conclusion is given in Section 5.

II. NADARAYA-WATSON KERNEL REGRESSION
Nadaraya-Watson kernel regression (NWKR) estimates the regression function ) (x f corresponding to any arbitrary x value using Eq. ( 1): where D denotes the training set, ) , ( i h x x K denotes a kernel function which fulfills some properties and h is the bandwidth parameter of the kernel function.Several types of kernel functions are commonly used, such as the Gaussian, uniform, triangle and Epanechnikov functions. According to Eq. ( 1), we can see that NWKR is a weighted average technique that matches the given samples using a kernel function as weighting values.This method allows accurate interpolation and approximation in the vicinity of training samples.Kernels assign weights to arbitrary samples based on their distance from the given samples.
In NWKR, a Gaussian kernel function is found to have a better prediction accuracy than the other kernel functions (Shapiai, Sudin, et al, 2011).The expression of the Gaussian kernel is as follows: ) 2 However, several articles found that the performance of NWKR mainly depends on the choice of the bandwidth parameter h rather than the kernel function.Figure 1 shows the relationship between the bandwidth h and the root mean squared error (RMSE, shown in Eq. ( 10)).In Fig. 1, the bandwidth h distinctly influences RMSE.Choosing h based on experience may result in poor prediction accuracy, especially when knowledge of the bandwidth h is insufficient.Thus, finding the optimal value of h is crucial for the prediction quality of NWKR.

III. PARAMETER OPTIMIZATION WITH CROSS-VALIDATION
Because the number of samples is small, CV is used to optimize the value of h.CV is a standard resampling technique used in many applications, such as model selection, and selecting variables and the number of components (Browne, 2000;Arlot & Celisse, et al, 2011).Under CV, the available data are divided into v disjoint sets, and the v-fold CV is then run v times using (v-1) groups as training sets and the remaining group as the validation set.This is done in turn until each group is left out once.Clearly, if v=n then v-fold CV is leave-one-out CV (LOO-CV), since exactly one object is left out at a time.The sample reuse technique of CV can help us optimize the parameter h where the amount of available data is small.Therefore, the CV error is taken as the objective function, and the optimal value of the bandwidth h is that which minimizes CV error.Owing to the fact that LOO-CV provides an almost unbiased estimate (Cawley & Talbot, et al, 2003), LOO-CV is chosen and the objective function is given by: ) ( represents the quality of estimation.In practice, different optimization results can be obtained by using different loss functions, which significantly influences the performance of the regression model.For such problems, we consider three representative loss functions, namely square loss, Huber's loss, and Vapnik's ε-insensitive loss function.
The square loss function is the following: (5) Huber's loss function, which is also called the L 1 -loss function, is: Vapnik's ε-insenstive loss function is defined as: In SVR, it has been demonstrated that for small sample regression problems Vapnik's ε -insensitive loss (with a properly chosen ε-parameter) yields better generalization than other loss functions (Cherkassky & Ma, 2004b).Cherkassky and Ma proposed a practical method for selecting the value of ε for SVR directly from the training data: where ζ noise is the standard deviation of the additive noise and n is the number of training samples.
Vapnik's loss function coincides with a special form of Huber's loss (with ε=0).From the viewpoint of traditional robust statistics, there is a well-known correspondence between the noise model and the optimal loss function.However, this connection between the noise model and the loss function is based on (asymptotic) maximum likelihood arguments, which are not suitable with small samples.Therefore, we compare the generalization performance of Vapnik's ε -insensitive loss in NWKR (with different values for ε) with other loss functions in the next section.

A. Comparison with Three Loss Function
First we describe the experimental procedure used for comparisons, and then we present the experimental results.
Training data: The simulated training data is (x i ,y i ), (i= 1,2,…,n), where the x-values are sampled from a uniform distribution on the input space, and the y-values are generated according to  . The target function f(x) is shown in Eq. ( 9).The y-values of the training data are corrupted by additive Gaussian noise.For each training data set, we generate five data sets using a small sample size (n=20) with additive Gaussian noise (the different noise levels are shown in Table 1).Performance metric: Since the goal is optimal selection of the NWKR parameter in the sense of generalization, the main performance metric is Pred_accuracy (prediction accuracy): where RMSE(h) defines the root mean squared error between NWKR estimates and the true values of the target function for the inputs.h best is the approximate optimal (minimum RMSE) bandwidth h obtained by calculating the RMSE for a range value of h with step t=0.01 in the domain [0,1].In Table 1, we present experimental comparisons for regression estimation using three representative loss functions: squared loss, Huber's loss (ε=0), and Vapnik's εinsensitive loss with ε given according to Eq. ( 8).The noise level (ζ) column indicates the standard deviation of the Gaussian noise with zero mean.In the column for loss function (and εselection), Vapnik(c-m) denotes the value of ε from Eq.( 8), and Vapnik(opt) denotes the optimal value of ε for Vapnik's εinsensitive loss function whose corresponding h is closest to h best .Pred_accuracy shows the minimal, maximal and average values for Pred_accuracy in five training data sets for each parameter selection method.
It can be seen in Fig. 2 that: 1) the NWKR approach has good performance for small sample regression problem (approximately 90%) with different noise levels; 2) the prediction accuracy for the square loss function is better than Huber's and Vapnik's (c-m) loss functions when the noise level is smaller than 0.1, but when the noise level is larger than 0.1 Vapnik's (c-m) loss function is the best of the three loss functions; 3) the robustness of the three loss functions is not strong, and weakens as the noise level increases; 4) we can obtain very good prediction accuracy using Vapnik's loss function with an appropriate choice of ε (Vapnik(opt)).
Selecting an appropriate value for the parameter ε for better prediction accuracy is important, and is studied in the next section.

B. Parameter estimation with regression model
As mentioned before, the NWKR regression approach with Vapnik's loss function could provide excellent prediction accuracy when the parameter ε is set appropriately.However, selecting ε from Eq. ( 8) is not the best choice because of its unsatisfactory prediction accuracy and robustness.In practice, it is not possible to know in advance the noise level, and the deviation in estimating the noise level using some well-known approach with small samples is unacceptable.It could be feasible to estimate the value of ε according to the dispersion of sample data.In this section, we attempt to estimate an appropriate value of ε depending on the standard deviation of the sample output data in Section 4.1.
Fig. 3 shows a scatter chart between the standard deviation of Y and the optimal value of ε with Vapnik's εinsensitive loss function.Theoretically, when the standard deviation of Y is 0, the parameter ε should also be 0, and the parameter ε should increase with an increase in the standard deviation of Y.However, when the standard deviation of Y is large enough, the parameter ε should remain invariant, for otherwise some loss value with the Vapnik's εinsensitive loss function could be 0 leading to the parameter optimization not obtaining the optimum solution.Thus, the logistic regression model was used to establish the relationship between the standard deviation of Y and the value of ε.The logistic regression model chosen in this paper is:

C. Comparisons with other parameter selection methods
In this section, the performance of the proposed method (parameter selection with Eq. ( 12)) is demonstrated in two ways.First the standard deviation of Y is changed while the sample size is unchanged, and second the sample size is changed and the standard deviation of Y is unchanged.www.ijarai.thesai.org

1) Standard deviation changed and sample size unchanged
The sample size n is set to 20, and the standard deviation of Y takes the values 0.01, 0.05, 0.1 and 0.5.For each training data set, we generate ten data sets, with the x-values sampled from a uniform distribution on the input space [0, 3], and the yvalues generated from Eq. ( 9) and corrupted by an additive Gaussian noise with zero mean and specified standard deviation.The test data set, kernel function and performance metric are the same as in Section 4.1.The comparison results for the four parameter methods are shown in Table 2.The robustness of the proposed method is stronger than the methods using square loss, Huber's loss and Vapnik's (c-m) loss.Fig. 4 shows the average prediction accuracy of the different methods for ten data sets, where we can see that the proposed method performs better than the other three methods.An increase in the noise level has little effect on the proposed method, while the performances of the other three methods are weakened.Meanwhile, the gaps between the average prediction accuracy between the proposed method and Vapnik(opt) are less than 5%.

2) Sample size changed and standard deviation unchanged
The standard deviation of Y is set to 0.1, and the sample size n is between 10, 15, 20, 25, 30 and 50.For each training data size, we generate ten data sets, with the x-values sampled from a uniform distribution on the input space [0, 3], and the yvalues generated from Eq. ( 9) and corrupted by the additive Gaussian noise N(0, 0.01).The test data set, kernel function and performance metric are the same as in Section 4.1.The comparison results for the four parameter methods are shown in Table 3. Fig. 5 shows the average prediction accuracy of the different methods for ten data sets, where we can see that the proposed method outperforms the other three methods (square loss, Huber's loss and Vapnik (c-m) loss), and the number of samples has little effect on the proposed method.However, the computation time is lengthened with the increased sample size, and the advantages of the proposed method are weakened because the standard deviation of Y is reduced when the sample size increases.It is suggested that the proposed method is suitable when the sample size is less than 30.This paper describes practical recommendations for setting meta-parameters for NWKR regression with small samples.Namely, the value of the ε parameter is obtained directly from the training data without estimating the noise level.Empirical comparisons suggest that the proposed parameter selection method (Eq.( 12)) yields good generalization performance for NWKR estimates under different noise levels and sample sizes.Hence, the proposed approach for NWKR parameter selection can be used by practitioners interested in applying NWKR to various application domains in which the sample size is small.In this paper, the proposed value of ε is derived for one target function, with Gaussian noise and an RBF kernel, but it is not clear whether such optimal selection is appropriate for other target functions, noise distributions and kernel types.Future related research may be concerned with investigating the optimal selection of ε for different target functions, noise distributions and kernel types.

Fig. 1 .
Fig. 1.Scatter diagram of bandwidth h and RMSE (Note: The samples are generated from y = f(x) + δ, and the target function ) (x f is shown in Eq. (9).The x-values for the training data are sampled from a uniform distribution in the input space [0, 3], and the y-values are corrupted using an additive Gaussian noise δ with zero mean, with the noise levels 0.1 and 0.5 denoting the standard deviation of the noise.The sample size is n=20.) Test data: 150 samples are used for the testing data set, generated sequentially with step-a=0 to the upper bound b=3.Kernel function: Gaussian kernel functions (2) are used in all experiments.

Fig. 2 .
Fig. 2. Scatter diagram of bandwidth h and RMSE The Average Prediction Accuracy for Different Loss Function

1 .Fig. 3 .
Fig. 3. Experimental results between the standard deviation of Y and Vapnik(opt), and the fitting logistic model

Fig. 5 .
Fig. 5. Average prediction accuracy for several parameter methods with different sample sizes

TABLE III .
COMPARISON RESULTS FOR SEVERAL PARAMETER METHODS WITH DIFFERENT SAMPLE SIZES