Graph-based Semi-Supervised Regression and Its Extensions

In this paper we present a graph-based semisupervised method for solving regression problem. In our method, we first build an adjacent graph on all labeled and unlabeled data, and then incorporate the graph prior with the standard Gaussian process prior to infer the training model and prediction distribution for semi-supervised Gaussian process regression. Additionally, to further boost the learning performance, we employ a feedback algorithm to pick up the helpful prediction of unlabeled data for feeding back and re-training the model iteratively. Furthermore, we extend our semi-supervised method to a clustering regression framework to solve the computational problem of Gaussian process. Experimental results show that our work achieves encouraging results. Keywords—Semi-supervised learning; Graph-Laplacian; Regression; Gaussian Process; Feedback; Clustering


I. INTRODUCTION
Regression is a fundamental task in data mining and statistical analysis.A regression task aims to analyze and model the relationship between variables so that the value of a given variable can be predicted from one or more other labeled variables.By using enough labeled training data, supervised regression algorithm can learn reasonably accurate model.However, in many machine learning domains, such as bioinformatics and text processing, labeled data is often difficult, expensive and time consuming to obtain.Meanwhile unlabeled data may be relatively easy to collect in practice.For this reason, in recent years, semi-supervised regression has received considerable attention in the machine learning literature due to its potential in utilizing unlabeled data to improve the predictive accuracy [6] [28].
An early semi-supervised regression method is iterative labeling [9], such as co-training algorithm [4] [27], which employs supervised regressors as the base learners, then labels and selects unlabeled data in an iterative process.Similarly [5] performed another co-training style semi-supervised regression algorithm by employing multiple learners.Although these methods achieved considerable improvements, they didn't take full advantage of the inherent structure between labeled and unlabeled data.Indeed, they just kept the supervised learning algorithm and changed the form of the labels of data, i.e., they label and relabel the unlabeled data iteratively.Unfortunately, the iterative process causes computational problems for large datasets.
Besides co-training, regularization based method has also been widely employed in the semi-supervised regression [11][15] [23][3].This method combines a regularization term of all labeled and unlabeled data, with the predictive error of labeled data into a criterion.In such a criterion, the unlabeled data can help to get a better knowledge for what parts of the input space that the predictive function varies smoothly in.A variety of approaches using the regularization term have been proposed.Some well-known regularization terms are graph Laplacian regularizer [29], Hessian regularizer [10], parallel field regularizer [14], and so on.These methods have enjoyed a great success.However, they are transductive, which means they only work on the observed labeled and unlabeled training data and can't handle the unseen data.
In this paper, we propose an inductive semi-supervised regression model through incorporating graph prior information into the standard Gaussian process (GP) regression.Our method firstly builds an adjacent graph over all the labeled and unlabeled data.Then we consider the adjacent graph as a prior and incorporate it with the standard GP prior to generate a new GP prior condition on the graph and a graph-based covariance function.From the new conditional prior and the graph-based covariance function, the marginal likelihood and the prediction distribution of semi-supervised GP regression are derived.Since the prediction from the GP model takes the form of a full predictive distribution, the unseen data can also be predicted easily.
Additionally, to further boost the learning performance, we also extend our semi-supervised method to a feedback framework.The early semi-supervised learning methods, such as self-learning [19] and co-training [27], usually make use of a supervised learning algorithm to label and select unlabeled data in an iterative process.And these methods have been proved to be effective in improving the prediction accuracy.Thus, the predictions of the learning process must contain some valuable information, and under some metrics, they can help to construct more accurate model.In other words, when a learning process is performed repeatedly, we gain extra information from a new source: past unlabeled examples and their predictions, which can be viewed as a kind of experience.This kind of experience serves as a new source of knowledge related to the prediction model.The new knowledge provides the possibility of improving the performance of our semisupervised GP regression.In this paper, to take advantage of such extra information, we also employ a feedback algorithm to pick up the helpful prediction of unlabeled data for feeding back and re-training the model iteratively.
Furthermore, we empirically demonstrate a further extension of the semi-supervised GPr.GP has the computational problem due to an unfavorable cube scaling (O(N 3)) during training, where, N is the number of training data.In recent years, many methods have been proposed to address this problem: sparse GP approximation [24][20] [12], localized regression [7] [17].In our work, we describe a clustering regression framework in order to bring the scaling down.Specifically, a clustering algorithm is employed as the first step in the process for identifying regions that have similar characteristics.Then for each cluster, a local semi-supervised regression model is built to describe the relationship between inputs and outputs.By partitioning the dataset and learning models locally, the computational cost for each local model is cubic only in the numbers of data points in each cluster, rather than in the entire dataset.As a result, even for large dataset, it can lead to a more favorable training scaling.This paper is organized as follows: In Section 2, we discuss some related work.In Section 3 we give some preliminaries and a brief overview of the Gaussian process regression.The problem statement and our main theorem, as well as the key models are detailed in Section 4. In Section 5, we lay out an extension algorithm that detects usefull predictions and feeds them into the training set.In Section 6 we experimentally compare our method with the state-of-art approaches and make a detailed discussion.According to the results, we find out a problem of our method, and also describe a clustering regression framework to fixing it.Finally, section 7 concludes our work.

II. RELATED WORK
Our work is closely related to several semi-supervised learning methods.One is the semi-supervised classification method proposed by [21].We both define a prior for the graph variables and attempt to incorporate it into the standard GP probability framework to derive a posterior distribution of latent variables condition on the graph.However, all their derivations are focused on semi-supervised classification problem but not the regression problem.Thus, we will not discuss in more detail.
Additionally, our work is also similar to Zhang's method of semi-supervised multi-task regression (SSMTR) [26].It seems that we both construct an adjacent graph and incorporate the prior of this graph with the GP prior to generate a semisupervised data dependent kernel function that defines over the entire data space.But there are several differences in deed.In Zhang's paper, they proposed a new GP likelihood ∏ m i=1 p(y i |X i )p(θ i ) for the supervised multi-tasks regression (named SMTR), and then changed the kernel function of the model to the semi-supervised kernel function, to extend their model to the semi-supervised setting.Here the semisupervised kernel function has also been used in classification task before [22].Actually, the prediction formulation of SMTR p(y i * |x i * , X i , y i ) is the same as standard GP but with different kernel functions.In our paper, we don't simply change the kernel function in a supervised GP, but take advantage of the prior of the adjacency graph to derive a new likelihood condition on the graph p(y|X, G) and a conditional prediction distribution p(y * |x * , X, y, G), which is the training model and prediction model for the semi-supervised regression.In other words, the major difference between our method and SSMTR is that the training and prediction models are totally different.Moreover, in Zhang's method, because of the large number of parameters, it is difficult to estimate the optimal values simultaneously.So the parameters are optimized through an alternating optimization algorithm.However, the parameters in our work are estimated by using the gradient descent method to minimize the negative log conditional likelihood p(y|X, G), which means that the training processes of two methods are different.

III. AN OVERVIEW OF GAUSSIAN PROCESS REGRESSION
GP has been proved to be a powerful tool for the purpose of regression.The important advantage of GP is the explicit probabilistic formulation.This not only provides probabilistic predictions but also gives the ability to infer model parameters.Here, we offers a brief summary of GP for supervised regression, see [18] for more details.We assume that the input training data is given as is the total number of input data and is the number of labeled data.X L and X U denote the inputs of labeled and unlabeled dataset respectively.We use y = {y 1 , . . ., y } to represent the corresponding outputs of labeled data X L .
In supervised GP regression, the corresponding output label y is assumed relating to an latent function f (x) through a Gaussian noise model: y = f (x) + N (0, σ 2 ), where N (m, c) is a Gaussian distribution with mean m and covariance c.The regression task is to learn a specific mapping function f (x), which maps an input vector to a label value.Usually, a zeromean multivariate Gaussian prior distribution is placed over f .That is: where K L is an × covariance matrix.In particular, the element of K L is built by means of a covariance function (kernel) k(x, x ).A simple example is the standard Gaussian covariance defined as: where b = {b j } d j=1 plays the role of characteristic lengthscales.c is the kernel over scale.The parameters b and c are initially unknown and are added to a parameter set θ, which is defined as containing all such hyper-parameters.
For a GP model, the marginal likelihood is equal to the integral over the product of likelihood p(y|f ) = N (f, σ 2 I) and the prior p(f |X L ), given as: which is typically thought as the training model of GP.Given some observations and a covariance function, we want to find out the most appropriate θ and σ, and make a prediction on the test data.There are various methods for determining the parameters.A general one is the gradient ascent, which seeks the optimal parameters by maximizing the marginal likelihood.
Given the observations and optimal θ and σ, the prediction distribution of the target value f * for a test input x * can be expressed as [18]: where the predictive mean and variance are: where k * is a matrix of covariances between the training data and test data.The matrix k * * consists of the covariance of the test data.

IV. SEMI-SUPERVISED GAUSSIAN PROCESS REGRESSION
As we can see in standard GPr, neither the prior of latent function f (Eq.( 1)) nor the predictive distribution (Eq.( 4)) contains any information of the unlabeled data.Evidently, to train a accurate GP model, we need to get sufficient training data (labeled data).However, the training data is often difficult and expensive to obtain, while the unlabeled data is relatively easy to collect.Therefore, it appears necessary to modify the standard GP model to make it capable of learning from unlabeled data, and thereby improve the performance of prediction.In this section we present how to effectively use the information of unlabeled data to extend the standard GP model into the semi-supervised framework.
According to semi-supervised smoothness assumption, if two points are close, then so should be the corresponding outputs.Based on this assumption, the unlabeled data should be helpful in regression problem.They can help explore the nearness or similarity between outputs.And the output should vary smoothly with this distance.So, to utilize the unlabeled data, we consider building an adjacent graph to define the nearness between labeled and unlabeled data.Then we attempt to incorporate the graph information into the standard GP probabilistic framework to generate a new probability model for semi-supervised GPr.

A. Prior Condition On Graph
In order to take advantage of the information of unlabeled data, we build an adjacent graph G = (V, E) on all observed data points X D = {X L , X U }, to find the adjacent relationship between labeled and unlabeled data, where V is the set of nodes composed by all data points, E is the set of edges between nodes.The graph can be represented by a weight matrix , where w ij = exp ) is the edge weight between nodes i and j, with w ij = 0 if there is no edge.
From the previous section, we can see that regression by GP is a probabilistic approach.Probabilistic approaches to regression attempt to model p(y|X D ).In this case, in order to make the unlabeled data affect our predictions, we must make some assumptions on the underlying distribution of input data.In our work, we attempt to combine the graph information with the GP.Thus, we focus on incorporating a prior of p(G|f ) with the prior of p(f |X D ) to infer a posterior distribution of f condition on the graph G.
Here, we consider the graph G itself as a random variable.There are many ways to define an appropriate likelihood of the variable G. [21] provides a simple likelihood of observing the graph: where ∆ is a graph regularization matrix, which is defined as the graph Laplacian here.We can derive ∆ in the following way: let ∆ = λL υ , where λ is a weighting factor, υ is an integer, and L denotes the combinatorial Laplacian of the graph Combining the Gaussian process prior p(f |X D ) with the likelihood function Eq.( 6), we can obtain the posterior distribution of f on the graph G as follows: Observably, the posterior distribution Eq.( 7) is proportional to p(G|f )p(f |X D ), which is a multivariate Gaussian as follows: The posterior distribution Eq.( 8) will be used as the prior distribution for the following derivation.To proceed further, we have to derive the posterior of f X independent of graph G.Here X denotes the more general dataset, which contains observed dataset X D and a set of unseen test data X T , i.e., X = {X D , X T }.In standard GP, the joint Gaussian prior distribution of f X can be expressed as follows: Then the same as above, the posterior distribution of f X conditioned on G is proportional to p(G|f X )p(f X |X), and it is explicitly given by a modified covariance function defined in the following: where Eq.( 10) gives a general description that for any finite collection of data X, the latent random variable f X conditioned on graph G has a multivariate normal distribution N (0, KX ), where KX is the covariance matrix, whose elements are given by evaluating the following kernel function: and k x and k z denote the column vector We notice that by incorporating the graph information ∆ with the standard GP prior p(f |X), we infer a new prior condition on the graph G and a graph-based covariance function k.In fact this semi-supervised kernel (covariance function) was first proposed by [22] from the Reproducing Kernel Hilbert Space view, and is used for the semi-supervised classification task.In our work, we mainly focus on how to utilize the new prior and the graph-based covariance function to derive the training and predicting distributions for semi-supervised GPr.

B. Objective Functions
Our objective training function for semi-supervised GPr is the marginal likelihood p(y|X D , G), which is the integral of the likelihood times the prior: Similar with standard GP, the term marginal likelihood refers to the marginalization over the latent function value f .But the difference is that the prior of semi-supervised GP is the posterior obtained by conditioning the original GP with respect to graph G.
According to Eq. 8 and the likelihood p(y|f ) = N (f, σ 2 I), the marginal likelihood of the observed target values y is: where This formula can be seen as the training model of our proposed method.We can select the appropriate values of hyper-parameters Θ = {θ, σ} by maximizing the log marginal likelihood log p(y|X D , G).The goal is to solve Θ =arg max log p(y|X D , G).In learning process we seek the partial derivatives of the marginal likelihood, and use them for the gradient ascent to maximize the marginal likelihood with respect to all hyper-parameters.
After learning the model parameters, we are now confronted with the prediction problem.In the prediction process, given a test data x * , we are going to infer f * based on the observed vector y.According to the prior Eq.( 10) and Eq.( 14), the joint distribution of the training output y and the test output Then we can use this joint probability and Eq.( 14) to compute the Gaussian conditional distribution over f * : By using the partitioned inverse equations, we can derive the Gaussian conditional distribution of f * at x * : where This is the key predictive distribution for our proposed semi-supervised GPr method.μ is the mean prediction at the new point and C is the standard deviation of the prediction.For fixed data and fixed hyper-parameters of the covariance function, we can predict the test data from the labeled data and a large amount of unlabeled data.
Note that the graph G contains the adjacent information of labeled and unlabeled data, and it is helpful for regression according to the smoothness assumption of supervised learning.Then, the knowledge on p(G|f ) that we gain through the unlabeled data carries information that is useful in the inference of p(y|X D , G) and p(f * |x * , X D , y, G), which is the training probability and predictive distribution for semi-supervised GP regression.Thus, our semi-supervised GPr method can be expected to yield an improvement over supervised one.

V. REGRESSION WITH FEEDBACK
In the semi-supervised regression, we learn a predictive model from labeled and unlabeled data.Then the output of the unlabeled data can be predicted through the model.In this process, predictive output can be viewed as a kind of experience.Such experience provides the possibility of improving the performance of semi-supervised GPr.Therefore, in this paper, we describe a feedback algorithm, which can pick up the useful prediction of unlabeled data for feeding back into the labeled dataset and re-train the model iteratively.
In a predictive system, we can not affirm that all the predictions of unlabeled data could be correctly predicted.For this reason, not all the predictions are helpful for re-training and we need to pick up the useful ones from them.Here we call one useful prediction a confident prediction.Now we have a problem that what the confident prediction is.Intuitively, if a labeled example can help to decrease the error of the regressor on the labeled data set, it should be the confident labeled data.Therefore, in each learning iteration of feedback, the confidence of unlabeled data point x u can be evaluated using a criterion of: here, M is the original semi-supervised regressor trained by the labeled dataset (X L , y L ) and unlabeled dataset X U , while M is the one re-trained by the new labeled dataset The pseudo code of our feedback framework is shown in Table I, where the function Semitrain returns a semisupervised GP regressor.The learning process stops when the maximum number of learning iterations, T, is reached, or there is no unlabeled data.

VI. EXPERIMENTS
In this section, we firstly evaluate the performance of the proposed semi-supervised GPr (SemiGPr) on some regression datasets, and make a direct comparison to its standard version (GPr).Then we show the experimental results of SemiGPr with the feedback algorithm (named FdGPr).Finally, we introduce the clustering framework, and empirically demonstrate the exclusion time and accuracy of the local SemiGPr extension by this framework.
There are d + 4 hyper-parameters in SemiGPr: kernel length-scales b = {b i } d i=1 , where d is the dimension of input x, kernel over scale c, noise σ and edge weight length-scale η.In our experiment, we select the appropriate values of {b, c, σ} by maximizing the marginal likelihood.To reduce the computing complexity, we fix η = 10 for all datasets.4fold cross validation is performed on each dataset and all the results are averaged over 40 runs of the algorithm.
The datasets used to evaluate the performance of our method are summarized in Table II.The examples contained in the artificial dataset Friedman is generated from the function: y = tan −1 (x 2 x 3 − 1/x 2 x 4 ) /x 1 .The constraint on the attribute is: [1,11].Gaussian noise term is added to the function.The real-world datasets are from the UCI machine learning repository and StatLib.
In our experiment, for each dataset, we randomly choose 25% of the examples as test data, while the remaining are training data.We take 10% of the training data as labeled examples, and the remaining is used as the set of unlabeled examples.Note that all the datasets are normalized to the range [0, 1].

A. Algorithmic Convergency
In this paper, we estimate hyper-parameters by using the gradient descent method to minimize the following log Firstly, we discuss the convergence of the above training objective function.In Figure 1, we show how the objective function value decreases as a function of the iterations on triazines (left) dataset and no2 (right) dataset.The result of triazines shows a typical convergence process.As the number of iterations is increasing, the objective function value is decreasing smoothly.Meanwhile, the objective function value of no2 is converged in two stages.From the results, we can see that the objective function value decreases with the increase of the number of iterations and the iterative procedure guarantees a local optimum solution for the objective function in Eq.( 20).According to our offline experiments, generally, the objective function converges after about 30-40 iterations for the datasets in Table II.

B. Efficiency of Unlabeled Data
To verify the SemiGPr model can take advantage of unlabeled data, for a fixed number of labeled data, we vary the number of unlabeled examples, and plot the mean squared error (MSE) for dataset triazines and no2.The corresponding curves are shown in Figure 2, where the dotted line and solid line indicate the predictive errors on unlabeled dataset and test dataset respectively.Note that when the proportion of unlabeled data is 0%, the result denotes the MSE of standard GPr.The figure shows that the proposed semiGPr algorithm has lower MSE compared to the standard GPr both on unlabeled and test dataset.Moreover, as the proportion of unlabeled examples increases, the advantage of semiGPr increases further.From this result, we can conclude that SemiGPr may bring extra advantage by utilizing the unlabeled data for model training.In other words, the unlabeled data provides some useful information, and our semi-supervised algorithm can make use of this information to improve the predictive accuracy.
While we observe a significant performance improvement of the proposed algorithm by using unlabeled examples, the unlabeled examples are not always helpful.For example, for data no2 (right figure of Figure 2), when the proportion of unlabeled data goes from 30% to 50%, the error rates are increased instead of reduced.The same happens to triazines when the size of unlabeled dataset goes from 90% to 100%.It is of interest to find out the cause of the negative effect of the unlabeled data experimentally in the future.

C. Evaluation of Regression Accuracy
To further clarify the effect of the proposed method, we compare the MSE between SemiGPr and GPr.The comparative results are summarized in Table III.The above value is the performance on unlabeled dataset, and the following value is on test dataset.In this experiment, we consider GPr as the baseline and compare the performance of SemiGPr with it.The improvements are also listed in the table.In addition to the average MSE, we test the significance of the performance difference between SemiGPr and GPr using a paired t-test on the MSE values.The differences are significant with a paired t-test at the 0.05 level, and the results with significant improvement in the table are bold-faced.III shows that our method SemiGPr performs as well as or better than the standard GPr in terms of the regression accuracy.We can observe that SemiGPr leads to improvements in most of the datasets, and the differences are significant in about half of the datasets.From the comparison, we can conclude that using the unlabeled training data with our semi-supervised regression framework, the GP regression accuracy can be improved.On some of the datasets like chscase, the precision of SemiGPr did not have a significant improvement over the standard one.There are two possible reasons for this result.One is the poor hyperparameter choices made in optimization process.The other one is the negative effect of the unlabeled data as shown in the previous experiment.

D. Comparison with other methods
To further evaluate the performance of SemiGPr, we compare our results with other semi-supervised regression methods.In the first experiment, the co-training method (COREG) presented in [?] is compared.The code and documentation of COREG are available at http : //lamda.nju.edu.cn/code/COREG.ashx.All the experimental setting of COREG is the same as that of SemiGPr, i.e., the same splitting of the training and testing sets and preprocessing methods, and 40 randomly runs of the algorithm for each dataset.The obtained results are summarized in Table IV.We perform a paired t-test at the 5% significance level, and the results with obvious improvement in the table are bold-faced.In general, we observe that our method achieves a smaller error on all of the datasets compared to COREG.In particular, on Wine and kin8nm datasets, we observe a significant performance improvement of the MSE over COREG.It confirms the conclusion that our semi-supervised method can take advantage of the unlabeled data and it is effective even when only a limited amount of labeled data is available.
In the second experiment, in order to illustrate the difference between our method and the one proposed in [26], we run experiments on exactly the same datasets of [26], following precisely their preprocessing and testing methods, where in Robot arm dataset, 2000 data points are selected independently for each task, with 1% as labeled data, 10% as unlabeled data and the remaining as test data, and in School dataset, for each task, 2% of the data is selected as labeled data, 20% as the unlabeled data and the rest as test data.
Here, we use the systematic sampling method as the selection method.The normalized mean squared error, which is defined as the mean squared error divided by the variance of the test output, is calculated as the performance measure.The results are averaged over 10 runs of the algorithm.In Table V we report the test normalized mean squared error for two multitask regression datasets.
It turns out that Zhang's method of SSMTR uses a very similar idea: constructing an adjacency graph and incorporating the prior of the graph with the GP prior to generate a semi-supervised data-dependent kernel function.Actually we derived the marginal likelihood and predictive distribution of the GP from different routes.As we discussed earlier, the major difference between two methods is that we have totally different training and prediction models.
From the results of this experiment in Table V we can obtain the merits and demerits of these two different models.The result shows that SSMTR achieves a smaller error on robot arm dataset compared to our method because Zhang's method considers the relevance among tasks.Their model consists of GP and a common prior on the parameters for all tasks, and the common prior can model the relevance well.The robot arm dataset contains 7 tasks which are 7 joint toques of the robot arm.The 7 joint toques have strong association with each other, which means the 7 tasks have high relevance.Therefore, for such a multi-task regression, it's better to learn a multitask model rather than building a single-task model for each task independently.However, for the second dataset (School score), although the tasks have some relevance with each other, our method still performs as well as the multi-task method SSMTR.In this dataset, the examination scores of students between different schools are related with the difficulty of the examination.In Zhang's paper, to model this latent relevance, they impose a common Gaussian prior θ i ∼ N (m θ , Σ θ ) on the kernel parameters for all tasks.On the other hand, our method is proposed for single task and can not model this relevance well.However, the common prior has no effect on a single task.Because when there is only one task in a dataset, the common prior becomes a fixed value.Therefore, it may be interesting in the future to compare which performs better for single-task.
Another major difference between two methods lies in hyper-parameter optimization.In Zhang's method, the number of parameters to estimate is large, since all the tasks are modeled in one formulation, and the number of parameters increases with the tasks.Because of the large number of parameters, it is difficult to estimate the optimal values simultaneously.So the parameters are optimized through an alternating optimization algorithm.And this could cause a computational problem for large multi-tasks datasets.However, in our work, the parameters of each task are estimated separately by maximizing the log-likelihood.Therefore, our work can be parallelized easily for a multi-task.It will also be interesting in the future to compare which performs better for hyper-parameter optimization and which saves training time.

E. Results of Feedback Algorithm
In this part, two of the datasets used in SemiGPr are presented to demonstrate the effectiveness of the SemiGPr extended by the feedback algorithm, which is denoted by FdGPr.Experimental setting is the same as the previous subsection.
To clarify unlabeled examples and their predictions really contain some valuable information and our feedback algorithm can utilize such information to improve the predictive accuracy, we plot the MSE of FdGPr for different iteration numbers.The results are shown in Figure 3.The dot line denotes the MSE on the unlabeled dataset, and the solid line is the result of the test dataset.The left figure is the result of dataset no2 and the right one is that for chscase.Note that when the feedback iteration is 0, the result denotes the MSE of SemiGPr.
From the figures we can see that when the iteration number is increased, the feedback algorithm cuts the error rate drastically over SemiGPr.The results show clearly that the unlabeled examples and their predictions have a beneficial effect on model learning.From the experiment, larger iteration number almost always produces better results, while considering the computational cost, the iteration T should be set to 20.Although FdGPr achieves a comparable performance to a non-feedback baseline on the unlabeled dataset, it does not have a significant improvement over the other ones on the test dataset.Therefore we should point out that the feedback algorithm makes our work transductive, and we should find a new metric to select predictions of unlabeled examples to improve the performance on test dataset in future work.From this result, we can make a conclusion that by utilizing feedback information, FdGPr makes performance improvements over the other methods, especially on unlabeled data.

F. Extension by Clustering Regression Framework
A significant problem with GP is that it's computationally expensive to carry out the necessary matrix computation (O(N 3)).To address this problem, we give a further extension of the SemiGPr by a clustering regression framework and discuss the possibility of improving the efficiency while keeping the accuracy.Indeed, some regression methods with similar idea have already been proposed.For example, [25] proposed a regression clustering algorithm to solve the complex distribution regression problem.The proposed algorithm updates the data in each cluster by using a regression error.Then the clusters and the corresponding regression functions can be obtained simultaneously.This method is effective for the dataset that has multiple tasks within it.In addition, in [13] and [8], excellent results have been obtained on some specific datasets by clustering the input data into several parts, and learning a regression model inside each cluster.In these studies, the accuracy of combining the clustering and regression has been discussed for the specific dataset, such as mixture distribution dataset and multiple spatial dataset, while we focus on empirically demonstrating the execution time and accuracy on general regression dataset.Besides, in above studies supervised approaches have been used for regression, but in our work we take advantage of the proposed semi-supervised regression.
Our clustering regression framework consists of three Step2: Calculate the distances from data to center for each data point

(c)
Until P (t) − P (t−1) < stages: 1) data partition, 2) training models and 3) output prediction.The pseudo code for the different phases of this framework is shown in Table VII..The first step is to partition the input data into several clusters.General clustering methods, for example k-means, divide the data into distinct clusters, where each data point belongs to exactly one cluster.However, this constraint is prone to cause unsuitable clustering results in the boundary areas among different clusters.Consequently, in this paper data partition is performed using a soft clustering model named fuzzy c-means clustering (FCM) [2].
The FCM algorithm attempts to partition a finite collection of N elements X = {X L , X U } = {x 1 , . . ., x N } into a collection of K fuzzy clusters with respect to some given criterion.Here, the X L and X U denote the labeled input set and the unlabeled input set separately.Given a finite set of data X, the algorithm returns a list of K cluster centers C = {c 1 , . . ., c K } and a N × K partition matrix P = p ij ∈ [0, 1], i = 1, . . ., N, j = 1, . . ., K, where each element p ij can be interpreted as the probability that the element x i belongs to cluster j.
Table VI provides the learning process of FCM.Given X and the initial partition matrix P , the FCM algorithm first computes the cluster centroid c i for each cluster, using the formula Eq.(a) in Table VI.The centroid of a cluster is the mean of all points, weighted by their degree of belonging to the  cluster.Then it calculates the distances D ij from data point x i to the cluster center c j .Finally, for each data point, it updates the coefficients of being in the clusters.It is repeated until the convergence condition is satisfied.
Through the algorithm presented in Table VI, we can get the partition matrix P , where the rows indicate the probabilities that the data point x belongs to each cluster, and the columns denote the probabilities that all the data points X are partitioned into cluster j.But the final goal of clustering is to calculate an indicator matrix Z = z ij ∈ {0, 1}, i = 1, . . ., N, j = 1, . . ., K. Here, the z ij is one if x i is assigned to the corresponding cluster j, or zero otherwise.The general idea of obtaining Z from P is that setting the maximum probability of each row to be 1, and the rest are 0.But it becomes the same as hard clustering to some extent.To gain fuzzy clusters, in this paper we exploit a threshold δ to filter the partition matrix.For example, set δ to be 0.4, then the corresponding z ij is one if p ij > 0.4, and zero otherwise.Thus, for a data point, it may not be assigned to only one cluster but to the clusters that have probability bigger than δ.Through adjusting the threshold δ, we can control how much clusters may overlap.
Several clusters with overlaps are obtained by the data partition step.Then in training step, local semi-supervised GPr model M (Eq.14) is trained for each cluster.Finally in predicting step, the prediction of a given data point x t is a weighted combination of the predictions of the individual local models given by ŷ(x t ) = ∑ K j=1 P xt (j) * M j (x t ), where P xt (j) denotes the weight of the model j.It effectively equals to the partition matrix that tells the probability of element x t belonging to cluster j.And M j (x t ) represents the prediction of input x t by using the model M j , the formula of which is Eq.18.To evaluate the performance of the clustering regression framework (named FCMGPr), we experiment on the datasets described in Table II.The parameters chosen for the FCM algorithm remain unchanged for each dataset, m = 2, = 10 −6 , and A is a diagonal matrix with size d × d, where d is the dimension of data X.Here, we set the threshold δ to be 0.4.And if the data points in a dataset are more than one thousand, then the number of clusters K equals to 4, if not, K = 2.The experimental setting of regression part is the same as the evaluation of SemiGPr.
Firstly, we compared the accuracy between SemiGPr and FCMGPr.The MSE results on the unlabeled and test data are shown in Table VIII.From the results we can see that the FCMGPr performs better than the semi-supervised one on no2 and chscase datasets.One of the reasons is that the local models are more flexible than a global one.In other words, constructing model locally can capture the details of data better than applying a global model across entire dataset.In addition, making predictions by weighted combination can help to avoid the inaccurate results due to incorrect clustering.However, there is a problem with clustering regression framework: if the amount of labeled data is too small in a cluster, particularly high-dimensional data, it will be easy to make poor hyperparameter choices or occur under-fitting.Then it results in bad accuracy for FCMGPr.This is an explanation of the results on wine and Friedman datasets, where the local model did not have an improvement over the semi-supervised one.
Secondly, we tested the exclusion time of SemiGPr and FCMGPr.The results are shown in Figure 4, where the left two figures are the comparison of training time and the right two figures show the results of predicting time.As we can see in Figure 4, for all four datasets, both the training time and the predicting time of FCMGPr are reduced over SemiGPr.From these results we can conclude that by using the clustering regression framework, the computational efficiency of the semi-supervised GPr can be greatly improved and it also has the possibility for improving the prediction accuracy.

VII. CONCLUSION
In this paper we presented and evaluated a semi-supervised GPr by incorporating an adjacent graph within the standard GP probabilistic framework.Through exploring the standard GP to semi-supervised setting, we can learn a regression model from only small number of expensive labeled data and a large amount easily obtained unlabeled data.Moreover, we presented a feedback algorithm, which can choose the confident prediction for feedback to further improve the performance.Furthermore, to solve the computational problem of GP, we also gave a further extension of the semi-supervised GPr by a clustering regression framework.The experimental results indicate that our semi-supervised regression approach can improve the prediction accuracy.Besides, by choosing the confident prediction for feedback, it brings a significant improvement in the prediction accuracy over a non-feedback baseline.The extension by the clustering regression framework is successful in reducing the exclusion time.
In the experiments, we compared SemiGPr with some stateof-art methods.There also exist some other semi-supervised regression methods, such as regularization regression method [14], propagable graph method [16].However, because of the different experimental settings, we could not compare the proposed method with them.Future work should include implementing these methods and empirical comparisons with them.We will also apply our scheme to harder regression tasks.Although the results of feedback extension were encouraged, it is noted that the algorithm has high time complexity due to the re-training of SemiGPr.Therefore, in the future work a new feedback criterion would need to be explored in order to obtain more accurate predictions but spending less computational cost.

Fig. 1 .
Fig. 1. log likelihood decreases along with the increase of the iteration No. for the triazines (left) and the no2 (right).

Fig. 2 .
Fig. 2. Performance of SemiGPr as a function of number of unlabeled examples.

Fig. 3 .
Fig. 3.The effect of different feedback iterations on unlabeled and test dataset.

Fig. 4 .
Fig. 4. Comparison of execution time ŷu } and unlabeled dataset {X U − x u }.Here x u is an unlabeled data point while ŷu is the real-valued output predicted by the original regressor M , i.e. ŷu = M (x u ).The first term of Eq.(19) denotes the mean squared error (MSE) of the original semi-supervised regressor on labeled dataset, and the second term is expressed the MSE of the regressor utilizing the information provided by (x u , ŷu ) on the labeled dataset.Thus, (x u , ŷu ) associated with the biggest positive E xu can be regarded as the most confident labeled data.In other words, If the value of E xu is positive, it means utilizing (x u , ŷu ) is beneficial.So we can use this unlabeled data paired with its prediction as labeled data in the next round of model training.Otherwise, (x u , ŷu ) is not helpful to train models, and will be omitted.Then the x u should remain in the unlabeled dataset X U .

TABLE I .
ALGORITHM OF FEEDBACK Input: Labeled dataset(XL, yL), Unlabeled dataset XU , Learning iterations T , Initial parameters set InitP ara Output: Prediction model M Step1: Training model M ← Semitrain(XL, yL, XU , InitP ara) Step2: Choosing and feedback for t = 1 : T do Create pool X U by randomly selecting data points from XU for each xu

TABLE II .
DATASETS USED FOR SEMIGPR.D IS THE FEATURE; N DENOTES THE SIZE OF THE DATA.

TABLE III .
COMPARISON OF SEMIGPR WITH THE STANDARD GPR ON DIFFERENT DATASETS.

TABLE V .
COMPARISON OF SEMIGPR WITH SSMTR (SSRT:SUPERVISED SINGLE-TASK REGRESSION WHICH USES ONE GP FOR EACH TASK).

TABLE VII .
PSEUDO CODE OF CLUSTERING FRAMEWORK

TABLE VIII .
RESULTS OF CLUSTERING REGRESSION FRAMEWORK.