Using Unlabeled Data to Improve Inductive Models by Incorporating Transductive Models

—This paper shows how to use labeled and unlabeled data to improve inductive models with the help of transductive models. We proposed a solution for the self-training scenario. Self-training is an effective semi-supervised wrapper method which can generalize any type of supervised inductive model to the semi-supervised settings. it iteratively refines a inductive model by bootstrap from unlabeled data. Standard self-training uses the classifier model(trained on labeled examples) to label and select candidates from the unlabeled training set, which may be problematic since the initial classifier may not be able to provide highly confident predictions as labeled training data is always rare. As a result, it could always suffer from introducing too much wrongly labeled candidates to the labeled training set, which may severely degrades performance. To tackle this problem, we propose a novel self-training style algorithm which incorporate a graph-based transductive model in the self-labeling process. Unlike standard self-training, our algorithm utilizes labeled and unlabeled data as a whole to label and select unlabeled examples for training set augmentation. A robust transductive model based on graph markov random walk is proposed, which exploits manifold assumption to output reliable predictions on unlabeled data using noisy labeled examples. The proposed algorithm can greatly minimize the risk of performance degradation due to accumulated noise in the training set. Experiments show that the proposed algorithm can effectively utilize unlabeled data to improve classification performance.


I. INTRODUCTION
Traditional inductive models like Naive Bayes, CARTs [1], Support Vector Machines are always in supervised settings, which means these model can only be trained on labeled data.Training a good inductive model needs enough labeled examples.Unfortunately, preparing labeled data for such task is often expensive and time consuming, while unlabeled data are readily available.This was the major motivation that led to the arise of semi-supervised paradigm which utilizes few labeled examples and vast amounts of cheap unlabeled examples to learns a model.Semi-supervised learning has achieved considerable success in a wide variety of domains, existing semi-supervised learning methods can be roughly categorized into several paradigms [2], including generative models, semisupervised support vector machines (S3VMs), graph based methods and bootstrapping wrapper method.
Self-training [3] is a simple and effective semi-supervised algorithm which has been successfully applied to various real-world tasks.It is an wrapper method, which means it can generalize any type of supervised inductive model to the semi-supervised settings [4].Self-training initially trains a classifier on labeled data and then iteratively augments its labeled training set by adding several newly pseudo-labeled unlabeled examples with most confident predictions of its own.Standard self-training uses the classifier model(trained on labeled examples) to label and select candidates from the unlabeled training set, which may be problematic since the initial classifier may not be able to provide highly confident predictions as labeled training data is always rare.In addition, since self-training utilizes unlabeled data in an incremental manner, early noise introduced to training sets would be reinforced round by round, resulting in severe performance degradation.Although some techniques, e.g.data editing [5], have been employed to alleviate this noise-related problem [6], results are yet undesirable.As a result, it could always suffer from introducing too much wrongly labeled candidates to the labeled training set, which may severely degrades performance.Another drawback of self-training is that the newly added examples are not informative to the current classifier, since they can be classified confidently [7].As a result, they may only help increase the classification margin, without actually providing any novel information to the current classifier.
In this paper, we show how to use unlabeled data to improve inductive models with the help of transductive models.We proposed a solution for the self-training scenario, a novel self-training style algorithm is proposed.Generally, unlike traditional self-training only using labeled data to label and select unlabeled example for training set augmentation, our algorithm utilizes both labeled and unlabeled data to facilitate the selflabeling process.In detail, all the labeled and unlabeled examples are presented as a graph, where a novel markov random walk with constrains is proposed to label all examples on graph in a transductive setting [8].This graph-based method satisfy manifold assumptionthat examples with high similarities in the input space should share similar labels.Typically, Most graph based methods output label information to unlabeled data in a transductive setting such as Label propagation, markov random walks, Low density separation [9].Those methods are designed to utilize unlabeled by representing both the data as a graph, with examples as vertices and similarities of examples as edges.Existing transductive graph-based methods assume all labels on labeled data correct, can not work under training sets subject to noise.While our transductive model can naturally deals with noisy labeled data, which utilize "label www.ijarai.thesai.orgsmooth" to automatically adjust the potential wrong labels.By incorporating this transductive model to the self-training process, we expect any applied supervised inductive model can be greatly improved.
The main contribution of this paper can be summarized as follows: • We show that incorporating transductive models to inductive models in semi-supervised settings can improve classification performance.The proposed algorithm learns a inductive model f from labeled and unlabeled data as follows: 1)initialize the model f using labeled set L; 2)use f to predict labels on unlabeled set U ; 3)select a subset S from U for which f has the most confident predictions; 4)construct a neighborhood graph G with L∪U under certain similarity measure; 5)incorporate a transductive model into the self-labeling process: knowing the prior information about labels on L ∪ U , start a constrained random walk on Graph G to label all the unlabeled examples in U ; 6)choose k most confident examples from U for labeled training set augmentation according to the output of random walk; 7)refine f with augmented labeled data.The procedure goes on until there are no unlabeled examples left.
The key distinction between the proposed algorithm and standard self-training is the incorporated transductive model that utilizes both labeled and unlabeled examples to give prediction on unlabeled data.Most graph based semi-supervised methods are transductive, which are nonparametric and can deal with multi-classification problems.We proposed a novel constrained markov random walk for the transduction purpose.The most desirable property of the proposed transdutive model is that it can work well even if training set contains label-noise.Therefore, it is perfectly suitable for the self-training process, as the pseudo-labeled set S may contain some wrongly labeled examples.At this step, it is expected to yield more reliable predictions on unlabeled data than the classifier does with training set subject to label-noise.Next, we will present the details of the proposed transductive graph-based model.

B. Markov Random Walk with Constrains
Markov random walk is regarded as a transductive graph based approach which exploits manifold assumption to label all the unlabeled examples.Typically, it is given an undirected graph G = (V, E, W ) , where a node v ∈ V corresponds to an example in L ∪ U , an edge e = (a, b) ∈ V × V indicates that the label of the two vertices a,b should be similar and the weight W ab reflects the strength of this similarity.In this paper, graph is constructed by using the k nearest neighbor criterion.For each example v ∈ L ∪ U , Let C = {1, , m} be the set of possible labels.Two row-vectors Y v , Ŷv are presented.The first vector Y v is the input.The lth element of the vector Y v encodes the prior knowledge about label l for example v.For instance, a labeled example v with label c has Y vc set to 1, and the remaining m − 1 elements of Y v set to 0. Unlabeled examples have all their elements set to 0, that is Y vl = 0 for l = 1...m.The second vector Ŷv is the output of the algorithm, using similar semantics as Y v .For instance, a high value of Ŷvl indicates that algorithms believe that the vertex(example) v should have label l.
The constrains of random walks is formalized via three possible actions: inject, continue and abandon(denoted by inj, cont, abnd with pre-defined probabilities Clearly, their sum is unit: To label any example v(either labeled or unlabeled), we initiate a random-walk starting at v facing three options: with probability p inj v the random-walk stops and return(i.e.inject) the pre-defined vector information Y v .We constrain p inj v for unlabeled examples.Second, with probability p abnd v the random-walk abandons the labeling process and returns the all-zeros vector 0 m .Third, with probability p cont v the random-walk continues to one of vs neighbors v ′ with probability proportional to ∈ E .We summarize the above process with the following set of equations.The transition probabilities are, The expected score Ŷv for node v ∈ V is given by, (IJARAI) International Journal of Advanced Research in Artificial Intelligence, www.ijarai.thesai.org In this paper, the three probabilities p inj v , p cont v , p abnd v are set using the same heuristics adapted from [10], which are defined by, (3) c v is monotonically decreasing with the number of neighbors for node v in graph G. Intuitively, the higher the value of c v , the lower the number of neighbors of vertex v and higher the information they contain about the labeling of v.The other quantity d v is monotonically increasing with the entropy (for labeled vertices).It is noteworthy that abandonment occurs only when the continuation and injection probabilities are low enough.This is most likely to happen at unlabeled nodes with high degree.In effect, high p abnd v prevents the algorithm from propagating information through high degree nodes.
The final labeling information for all v ∈ L ∪ U can be computed through random walk based on Eq.( 2).The algorithm converges when label distribution on each node ceases to change.Note that initial labeled data set L assumes to be noise-free, while the pseudo-labeled dataset S may contain classification noise, hence, certain modification about the transition probabilities needs to be made: • Since labels on L, which are considered noise-free, should not change during the random walk.For example v ∈ L, the transition probabilities should be fixed as follows: • Since examples in S may be wrongly labeled by the classifier, labels on S are allowed to change.
For ∀v ∈ S, the transition probabilities should be computed according to Eq.(3); • For unlabeled example u ∈ U − S, we only constrain p inj u = 0. Note that the predicted label y u and labeling confidence CF(u, y u ) of each example u ∈ U − S can be easily obtained from Y u : In this paper, our strategy is to incorporates such transductive model into the standard self-training's labeling process, concrete procedures of the proposed algorithm is outlined in Algorithm 1.It is noteworthy that size of S only has mediate and minor impact on the final performance.For convenience, |S| is empirically set equal to the number of initial labeled examples,i.e.|L|, and we also set k equal to |L|, The maximum iteration number M is set to 50.

III. EXPERIMENTS AND DISCUSSION
In this section, we design experiments to verify the efficacy of our algorithm.We mainly focus on the self-training framework, trying to find out how transductive model can improve the semi-supervised inductive model.12 UCI data sets are used in the experiments [11].Information on these data sets is shown in Table 1 Compute y u , CF(u, y u ) for all u ∈ U using Eq.( 4)and Eq.( 5) Choose the k most confident examples from U based on CF(u, y u ); 12: Add the chosen pseudo-labeled examples to L ′ ; f ← Learn(L ∪ L ′ ) 14: end while The proposed algorithm is compared with standard selftraining and SETRED [6].SETRED is an improved selftraining algorithm by incorporating data editing techniques to help identify and remove wrong labels from the training sets during the self-training process.For fair comparison, the termination criteria used by self-training and SETRED are similar to that used by our algorithm.
we used three supervised inductive model as base learners to perform classifier induction, aiming to investigate how each comparing algorithm behaves along with base learners bearing diverse characteristics.Specifically, Naive Bayes, www.ijarai.thesai.orgExperiments are carried out on each data set for 100 runs under randomly partitioned labeled/unlabeled/test splits.TableII to TableIV present classification errors of these compared algorithms under different inductive models.The "initial" column denotes the average error rates of classification with labeled data only.Columns denoted by "final" and "improve" represent the average error rates and performance improvements of each algorithm respectively.TableII to TableIV show that proposed algorithm can effectively improve the performance with all the underlying inwww.ijarai.thesai.orgIn particular, comparing to SETRED which utilizes a specific data editing technique to actively identify wronglylabeled examples in the enlarged training set, the proposed algorithm has achieved better results with no effort for cleaning the training set.This evidence supports our arguments that the incorporated transductive model is robust to noises introduced by the self-labeled process, thus it can achieve stable performance.Moreover, empirical results of on the 2 multi-class datasets(vehicle,wine) suggest that our algorithm is superior to self-training and SETRED when dealing with multi-class classification problems.This is mainly due to the fact that it can naturally handle multi-class classification by exploiting manifold assumption to yield confident predictions for training set augmentation.

IV. CONCLUSIONS
This paper shows the benefits of incorporating transdcutive models into semi-supervised bootstrap inductive models, such as self-training.This strategy utilizes both labeled and unlabeled data to yield more reliable predictions for unlabeled examples.We propose a robust self-training style algorithm which exploits manifold assumption to facilitate the selftraining process.We adopt a transductive model based on graph random walks to prevent performance degradation due to classification noise accumulation.Empirical results on 12 UCI datasets show that proposed algorithm can effectively exploit unlabeled data to enhance performance.
Graph construction is vital to our algorithm.In this paper, we only use the common EUCLIDEAN distance as the distance measure, there is no guarantee that this is the optimal choice.Generally, the problem of choosing the best distance measure for a specific learning task is very difficult, and some efforts have been made towards tackling this problem under the name of distance metric learning.How to identify or learn the optimal distance measure and how does it affect the performance are worth further investigation.

Description and Notation Let L denote the labeled training set with size |L| and
U denote the unlabeled training set with size |U |.The goal of our algorithm is to learn a classifier from L ∪ U to classify unseen examples.Generally, initial labeled examples are quite few, i.e. |L| ≪ |U |.
. For each data set, about 25% data are kept as test examples.10% of the remaining data set is used as the labeled

TABLE I :
Data set summary

TABLE II :
Classification error rates of 3 compared algorithms on 12 datasets using naive bayes

TABLE IV :
Classification error rates of 3 compared algorithms on 12 datasets using LibSVM