PICT

Classifying text data has been an active area of research for a long time. Text document is multifaceted object and often inherently ambiguous by nature. Multi-label learning deals with such ambiguous object. Classification of such ambiguous text objects often makes task of classifier difficult while assigning relevant classes to input document. Traditional single label and multi class text classification paradigms cannot efficiently classify such multifaceted text corpus. Through our paper we are proposing a novel label propagation approach based on semi supervised learning for Multi Label Text Classification. Our proposed approach models the relationship between class labels and also effectively represents input text documents. We are using semi supervised learning technique for effective utilization of labeled and unlabeled data for classification .Our proposed approach promises better classification accuracy and handling of complexity and elaborated on the basis of standard datasets such as Enron, Slashdot and Bibtex.


INTRODUCTION
The area of text classification is getting more popular among the researchers.The major objective of text classification system is to organize the available text documents systematically into their respective categories [7].This categorization of text documents facilititates ease of storage, searching, retrieval of relevant text documents or its contents for the needy applications.Three different paradigm exists under text classification and they are single label(Binary) , multiclass and multi label.Under single label a new text document belongs to exactly one of two given classes, in multiclass case a new text document belongs to just one class of a set of m classes and under multi label text classification scheme each document may belong to several classes simultaneously [3].In real practice many approaches are exists and proposed for binary case and multi class case even though in many applications text documents are inherently multi label in nature.Eg.In the process of classification of online news article the news stories about the scams in the commonwealth games in india can belong to classes like sports, politics, country-india etc.
Multilabel text classification problem refers to the scenario in which a text document can be assigned to more than one classes simultaneously during the process of classification..It has attracted significant attention from lot of researchers for playing crucial role in many applications such as web page classification, classification of news articles, information retrieval etc. Generally supervised methods from machine learning are mainly used for realization of multi label text classification.But as it needs labeled data for classification all the time, semi supervised methods are used now a day in multi label text classifier.Many approaches are preferred to implement multi label text classifier.Through our paper we are proposing label propagation approach for multi label text classifier, it uses existing label information for identifying labels of unlabeled documents.We are representing input text document corpus in the form of graph to exploit the ambiguity among different text documents.The ambiguity is represented in the form of similarity measures as a weighted edge between text documents.With the setting of semi supervised learning we have focused on not only graph construction but also sparsification and weighting of graph to improve classifiers accuracy.We apply the proposed framework on standard dataset such as Enron, Bibtex and slashdot.
The rest of the paper is organized as below.Section 2 describes literature related to semi supervised learning methods for multi label text classification system; Section 3 highlights mathematical modeling of our approach.Section 4 describes our proposed label propagation approach for building multi label text classifier followed by experiments and results in Section 5, followed by a conclusion in the last section.

II. RELATED WORK
Multilabel text classifier can be realized by using supervised, unsupervised and semi supervised methods of machine learning.In supervised methods only labeled text data is needed for training.Unsupervised methods relies heavily on only unlabeled text documents; whereas semi supervised methods can effectively use unlabeled data in addition to the labeled data [1] [2].The traditional approach towards multi-label learning either decomposes the classification task into multiple independent binary classification tasks or identifies rank to find relevant set of classes.But these methods do not exploit relationship among class labels.Few popular existing methods are binary relevance method, label power set method, pruned sets method, C4.5, Adaboost.MH & Adaboost.MR, ML-kNN, Classifier chains method etc [20].But all these are lacking the capability of handling unlabeled data ie these are based on principle of supervised learning.
While designing a multi label text classifier the major objective is not only to identify the set of classes belonging to given new text documents but also to identify most relevant out www.ijacsa.thesai.org of them to improve accuracy of overall classification process.Graph based approaches are known for their effective exploration of document representation and semi supervised methods explores both labeled and unlabeled data for classification that's why accuracy of multi label text classifier can be improved by using graph based representation of input documents in conjunction with label propagation approach of semi supervised learning [16][17].
Table 1 summarizes few existing well-known representative methods for multi label text classifier based on semi supervised learning, few uses only graph based framework and few uses both.In preprocessing stage graph based approaches can effectively represents relationship between labeled and unlabeled documents by identifying structural and semantical relationship between them for more relevant classification ; and during training phase semi supervised methods can propagate labels of labeled documents to unlabeled documents based on some energy function or regularizer.Our proposed work is based on the same strategy.

III. MATHEMATICAL MODEL OF PROPOSED SYSTEM
In this section we are introducing some notions related with text classification.We are firstly representing the input document corpus in the form of graph.The process of graph construction deals with conversion of input text document corpus , X to graph G ie X  G , where X represents input text document corpus x1,x2,..,xn wherein each text document instance x i in turn represented as m-dimensional feature vector.And G represents overall graph structure as G=(V,E) where V = set of vertices corresponding to document instance x i ; E represents set of weighted edges between pair of vertices where associated edge weight corresponds to similarity between two documents.Generally weight matrix W is computed to identify the similarity between pair of text documents.Various similarity measures such as cosine, Jacobi or kernel functions K(.) like RBF kernel , Gaussian kernel can be used for this purpose.With this graph based setting, we are using semi supervised learning to propagate labels on the graph from labeled nodes to unlabeled nodes and compare the estimated labels ̂ with the true labels.

IV. PROPOSED APPROACH
We are mainly using theme of smoothness assumption of semi supervised learning to propagate the labels of labeled documents to unlabeled documents.Smoothness assumption of semi supervised learning states that "if two input points x1,x2 are in a high-density region are close to each other then so should be the corresponding outputs y1,y2".Thus based on this we mainly emphasized on exploiting relationships between input text documents in the form of graph and relationship between the class labels in the form of correlation matrix.The purpose behind this is to reduce classification errors and assignment of more relevant class labels to new test document instance.www.ijacsa.thesai.orgDuring classifiers training phase we are computing similarity between input documents to identify whether they are in high density or low density region.We evaluated relationships between documents by using cosine similarity measure and represented it in the form of weighted matrix, W as : Where X1and X2 are two text documents represented in the feature space.Large cosine value indicates similarity and small value indicates that documents are dissimilar.After that we performed graph sparcification by representing it in the form of diagonal matrix in order to reduce consideration of redundant data.While identifying relationships between class labels we computed correlation matrix C mxm where m is no. of class labels using RBF kernel.Each class is represented in the form of vector space whose elements are said to be 1 when corresponding text document belongs to the class under consideration.Then in testing phase, in order to provide relevant label set to unlabeled document we computed energy function E to measure smoothness of label propagation.This energy function measures difference between weight matrix W and dot product of sparcified diagonal matrix with correlation matrix.

E = Wij -D -1 Cij
The labels are propagated based on minimum value of Energy function.It indicates that if two text documents are similar to each other than the assigned class labels to them are also likely to be closer to each other.In other words two documents sharing highly similar input pattern are likely to be in high density region and thereby the classes assigned to them are likely to be related and propogated to those documents which in turn resides in same high density region.
After this label propagation phase, we obtained labels of all unlabeled document instances.We computed accuracy to verify correct assignment of label sets.The corresponding results are given in table [III].We once again ensured the working by applying all this document and label set to existing classifier chains method which is supervised in nature.We used decision tree(J48 in WEKA),SVM (SMO & libSVM) separately as base classifiers and computed the results.The corresponding results are given in table [IV].
The summary of our proposed label propagation approach is given as: In this section, in order to evaluate our approach we conducted experiments on three text based datasets namely Enron , Slashdot , Bibtex and measured accuracy of overall classification process.Table II summarizes the statistics of datasets that we used in our experiments.Enron dataset contains email messages.It is a subset of about 1700 labeled email messages [21].BibTeX data set contains metadata for the bibtex items like the title of the paper, the authors, etc. Slashdot dataset contains article titles and partial blurbs mined from Slashdot.org[22].
We used accuracy measure proposed by Godbole and Sarawagi in [13] .It symmetrically measures how close y i is to Zi ie estimated labels and true labels.It is the ratio of the size of the union and intersection of the predicted and actual label sets, taken for each example and averaged over the number of examples.The formula used by them to compute accuracy is as follows: We also computed precision , recall and F-measure values , the formula used to compute them is as follows: We evaluated our approach under a WEKA-based [23] framework running under Java JDK 1.After label propagation phase, we obtained labels of all unlabeled documents.Thus we get entire labeled dataset as a result now.We applied this labeled set to Ensemble of classifier chains method which is supervised in nature [24] and measured accuracy ,precision, recall on three different base classifiers of decision tree(J48 in WEKA) , and two variations of support vector machine (SMO in WEKA , libSVM).We also measured overall testing and building time required for this process.The Ensemble of classifier chains method (ECC) is proven and one of the efficient supervised multi label text classification technique, we verified our entire final labeled dataset by giving input to it.The results are enlisted in table IV Now we are defining our graph based multi label text classifier system S as follows: S = { X, Y, T, ̂, h}; where X represents entire input text document corpus = {x1,x2,..,xn}.Out of these |L| numbers of documents are labeled and remaining are unlabeled.Y represents set of possible labels = {Y1,Y2,…,Yn}.T represents multilabel training set of classifier of the form {(x1,Y1), (x2,Y2),….., (xn,Yn)} where is a single document instance and Yi Y is the label set associated with xi .̂ represents set of estimated labels = { ̂l , ̂u}.The goal of the system is to learn a function h i.e. h : X  2y from T which predicts set of labels for unlabeled documents i.e. xl+1 ..xn Input -T: The multi label training set {(x1,Y1), (x2,Y2),….., (xn,Yn)}.z: The test document instance such that z X Output -The predicted label set for z.Process: Compute the edge weight matrix W as | | | | and assign Wii=0 -Sparcify the graph by computing diagonal degree matrix D as Dii=∑j Wij -Compute the label correlation matrix C mxm using RBF kernel method -Initialize ̂(0) to the set of (Y1,Y2,…,Yl,0,0,……..,0) -Iterate till convergence to ̂(∞) 1. E = W ij -D -1 C ij 2. ̂(t+1) = E 3. ̂(t+1) l = Y l -Label point z by the sign of ̂(∞) i V. EXPERIMENTS AND RESULTS

TABLE 1 :
STATISTICS OF POPULAR ALGORITHMS FOR MLTC BASED ON SEMI SUPERVISED LEARNING AND GRAPH BASED REPRESENTATION

TABLE II :
STATISTICS OF DATASETS 6 with the libraries of MEKA and Mulan [21][22].Jblas library for performing matrix operations while computing weights on graph edges.Experiments ran on 64 bit machines with 2.6 GHz of clock speed, allowing up to 4 GB RAM per iteration.Ensemble www.ijacsa.thesai.orgiterations are set to 10 for EPS.Evaluation is done in the form of 5 × 2 fold cross validation on each dataset.We first measured the accuracy, precision, Recall and after label propagation phase is over.Table III enlists accuracy measured for each dataset.

TABLE III :
RESULTS AFTER LABEL PROPAGATION PHASE

TABLE IV :
RESULT AFTER USING SUPERVISED MULTI LABEL CLASSIFIER