Accuracy Based Feature Ranking Metric for Multi-Label Text Classification

In many application domains, such as machine learning, scene and video classification, data mining, medical diagnosis and machine vision, instances belong to more than one categories. Feature selection in single label text classification is used to reduce the dimensionality of datasets by filtering out irrelevant and redundant features. The process of dimensionality reduction in multi-label classification is a different scenario because here features may belong to more then one classes. Label and instance space is rapidly increasing by the grandiose of Internet, which is challenging for Multi-Label Classification (MLC). Feature selection is crucial for reduction of data in MLC. Method adaptation and data set transformation are two techniques used to select features in multi label text classification. In this paper, we present dataset transformation technique to reduce the dimensionality of multi-label text data. We used two model transformation approaches: Binary Relevance, and Label Power set for transformation of data from multi-label to single label. The Process of feature selection is done using filter approach which utilizes the data to decide the importance of features without applying learning algorithm. In this paper we used a simple measure (ACC2) for feature selection in multi-label text data. We used problem transformation approach to apply single label feature selection measures on multi-label text data; did the comparison of ACC2 with two other feature selection methods, information gain (IG) and Relief measure. Experimentation is done on three bench mark datasets and their empirical evaluation results are shown. ACC2 is found to perform better than IG and Relief in 80% cases of our experiments. Keywords—Binary relevance (BR); label powerset (LP); ACC2; information gain (IG); Relief-F (RF)


I. INTRODUCTION
A feature is a measurable characteristic or property of the observed process.Text data is high dimensional in nature, and a moderate sized dataset may contain thousands of features.Multi-label is another important property of text data; i.e. a document can belong to none, one or more than one classes.In single label classification, documents belong to only one label (class) but in multi-label classification, which is a case in real world scenario like web pages, newspapers, sports magazine, data mining etc., a document can belong to more than one class that has become recent research topic [1] .Feature selection (FS) is a data pre-processing step in many machine learning applications, which plays an important role in reduction of dimensionality [24].It helps in mitigating the computational requirements and understanding data.FS removes dimensionality by filtering out irrelevant features, thus improving the prediction capability of a classifier.Researchers evaluate the integrity of feature selection in two ways, individual and subset evaluation [12], [5].Individual evaluation is computationally efficient it evaluate and assign the weights (ranks) to features (variables) according to their prediction ability in classification.It ignores the inter-dependency of features and also incapable of removing redundant features [21].Subset evaluation handles redundancy and relevance of features, but it requires higher computational power.The main objective of feature selection is to select subset of features having stronger discrimination power [19].It reduces effects of redundancy and noise variables by keeping only the features which are efficient for prediction [3].
If two features are extremely correlated as to showing dependence on each other, only one feature is sufficient for data description [17].Dependent features give no extra information about data.The goal of feature selection is to obtain total information from fewer unique features containing maximum discrimination about the classes.In some applications, due to lack of information about the observed process, features having no correlation with the class act as noise.Such feature produce bias in classification process.Classifier efficiency is enhanced by feature selection techniques which give some cognizance about data and the process being observed.
From machine learning perspective to remove irrelevant features, feature selection criterion is required, which takes into account relevance of each feature with the output class.Irrelevant features lead to poor generalization of the predictor.Feature selection is not some dimensionality extraction technique like principle component analysis (PCA) [2], [20].Since discriminative features may be independent of all the data, so a procedure called pruning is introduced after feature selection to find the subset of optimal features.To evaluate all the subset of features of size 2N, problem become NP-hard which is difficult to solve in polynomial time that's why a sub-optimal solution is incorporated which can eliminate redundant features with malleable computations.Subset feature selection deals with the scenario that some subset of features are selected while all others are ignored.
we introduce a well known feature selection technique ACC2, widely used in single label text classification for feature selection.In this paper e present a single label feature selection approach named ACC2 which is applied in conjunction with Binary Relevance and label power set.The presented technique is very fast and accurate compared to other two feature selection methods (IG, RF).To change the multi-label data into single label we use Binary Relevance (BR) and Label Powerset (LP) techniques.
BR transforms the original dataset into L datasets where L is the number of labels associated with the dataset.Each new dataset contains all the instances as in original dataset, but with only one class associated with each instance; and each of label value has only two states being either positive or negative.BR normally doesn't take into account the features correlation and fails to predict label ranking but it is light weight and reversible.Other advantage of BR is that independent features can be added or removed in model without disturbing rest of the model.In LP approach new classes are generated using possible combination of labels and then problem is solved using single label multi-class approach.
Remaining paper is distributed as: Related work is discuss in Section II.In Section II-B, we describe two label transformation methods and their basic theory.Basic concepts related to feature selection and its importance is discussed in Section III.Section VII introduces benchmark multi-label datasets and their statistics, while Section VI presents the most frequently used evaluation measures for multi-label learning.Results of feature selection algorithms on benchmark datasets are discuss in Section VIII.

II. RELATED WORK
Feature selection is widely use to reduce the dimensionality of data.A number of comprehensive publications can be found on supervised, semi-supervised and non-supervised machine learning topics relating to features selection and classification domains [11], [12], [4], [18].Multi-label feature selection approach using Relief and Information Gain (IG) is discussed in [13].A novel approach which jointly performs feature selection with classification called the joint feature selection and classification for multi-label learning (JFSC) is proposed by [14].Distribution based feature selection measure Chi square is used with label power set as a problem transformation technique [15].Ensemble embedded feature selection (EEFS) a novel technique is propose by [16], , , , , they develop this method for the feature selection of multi-label clinical data.
To deal with multi-label classification variety of classifiers exist such as Ada-boost [26], BP-MLL [27], SVM [5], ML-KNN [25] each classifier has its own importance but ML-KNN is mostly preferred in most of the research work.In ML-KNN method Ecludean distance is measured between the unlabeled test example and the other instance of the training data set, then using the concept of maximum a posteriori (MAP) label for the test example is selected.

A. Multi-Label Learning
According to [5] multi-label learning has two categories: Multi-label Classification (MLC) and Label Ranking (LR).MLC is defined as a function h M LC : χ → 2 L where χ is an e-dimensional feature space and L = {λ 1 , λ 2 , . . ., λ r } is an output space of r > 1 labels.Each subset of L is called label-set.If an input instance is given to classifier or predictor it will give a set of relevant labels, Y, and irrelevant labels, Ȳ .Hence, a bipartition of labels is obtained which is partitioning labels into relevant and irrelevant features.Generally speaking multi-class classification is a special case of MLC where h M C : χ → L while in binary classification h B : χ → {0, 1}.
In Label ranking a function f : χ × L → R that returns ordering of all possible labels according to the relevance of labels in response to an input instance x.Thus a label λ 1 is ranked higher than other label λ 2 if it satisfies f (x, λ 1 ) > f (x, λ 2 ).A rank function,τ x , maps the classifier real output values to the position of label in ranking, {1, 2, . . ., r}.Hence, lower the position the better the label rank i.e. f (x, λ 1 ) > f (x, λ 2 ) ⇒ τ x (λ 1 ) < τ x (λ 2 ).Fig. 1 [6] describes the basic taxonomy for feature selection in multi-label classification.

B. Data Transformation Methods
Let X is an e-dimensional input space of numerical features.L = {λ 1 , λ 2 , . . ., λ r } is an output space of r > 1 labels.A relation of features and labels is given as (x, Y ) where x = x 1 , x 2 , . . ., x e , which is an e-dimensional instance associated to L set of labels as Y i ⊆ C.Where Y = {y 1 , y 2 , . . ., y r } = (0, 1) r here Y is r-dimensional binary vector and label of each element is 1 if it is relevant, 0 otherwise.Table I shows the comparison of single label (binary, multi-class) data with multi-label one.
Multi-label learning is categorized into two groups: method adaptation in which existing single label classifier models are enhanced to deal with multi-label data directly while second one is problem transformation methods which transform the multi-label problem into several binary classification problems (BR) or into different possible combinations of label set (LP).

C. Binary Relevance (BR)
It is like one-versus-all (OVA) approach, it generates one dataset for each label, in new generated dataset positive patterns represent the presence of a particular class label and all other patterns are set to negative.BR transforms the original dataset in to L datasets.Each new dataset contains all the instances as in original dataset, but with only one class; and each of feature value has only two states being either positive or negative.In the i th dataset, if label set for an instance contains the i th label then its label is positive otherwise negative.For classifying new pattern, it is assigned a class label by all the L datasets and the union of labels is the predicted label set.Although BR settles linearly with label set L of r dimensions; but it does not consider the correlation of labels.
Table III shows binary relevance (BR) based transformation of data from multi-label to single label when applied to the dataset of Table II.In this approach each distinct combination of labels present in training set is treated as different class and then single-label classification is performed on the transformed data.Although this approach makes the task easy but with the increase in classes, label-set size also increases; hence increasing the computational cost and causes impediment in learning.The number of examples for training of each label set will be very small.To settle this problem, initial set of labels are split up into small random subsets of labels (label-sets).LP is performed on these label sets.This approach is called RAKEL, random k label sets, where k parameter specifies the size of label sets.Unlike BR, LP considers the correlation between labels.Table IV represents dataset formed after transformation using label power-set.

III. FEATURE SELECTION
In this section basic concepts related to feature selection and its importance is discussed.In FS we cater the best features, which relatively provides more information of instance category to the classifier.In FS we find most suitable subset of features X ⊆ X that may enhance prediction capability of the classifier.There are basically three FS approaches: filter, wrapper and embedded.We discuss each one with detail.

A. Wrapper Approach
Wrapper methods find the most suitable subset of relevant features using the classification/learning algorithm; it offers high computation cost as it has to run classification task for each subset of features.As the number of features increases, the classification is required more often to find the suitable subset of features; thus giving arise to polynomial time tough scenario.To overcome the computational burden and to find most suitable subset of features, searching algorithm are incorporated.
There are different search algorithms for feature selection, each having its pros and cons.Tree structure is used in branch and bound approach [10] for selection of features; its complexity increases exponentially with increase in number of features.For large datasets with a huge number of features, exhaustive search approach is not appropriate.There are feasible linear approaches which yield good result with lesser computation cost i.e. sequential search, particle swarm optimization, genetic algorithm and heuristic search algorithms.Wrapper methods further split up into two categories: sequential search and heuristic search algorithms.
Sequential search algorithm continue to add/remove features until a maximum objective function is reached.A criterion is set whose objective is to maximize the objective function with minimal number of features.Sequential search algorithms are iterative in nature.
Sequential feature selection algorithm starts with an empty set; accumulate a single feature that yields maximal value for objective function.Wrapper approach necessitate the learning algorithm to find suitable set of features, but it is inclined towards finding the set of features which are more suitable for a particular learning algorithm; a rigorous computation power is required for the wrapper approach.

B. Embedded Approach
Embedded approach integrates feature selection with the training algorithm as some part of the process, like decision trees; selection of best features, having paramount discriminative power to differentiate among classes, at each stage.

C. Filter Methods
Filter methods selects sets of optimal features based on the peculiarity and idiosyncrasy of the dataset; irrelevant features are filtered out, this whole process is separate from the learning phase/algorithm.Variable ranking technique is the major method used in filters for feature selection in ordered form.Ranking methods are versatile thats why they hugely contribute to the practical applications.A particular ranking measure is used to rank the features with respect to some threshold; features below this threshold are discarded.
Basic trait of a relevant/distinctive feature is that it preserves the necessary information about classes present in the dataset.This trait is the relevance of feature necessary for segregation of distinct classes.But how could feature relevancy be described by current standards?Different researchers describe it differently.In [7] author defines an irrelevant feature as: "an irrelevant feature is conditionally independent of class labels".This fact depicts that a relevant feature can not be independent of class labels, but it can be independent of input data.This also suggests that relevant features have a certain amount of influence on the classes, if not then they should be considered as irrelevant.One most important parameter in determining the feature relevancy is feature correlation between features and classes; which describes a feature's importance to discriminate classes.
In this paper, we used ACC2 feature selection measure on multi-label text data and compared with two other well known filter based methods (Relief F and Information Gain).In Sections III-C1, III-C2 and III-D, we discuss these techniques in detail.

1) Relief F measure:
It is heuristic approach developed by [8] removes the irrelevant features from the datasets.It is the extension of basic Relief algorithm [9].Relief is capable of dealing with discrete as well as continuous attribute but it can't deal with multi-class problems.It estimates features on the basis of discrimination power value of attributes among the instances.Relief F seek for k nearest misses M j (C), j = 1 . . .k, for each class C. Calculate the weight/estimate by taking average contribution of each class.
In above equation R is a randomly selected instance, for which Relief searches for its two nearest neighbors: one from the same class, called nearest hit H, and the other from the different class, called nearest miss M. It updates the quality estimation W [A] for all attributes A depending on their values for instance R, M and H .If instances R and H have different values of the attribute A then the attribute A separates two instances with the same class which is not desirable so we decrease the quality estimation W [A]. In (1) different function calculates the difference between two instances on the basis of nearest hit and nearest miss.
Basic idea about the working is that it separates classes pair on the basis of features regardless the fact that which two classes are nearest to each other.
2) Information Gain (IG): Information gain represents dependency of input labels with the class labels.It is defined by well-known equation of Shannon's about entropy: Actually entropy is the uncertainty in output label Y. Hence entropy in output, given input labels is: By already knowing the input labels we can predict output label Y with more accuracy.Hence IG relates the dependency of input label X to output label Y given as:

D. ACC2 Feature Selection Measure
Accuracy measure (ACC) is a well known feature selection technique widely used in single label text classification.It is simply the difference of true positives and false positives of a term.It works well in balanced dataset but perform poorly on unbalanced dataset because this algorithm is biased toward tp.
Balanced Accuracy measure(ACC2) is an enhanced version of accuracy measure (ACC) [22].ACC2 is the absolute difference of true positive rate (tpr) and false positive rate (f pr).As tpr is normalized; obtained after division with the class size; it solves the problem of biasing toward tp.In multi label text classification we, for the first time, use this simple technique for feature selection.Formulae of ACC and ACC2 are given in ( 5) and (6,) respectively.

IV. PROPOSED METHODOLOGY
In multi-label text classification, we present a well known feature selection measure ACC2; which is widely used in single label text classification.We compare the performance of ACC2 with two (Information gain, Relief-F) other feature selection measures.We first use Binary Relevance (BR) and Label Power-Set for data transformation.To reduce the dimensionality of data we did feature selection.Information gain used the entropy measure between labels and features showing dependency between features and labels (classes).Features having greater values of IG are ranked higher.Entropy is the impurity present in the instances/examples, while information gain is an average reduction in entropy in accordance with a given feature.Higher the value of IG, better is the dependence between features and classes.
Balanced accuracy measure is most widely used algorithm in single label text classification.It takes the absolute difference of true positive rate (tpr) and false positive rate (fpr).Detailed expressions of three feature selection measures are given in Section III.RF-BR, IG-BR and ACC2-BR first transform the multilabel dataset into single label datasets using binary relevance transformation, then feature selection methods RF, IG and ACC2 are applied to select the highly discriminative features among the classes.But in these methods, as the BR does not consider the correlation between labels during transformation, the same problem exist in these approaches.
In RF-LP, IG-LP and ACC2-LP methods the process of feature selection is done after transformation of data from multi-label to singe label using label power-set technique.Data transformation techniques are described in Section II-B.
After feature selection the process of classification is done using ML-KNN classifier.We use four well known evaluation measures (Hamming Loss, Subset accuracy, Micro and Macro average F measure) to estimate the accuracy of three feature selection algorithms.

V. MOTIVATION EXAMPLE
This section discusses the working of six feature ranking metrics with the help of an example.Table V is a sampled dataset presented only for illustration and comparison of different metrics based on problem transformation.We have 15 documents belonging to 3 classes and 10 terms/features.We practically show that multi-label data after transformation to single label becomes highly unbalanced.It is not a problem in single label feature selection regime.In multi-label classification due to multi label to single label transformation problems do exist; as binary relevance does not take into consideration the label dependency.On the other hand, LP only considers the distinct label-sets.It is, therefore, unable to predict new label-sets, causing over-fitting of training data.However, these techniques are light weight giving results almost comparable to problem adaptation techniques.
Table VI shows comparison of six ranking metrics and scored assigned by these metrics to features.In multi-label datasets, features can have relevance with more than one classes.So it is very difficult to judge the discrimination power of particular feature with respect to class labels.So many factors are to consider in multi-label domain for rank assignment.As can be seen that IG-BR and ACC2-BR assigned first rank to f 10 while RF-LP and RF-BR assigned first rank to f 4 .From V, one can estimate that f 10 , f 9 and f 8 are more important as they highly match with three classes.But RF-LP and RF-BR assigned the first rank to f 4 .Other metrics assigned lower ranks to this feature.In multi-label domain, features correlation between themselves and with all the class labels should also be considered.

VI. EVALUATION MEASURES
Evaluation measures used for multi-label classification are different from those used for single label classification.Evaluation Measures fall into two categories: label based and example based.Label based is an extended form of evaluation measures used for single label classification domain.Example based is specifically built for multi-label domain [28].Here we give the expressions of evaluation measures used for multilabel classification.In all below evaluation measures x is label predicted KNN classifier and y is actual or true label.

Hamming loss(x
Xor(x i , y i ) L www.ijacsa.thesai.org Hamming loss is an average measure of difference between actual and predicted value for labels.A low value of hamming loss is required to show better classification performance.
Accuracy is the closeness of the measure value to the known standard value.It is a fraction of correctly classified instances to the total number of instances to be classified.In multi-label classification accuracy of a metric is measure using above equation.
Precision is the fraction of correctly classify instances to the total number of instances to be classify.
Recall shows the fraction of number of correct instances to the total number of retrieved instances.

Subset accuracy
Subset accuracy or classification accuracy is defined by (10).It is very strict requirement, as it is the average of set of predicted labels exactly matching the set of actual labels.
F 1 measure is a single measure obtained by combining two evaluation measures precision and recall.It is use to make trade off between precision and recall.
In macro F 1 measure we calculate the precision and recall of each set and take there average.
In micro F 1 measure we find the t p , f p and f n of all the available sets and then apply them in (16) to calculate the final score.In equation q represents the available sets.A high value of accuracy and other evaluation criterion is required to show better classification performance, except for hamming loss metric.

VII. EXPERIMENTAL SETUP AND DATASETS
We performed experiments on three benchmark text datasets given in Table VII.Preprocessing, such as stemming and stop word removal was already done on these data sets available at (mulan dataset).We used Java platform for experimentation.Transformation of data from multi-label to single label is done using Binary Relevance (BR) and Label Powerset (LP) techniques.After data transformation feature selection algorithms are applied to reduce the dimensionality of data.The process of classification is done using ML-KNN classifier.The performance of feature selection algorithms is measure on percentage (10%, 20%, 30%, 40%, 50%,60%, 70%, 80%) of top ranked features selected by every algorithm.We used five (Hamming Loss, Ranking Loss, Subset accuracy, Micro and Macro average measure) evaluation measures to test the performance of six feature selection algorithms at different test points of data.Table VII shows benchmark datasets that are used in experimental evaluation for feature selection.Table also represents the characteristics of six datasets, such as number of instances (N); number of features (F); number of class labels (L); the label cardinality (LC); label density (LD); and distinct combinations of labels (DC).

Label Cardinality
Label cardinality (LC) shows the average number of labels per example/instance.It can be calculated using above equation.In (17) N is number of instances and L represent number of labels in a sample.

Label Density
Label density (LD) is normalized form of LC shown in (18).
For each dataset D, feature reduction measure for feature selection can be calculated from (19).
Where, X' is the feature subset obtained after feature selection from dataset D; M is the number of examples.Six feature selection techniques are performed on each dataset.classifier response is evaluated for features that are selected.

VIII. RESULTS
We applied six FS methods and five evaluation measures on three benchmark datasets.Tables VIII, XII and XVI shows the hamming loss measure for described datasets.Hamming loss is the relative frequency of predicted and actual labels as previously shown in (9).Subset accuracy (10) is another measure, which tell that either a predicted label is the actual true label or not.Micro averaged precision results are shown in Tables X, XIV and XVIII.In micro averaged precision large classes dominate over small classes, as it is the fraction of true positives and tp + tn of all concerned classes.F1 measure is the harmonic mean of precision and recall, it considers the true positives and ignores the true negatives but this measure assigns equal weight to precision and recall.Whereas precision is the number of actual correct results out of the marked correct results by the classifier tp tp+fp ; and 'recall' is the fraction of correct results out of all the correct results tp tp+fn [23].Macro average measure is more biased towards average recall than average precision.Label based micro average criterion is biased towards most populated labels, while macro average is the average of tp and f p for each class separately.Macro averaging is biased to least populated classes.

A. Enron Dataset
Enron dataset is a test bench dataset available at (mulan dataset), having 1702 instances and 53 labels with cardinality 3.78.Tables VIII to XI show the experiments done on Enron dataset and in next subsections we discuss their results based on different measures.3) Micro and macro averaged F1-score: In Table X, we compare different feature ranking criterion based on BR and LP transformation approaches at different number of selected features.RF-BR performed better than IG-BR and ACC2-BR in micro-averaged case in BR domain.In LP domain, ACC2-LP is leading while IG-BR performed poorer.RF-BR approach outperformed in macro-averaged case while ACC2-LP outperformed in LP case.Whereas IG-BR as well as IG-LP underperformed in both micro and macro cases.XI shows ranking loss for different feature selection criteria.ACC2-BR has the least ranking loss for 30, 50, 60 and 80 percent of selected features.For 20, 40 and 70 percent of test points RF-BR has the least ranking loss; IG-BR has only least ranking loss at 10 percent of selected data points.For LP case, RF-LP and ACC2-LP has three times least ranking loss, while IG-LP has two times least ranking loss.Hence, overall ACC2 method outperformed for LP and BR cases.

B. Medical Dataset
Results for experiments of different measures on Medical dataset are presented in Tables XII to XV. Subsequent section present discussion of these measures.
1) Hamming loss for medical dataset: hamming loss for different metrics of medical dataset is given in Table XII.ACC2 has the least hamming loss for 9 out of 16 cases at different number of selected features.RF has least hamming loss for 4 cases, and IG has least hamming loss in 3 out of 16 cases.
2) Subset accuracy measure for medical dataset: For medical dataset, ACC2-BR has the maximum subset accuracy for 10% to 30% of total number of features (see Table XIII).While for 40% to 80% of total number of features, RF-BR has the maximum accuracy.In LP case, ACC2-LP gives the maximum subset accuracy only for 40% and 70% of features.IG underperformed in medical datasets, while RF technique take the maximum value in 10 out of 16 cases.
3) Micro and macro-average F1-score for medical dataset: In Table XIV combined values for micro and macro-averaged F1-score are given for medical dataset.Out of 16 calculations at different percentages ACC2 take the maximum microaveraged value for 8 times while IG performed better than RF both in BR and LP case by taking 5 times max values of micro-averaged score.ACC2-BR becomes highest at 20% to 50% of selected features.ACC2-LP becomes highest at 30%, 50%, 70% and 80% of selected features.While IG-BR and IG-LP performed better than RF-BR and RF-LP in macro-average measure in medical dataset.

IX. DISCUSSION
We present a feature selection technique in multi-label text classification.We demonstrated the comparative study of six feature selection metrices, three for BR and three for LP case, for multi-label text classification.ACC2 measure is very simple technique and requires less computations as compared to other metrices.Despite its simplicity, It's performance is comparable to other complicated metrices as shown in Tables XX to XXII.It can be seen from Table XX, least hamming and ranking loss for Enron dataset is attained by ACC2 measure.These are %age of number of selected features assigned to each case out of eight cases for each of BR and LP case.Hence in all three datasets, ACC2-BR has 70.8% least hamming loss among the six metrices; ACC2-LP attains least hamming loss in 33.33% cases.Overall, least ranking loss for three datasets for ACC2-BR is 58.33% and 25% for ACC2-LP case.While the subset accuracy, micro, macro-averaged measures are computed for maximum values among the six feature ranking metrices.

X. CONCLUSION
In this paper, we evaluate the performance of three feature ranking algorithms and two data transformation techniques by using five evaluation measures on three benchmark datasets.For data transformation techniques from multi-label to single label, we conclude that binary relevance doesn't take into consideration the label dependency.While on other hand LP only consider the distinct labelsets, hence unable to predict new labelsets causing over-fitting of training data.
In feature ranking algorithms Relief F measure does not deal with redundant features.Rather then converting a multinomial classification problem into binomial classification problem, RELIEFF searches for k near misses from each different class and averages their contributions for updating W, weighted with the prior probability of each class.Information gain capture the amount of information present in a feature for the purpose of automatic text classification.ACC2 select highly discriminative features which occur more time in one class but less times in other class.In future work, we will adopt ACC2 measure to directly deal with multi-label data.
Description of feature selection methods with transformation techniques is given below: 1) ACC2-BR: ACC2 as feature selection measure based on BR 2) ACC2-LP: ACC2 as feature selection measure based on LP 3) RF-LP: RF as feature selection measure based on LP 4) RF-BR: RF as feature selection measure based on BR 5) IG-BR: IG as feature selection measure based on BR 6) IG-LP: IG as feature selection measure based on LP Relief-F is a univariate feature selection measure; it demarcates or evaluate the quality of features of single label datasets.Relief-F award different score for features having different values on different classes but castigates features having different values for the same class.

TABLE I :
Single Label vs. Multi-label Dataset

TABLE II :
Multi-label Dataset Example

TABLE IV :
Transformed Dataset using the Label Powerset Method Let C = {ω i : i = 1, 2, . . ., L} be a finite set of classes, x i is a, instance linked with set of labels Y i where Y i ⊆ C. A label set such that S ⊆ C and k = |S|, is called k − labelset.

TABLE V :
Artificially Sampled Dataset for Multi-label

TABLE VI :
Comparison of Rank Assignee Metrics to Features on Sampled Dataset

TABLE VII :
Description of Datasets Table VIII shows the hamming loss for six feature ranking measures based on filter approach on Enron dataset.Hamming loss is computed for different data test points for selected features.Least-BR shows those BR problem transformation based feature ranking measures having least hamming loss; Least-LP shows those feature ranking measures having least hamming loss among other measures for LP transformation case.As can be seen from Table VIII, ACC2 produces the least hamming loss both in BR as well as LP transformation case.It is the simplest technique among all described approaches.

TABLE VIII :
Feature Ranking Metrics Having Least Hamming Loss using KNN Classifier Subset accuracy values of six feature ranking metrics are given in TableIX.Max-BR shows the occurrence of a measure, among three other measure based on BR problem transformation approach, having maximum subset accuracy value.In same way Max-LP shows a measure having the maximum subset accuracy value among other techniques based on LP transformation approach.Clearly, ACC2 measure subset accuracy is leading to all other techniques.

TABLE IX :
Subset Accuracy Values for Feature Ranking Metrics using KNN Classifier on Enron Dataset

TABLE X :
Micro and Macro-averaged F1-Score Values for Feature Ranking Metrics using KNN Classifier on Enron Dataset

TABLE XI :
Ranking Loss Values for Feature Ranking Metrics using KNN Classifier on Enron Dataset

TABLE XII :
Hamming Loss Values for Feature Ranking Metrics using KNN Classifier on Medical Dataset

TABLE XIII :
Subset Accuracy Values for Feature Ranking Metrics using KNN Classifier on Medical Dataset

TABLE XIV :
Micro and Macro-averaged Values for Feature Ranking Metrics using KNN Classifier on Medical Dataset

TABLE XV :
Ranking Loss Values for Feature Ranking Metrics using KNN Classifier on Medical Dataset least hamming loss value.On the other hand RF-BR did not take least value of hamming loss measure in BR domain.For LP case ACC2 and RF-BR generated the least hamming loss values, as shown in TableXVI. the

TABLE XVI :
Hamming Loss Values for Feature Ranking Metrics using KNN Classifier on Bibtex Dataset The comparison of IG, RF, ACC2 is shown in Table XVII for BR and LP transformation case on different percentages of total number of features.ACC2-BR took the lead in BR case attaining maximum values in five cases among other two techniques, while in LP case, RF took the same lead among other two techniques.Table XVIII).In this case, for both LP and BR transformations RF and ACC2 attained maximum values in 5 out of 16 cases.IG attained maximum values in 6 out of 16 cases, in both transformation cases.4) Ranking Loss for bibtex dataset: Table XIX shows the ranking loss of six different metrics; and the metrices attained the least ranking score among the six metrices.In bibtex case, IG for BR and LP transformation cases attained least values for ranking loss measures in 7 cases, while RF remained highest in 5 out of 16 cases.ACC2 attained least ranking loss in 4 out of 16 cases.

TABLE XVII :
Subset Accuracy Values for Feature Ranking Metrics using KNN Classifier on Bibtex Dataset

TABLE XVIII :
Micro and Macro-averaged F1-Score Values for Feature Ranking Metrics using KNN Classifier on Bibtex Dataset

TABLE XIX :
Ranking Loss Values for Feature Ranking Metrics using KNN Classifier on Bibtex Dataset

TABLE XX :
Percentage of a Feature Ranking Metric Producing Highest Subset Accuracy, Micro, Macro Average F1 Measure and Producing Lowest Hamming and Ranking Loss Enron Dataset

TABLE XXI :
Percentage of a Feature Ranking Metric Producing Highest Subset Accuracy, Micro, Macro Average F1 Measure and Producing Lowest Hamming and Ranking Loss Bibtex Dataset

TABLE XXII :
Percentage of a Feature Ranking Metric Producing Highest Subset Accuracy, Micro, Macro Average F1 Measure and Producing Lowest Hamming and Ranking Loss Medical Dataset