Trending Challenges in Multi Label Classification

Multi label classification has become a very important paradigm in the last few years because of the increasing domains that it can be applied to. Many researchers have developed many algorithms to solve the problem of multi label classification. Nerveless, there are still some stuck problems that need to be investigated in depth. The aim of this paper is to provide researchers with a brief introduction to the problem of multi label classification, and introduce some of the most trending challenges. Keywords—Challenges; Correlations among labels; Multi Label Classification


I. INTRODUCTION
Classification is an important data mining task that could be defined as the prediction of class label for unseen instances as accurate as possible [1].Most researchers are interested in single label classification, where the goal is to learn from a set of instances that are associated with a unique class label from a set of disjoint class labels.If the total number of disjoint classes equals two, then the problem is called binary classification, otherwise, the problem is a multi class classification.On the contrary of the previous problems, Multi-Label Classification (MLC) allows the examples (instances) to be associated with more than one class label at the same time.So, the goal of MLC is to learn from set of instances, where each instance belongs to one or more class labels at the same time [2].
MLC was motivated firstly by text categorization and medical diagnosis [3].Recently, more researchers pay great attention toward the problem of MLC due to its importance in the real world problems [3].In many domains where single label classification failed to solve the classification problem, MLC did.For example, single label classification may tag an email message as work or research project but not both, where the fact is, it could be tagged as both work and research project at the same time, which MLC does.
MLC is -by its nature-a challengeable problem due to many reasons such as the huge number of labels combinations that grows exponentially, high dimensionality, unbalanced data, and many other reasons [9].This paper aims to pin point to the most trending challenges in MLC based on extensive study of many recent researches and articles.These challenges include but not limited to : exploiting correlations among labels from both types conditional and unconditional dependencies, features selection methods that are designed especially to handle multi label datasets, and having new stratification methods that are suitable to the nature of multi label datasets.This paper is organized as follows.In the next section, we present some of the related work.In section 3, Trending challenges in the field of MLC are introduced.Finally, we conclude and present some of the future works.

II. RELATED WORK
According to [1], there are two approaches that are widely used to handle the problem of MLC: Problem Transformation Methods (PTM) and Algorithm Adaptation Methods (AAM).The former transforms the multi label problem into one or more single label classification problems, that could be solved using any single label classification algorithm [9].The latter extends a single label algorithm to directly handle a multi label data.

A. Problem Transformation Methods
An algorithmic independent methods that handle multi label datasets by transforming it to single label dataset or more as a preprocessing step, and then apply any single label classification algorithm.In fact, there are many transformation methods which could be grouped into two groups:

1) Simple Problem Transformation Methods
The most simple straightforward method is the ignore method, which ignores any multi label instances that exist in the dataset [9].This naïve method is unacceptable, since it causes much of information loss.Other simple methods calculate the frequency of each label and then either select the most frequent label, least frequent label or randomly select any label as transformation criteria [10].
Transformation methods based on label frequency do not reflect any logic in solving the problem of MLC, and may cause different problems like increasing the complexity of the learning process when selecting the least frequent label or imbalance class distribution problem when selecting the most frequent label.
The last transformation method copies any multi label instance number of times equals to the number of labels it is associated to, with or without using a weight [11].This method does not cause any information loss but it neglects the important correlations among labels and may increase the complexity of the learning process through increasing the number of single label instances in the dataset.www.ijacsa.thesai.org

2) Complex Problem Transformation Methods
Roughly speaking, most complex problem transformation methods are based on or inspired by two famous methods : Binary Relevance (BR) and Label Powerset (LP) [12].Each algorithm represents different approach in handling the problem of MLC.
BR divides the multi label dataset into q different datasets with each dataset contains all the positive and negative instances for specific label [12].It then trains q classifiers for all datasets and merge the prediction of all these classifiers to get the final predictions.BR may considered to be simple method with linear complexity with respect to the total number of labels and has the advantage of being executed in parallel, but suffers from many limitations such as : It neglects any correlations among labels, and considers labels to be mutual exclusive, which is totally not correct when handling the problem of MLC.Another limitation for BR is the complexity of the method in the case of huge number of labels [11].
On the contrary of BR, LP considers correlations among labels as it treats every unique combination in the dataset as single class in multi class classification problem.LP exactly transforms MLC problem into multi class problem, and then trains any single label classifier [12].LP suffers from several drawback as the problem of imbalance class distribution, especially when the number of distinct label sets is high compared to the number of instances in the dataset.Also, LP is capable to predict only those combinations that appeared in the training phase [12].
Although BR and LP are suffering from several limitations, but they inspired many researchers to design many algorithms based on their concepts, or try to do some enhancements to those basic transformation methods through overcoming their limitations.For example Classifier Chains ( CC) tries to enhance BR through taking label correlations into account by training q classifier that are connected with each others in such a way that the prediction of each classifier is being added to the dataset as new feature, which is used to predict new labels [10].CC suffers from one drawback that is related to the order of the chain.Different orders give different predictions which may influence the performance and the accuracy of the classifier.This problem has been solved by randomly ordering the classifier chains in new method called Ensemble of classifier chains (ECC ) [13].
LP by itself has been studied intensively by many researchers, due to its simplicity and its great advantage of taking label correlations into account.The intensive studies of LP result in many algorithms that are based on LP or an enhancement of LP such as The RAndom k-labELsets method (RAkEL ) [14] which solved the problem of imbalance class distribution of LP especially when having large number of labels.RAkEL trains an ensemble of LP classifiers, where each classifier is assigned to a small subset of label combinations of size k.RAkEL has the ability to predict combinations that are not exist in the training dataset.The bottle neck of RAkEL is to determine the optimal value for the combinations size ( k); if k is large enough then it will suffer from the same shortcomings of LP, and if it is small enough then it will suffer from information loss especially in correlations among labels , in addition to having low accuracy and high complexity [12].
Pruned set ( PS) is another transformation method that solved the problem of imbalance class distribution in LP by pruning instances that have frequency less that specific user defined threshold [13].This technique reduces the high complexity of LP by considering only the important and frequent combinations of label sets.The price of this solution is to lose important information, and increase the probability of overfitting.An Ensemble of Pruned Sets (EPS) [13] enhanced the prediction of PS by considering the prediction of multiple classifiers obtaining by voting while increasing the complexity of the algorithm.
Different approach to solve the problem of MLC is based on Pairwise Methods.The Ranking by Pairwise Comparison (RPC) transformation method divides a dataset with q labels into q(q-1)/2 datasets for each pair of labels [15].Then a binary classifier is trained for each dataset, and a final prediction is built based on counting the votes for each label.RPC was extended by adding a virtual label that has been used as split point between relevant and irrelevant labels.This transformation method is called Calibrated Label Ranking (CLR) [16].

B. Algorithm Adaptation Methods
The high efficiency of many algorithms in handling single label classification problems has inspired many researchers to adapt and enhance these algorithms to handle the problem of MLC.ML-C4.5 [17] adapted the popular algorithm C4.5 to handle multi label datasets.Two adaptations has been carried out: the first adaptation allowed the leaves to have multi labels, while the second adaptation was the modifying of the entropy definition in order to have enough information that determine to which classes an exact pattern belonged to.

Multi class Multi label Associative Classification (MMAC)
is an algorithm that follows the concepts of Associative Classification (AC) [18].Firstly, it transforms the multi label dataset into single label dataset using copy as problem transformation method.Then it trains single label associative classifier to predict a single label using if -then rules.Finally it merges the predictions of rules that have the same antecedent to form a rule with more than one label in the consequent of the rule.It is worth mentioning that all the datasets that have been used to evaluate MMAC are single label datasets and have never been tested against multi label datasets.
Rank-SVM is a multi label ranking algorithm that is based on SVM ranking [19].This algorithm aims to optimize the ranking loss, but suffer from not taking the important correlations among labels into account, and never been tested against datasets with huge number of labels where it is expected to show very low performance.
Several algorithms are based on the popular K -Nearest Neighbors algorithm (KNN) that is based on the technique of lazy learning.ML-KNN [20] is an example of these algorithms.All of these algorithms share the same first step with KNN (retrieving the k nearest example) and distinguish www.ijacsa.thesai.orgthemselves on the aggregation of the label sets of these examples.

Back Propagation for Multi-Label Learning (BP-MLL) is an adaptation of the traditional feed-forward neural networks.
It optimizes an error function that is similar to the ranking loss [21].Multilabel Multiclass Perceptron (MMP) is also another algorithm that uses neural network to handle the problem of MLC [22].It uses one perceptron for each label as in BR, and the final prediction is calculated using the inner products.MMP is an efficient algorithm especially for large datasets with many labels [9]. Figure 1 depicts a brief taxonomy of MLL methods.In addition to the previous way of categorizing MLC algorithms, there is another interesting way of categorization, which is based on the degree of correlations among labels that has been considered in the algorithms.Based on that, we can distinguish three types of MLC algorithms as shown in Table1.

A. Exploiting correlations among labels to facilitate multi label learning
Multi label datasets usually have many features that do not exist in single label datasets such as high dimensionality, unbalanced data and the exponential growth of combinations of labels.These features, in addition to the core nature of multi label data; that is based on dependencies among labels, lead to an urgent need to exploit correlations among labels, in order to have additional knowledge that helps in facilitating the learning process [9].Many algorithms [1] [11] [13] [14] [25] have tried to exploit the correlations among labels to enhance the accuracy of the multi label classifier, but most of these algorithms suffer from high complexity in the learning process [10].Based on that, the true challenge is to exploit high order labels correlations locally and maintain a linear complexity at the same time [2].

B. Proposing new problem transformation methods based on correlations among labels
Transforming multi label datasets into one single label dataset or more is a basic step for most multi label algorithms that follow the approach of PTM.The selection of the transformation criteria is usually based on the frequency of a label.Some examples of transformation criteria are: Most Frequent Label (MFL), Least Frequent Label (LFL) or simply by selecting any label randomly [10] [11].Since multi label datasets is based on a basic assumption which is; labels are not mutually exclusive, and they do have correlations and dependencies among them [9], it would make more sense if the transformation criteria will be based on correlations among labels [1].

C. Proposing new features selection methods that are suitable for the nature of multi label datasets
Features selection is a basic step in many data mining tasks that aims to define the relevant features in the dataset and eliminate irrelevant ones [23].Labels in single classification are considered to be mutually exclusive, which is not completely true in MLC, and based on that, there is an urgent need to use suitable features selection methods that are designed especially to handle multi label data, and it would be even better if these features selection methods take into account the correlations among labels [23].

D. Hierarchical Multi Label Classification (H-MLC)
In some datasets, labels could be organized in a hierarchical way like "Yeast" dataset where labels are correlated to each others in a hierarchical way.Two types of structures could be used to represent the hierarchical nature of the multi label datasets: a tree or a Directed Acyclic Graph (DAG).In a tree structure a child have one and only one parent, while in DAG a child may have more than one parent at the same time [24].It would be a nice and promising idea to design an algorithm that manage label correlations using a hierarchical structure with minimum complexity in the www.ijacsa.thesai.orglearning process.Interesting approaches could be found in [24][25].

E. Proposing new stratification methods that are suitable for the nature of the multi label datasets
Stratification is a techniques that is used in sampling, and take into account the existence of all disjoint groups in the target population, so the chosen sample reflects the whole population in a representative way.In single label classification, stratification is easy since every instance is associated with only one label, and labels are mutually exclusive.Whereas in MLC, the task becomes more and more complicated as instances are usually associated with more than one label, and labels are not mutually exclusive.In [26] two stratification methods were proposed in the context of MLC, but much effort should be done to solve the problem of stratification in the field of MLC.

F. High dimensionality of label space in multi label datasets
High dimensionality is one of the most challengeable issue in MLC, and perhaps the main challenge.In MLC most labels are associated with a few number of training instances in comparison to the total number of instances in the dataset.This situation is similar to the problem of imbalance class distribution in single label classification.And the situation will be more worse when the number of labels in the dataset is very high ( more than 100 labels).There is an urgent need to a simple yet fast algorithm that is capable of handling large number of labels that are associated with a few number of instances and maintaining a linear complexity at the same time.Example of such an algorithm could be found in [27] where the authors proposed new algorithm HOMER construct a hierarchy of ML classifiers where each classifier considers small subset of labels.This algorithm shows fair performance and good accuracy in only two datasets, and compared only against BR.HOMER needs to be investigated more in depth using larger datasets with a fair evaluation against other algorithms than BR.

IV. CONCLUSION AND FUTURE WORK
In this paper, we have introduce a brief introduction to MLC.Also, we survey some of the most well known algorithms in the field of MLC.The main contribution of this paper is introducing some of the trending challenges in the domain of MLC.In the near future, we aim to investigate in depth about these trending challenges and propose new methods to exploit correlations among labels.Also, we are now evaluating new transformation methods that are based on the correlations among labels.