A Strategy for Training Set Selection in Text Classification Problems

An issue in text classification problems involves the choice of good samples on which to train the classifier. Training sets that properly represent the characteristics of each class have a better chance of establishing a successful predictor. Moreover, sometimes data are redundant or take large amounts of computing time for the learning process. To overcome this issue, data selection techniques have been proposed, including instance selection. Some data mining techniques are based on nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods. The weaknesses of these methods usually involve a lack of accuracy, lack of robustness when the amount of data increases, over?tting and a high complexity. This work proposes a new immune-inspired suppressive mechanism that involves selection. As a result, data that are not relevant for a classifier’s ?nal model are eliminated from the training process. Experiments show the e?ectiveness of this method, and the results are compared to other techniques; these results show that the proposed method has the advantage of being accurate and robust for large data sets, with less complexity in the algorithm.


I. INTRODUCTION
Nowadays most of the information is stored electronically, in the form of text databases.Text databases are rapidly growing due to the increasing amount of information available in electronic form, such as electronic publications, various kinds of electronic documents, e-mails, and the World Wide Web.
Text mining, also known as knowledge discovery from textual databases, is a semi-automated process of extracting knowledge from a large amount of unstructured data.Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data.Typically, only a small fraction of the many available documents will be relevant to a given individual user.Without knowing what could be in the documents, it is difficult to formulate effective queries for analyzing and extracting useful information from the data.Users need tools to compare different documents, rank the importance and relevance of these documents, or find patterns and trends across multiple documents.Thus, text mining has become an increasingly popular and essential theme in data mining (Feldman 1995).
There are many types of statistical and artificially intelligent classifiers, as it can be seen in [1], [2].One of the main issues in classification problems involves the choice of good samples to train a classifier.A training set capable to represent well the characteristics of a class has better chances to establish a successful predictor.

II. OBJECTIVES
This paper proposes a new approach for addressing the training data reduction in text mining classifications problems.This new algorithm was inspired by suppression mechanisms found in biological immune systems [3].The suppression concept is applied to the training process to eliminate very similar data instances and to keep only representative data.The propose consists in a non-statistical method to select samples for training.The main objectives of this work are to find a subset of samples for training without spending excessive processing time and to simultaneously maintain good accuracy.
In order to do this, this paper is set out as follows.The Section 2 presents a literature review of what has been done to solve the reduction problem as well as the features and problems associated to each of them.Section 3 introduces a detailed description of the algorithm proposed and the suppression mechanism.Section 4 explains the methodology used in the experiments.Finally, Section 5 points out the conclusions and gives some direction of future work.

III. PREVIOUS WORK
An important contribution in the area of data reduction for structured data (data mining) can be found in (Cano et al. 2003).In this work, the authors present a review of the main instance selection algorithms.In addition, they perform an empirical performance study that compares the classical instance selection methods with four major evolutionary-based strategies.The authors divide the instance selection methods into four sets.The first set involves techniques based on nearest neighbor (NN) rules.These techniques are Cnn [4], Enn, Renn [5], Rnn [6], Vsm [7], Multedit [8], Mcs [9], Shrink, Icf [10], Ib2 [11], and Ib3 [12].The second set involves methods based on ordered removal.These methods are Drop1, Drop2 and Drop3 [13].There are two methods based on random sampling that were considered, i.e., Rmhc [14] and Ennrs [15].The evolutionary-based methods are the generational genetic algorithm (GGA) [17] and [17], the steady-state genetic algorithm (SGA) [18], and the CHC adaptive search algorithm [19].The authors in [19] claim that the execution time associated with evolutionary algorithms (EAs) represents a greater cost compared to the execution time www.ijacsa.thesai.org of the classical algorithms.However, when compared to non-EAs that have a short execution time, EA-based algorithms offer more reduction without overfitting.The authors concluded that the best algorithm corresponds to the CHC, whose time is lower compared to the rest of the EAs, the probabilistic algorithms and some of the classical instance selection algorithms.The classical and evolutionary algorithms are affected when the size of the data set increases, whereas CHC is more robust.In CHC, the chromosomes select a small number of instances from the beginning of the evolution, so that the fitness function based on 1-NN has to perform a smaller number of operations.There are many other strategies in the literature [20], [21].[22], [23], [24], [25], [26] and [27].

IV. THE SUPPRESSION MECHANISM
The suppression concept for proposed algorithm SeleSup (selection by suppressor) is employed in the training set to eliminate very similar data instances and to keep those instances that are truly representative of a certain class [28].To perform such tasks, the mechanism divides the training database into two subsets.The first subset represents the white blood cells (WBCs) or antibodies in the organism, representing the training set.The second subset represents a set of pathogens or antigens that will select the higher affinity with WBCs; hence, this method performs suppression.The algorithm starts with the idea that the system's model must identify the best subset of WBCs to recognize pathogens, i.e., the training set, and to be able to identify new pathogens that are presented.
Both antibodies and antigens were represented as vectors containing the most relevant terms of the documents.Each vector was normalized to belong to the same scale of values which is mapped to the interval [0,1].The affinity between antibodies and antigens was determined by the cosine distance.This measure is commonly used to measure the level of similarity between two documents.
Given two vectors representing documents, WBC and Pathogen, their cosine will describe the similarity.
As the angle between the vectors shortens, the cosine angle approaches 1, meaning that the two vectors are getting closer, or more similar.
According to [28] the algorithm aims to identify the best subset of antibodies to recognize the antigens, i.e., the new training set must be able to identify new antigens.Finally, the antibody survivors are represented by an evaluation measure (fitness value) and are selected to be a part of the new reduced training set.
In other words, those WBCs able to recognize pathogens from the suppression set remain while the others are eliminated from the population.The signals for a WBC's survival are represented by a fitness variable.Each time the nearest WBC recognizes a same class-label pathogen, the survival signal is sent and the fitness is incremented.Every WBC with a fitness greater than zero is selected to be part of the new suppressed repertoire.The pseudo-code for this technique can be seen in Algorithm 1.The Reuters-21578 Text Collection contains documents collected from the Reuters newswire in 1987.It is a standard text categorization benchmark that contains 135 classes.The collection was divided it in two subsets: one consisting of the four more balanced classes, which was identified as Reuters-4, and the other consisting of the ten most frequent classes, which was identified as Reuters-10.The third datasets consists of the sixty two classes, which was identified as Reuters-Original.
The last data set, the NewsGroup (20NG) dataset contains approximately 20000 articles evenly divided among 20 Usenet www.ijacsa.thesai.orgnewsgroups.Over a period of time 1000 articles were taken from each of the newsgroups, which makes an overall number of 20000 documents in this collection.Except for a small fraction of the articles, each document belongs to exactly one newsgroup (Joachims 1997).
The performance of the two classification algorithms Naive Bayes and Support Vector Machine (SVM) over the resulting reduced training and test subsets of SeleSup is compared to the performance over the subsets selected by the CHC algorithm, which is based on genetic algorithms [19] and random sampling (RS) based on the reduction percentages of experiments of each algorithm.
For each one of these subsets, the algorithms SeleSup and RS of each method were run out ten times and the reduced sets of training data were submitted to the classification algorithms (Naive Bayes and Support Vector Machine).The CHC percentage reduction, obtained in just one execution, due to computational cost was adopted.The RS was run 10 times.The average was obtained as final result for each experiment.

A. REUTERS
The first experiment performed in this paper makes use of the The SGML files were transformed into XML format and were pre-processed in Microsoft Excel, joining all documents in one single file.The resulting file was considered as the format for the input file for the mining process containing a collection of 8250 records sorted into 62 categories.
Then, the usual text mining data preparation techniques were performed.From this subset it was partitioned other two subsets: Reuters-4 and Reuters-10 as explained in next section.The four more balanced and the ten most frequent classes are indicated in Table 2 and 3.

C. Parameters
The parameter setting is given in Table 5 and remained constant throughout the experiments.It was used stopwords and stemming in the document preparation stage.In additional, it was performed a filter on keywords with more than 50% significance and keyword´s relevance was used to generate the vector space model.www.ijacsa.thesai.org

D. Significance Test
Statistical evaluation of experimental results has been considered an essential part of validation of the new machine learning methods [29], [30].The statistical test has the objective of reject a false null hypothesis [31].
This paper shows a comparison between nonparametric tests, Wilcoxon signed rank test [32] and Mann-Whitney test [33] for comparing of two classifiers, Naïve Bayes and SVM.[29] mentions Wilcoxon signed rank test as safe and robust non-parametric tests for statistical comparisons of classifiers.
It was used data sets with high dimension space, which demand a high processing time.So, it was chosen the training data set of the each one of the four data sets (see Table 1), which have been run on 10-fold cross-validation method to obtain a random sample of 10 results.The test is two-tailed with significance level of 0.05.The results have been obtained through the KEEL software [34], [30] and [29].
Generally when the p value is greater than 0.05, the null hypothesis is accepted resulting as no evidence that the samples are significantly different.However, if the null hypothesis is rejected (p < 0.05) denotes that the samples are statistically significant.

VII. RESULTS AND ANALYSIS
The first experiment was carried out in the Reuters-4 data set.This data set is characterized by balanced classes (see Table 6 and 7).The accuracy of SeleSup is just as good as results of CHC-100 and with the same data set without reduction, the results presented are very similar.The CHC-100 produces the best performance.Therefore, CHC-100 hasn't nearly as high reduction rate as SeleSup.
The CHC-1000 has a bigger reduction, but comparing with SeleSup the accuracy don't nearly produce as good results as its.In the tests, there was only one case (CHC1000) where the performance hasn't shown significantly different.The second experiment was carried out with the Reuters-10 data set.This data set is characterized by an imbalance on its classes (see Error! Reference source not found.).
Therefore, as can be seen in Table 8, all the classifiers produced satisfactory results when their learning process used all the training and test data set.In addition, as expected, the same behavior occurs when suppression mechanism is applied.
The accuracy of SeleSup is just as good as results with the same data set without reduction, Random Sampling and CHC-100.The results are very similar between the classifiers.Therefore, CHC-100 has not nearly as high reduction rate as SeleSup.
It can be noticed that if the number of evaluation increases, the accuracy test of CHC-1000 decreases and consumes a high time execution (more than 50 higher).So, the CHC-1000 doesn't produce nearly as good results as SeleSup.
The results (Table 9) indicate that the Wilcoxon test is more powerful than the Mann-Whitney test according to [29].www.ijacsa.thesai.orgThe third experiment was carried out with the Reuters Original data set.This data set is characterized by a great imbalance on its classes and high dimensionality (Table 10  and 11).SeleSup produced results almost as good as CHC-1000 in the training set, but the Reuters Original without suppression produces the best results in the test set.
It can be noticed once more that the CHC-1000 produces the best data reduction percentages, but it isn't nearly as fast as SeleSup.According to (Cano et al. 2003) the main limitation of CHC is its long processing time, which makes it difficult to apply this algorithm to very large data sets.
This experiment shows the limitations of the SVM with the larger dataset (Original Reuters) which were omitted.
Finally, the last experiment was carried out using the Newsgroup data set.This data set is an example of a very large data set with 18300 instances (see Table 12).This is the largest data set in our experiments.
The SeleSup and CHC obtained results are very similar in accuracy.In addition, the algorithm SeleSup was easily applied in this data set and its results were just as good as CHC-1000.Its processing time has been very meaningful when compared with the CHC that produces a very similar percentage of reduction (92,09% and 93,29%).
It can be observed that the RS had in general results very similar to the algorithms SeleSup and CHC, but it has a clear disadvantage of not reducing data by itself.Therefore, another algorithm has to be used to define the reduction percentage.This paper presented a new method for instance selection (IS) by suppressing data in the original training set.IS can be very useful to reduce costs, improve computational performance and eliminate non-informative data.The proposed technique was designed to work together with different types of classifiers.The goal was to improve the performance related to the time spent on training without losing accuracy.This approach was inspired by the suppression mechanisms found in biological immune systems.
The experiments were conducted by testing the SeleSup algorithm in four data sets.The performance of three classification algorithms over the resulting training subsets of SeleSup was compared with the performance over the subsets selected by the CHC algorithm and random sampling (RS).
In order to test whether the algorithms' performances were significantly different or not, it was adopted a comparison between non-parametric tests Mann-Whitney U and Wilcoxon signed rank.In the tests, there were only one case where the performances haven't shown significantly different.Therefore, the statistical tests have provided strong evidence concerning the results obtained when comparing the evaluated algorithms.
The SeleSup algorithm significantly reduces the data set size.This algorithm is just as good as CHC algorithm and it offers the advantage of being faster.Then, it consumes less processing time.Although CHC has a higher reduction rate, it does not produce the best results with high dimensionality data sets and it showed high time execution.Moreover, on the contrary of CHC, the presented approach was applied to all the data sets on a less power computer, and overall, its results were better than RS.

IX. FUTURE WORK
An alternative method for performing a faster test would be inserting into the WBCs' population the pathogen-specific WBC whose distance is the minimum distance.This technique should provide the system with the capability of keeping rare cases or rare classes in the training set.
An additional improvement to the original algorithm could be to insert some probabilistic information on the choice of the WBCs to be eliminated.The way that the mechanism works currently is deterministic with regard to data selection.

_______________________________________________ Algorithm 1 :
The Suppressive Algorithm ________________________________________________ input:The normalised (in[0, 1]) full training data set T and the fraction f of WBCs (default f =0.9) output: A reduced training data set T // Initialisation phase Shuffle T and assign [f •|T|] samples as WBCs (training set); the remaining samples are assigned as pathogens (suppression set); for all the WBCs do fitness = 0; // Suppression phase for each pathogen p do NearestWBC ← Find the nearest WBC with regard to p; if NearestWBC's class = p's class then // NearestWBC was able to recognize the pathogen Increment the NearestWBC's fitness by one; endif; end; // Output phase Eliminate those WBCs whose fitness value is 0; Output the set of surviving WBCs as the reduced training set T __________________________________________________ V. EXPERIMENTAL STUDY In this section, the experiments presented aims to evaluate the reduced training instances selected by the SeleSup algorithm in four data sets (shown in Error!Reference source not found.)frequently used in information retrieval research.
Reuters collection (Zeidat et al. 2006; Yang et al.1996; Schapire 1990; Schapire et al. 2000; Sebastiani 2002).The Reuters-21578 collection is a collection of documents from the Reuters news agency that was released in 1987.By 1990, the collection was given to the scientific community to perform research related to text categorisation.The rights of authorship belong to Reuters Ltd. and the Carnegie Group, which promoted its free distribution for research activities.The document basis consists of 21578 Reuters articles that consist of files in the SGML language.These documents are grouped into 22 separate files.Each document possesses several attributes that indicate different characteristics.The attributes used in this work are: Lewissplit (related to the information of the experiments done by Lewis who defines the values Test, Training and Not-Used); Oldid, which represents the identification number of the collection (before the Reuters-21578); D, which represents the categories or classes; and Body, which presents the text content of major news.The number of documents per class varies from class "earnings" (3964 documents) to class "castor-oil" (which contains a single document).Furthermore, some documents are not associated with any of the classes, and others are associated with up to 12 of the classes.

TABLE II .
FOUR MORE BALANCED CLASSES OF REUTERS DATA SET.

TABLE VI .
RESULTS FOR REUTERS-4 DATA SET

TABLE VIII .
RESULTS FOR REUTERS-10 DATA SET

TABLE IX .
MANN-WHITNEY U AND WILCOXON TESTS COMPARING BAYES VS SVM FOR REUTERS-10 DATA SET

TABLE X .
RESULTS FOR ORIGINAL REUTERS DATA SET

TABLE XI .
MANN-WHITNEY U AND WILCOXON TESTS COMPARING BAYES VS SVM FOR ORIGINAL REUTERS DATA SET

TABLE XIII .
MANN-WHITNEY U AND WILCOXON TESTS COMPARING BAYES VS SVM FOR NEWSGROUP DATA SETTo carry out efficiently the training of classifiers of large collections of text the selection of the training set must be done carefully.If it is used an excessive number of documents the computational effort can make the task impossible.Using a very small sample leads to the inaccuracy of the classifier.