A New Feature Filtering Approach by Integrating IG and T-Test Evaluation Metrics for Text Classification

High dimensionality is one of the main issues associated with text classification, such as selecting the most discrepant features subset for classifier's effective utilization is a difficult task. This significant preprocessing stage of selecting the relevant features is often called feature selection or feature filtering. Eliminating the non-relevant and noise features from the original feature set will drastically reduce the size of the feature set and the time complexity of the classification models and also improve or maintain their performance. Most of the existing filtering method produced a subset with relatively high number of features without much significant impact on running time, or produced subset with lesser number of features but results in performance degradation. In this paper, we proposed a new bi-strategy filtering approach that integrates Information Gain with t-test that selects a subset of informative features by considering both the score and ranking of respective features. Our approach considers the results' disparity produced by the benchmark metrics used in order to maximized and lessen their advantage and disadvantage. The approach set a new threshold parameter by computing V-score of the features with minimum scores present in both the two subsets and further refined the selected features. Hence, it reduces the size of the features subset without losing much informative features. Experiment results conducted on three different text datasets have shown that the proposed method is able to select features that are highly discrepant and at the same time achieves a significant improvement in terms of classification accuracy and F-score at the cost of a minimum running time. Keywords—Dimensional reduction; feature filtering; feature selection; t-test; information gain; V-score


I. INTRODUCTION
In this emerging era of computing and internet technology, especially the emerging of social media, text analytic becomes more cumbersome [1]. As a result, both the size of features and instances of a textual dataset has been increasing rapidly. The increasing size of the text data results in diverse research problems to text analytic tools, such as machine learning. Text classification is one of the pronounce problem associated with text analytics [2] [3], and currently is becoming one of the most vital research direction in the field of machine learning.
Text classification or documents classification is the problem of assigning unlabeled text instances to one or more predefined labelled classes or categories [4] [5] [6] [7]. Text classification has been utilized in various application domains [8], e.g. spam filtering [9], Sentiment Analysis [10], Natural Language Processing [11] [12] Information Retrieval, Text Mining and so on.
One of the most crucial steps of the preprocessing of text data is the presentation of text documents into vector space via Bag_of_Word (BOW) [13][14] [15]. The final product of this task is associated with two main issues, a vast number of features representation, and the presence of irrelevant and noisy features which general termed high dimensionality [15]. These issues can cause a lot of problems for the Text classification task, which is known to be intrinsically high dimensional [4] [5]. Classification in a situation that involves high number of features or high-dimensional space can become infeasible or very difficult due to computational complexity expensiveness [16] [17]. However, feature reduction approach is considered as a dimensional reduction problem. The huge features generated introduces the so-called "dimensionality curse" with thousands of features that increase the computational complexity of a classifier [18] [19] [20][21] [22]. Curse of dimensionality is a popular known problem for machine learning models [23]. When it arises in text classification, it seriously worsens the performance of the classifier in terms of classification accuracy and running time [5] [24].
The main goal of dimensionality reduction is to reduce the number of features without worsening the performance of the classifier [14] [19]. As the key way to overcome this problem, feature selection (FS) technique can be applied to filter out irrelevant, redundant and noisy features and selects the most informative subset of features from the original features set [1] [19]. This task will aggressively reduce the original vectors space representation of features into lower-dimensional vector representation [25][26] [27]. Moreover, the properties of the informative features in the original feature set would be unaltered in the processes of dimensionality reduction. Feature selection (FS) approach ranks the original features according to some criterion evaluation (scores) and selects the topranked features to form an informative subset [27], which retains a good degree of discriminating capability in separating documents of various categories [28] [29]. In contrast to the feature selection, feature extraction approach transforms the text documents on to a new lower-dimensional space from their original high dimensional feature instead of selecting a features subset from the original features set [30][31] [15]. 500 | P a g e www.ijacsa.thesai.org Generally, feature selection methods [32] are broadly grouped into filter methods, wrapper methods [33], and embedded methods [34] [35] [27]. Filter methods [36] are independent that they do not interact with classifier when constructing an informative features subset. They rely on metrics for evaluating and ranking the importance of a feature prior to the classification. The methods can attain quick feature sorting to effectively filter out a high number of nonrelevant or noise features [27]. They select features subset by considering the usefulness of a feature according to evaluation metrics [35] [28][27] [37]. Filter methods usually have good computational efficiency but sacrifice classification accuracy to some extent. Information Gain [38], Chi-Square [39], Fisher Score [40], ReliefF [41], t-test [4] are among the few filter based methods. Wrapper methods are dependent on classifiers that they frequently interact with the classification algorithm in order to construct a subset of informative features [13][35] [27]. They evaluate a particular feature subset by training and testing a given classifier. The methods are tailored to a particular classifier [42]. These methods have bad computational efficiency but result in high classification accuracy, and they are not usually favoured in text classification task [43]. Heuristic Search Algorithms (HSA) and Sequential Selection Algorithms (SSA) [44][45] [46] are common examples of classical wrapper methods. Embedded Methods integrate classifiers with feature selection technique during the training phase and optimally search feature subset by designing an optimization function [35][44] [47]. Like wrapper methods, embedded methods frequently interact with the classifier but have computational efficiency better than wrapper methods, and are also tailored to a specific classifier [43]. Selection-Perceptron (FS-P) [48], Support Vector Machines (SVM-RFE) [49], Lasso (L1) and Elastic Net (L1+L2) based models [50] [51] are some few examples of embedded based methods. This paper is based on filter FS approach, and goal of this research work is to propose a new approach that selects more informative features from the original features set which help classification model to achieve good performance with regard to both time complexity and classification accuracy. The main point of view is on dimensional reduction, to reduce the number of features and processing time without sacrificing the classification accuracy. The features are exposed to double filter-based evaluation metrics (IG and t-test), in which at the final output, are obtained, only the discriminate features that highly contribute to the classification task, and produce a lower dimensionality subset base on features' respective rank and score. The approach blends the concepts of intersection and vector magnitude to select a subset of refined informative features by considering both the score and ranking of respective features. An experiment conducted with three distinct text datasets has shown that the proposed approach produces acceptable results by achieving a recorded performance of 67.65%, 54.74%, and 80.16%, and running time of 7464ms, 4689ms, and 29806ms on 20NewsGroups, NewsCategory, and Reuters-21784, respectively. This shows that the method retains most of the informative features when compared with other chosen methods.
The remaining body of this paper is systematically partitioned as follows: In Section 2, related works are presented. The proposed approach and the Filter-based feature selection methods employed explicitly by the approach, namely IG and t-test are discussed in Section 3. Properties of the datasets used and experimental set up are devoted to Section 4. Experiment results and discussion are systematically placed in Section 5. Finally, the study ends with a conclusion and highlights of possible future work which are given in Section 6.

II. RELATED WORKS
There are large number of research works on filter-based feature selection metrics to remove irrelevant and noisy features in text classification problem. The primary aim is often to reduce the feature dimensionality so as to minimize the processing time without sacrificing or improving the classification accuracy. In an effort to reduce the computational complexity, some numerous current works hybridized multiple scoring metrics to select most informative features. Results discrepancy is among the top challenges in hybridization approach as different results would be obtained when applying different evaluation metrics on the same dataset [38], and this issue can result in selecting noncontributory features. In this section, we will briefly present some review of those works, and lastly, we will summarize the drawbacks of the existing methods.
Lewis [52] uses mutual information (MI) to measure the importance of a feature, thus proposed a new scoring metric known as Mutual Information Maximization (MIM) that computes the relevancy between n features and classes. Liu and Setiono [39] proposed an algorithm that computes the score of each feature and selects relevant features based on chi-square score. The algorithm calculates the numeric attribute intervals and selects features according to the statistical data characteristics. A comparative study by Mladenic and Grobelnik [53] on a different dataset was conducted, and only for the Multinomial Naïve Bayes (NB) model upheld Odds Ratio over a wide variety of evaluation metrics been compared. For feature filtering, Bi-Normal Separation (BNS) has previously been described to be outstanding in ranking terms. Forman [54] improve an existing scoring metric for features by substituting IDF with BNS. The new method, TF-BNS scales the magnitude values and rank features by computing the BNS score of every feature. Empirical evaluation of text classification tasks using Support Vector Machine (SVM) shown significantly better performance in terms of F-measure and accuracy. Uguz [55] applies IG to ranked terms in a given document according to their importance in the initial stage of his proposed framework. Vinh et al. [56] proposed a new approach for selecting feature by normalizing well known MI (Mutual Information) measurement and used it to assess the potentiality of the features. Despite the competitive results achieved, the proposed approach could not conceal the highly correlated features influence the classification outcomes. Azhagusundari and Thanamani [57] developed a feature selection method based on IG for selecting the discriminant features from a give original set. The authors used IG to build a discernibility matrix which could be used to select the optimal subset of 501 | P a g e www.ijacsa.thesai.org features from the set of original data. Experimentally they showed their method obtained comparative classification accuracy on comparison with the original dimensionality. A greedy feature selection method using mutual information is introduced by Hoque and et al. [58]. The method blends feature-feature and feature-class MI to select the optimal feature subset. Wang et al. [4] use the concept of term frequency and developed a new feature scoring metric approach based on t-test, the method measures the diversity of the distributions of a feature between the particular category and the entire dataset. Experiment results indicate that the proposed method is marginally better than IG and chi-square method in terms of micro-F1 and macro-F1. Rehman et al. [5] proposed a novel function metric for feature ranking named Normalized Differences Measure (NDM), which evaluate the rank of a term by considering the term's relative document frequencies in both positive and negative classes. Zhou et al. [37] proposed a feature selection algorithm that uses segmented term frequency to compute the frequency of a document. Moreover, the impact of the same feature term to the classification under the dissimilar frequency of term is deeply considered. The algorithm uses the resultant terms' frequencies to give scores to each available feature and selects those features that are above a defined threshold. When Compared with six different FS methods, the empirical result demonstrated that the proposed method could able to increase classification accuracy on a textual dataset.
All the works mentioned earlier are single FS methods that consider only a single strategy for the selection of an informative subset of features. Consideration of multiple strategies altogether is impossible with a single feature selection method. In view of that, the hybridization approach has received significant attention in the field of dimensional reduction currently. The methods combined different FS methods considering various aspects of the features into single. Tsai and Hsiao [59] combine multiple methods for dimensional reduction to figure out more informative features for stock prices prediction task. The method integrates decision tree, PCA, and genetic algorithm as search methods, and utilizes the concept of an intersection, union, and multiintersection approaches to filter out irrelevant variables. An intermediary method of union (OR) and Intersection (AND) approach named modified union is presented by Bharti and Singh [60]. The authors applied union (OR) and intersection (AND) on k-top selected ranked features, and on remaining unselected features subset, this merges the feature subsets into a single subset and further select the most relevant features. The feature filtering methods used in the study are document frequency (DF) together with term variance (TV). To exploit the advantages of two different FS methods, a hybridization of cluster-based and the frequency-based approach is presented by Nguyen and Bao [13]. The proposed method termed FCFS on comparison with its counterpart achieved the best performance in terms of micro-F1. To tackle the problem of results discrepancies, a new feature selection approach that combines the computed scores from multiple FS methods into one is proposed by Rajab [61]. The proposed method normalizes and computes vector score (V-Score) magnitude of each feature using the scores produced based on IG and Chisquare function metrics, and selects the top-ranked features.
Kamalov and Thabtah [62] proposed a method that selects optimal features from sets with ranking features produced by three different ranking strategies. The authors used vector scores (V-Scores) to stabilize the scores obtained from three methods (IG, Chi-square, and inter-correlation) and assign a new rank to each feature. To further remove non-relevant and noise features from feature subsets produced by two different evaluation functions, Li et al. [27] consider the application of union approach on the lowest rank feature subset produce by Fisher score and IG methods.
Many studies have investigated the strength of several filtering methods and their combination in the literature. Forman [32] empirically studied and compared twelve different evaluation metrics for feature selection on a text classification problem, and they finally revealed that BNS with IG has the minimum correlated failure so as mark best backup choice. The impact of integrating five methods for FS was investigated by Thubaity et al. [63]. The study employed IG, Chi-square, NGL, GSS, and RS methods on Arabic textual dataset. Union (OR) and intersection (AND) approach were utilized to integrate the scores produced from various FS methods employed to a single sorted feature set. Results Analysis showed there was no any improvement recorded in terms of classification accuracy when more than three FS metrics were integrated, while a small improvement was noticed for integrating two to three FS metrics. Vora and Yang [64] present a comparative study on ten different filtering methods namely Fisher Score, Chi-square, Gini Index, Laplacian Score, IG, mRmR, CFS, FCBF, Kruskal-Wallis, and REliefF. Experimented on five different text dataset, the authors found that combination of Kruskal-Wallis, Gini Index with SVM classifier lead the race as it achieved competitive classification performance but takes longer processing time, while IG and Chi-2 are projected as methods with a large number of similar features have been selected.

III. MATERIAL AND METHOD
This section presents a brief discuss on the information gain and t-test algorithm since both are useful for the proposed approach that will be explained in sub-section C.

A. Information Gain Algorithm
Gain (IG) [38] [65], is an information theoretical and entropy-based method which is widely used in the field of dimensional reduction [43] [37]. IG is previously used to determine attribute use in splitting instances in decision treebased models [66] and currently is applied to select the informative features subset in a given set of features. The method computes and assigns score to each feature considering the variation between entropy obtained based on presence or absence of term in a given category [37]. High information gain or high score indicates the discriminating capability of a feature and ranked top. The entropy of discrete random variable is formulated as: denotes a specific event of the variable , ( ) denotes the probability of an event ( ). The general formula for computing IG of a given feature t is given as: where P(t, c) is the probability of class c and occurrence of the feature t. P(t) is the probability of class containing feature t, P(c) is the probability of class c. ��� � denote feature not present, and class not present, respectively. Let N represent the total number of documents in a given dataset, and Ns with indicated subscripts values represents counts of documents. Using Maximum Likelihood estimates (MLEs) of probabilities, equation (2) can be expressed as: In information theory logic, a term/feature contains about the class, if the distribution of a term is equivalent in the class as it is in the whole collection, then (t, c) = 0. IG attains its optimal value if the term is a perfect discriminator for class membership if the term exists in a document if only the document is in the class.

B. Student Statistical Test Algorithm
Statistical Test (t-test) is a statistical-based method which is commonly used to evaluate if the means of two groups are statistically different from each other by computing a ratio between the mean difference of two groups and the variability of the two groups [4] [67]. Presently, t-test is widely used as an evaluation function to select significant features that contribute to classifying instances. The method computes score of feature by measuring the distinct distributions of the term in relevant category and documents collection [68]. The formula for calculating t-test is given as: Each class's specific scores obtained from (4) are combined to find the final score as follows: where denotes the standard deviation within a category, denotes the ℎ category, is the number of documents in ℎ category, is the total number of categories, denotes the average TF of term in category , donates average TF of term in the corpus. is the total number of documents.
However, when the score is less than the defined threshold, it indicates that the feature has lower discrimination ability; otherwise, the feature will contribute in the classifying instances and will be selected.

C. Proposed Approach
Considering the problem of result discrepancy produced when two filtering methods are combined, and the risk of losing informative features, an approach is proposed named new bi-strategy feature filtering approach which hybridizes IG with t-test to remove indiscriminate features by taking into consideration both feature ranking and vector score magnitude (V-score). The approach applies IG and t-test metrics independently to compute scores and assign the computed scores to each feature in the original features set, let say D 1 and D 2 . The top-ranked features that are greater than a predefined threshold K 1 are considered as significant features and are selected, new subsets of features S 1 and S 2 , which are based on IG and t-test are generated independently. Next, a feature with minimum IG score from S 1 and a feature with minimum t-test score from S 2 that are present in both S 1 and S 2 are selected and their V-scores are computed. The minimum V-score among the two computed V-scores is set as the new threshold K 2 . The approach further refines the features subsets by selecting a feature only if it is present in S 1 or S 2 and its V-score is greater than the new defined threshold K 2 otherwise it is an indiscriminate feature and will be neglected.
V-score of a given feature is computed using the concept of vector magnitude proposed in [61] that is, summing the squares of a vector's coordinates and taking the square root of the summation, it is formulated as: NB: The values of the scores produced by IG is different from that of t-test. So, we have to normalize the scores first before computing V-score so as to uniformly transform them into equivalent scale.
Let us consider Table I below, which contains few samples extracted from 20NewsGroups. It shows generated ranking of each feature based on the chosen filter methods. It can be seen that there are presence of discrepancies in the output. This issue arises due to the different theoretical strategy used by distinct filter methods to compute the score of each feature in the given dataset. IG ranked "Thanks" the lowest while t-test ranked it the highest, there is high assurance for IG method to eliminate this particular feature when a threshold is defined despite it has been selected by t-test method as the most informative feature. Therefore, both methods fall into the problem of losing informative features, likewise the existing hybrid filtering methods. Nevertheless, the proposed approach mitigates such issue by considering both the ranking and score of each feature. The approach sets a new thresholds base on computed V-score and further refines the features subset.  APPEND(L 1 (A(t i , c), t i )) 6.

APPEND[SL, (t i )] 20. End For 21 Return SL End
A summarized flowchart of the proposed methodology is depicted in Fig. 1. The process begins with the raw datasets as input. After relatively balancing the all unbalanced datasets, then original features set is constructed using TF-IDF. Next step is the initial features subsets formation using IG and t-test filter methods to compute and assign a score to all features and a sequence of high ranked features will be selected. Next, we employed the proposed BI-strategy filtering approach to further refined the initial features subsets and generate the new informative features subset. Lastly, we validate the new approach by recording the Classification accuracy and f-score of the selected classifiers.

D. Experiment and Datasets
In this sub-section, summary of the datasets used and the experimental process adapted are briefly explained. The classification algorithms employed are also presented. Lastly, the section ends with a discussion on classifiers and implementation requirements.

1) Dataset:
To evaluate the proposed approach in this experiment, the well-known three text benchmark datasets widely used for multi-class classification task is selected. Two of the datasets (Reuters 21578 and News Category) are unbalanced while the other one (20newsGroups) is balanced. We believed that both the datasets are highly dimensional with large number of samples, and also diversity amount of classes is considered. The summary information of the datasets is display in Table II. The 20NewsGroups approximately comprises of 20,000 documents gathered from the collection of Usenet Newsgroups [69], and it consists of relatively balanced 20 distinct categories, each category contains around 1000 documents. The Reuters-21578 comprises of 21578 documents gathered from Reuters newswire, and it consists of unbalanced 135 categories with each document is associated with at least one categories (multi-label) [70] [60]. Before importing the dataset into our experiment, we assigned only one category label to each document by stripping out all country names on the list and selecting the first topic left. Moreover, any document that is not associated with any topic was also eliminated from the dataset. This significantly reduced the number of categories and documents to 90 and 11367. The News Category comprises of around 150, 000 samples gathered from Short News Category, and it consists of unbalanced 41 categories with each document is associated with at least one categories (multi-label). We combined some few categories that can be naturally merged together, such as 'CULTURE & ARTS', ARTS & CULTURE', and 'ART'. We finally reduced the number of categories to 36.

2) Experiment settings:
In the initial phase of the experiment, all English letters are converted into lowercase, stop words are removed, and words having non-characters are filtered. After then, roots of English words are found by applying porter stemmer algorithm [71]. And lastly, feature extraction is performed using TF-IDF weighting [72]. NB: all the three datasets are randomly divided into 60% training and 40% testing.
To validate the proposed filtering approach and its effectiveness on classification models, two existing benchmark methods for feature filtering are selected for comparison, namely IG and t-test. The selection is based on the fact that the proposed method is a hybrid of the selected methods. Besides the new approach, three other existing hybrid filtering approaches include Union (OR) approach, Intersection (AND) approach and Vector Magnitude (V-score) approach proposed in [61] are also selected. The initial threshold value K 1 is based on the number of features been ranked in the original set and was set as 60% for both IG and t-test, any feature below the predefined threshold is low scored feature and will be disregarded otherwise will be qualified for further selection evaluation. Jaccard Similarity Coefficients (JCC) is used in this study to measure the similarity of features been selected by different benchmark filtering methods.
Five different well-known classification methods are used for validation purpose in this study. The selection is based on the positive recommendation of the methods in terms of text multi-class classification. The selected methods including Support Vector Machine (SVM) [18] [22], Naïve Bayes (NB) [73], Decision Tree (DT) [38], Random Forest (RF) [74] [18], and Ridge Regression (RR) [75] [76]. All these models will be used to record the classification accuracy and performance of the stated filtering methods. Default values of most of the parameters associated with the classification methods are retained. For SVM and NB, multi-class SVC with kernel function and MultinomialNB are adapted while for RF number of estimation was set to 100 when executed on 20News groups and Reuters 21578 datasets and set to 20 on News Category dataset, respectively.
Because of space limit, the performance of the classification methods will be reported using two standard recognized metrics widely used for text classification in literature, namely, Accuracy and F1-score. Accuracy is the percentage of the documents that are classified correctly in the given entire documents dataset. F1-score is the representation of harmonic mean of precision and recall. Accuracy and F1score ware computed using the following equations. In this study, all the implementations for the experiment are conducted on Python (V3.8.2) environment, which is installed on a computer with Windows 8 (OS). Other minimum required conditions for the experiment include Intel(R) Core TM i5 processor4300m@2.60GHz/8GRAM/64 GB. Table III shows a brief description of the chosen filtering methods based on the formulation and strategy adapted. As it can be seen that all the selected hybrid filtering method ware formulated by integrating information theory and statistical theoretical based benchmark methods (IG and t-test), the reason behind the selection of this two benchmark methods is by considering the Jaccard Similarity Coefficients between them which is very low compared to other methods. The average percentages of features reduced by different methods in all the three datasets are displayed in Fig. 2. From the figure, it can be seen that the percentage reduction differences in terms of feature dimensions between the proposed method (PM) and existing methods (IG, TS, UA, IA, VS). PM, UA, and IA reduced the number of features by 52.07%, 19.2% and 60% on 20NewsGroups Dataset, where as 48.08%, 26.53%, and 53.48% on NewsCategory Dataset and finally 45.25%, 28.86% and 51.15% on Reuters-21578 Dataset respectively. While IG, TS and VS reduced the features by 40% in all the three datasets, this is because a fixed threshold K 1 was defined in all the experiments. The figure reveals in all the three datasets, the proposed method comparatively reduces the feature dimensions, with IA and UA achieved the highest and lowest percentage of features been reduced in all the datasets. 505 | P a g e www.ijacsa.thesai.org The proposed approach could not beat IA method in terms of feature reduction because we seriously take into consideration the risk of avoiding losing informative features which will suffer the performance of classifier as discovered with IA and related methods. In particular, our proposed approach saves as an intermediary between IA and the other methods. Tables IV, V and VI show the classifiers' performance including Ridge, MNB, SVC, DT and FR based on classification accuracy and running time after applying the existing and proposed filtering methods on the text datasets. Best results are face bolded. The impact of filtering methods on the classifier performance in both the five classifiers results is noticeable.  In Table IV, accuracy results and running times are summarized for different chosen filtering methods on 20 Newsgroups dataset. Both the methods showed good performance with all the classifiers except with DT. The average classification accuracy, when filtering methods (IG, TS, UA, IA, VS and PM) were applied are 67.54%, 68.09%, 67.20%, 65.51%, 66.90%, and 67.65%. However, from the average score, we notice that the accuracy by different filtering methods is basically at the same level across all the classifiers with TS achieved the highest accuracy score followed by our method with no much significant difference. As can be seen from the table, in terms of running time, the proposed approach and AI marked the lowest as they achieved an average running time of 7464ms and 7425ms, thus supersede TS and the other methods. In particular, our method shows competitive performance on 20NewsGroups dataset.

IV. DISCUSSION OF RESULTS
The classification performance on NewsCategory dataset is shown in Table V. the average classification accuracy for the filter methods are 54.29%, 54.40%, 54.50%, 51.14%, 54.46%, and 54.75%. We notice the average accuracy of the proposed approach is comparatively little bit higher than that of the other filter methods compared. From the table, it can be seen that in most cases, our approach achieved a lower running time (4689ms averagely) but a little bit higher than IA (4635ms averagely). However, the overall comparison on NewsCategory dataset shows our method achieved significant performance. Table VI reports the classification performance on Reuters-21578 dataset. It shows that the accuracy by different filtering methods is roughly similar. The average accuracy for the filter methods are 79.78%, 79.87%, 75.72%, 80.00%, and 80.16% . Compared with the other filter methods, the average accuracy of the proposed method is comparatively higher. The lowest average running time is achieved by IA as 29865ms and then followed by our method as 29806ms upon all the filter methods.
In general, the performance of the filter methods reported on each dataset is roughly at the same level across all the five classifiers. On 20NewsGroups dataset, we observed that TS recorded the highest classification accuracy with a slice difference than that of our method, but the running of our method is significantly lower than that of TS. While on NewsCategory and Reuters-21576 datasets, our method recorded the highest classification accuracy with lower running time. IA generally recorded the lowest running time upon all the classifiers but sacrificed their performance. This indicates that the method filters out some informative features, thus reducing classification capability. On the other hand, IG, TS, UA, and VS achieved competitive performance but the running time is significantly high, and this indicates the presence of noise and irrelevant features in the final subset produced which need to filter out so as to reduce the time complexity. We also observed that the best accuracy results were obtained with the Ridge classifier and SVC classifier, whereas the results that are obtained with DT are comparatively bad. Generally, the proposed method achieves acceptable performance on all datasets, which indicates that this kind of filter method can not only reduce the size of the features set but also ensure that informative features are retained so that the performance of a classifier is not sacrificing.
Despite work done to balance the two unbalanced datasets used in this experiment still, the datasets are relatively unbalanced. Therefore using accuracy metrics to evaluate the performance could be misleading. In order to further verified the validity of the proposed approach, Fig. 3, 4 and 5 shows the performance results based on F-score of the chosen classifiers when the proposed and existing filter methods were applied on the three datasets selected. Examining both the figures, we can see that the results obtained are in line with accuracy results obtained in Tables IV, V, and VI. The information depicted in Fig. 3 shows the approach recorded the highest F-score after TS with a relatively small difference. Moreover, in Fig. 4 and 5, the proposed approach attains the highest F-score with a minimal gap. Therefore, we conclude that the proposed method achieves the best classification performance in terms of the highest F-score on most of the cases. Although the proposed approach, does not always give the highest result on all the datasets such as with 20NewsGroups, but the F-scores results are still acceptable. 507 | P a g e www.ijacsa.thesai.org

V. CONCLUSION AND FUTURE DIRECTIONS
Filter-based Feature selection is one of the dimensional reduction techniques, and it is an important preprocessing step of any text classification problem. Selecting the most informative features is one of the main problems faced in building a robust classifier due to performance degradation and time complexity. Reducing the dimension of features by removing irrelevant and noise features as well as retaining the relevant features will significantly reduce the classifiers' computational complexity. There have been a quiet number of works done in the literature to address this problem. The common filtering methods select features by considering a single theoretical approach. Recently, a hybrid approach that combines multiple filtering methods based on different theoretical approach receives more attention. These methods produce a discrepancy in the result that makes combining features subsets produced into a single subset and selecting significant features a difficult task. In this paper, we propose a novel Bi-strategy fileting approach that uses the combined scores of IG and t-test to produce refined features subsets by setting a new threshold. The method filters out common features with low V-scores from the considered subsets of features without sacrificing classifier's performance. First, two subsets with high ranked features based on IG and t-test are produced. This is done by defining the initial threshold K 1 . Then the method identified a feature with minimum IG and ttest scores that are present in both subsets produced and compute their V-scores. The minimum V-score is set as the new threshold K 2 , and it is used to further filter out insignificant features from the IG and t-test subsets.
In order to validate the performance of the proposed method, the study presents a comparison based on accuracy and F-score of the filtering approach with that of benchmark methods include IG, t-test (TS), and existing hybrid subsets merging approaches include Union (UA), Intersection (IA) and V-score (VS) using five classification algorithms. The experiment is conducted using three different text datasets, 20Newsgroups, NewsCategoty, and Reutres-21578. Results in Fig. 2 show that our filter method produces a subset with features that is higher than that of IA in number but smaller than that of IG, TS, UA, and VS. It is the fact that our method ignored irrelevant and noisy features and at the same time retained much more informative features, unlike IA. Further experiment results showed that with the small size of features subset produced, our approach achieved a significant improvement in terms of accuracy and F-score of the classifiers used at the cost of a minimum running time. Lastly, a conclusion is reached that the proposed approach achieved a competitive performance even though it does not always give the highest result in most cases, but the results are still acceptable.
In future work, there is a need to investigate the following task: (1) To develop and in-cooperate a feature hashing method as the next step to our method that will consider the correlation between features. (2) To develop a method that has the capability to automatically determine optimal threshold parameter(s) between significant and non-significant features without any domain expert involvement.