A New Discretization Approach of Bat and K-Means

Bat algorithm is one of the optimization techniques that mimic the behavior of bat. Bat algorithm is a powerful algorithm in finding the optimum feature data collection. Classification is one of the data mining tasks that useful in knowledge representation. But, the high dimensional data become the issue in the classification that interrupt classification accuracy. From the literature, feature selection and discretization able to overcome the problem. Therefore, this study aims to show Bat algorithm is potential as a discretization approach and as a feature selection to improve classification accuracy. In this paper, a new hybrid Bat-K-Mean algorithm refer as hBA is proposed to convert continuous data into discrete data called as optimize discrete dataset. Then, Bat is used as feature selection to select the optimum feature from the optimized discrete dataset in order to reduce the dimension of data. The experiment is conducted by using k-Nearest Neighbor to evaluate the effectiveness of discretization and feature selection in classification by comparing with continuous dataset without feature selection, discrete dataset without feature selection, and continuous dataset without discretization and feature selection. Also, to show Bat is potential as a discretization approach and feature selection method. . The experiments were carried out using a number of benchmark datasets from the UCI machine learning repository. The results show the classification accuracy is improved with the Bat-KMeans optimized discretization and Bat optimized feature selection. Keywords—Classification; discretization; feature selection; optimization algorithm; bat algorithm


I. INTRODUCTION
Bat Algorithm (BA) is a metaheuristic optimization algorithm that is found on echolocation characteristics of microbats where the emission and loudness has a varying pulse rates [1]. BA has been employed to resolve various optimization problems in many application. BA could possibly perform better than Particle Swarm Optimization (PSO) [2]. Further, unlike PSO, the parameter of BA can be varied. Moreover, the BA is capable to zoom the solving space, automatically which eventually allows local exploitation.
In [3] BA was used for classification in medical data. Moreover, this research [4] objective is to improve the classification performance in large data. This objective is achieved by employing BA as feature selection to select important feature in real-life dataset from UCI repository. In [5] hybridization of BA with Weighted Extreme Learning Machine (WELM) known as WELM-BAT was proposed to adjust the parameter for WELM. WELM-BAT was tested on real world medical diagnosis datasets and the result shows the performance of classification is improved.
The classification performance is improved using BA in the large data [6]. In [7] a hybrid Cross-Entropy method (BBACE) with Binary Bat Algorithm is used for feature selection in Big Data. To evaluate the effectiveness of the proposed method, the classifier such as Support Vector Machine and Naïve Bayes is used as a comparison regarding to time processing and accuracy. The results show that BA produced a better result compare to PSO.
Classification aims to assign data into predefined topic correctly. High dimensional, data redundancy and large amount of data are the factors that influence the classification performance. In data mining, classification is one of the popular problems that gets attention from researchers.
In real world datasets, the value of datasets are mixed data (continuous and discrete data), continuous data and discrete data. But some classification approaches require discrete data only. Thus, a process to convert from continuous data into discrete data is required. The process is known as discretization. The discretization task is to convert the continuous values into a small number of range based on the break points and then mapping each continuous value within the range, where each range is assigning to a discrete value. Discretization is needed before performing the feature selection task.
In classification problem, it is challenging to deal with irrelevant features caused by huge size of data [8]. Feature selection aims to select the relevant features and remove the irrelevant features. On the other hand, discretization aims to discretize the continuous value. Data representation in discrete value is easier to understand. Let consider this example, a person's weight has various readings, and it is difficult and ambiguous to interpret whether the reading is light or heavy. Discretization will place the weighted reading within a certain interval and the weight can be easily interpreted as to whether it is light or heavy. This paper has proposed implementation of optimizing algorithm in discretization and feature selection steps. To implement the steps, BA is hybrid with the K-Means as a discretization approach. Then BA is then used as a feature selection to determine optimum features. This paper is organized as follows: related works is presented in Section 2. Section 3 discusses the proposed approach based on K-Means and Bat algorithm and the description of the data sets. Section 4 discusses the experiment results and Section 5 presents the conclusion. www.ijacsa.thesai.org

II. RELATED WORKS
This section briefly discuss on basic concept of Bat algorithm and reviewed the previous work on classification, discretization, and feature selection.

A. Bat Algorithm
BA is an intelligent optimization algorithm to solve complex engineering problems. The BA is motivated by the emitting sound pulse of bats known as echolocation. The echolocation of bats it used to search pray. In Fig. 1 shows bat will emit the sound to prey or obstacle refers as object. Then the object gives the reflection wave to bat by returning the echo wave to bat. From these echo wave, the bat can differentiate type of object; prey or obstacle event though in the darkness.
In this paper, the rules of bat algorithm are employs as follows: 1) Rule 1: The echolocation of bats used to sense the distance and they can recognize object and differentiate the kind object; obstacle or prey.
2) Rule 2: Bat randomly flies with varying wavelength ,  and loudness, 0 is a random vector drawn from a uniform distribution.   Fig. 2 shows the flowchart for the Bat algorithm. The algorithm starts with initializing the value of loudness, pulse rate, velocity, frequency, number of bat population and the maximum number of iteration. While the number maximum iteration is not exceeded, the random number is generated to produce a new solution. In this new solution involve adjusting the previous frequency, updating the previous velocity as well Echo  Later, if the value of the random number is higher than loudness and frequency in the previous location is lower than the frequency in a new location, accept the new solution otherwise continue to rank the best bat with their frequency. The process is repeated until reach the maximum iteration and when the iteration reaches the algorithm will come out the optimum results.

B. Classification
Various algorithms were designed for supervised data classification which are based on totally different approaches, for example, neural network [9] and text clustering [10]. Cano in 2019 [11] and Shafiq in 2020 [12] presented good review of data classification methods, including their performance measure and comparison.
Ayyagari [13] proposed a new classification algorithm that integrate the CART, k-NN and SVM. The proposed algorithm was used to handle the problem on imbalanced datasets and class overlap that influence the performance of classification algorithm.
One of the approaches can be used for supervised data classification is an optimization algorithm based on swarm. The most influential important factors on the classification performance are discretization and feature selection that occur in data preparation phase. For example, Zhou in their research, employed discretization in feature selection to find relevant feature [14]. In another research, Ucar employed P-Score in feature selection algorithms to improve classification performance [15].

C. Discretization
Discretization can be defined as transforming continuous values into numerical values. In classification, discretization process is employed in data processing step before performing the feature selection process. Data preparation is one of the critical steps in classification problems [16]. The results from preparation is directly incorporated into the model and final results will be obtained. This means a good data source from data processing can increase the accuracy of classification performance.
A discretization approach was proposed to handle a mix of categorical and numeric safety data using fuzzy c means algorithm to discretize values [17]. The research in [18] has employed fuzzy discretization with Random Forest (RF) classifier to enhance labeling accuracy. In [19], Boltzmann machine is used in discretization process.
K-Means was describe by J. MacQueen in 1967 is an iterative algorithms [20]. The K-Means is used as the discretization algorithm [21]. The standard K-Means algorithm as shows in Fig. 3. From Fig. 3, the number k centroid is determined randomly. k is the number of class. Then the distance between data and each k centroid is calculated. The data is grouped into a pre-defined class according to the minimum distance between data and centroid. The process is repeated until the cluster centroid value stop changing.
This study has employed K-Means as pre-treatment method where data is discretized by K-Means before classification process. In [22] K-Means was employed to discretize the mixed data, in which the datasets present both, numerical and categorical attributes. In [23], K-Means is combined with discretization technique and Naïve Bayes to solve the problem in anomaly based intrusion detection for network intrusion detection system.

D. Feature Selection
In classification problem, feature selection is an effective method for dimensional reduction and helps in improving classification performance. Rachman and Rustam [24] in their proposed work researcher have use Fisher's Ratio algorithm to select cancer data. Their experiment reveals that feature selection gives a better result on classification accuracy.
Cheruku and Edla [25] highlight that the problems in medical data classification are redundant and highdimensional. These problems were solved with improvements in data preparation before classification step. The research has proposed the Branch and Bound algorithm in feature selection step too.
Mithy and KrishnaPriya [26] has employed various feature selection methods in prediction of anemia for pregnant women. The feature selection methods employed rough set base fuzzy threshold in prediction and give promising results.
Reem Kadry and Osama Ismael [27] has improved the K-Nearest Neighbour performance in multidimensional data classification problem. In their work Particle Swarm Optimization (PSO) algorithm is employed to find out the optimum feature on Soybean datasets.

III. METHODOLOGIES
The experiments are carried out on nine benchmark datasets from UCI machine learning repository were used as input data and the detail about datasets are shown in Table I. The nine datasets refer as contS dataset and become the initial datasets. Assume the contS consist of the datasets as shown in Table I where, contS={DS1, DS2, DS3, DS4, DS5, DS6, DS7, DS8, DS9} In this research, BA is hybrid with K-Means discretization refers as hBA and the hBA is enhance by employ BA as feature selection and refer as hfBA. Fig. 4 shows the framework in this research that involve two main process; 1) discretization process and 2) feature selection process. The implement of hBa and hfBA are explain in the following section.

A. hBA Approach
From Fig. 4, the approach start with the continuous datasets (contS) is convert into discrete data by using Bat-K-Means algorithm (hBA). From Fig. 2, the number k centroid is randomly define, but in hBA dataset, k centroid is according to optimum result from BA as shown in Fig. 3 and generate a new datasets which is optimize discrete dataset refers as hbaS dataset. In this approach, does not used feature selection so the algorithm will continue to the next process which is cluster analysis by using k-Nearest Neigbours. Finally, classification accuracy is measured.

B. hfBA Approach
From Fig. 4, the hfBA approach start in feature selection process. Where, BA algorithm as shown in Fig. 3 is applied to select the optimum feature after discretization with hBA approach. Then a new dataset refer as sub discrete dataset, hfBaS dataset is generated.
In hfBA approach, the size of bat represent by length, where is the total number of features in a dataset. The information regarding the instances is given by where is the number of instance. Each instance is }, where is the number of attributes for the -th instance in the DS dataset.
For example, it is assumed that DS dataset, has 10 attributes, 25 instances and 50 generations or repetitions. After 50 repetitions, the -th bat or instance number 10 th represented by is considered the best in the population. The optimum value for 10 th instance is . Thus the optimum value for attributes is given by . Based on BA, the minimum distance is assume as optimum value, so, in this study only value of are selected. For example, , so the attributes 1,3,5,9,10,11,12, and 14 are irrelevant features and will removes from dataset. This research also employ K-Means discretization to contS datasets refers as discrete datasets (dkS). Besides, BA applied to select optimum feature from dkS and select feature from contS. The approach is follows the algorithm as shown in Fig. 4. The approach and the generated datasets are simplify in Table II.

IV. EXPERIMENTS AND RESULTS
In this section, the experiments are set up to show Bat algorithm is potential as discretization approach and as feature selection to improve classification accuracy. Four experiments were conducted to show the BA in the discretization process and one experiment to show BA in discretization and feature selection process.
The comparison of hbaS dataset with contS, cfbaS, dkS and df of hfbaS datasets are conducted. The comparison results show in Table III that analyze by k-Nearest Neighbours for all datasets including initial datasets and new datasets. For easier understanding, the results are presented in a graphical form according to the following experiment where involving comparisons between two data sets only.
1) 1 st experiment: the experiment is conducted to compare hbaS datasets with contS dataset and the result is shown in Fig. 5.
2) 2 nd experiment: the experiment is conducted to compare hbaS datasets with cfbaS dataset and the result is shown in Fig. 6.
3) 3 rd experiment: the experiment is conducted to compare hbaS datasets with dkS dataset and the result is shown in Fig.  7.
4) 4 th experiment: the experiment is conducted to compare hbaS datasets with dfbaS and the result is shown in Fig. 8. 5) 5 th experiment: the experiment is conducted to compare hbaS datasets with hfbaS and the result is shown in Fig. 9. Fig. 5 presents the comparison between hbaS dataset with contS dataset. The purpose of this experiment is to identify which datasets give the better classification accuracy, either datasets with optimize discretization or continuous (original) datasets. From the results, classification accuracy of 3 datasets are improved by hBA which are DS1, DS5 and DS7. This experiment shows, by using hBA approach on the contS dataset can increase the classification performance results of dataset by 33%. Fig. 6, presents the comparison between hbaS dataset with cfbaS dataset. The purpose of this experiment is to identify which datasets give the better classification accuracy, either datasets with optimize discretization or continuous dataset with optimize feature selection. The results shows hBA achieve the better performance in 3 datasets; DS3, DS5 and DS7. This experiment shows, hBA approach perform only in 33% from the entire datasets.
In the third experiment, the comparison results between hBA approach with dK approach shows in Fig. 7. The purpose of this experiment is to identify which datasets give the better classification accuracy, either datasets with discretization using optimize discretization or datasets with optimization using original K-Means discretization. Both of the approaches discretize the contS dataset. The results shows hBA achieve the better performance in 3 datasets; DS5, DS7 and DS9. This experiment shows, discretization with dK is better than discretization with hBA. hBA approach perform only in 33% from entire datasets.
In the third experiment, the comparison results between hBA approach with dK approach shows in Fig. 7. The purpose of this experiment is to identify which datasets give the better classification accuracy, either datasets with discretization using optimize discretization or datasets with optimization using original K-Means discretization. Both of the approaches discretize the contS dataset. The results shows hBA achieve the better performance in three datasets; DS5, DS7 and DS9. This experiment shows, discretization with dK is better than discretization with hBA. hBA approach perform only in 33% from entire datasets. In the fourth experiment, the comparison results between hBA approach with dfBA approach shows in Fig. 8. The purpose of this experiment is to identify which datasets give the better classification accuracy, either datasets with optimize discretization or dataset with K-Means discretization and optimize feature selection. The results shows hBA achieve the better performance in seven datasets; DS1, DS3, DS4, DS5, DS6, DS7 and DS9. This experiment shows, hBA approach perform in 78% from entire datasets.    9 presents the comparison between hBA approach and hfBA. The purpose of this experiment is to identify which one is better in improving classification accuracy either optimize discretization or optimize discretization with optimize feature selection. From the results, hfBA able to improve the classification accuracy in all datasets.

V. CONCLUSION
The Bat algorithm is a powerful algorithm in finding the optimum feature data collection. The algorithm mimic the behavior of bat. Meanwhile, the high dimensional data in classification is interrupt classification accuracy. From the literature, feature selection and discretization is needed to solve. For that reason, this study show the Bat algorithm is potential as a discretization approach and as a feature selection to improve classification accuracy.
In this paper, a new hybrid Bat-K-Mean algorithm is proposed refer as hBA appraoch that used to convert continuous data into discrete data called as optimize discrete dataset (hbaS). The Bat algorithm is used to select the optimum feature in hbaS dataset for data dimension reduction and the new dataset refer as sub optimize discrete dataset (hfbaS) is created. The experiment was conducted to show the potential BA as discretization approach and feature selection approach to improve classification accuracy. k-Nearest Neighbor is used to evaluate the effectiveness of BA as discretization and feature selection in classification by comparing with dataset without feature selection, discrete dataset without feature selection, and dataset without discretization and feature selection. The experiments were carried out using a number of benchmark datasets from the UCI machine learning repository. The results show the classification accuracy is improved with the Bat-K-Means optimized discretization and Bat optimized feature selection.
The experiment results were present in Table III and Fig. 5 to Fig. 9. From the observation, conventional K-Means is not strong enough to improve classification performance (Fig. 7). If only the optimize discretization used in a continuous dataset is not able to improve the classification accuracy (Fig. 8). Optimize feature selection is required to select the optimum feature in the discrete dataset (Fig. 9). The combination of optimize discretization and optimize feature selection archive the better classification performance compare to continuous (original) in entire datasets.
In conclusion, BA is a powerful optimization technique and potentially to be used in discretization and feature selection. In the future, the research will extend for discretizing the mixed dataset value instead of continuous value only.