Modified K-nearest Neighbor Algorithm with Variant K Values

In Machine Learning K-nearest Neighbor is a renowned supervised learning method. The traditional KNN has the unlike requirement of specifying ‘K’ value in advance for all test samples. The earlier solutions of predicting ‘K’ values are mainly focused on finding optimal-k-values for all samples. The time complexity to obtain the optimal-k-values in the previous method is too high. In this paper, a Modified K-Nearest Neighbor algorithm with Variant K is proposed. The KNN algorithm is divided in the training and testing phase to find K value for every test sample. To get the optimal K value the data is trained for various K values with Min-Heap data structure of 2*K size. K values are decided based on the percentage of training data considered from every class. The Indian Classical Music is considered as a case study to classify it in different Ragas. The Pitch Class Distribution features are input to the proposed algorithm. It is observed that the use of Min-Heap has reduced the space complexity nonetheless Accuracy and F1-score for the proposed method are increased than traditional KNN algorithm as well as Support Vector Machine, Decision Tree Classifier for Self-Generated Dataset and Comp-Music Dataset. Keywords—Classification; K-nearest Neighbor (KNN) classification algorithm; Indian Classical Music; Performance measures; Heap data structure


I. INTRODUCTION
The K-nearest neighbors is a simple and effective classification algorithm. The most important advantage is that the classification results can be easily interpreted. Despite all these advantages, it has shortcomings like high computational cost, large memory requirement, and equal-weighted features and in last deciding appropriate value of the input parameter K [1]. There are many variants of the KNN algorithm proposed to overcome these shortcomings. In [2,3] the author proposed a weighted KNN. In [2] first learns weights for different attributes and according to the weights assigned, each attribute would affect the process of classification that much only. In [3] inverse of Euclidean distance is considered as the weight for load forecasting. In [4] various distance functions are implemented with KNN on a medical dataset with different types of attributes. In [5] authors used various pitch distributions as feature set for KNN with different distance functions in Raga Identification.
In [6] authors pointed out that traditional KNN has limitations to solve few problems like imbalance, noisy, sparse dataset. The authors proposed Hybrid KNN (HBKNN) to sort out these problems.
In KNN variations the researchers combined KNN with Kmeans clustering algorithm to reduce the computation complexity. In [7] authors applied this approach to improve accuracy in air quality assessment. This approach worked well for Big data as well in [8].
The basic assumption of the standard KNN is fixed K value for all data points to classify. However, many datasets have uneven distributions of data points, or even experts also not able to predict optimal K value. So many researchers proposed various methods for predicting k value. In [9] authors proposed a local mean representation-based k-nearest neighbor classifier (LMRKNN) method. In this method the representation-based distances calculated by the categorical k-local mean vectors instead of the simple majority vote for making the classification decision. The LMRKNN is outperformed on many real datasets downloaded from the University of California, Irvine (UCI), and Knowledge Extraction based on Evolutionary Learning (KEEL) repositories than traditional KNN. In [10] authors proposed an algorithm called Adaptive K-nearest neighbor (AdaKNN) algorithm which uses the density and distribution of the neighborhood of a test point and learns a suitable K for it with the help of artificial neural networks. This strategy for rightly classifying the test point is employed by Wettschereck and Dietterich in [11] in which, the value of K is determined for different portions of input space by applying cross-validation in its local neighborhood. The Ada-KNN2 is proposed as an extension to the Ada-KNN algorithm in which the neural network is replaced with a heuristic learning method based on local density indicator of a test point and information about its neighboring training points.
The large value of K would increase the computational cost and time in case of large data sets. To solve this problem, in [12] the variant value of K is proposed so that the early break of the algorithm can be possible, which ultimately saves computational time.
In [13] Adaptive KNN algorithm is developed by choosing optimal k for each item by maximizing its expected accuracy computed on similar points. The evaluation is done on three different datasets of Geo-Spatial Data.
In [14] the author employed a correlation Matrix, to reconstruct test data and assign different K values to the different test data points. The proposed algorithm achieved high accuracy and efficiency in applications of classification, regression, and missing data assertion. www.ijacsa.thesai.org The prediction of K value with the cross-validation method is usually time-consuming. In [15] authors introduced the training phase in the KNN classification algorithm and proposed a k*Tree method to learn different optimal k values for different test samples. The proposed K*Tree method reduced the running cost of the test phase. The efficient working of the proposed method is observed using 20 different real datasets.
In ICM the lots of work done in Raga recognition using KNN algorithm [5,16,17,18,19]. The researchers focused either on Features or compared using the different classifiers. The classifiers are used in their traditional form. In Data Mining as the application changes, the keen thinking about the parameters used in classifiers is required. The impact of these parameters on the performance also need to be observed.

_________________________________________________
The traditional KNN does not have a training phase. It calculates the distance between every sample in test data with every sample in training data. The most nearest "K" neighbors are identified for every sample based on distance. The class having maximum count belong to "K" nearest Neighbor is assign to test sample. www.ijacsa.thesai.org In the proposed method algorithm is divided in two phases training and testing. In step 4.1 the samples per class are present in training data are calculated. The step5.1 calculates K value for each class label considering its percentage contribution in training data. In steps 6 and 6.1 the Euclidean distance is calculated between every sample in training data and stored in Min-Heap of size 2M_K. In this M_K is the maximum size of Heap.
Ones the Min Heap is ready, in step 7.1 the class labels are identified for every test sample from first entries in Min-Heap. The class label with maximum count will be assigned to the test sample in step 7.2. Step 7.3 counts the correctly classified samples for every class and stored in the TP array. Where TP gives True Positive values for each class. The steps 7.1 to 7.3 are executed for every distinct value of K which was calculated in step 5.1. The value of K will vary from minimum to maximum value of K for classes calculated in step 5.1. After calculating TP for all different "K" values. The optimal 'K' value for every class is calculated by finding maximum trup positive count of every class. This completes the training phase. In the best case for all classes, the same 'K' may come. In the worst-case, every class will get different optimal 'K'.
In testing phase distance between every test sample and training sample is calculated and the Min-Heap is constructed for maximum optimal 'K' value which has got from the training phase. The nearest neighbors are identified from the first K entries in Min-Heap. The class label of maximum count of neighbors is assigned to the test sample.
The computational complexity of KNN is one of the limitations of KNN. In traditional KNN training phase is not available. The Time complexity of traditional KNN is ( ) ( ) ( ) The complexity for calculating distance between every testing sample with training samples is ( ). After calculating the distance between samples the sorting algorithm with average-case complexity N log 2 N is required to sort the distance array. So to sort D tuples the sorting complexity will be ( ) . To get "K" nearest neighbor from sorted data will be O(K) which will be finally O(D*K) for D test samples. Even if instead of sorting the Heap data structure is used to get 'K' nearest neighbor, complexity will reduced to O(D*Tlog 2 T) + O(D*Klog 2 T).
In MVKNN training and testing phases are introduced. The complexity of training phase is (O(T*(T+1)/2) + O(T*Tlog 2 K) + O(K*Klog 2 K))}. In the worst-case, number of distinct K values, will become equal to the distinct value of percentage of records belonging to the number of classes present in Dataset, and in the best case, only the same K value is for all classes. The complexity to calculate the distance between every training sample with other training sample is O(T*(T-1)/2). To find the K nearest neighbor first it will create "T" number of Min-Heap of 2K size. So the complexity to create the T number of Min-Heap with T elements of size 2K will be O(T*Tlog 2 K). To get K nearest neighbor Delete_min operation will be performed K times so its complexity will be O(Klog 2 K).
The testing phase complexity will be O(D*T) + O(D*Tlog 2 K) + O(Klog 2 K). The O(D*T) is complexity for calculating distance between every test sample with training sample. The O(D*Tlog 2 K) is complexity for creating Min-Heap of K size for T distance values. The Heap will be generated for every test sample.
The computation complexity of traditional KNN is higher than the computation complexity of the testing phase. If the complexity of both training and testing phase in MVKNN is considered then it is higher than traditional KNN but as we know the training of classifier is done only ones and are not required to perform whenever testing is executed. So based on this assumption the computational complexity of MVKNN testing phase is lower than traditional KNN.
The computational calculations can be understand more clearly by taking a small example.
Let us consider total samples 1000. Take a 70:30 ratio for training and testing. So T= 700 and D= 300, the number of classes present in the dataset are 8.
The total computations in traditional KNN will be.
Distance calculations = 210000. This case study shows that computation for the training phase in the worst-case nearly one and half times of computations in traditional KNN. The testing computations are almost half of the computations in traditional KNN. So this work may conclude that MVKNN is computationally efficient than traditional KNN provided training should be performed occasionally.
The space complexity is also reduced. In traditional KNN O(D*T) memory will be required to store the distance in sorted array or Heap form. Wherein MVKNN space complexity for the training phase is O(T*log 2 K) and testing phase O(T*log 2 K).

III. EXPERIMENTAL RESULTS
The proposed algorithm is presented as an extension of the traditional KNN algorithm. The performance of both algorithms is compared with our data set and CompMusic dataset.
In our dataset, 1450 samples of 8 different Ragas are present sung by different singers. The samples are stored in www.ijacsa.thesai.org .wav format with sampling frequency 44100Hz and 16bps. The frame size is considered as 20ms with 25% overlapping.
CompMusic dataset [16,17] includes full-length audio recordings with the Raga label. It is a collection of several artists' vocal as well as instrumental performances. The clips were extracted from the live performances and CD recordings of 13 artists. Total 129 tunes for 08 ragas are considered. The dataset is downloaded as per instructions given in [20]. The duration of each tune averages 5-6 minutes. The tunes are converted to mono-channel, 44100 Hz sampling rate, 16 bit PCM.
The Pitch values are calculated as mentioned in [21]. The Pitch values are divided into 36 bins and constructed Pitch Class Distribution of every sample. Fig. 1 show the PCD for one sample of Raag Asavari. The PCD of the sample shows the frequency count of every bin. This sample is sung in the second octave so the Notes are present between bin numbers 13 to 25.
The PCD is calculated for all the samples and created a feature vector to give input to traditional KNN and MVKNN algorithm.
The experimentation for traditional KNN is done for varying K values from 1 to sqrt(T). The elbow method is applied and observed that after K=11, the accuracy is nearly constant up to K=20. Similarly with Decision Tree and SVM classifiers are also implemented with same datasets. Accuracy and F1 score is calculate as per following equations 1 and 2 respectively [22]. The results are documented in Table I.
The PCD input is given to the MVKNN algorithm. For one instance the dataset is split into 30% testing and 70% training using train_test_split in Python. The training is performed for k=10, 11, 12, 13, 14 distinct "K" values using Min-Heap. The confusion matrix containing True Positive, True Negative, False Positive and False Negative values is calculated for given dataset. The True Positive values are observed in every class for each "K". The "K" having maximum True Positives is taken as an optimal K value for that class during the testing phase. In Table II the optimal K values are shown for every class for one instance.   Table III shows a comparison of Accuracy and F1-score of traditional KNN and MVKNN for self-Generated data and CompMusic data. It is observed that Accuracy and F1-Score are improved for both datasets. The experimentation is done several times by taking an equal number of samples belonging to each class as well as by making imbalanced classes. It is observed that the variation in "K" values always improved results than the same value of 'K'.

IV. CONCLUSION
In this paper, the survey of modified KNN algorithms is done. The KNN algorithm for variant K values for every test sample is proposed. The training phase is introduced to identify the optimal K value. The use of the Min-Heap data structure of 'K' size has reduced the space complexity. The algorithm was implemented using Indian Classical Music for classifying it based on the Raga. The PCD features of two different datasets are considered as an input vector. The Accuracy and F1-score measures are considered for comparing performance. The improvement in Accuracy and F1-score is observed using the proposed MVKNN algorithm in comparison with traditional KNN, Decision Tree and SVM. In Indian Classical Music, the repeating patterns play a very important role for Raga identification. In the future, the plan to apply the proposed algorithm on high dimensional feature vector of repeating patterns in a signal to improve the results of Raga identification.