Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Asmita Singh; Malka N. Halgamuge; Rajasekaran Lakshmiganthan

doi:10.14569/IJACSA.2017.081201

DOI: 10.14569/IJACSA.2017.081201

PDF

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Author 1: Asmita Singh

Author 2: Malka N. Halgamuge

Author 3: Rajasekaran Lakshmiganthan

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 8 Issue 12, 2017.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: This study aims to evaluate impact of three different data types (Text only, Numeric Only and Text + Numeric) on classifier performance (Random Forest, k-Nearest Neighbor (kNN) and Naïve Bayes (NB) algorithms). The classification problems in this study are explored in terms of mean accuracy and the effects of varying algorithm parameters over different types of datasets. This content analysis has been examined through eight different datasets taken from UCI to train models for all three algorithms. The results obtained from this study clearly show that RF and kNN outperform NB. Furthermore, kNN and RF perform relatively the same in terms of mean accuracy nonetheless kNN takes less time to train a model. The changing numbers of attributes in datasets have no effect on Random Forest, whereas Naïve Bayes mean accuracy fluctuates up and down that leads to a lower mean accuracy, whereas, kNN mean accuracy increases and ends with higher accuracy. Additionally, changing number of trees has no significant effects on mean accuracy of the Random forest, however, the time to train the model has increased greatly. Random Forest and k-Nearest Neighbor are proved to be the best classifiers for any type of dataset. Thus, Naïve Bayes can outperform other two algorithms if the feature variables are in a problem space and are independent. Besides Random forests, it takes highest computational time and Naïve Bayes takes lowest. The k-Nearest Neighbor requires finding an optimal number of k for improved performance at the cost of computation time. Similarly, changing the number of attributes that effect Naïve Bayes and k-Nearest Neighbor performance nevertheless not the Random forest. This study can be extended by researchers who use the parametric method to analyze results.

Keywords: Big data; random forest; Naïve Bayes; k-nearest neighbors algorithm

Asmita Singh, Malka N. Halgamuge and Rajasekaran Lakshmiganthan, “Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms ” International Journal of Advanced Computer Science and Applications(IJACSA), 8(12), 2017. http://dx.doi.org/10.14569/IJACSA.2017.081201

@article{Singh2017,
title = {Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms },
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2017.081201},
url = {http://dx.doi.org/10.14569/IJACSA.2017.081201},
year = {2017},
publisher = {The Science and Information Organization},
volume = {8},
number = {12},
author = {Asmita Singh and Malka N. Halgamuge and Rajasekaran Lakshmiganthan}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Impact of Different Data Types on Classifier Performance of Random Forest, Naïve Bayes, and K-Nearest Neighbors Algorithms

Upcoming Conferences