Processing Sampled Big Data

Waleed Albattah; Rehan Ullah Khan

doi:10.14569/IJACSA.2018.090846

DOI: 10.14569/IJACSA.2018.090846

PDF

Processing Sampled Big Data

Author 1: Waleed Albattah

Author 2: Rehan Ullah Khan

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 9 Issue 8, 2018.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Big data processing requires extremely powerful and large computing setup. This puts bottleneck not only on processing infrastructure but also many researchers don’t get the freedom to analyze large datasets. This paper thus analyzes the processing of the large amount of data from machine learnt models that are built on the smaller sets of data samples. This work analyzes more than 40 GB data by testing different strategies of reducing the processed data without losing and compromising on the detection and model learning in machine learning. Many alternatives are analyzed and it is observed that 50% reduction does not drastically harm the machine learning model performance. On average, in SVM only 3.6%, and in Random Forest, only 1.8% performance is reduced, if only 50% data is used. The 50% reduction in instances means that in most cases, the data will fit in the RAM and the processing times will be considerably reduced, benefitting in execution times and or resources. From the incremental training and testing experiments, it is found that in special cases, smaller sub-sampled data can be used for model generation in machine learning problems. This is useful in cases, where there are either limitations on hardware or one has to select among many available machine learning algorithms.

Keywords: Deep learning; content analysis; machine learning; support vector machines; random forest

Waleed Albattah and Rehan Ullah Khan, “Processing Sampled Big Data” International Journal of Advanced Computer Science and Applications(IJACSA), 9(8), 2018. http://dx.doi.org/10.14569/IJACSA.2018.090846

@article{Albattah2018,
title = {Processing Sampled Big Data},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2018.090846},
url = {http://dx.doi.org/10.14569/IJACSA.2018.090846},
year = {2018},
publisher = {The Science and Information Organization},
volume = {9},
number = {8},
author = {Waleed Albattah and Rehan Ullah Khan}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Processing Sampled Big Data

Upcoming Conferences