Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

Saleem Raja Abdul Samad; Pradeepa Ganesan; Amna Salim Al-Kaabi; Justin Rajasekaran; Singaravelan M; Peerbasha Shebbeer Basha

doi:10.14569/IJACSA.2024.0151036

DOI: 10.14569/IJACSA.2024.0151036

PDF

Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

Author 1: Saleem Raja Abdul Samad

Author 2: Pradeepa Ganesan

Author 3: Amna Salim Al-Kaabi

Author 4: Justin Rajasekaran

Author 5: Singaravelan M

Author 6: Peerbasha Shebbeer Basha

International Journal of Advanced Computer Science and Applications(IJACSA), Volume 15 Issue 10, 2024.

Abstract and Keywords
How to Cite this Article
{} BibTeX Source

Abstract: Cyberattacks are intentional attacks on computer systems, networks, and devices. Malware, phishing, drive-by downloads, and injection are popular cyberattacks that can harm individuals, businesses, and organizations. Most of these attacks trick internet users by using malicious links or webpages. Malicious webpages can be used to distribute malware, steal personal information, conduct phishing attacks, or perform other malicious activities. Detecting such malicious websites is a tedious task for internet users. Therefore, locating such a website in cyberspace requires an automated detection tool. Currently, machine learning techniques are being used to detect such malicious websites. The majority of recent studies derive a limited number of features from webpages (both benign and malicious) and use machine learning (ML) algorithms to detect fraudulent webpages. However, these constrained capabilities might not use the full potential of the dataset. This study addresses this issue by identifying malicious websites using both the URL and webpage content features. To maximize detection accuracy, both ngrams and vectorization methods in natural language processing are adopted with minimum feature-set. To exploit the full potential of the dataset, the proposed approach derives the 22 common linguistic features of the URL and generates ngrams from the domain name of the URL. The textual content of the webpages was also used. The research employs seven machine learning algorithms with three vectorization methods. The outcome reveals that the proposed method outperformed the results of previous studies.

Keywords: Machine learning; N-gram; linguistic features; natural language processing (NLP); malicious webpage

Saleem Raja Abdul Samad, Pradeepa Ganesan, Amna Salim Al-Kaabi, Justin Rajasekaran, Singaravelan M and Peerbasha Shebbeer Basha, “Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning” International Journal of Advanced Computer Science and Applications(IJACSA), 15(10), 2024. http://dx.doi.org/10.14569/IJACSA.2024.0151036

@article{Samad2024,
title = {Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2024.0151036},
url = {http://dx.doi.org/10.14569/IJACSA.2024.0151036},
year = {2024},
publisher = {The Science and Information Organization},
volume = {15},
number = {10},
author = {Saleem Raja Abdul Samad and Pradeepa Ganesan and Amna Salim Al-Kaabi and Justin Rajasekaran and Singaravelan M and Peerbasha Shebbeer Basha}
}

Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.

Automated Detection of Malevolent Domains in Cyberspace Using Natural Language Processing and Machine Learning

Upcoming Conferences