Phishing Image Spam Classification Research Trends: Survey and Open Issues

A phishing email is an attack that focused completely on people to circumvent existing traditional security algorithms. The email appears to be a dependable, appropriate, and solid communication medium for internet users. At present, the email is submerged with spam content, both in text-based form or undesired text planted inside the images. This study reviews articles on phishing image spam classification published from 2006 to 2020 based on spam classification application domains, datasets, features sets, spam classification methods, and the measurement metrics adopted in the existing studies. More than 50 articles, both from Web of Science and Scopus databases were picked. Achieving the study’s target, we carried out a broad survey and analysis to identify the domains where spam classification was applied. Furthermore, several public data sets, features set, classification methods, and measuring metrics are found and the popular once were pinpointed. The study revealed that Personal Collection, Dredze, and Spam Archives datasets are the most commonly used datasets in image spam classification research. Low-level and image metadata are the most widely used features set. The methods of image spam classification as identified in this study are supervised machine learning, unsupervised machine learning, semi-supervised machine learning, content-based and statistical learning. Among these methods, the most commonly utilized is the Support Vector Machine (SVM) which falls under supervised machine learning. This is followed by Naı̈ve Bayes and K-Nearest Neighbor. The commonly adopted metrics for the performance evaluation of the existing image spam classifiers are also identified and briefly discussed. We compared the performance of the state-of-the-art image spam models. Lastly, we pointed out promising directions for future research. Keywords—Phishing; spam; image spam classification; machine learning; deep learning


I. INTRODUCTION
Phishing is a social engineering attack against people in a helpless society by controlling human beings into giving their confidential information to the cheats, called phishers. It is a criminal way of stealing internet users' private information using deceptive emails and counterfeit websites [1]. Phishing is also defined by [2] as a criminal instrument that utilizes both social engineering and specialized deception to take consumers' individual personality information and monetary account credentials. The coming of the Internet and the increasing number of its users have made email to be an important medium of communication. As of late, there has been an expanding utilization of emails and this has driven to the appearance of issues caused by phishing emails and spam. A typical email user gets around 40-50 emails per day [3].
According to [4], the entire number of phish identified in 1Q 2018 was 263,538. This was more than 45% from the 180,577 taken note in 4Q 2017. It was moreover higher than the 190,942 recorded in 3Q 2017. Likewise, the whole number of phishing identified in 2Q 2018 was 233,040, related to 263,538 in 1Q 2018. These sums are more than the 180,577 recorded in 4Q 2017 and the 190,942 watched in 3Q 2017. The phishing identified in 2Q and 3Q of 2019 were 112,163 and 122,359 respectively. Although there is a significant decrease in the phishing activities when compared with the figures of the previous years (2018 and 2017); however the request for phishing identification in our contemporary society is still a necessity to protect end-users from malicious emails. Phishing attacks are growing speedily in size and it's attacks expanding dynamically. This results in a serious economic loss around the world [1]. Fig. 1 depicts the statistics of phishing attacks in the 1Q of 2019 while Fig. 2 illustrates the most-targeted industry sectors in 2Q of 2019 [4]. The past decade has seen the internet and emails to be flooded with spam content [5]. Regardless of constant awareness and the number of anti-spam algorithms emerging, spam contents are in increase [6]. Sending a large volume of spam contents at the server-side causes delays in service response, reducing the authenticity of the mail and consume a large portion of the storage space. At the user side, grouping the spam into valid and not valid, considering the large number (IJACSA) International Journal of Advanced Computer Science and Applications, Vol 11, No. 11, 2020  of electronic mails that a user gets per day need devoting a substantial amount of time [7]. Spam messages are not restricted to email. Many people are exposed to spam content when they visit social networks like Telegram, Facebook, Instagram, Twitter, and so on. A study revealed that more than 70% of the total internet users use these social networks and are exposed to spam content [8].
Various algorithms have been designed to solve the problem of text-based spam. At present, spammers are sending these messages in the form of an image to confuse and possibly overpower these algorithms. Image spam is a concept that began in early 2005. More than 50% of the spam was made up of images by the end of 2006 [9], [6]. Image spam is another modern challenge in a phishing email. Image spam is email spam where a text content inserted into images to confuse conventional text-based spam channels [10]. It is a complex type of spam that is tempting and strenuous for the user to notice [5], [11]. Fig. 3 shows examples of spam images. The objective of image spam is clearly to bypass the investigation of the content of text-based email performed by the existing spam algorithms. For this reason, spammers usually include some bogus text to the email together with the attached image such as a length of words that are persuasive or cogent to surface in genuine emails and not in spam [10].
Machine Learning (ML) is a branch of artificial intelligence that involved in creating algorithms that can modify itself using structured data without human intervention to yield expected results [12]. Examples are Linear Regression, Logistic Regression, Decision Tree, Support Vector Machine (SVM), Naïve Bayes, K-Means, and Random Forest. Deep Learning (DL) is a branch of machine learning in which algorithms are developed and function similar to those in machine learning, but there are multiple layers of these algorithms, and each providing a different meaning to the data it feeds on [12]. These algorithms include the Artificial Neural Networks (ANN), Deep Neural Network (DNN) and Convolutional Neural Network (CNN) [13]. In summary, machine learning algorithms need structural data, that is they are built to learn to do things by understanding labeled data, then use it to produce further outputs with more sets of data. However, they need to be retrained through human intervention when the actual output is not the desired one. While deep learning algorithms depend on layers of the artificial neural network. They do not require human intervention as the nested layers in the neural networks put data through hierarchies of different concepts, which eventually learn through their own errors [12].
There are different types of techniques used in classifying image spam as shown in Fig. 4 [3]. These are grouped into Supervised Machine Learning, Unsupervised Machine Learning, Semi-supervised Machine Learning, Content-based Learning, and Statistical Learning. Numerous researchers utilized these approaches for phishing email classification and detection. Depending on the nature of the data to be classified, choosing suitable and appropriate techniques is exceptionally crucial. The supervised machine learning algorithms often used from the surveyed literature are Decision Tree, Fuzzy Logic, Support Vector Machine, Neural Networks, Bayesian Network, and Genetic Algorithm. Some researchers compared two or more of these techniques to see which one produces better results [14], [15]. Deep learning approaches have not been well exploited in image spam classification since their advent [16]. They have the capability to handle large datasets and can extract image features more accurately than the existing image processing techniques [5].
Unlike other survey articles, we achieve comparisons of the performance of the existing state-of-the-art image spam models. Also, this review can help researchers working in the field of image spam classification by answering the following research questions: (a) What are the various areas of application where image spam classification has been utilized? (b) Which publicly available datasets can be accessed for the various areas of application of image spam classification? (c) What are the commonly used features set in the existing image spam classification models? (d) What performance evaluation parameters are applied to determine the effectiveness of the image spam classification algorithm? (e) What are the challenges and research directions for future researchers working in the field of image spam classification?
The organization of the paper is as follows. Section 2 review the existing literatures or related works. Section 3 www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol 11, No. 11, 2020 Fig. 4. Types of Techniques in Spam Classification [3].
discuss the future research directions. Section 4 gives the summary of the paper.

II. RELATED WORKS
The review of the related works is discussed under the following headings: Identification of spam classification application areas, spam classification dataset analysis and review, feature set analysis and review, and the analysis and review of spam classification techniques.

A. Identification of Spam Classification Application Areas
Basically, spams are categorized into Text-based and Image-based [5]. Spams are further divided into content-based spam and non-content-based spam. Content-based spam is the first-generation image spam [17]. This includes the spam in emails in text-based form. In this category, the extracted content from the body, headers, and keywords of emails are used by the classification algorithms to classify the images [5]. A wide range of machine learning techniques can handle this type of spam classification [6]. Non-content-based spam include complex kind of email spam and this falls into the second and third generation of image spam [17]. In this category, the undesired text is embedded in images. To classify the image spam, we can rely on the attributes of the image but recently, the advent of deep learning techniques make it possible to classify these images based on their raw byte form [5].
Images that fall in the first generation contain simple spam images hence they can be easily recognized by the optical character recognition (OCR) tools. In the second and third generation, the images contain noise and superimposing background to confuse and make them unrecognizable by the OCR. The OCR tools have the ability to partition the portions of the image that contain particular objects for the purpose of text extraction and detection [17], [5]. The background noise included with the text inside an image is a challenging task for OCR [17]. In this study, we are going to look at the application areas of spam classification under two (2) domains. Text-based and Image-based spam as shown in Table I.

B. Spam Classification Dataset Analysis and Review
This section shows the datasets that were used in spam classification and the detailed analysis. The researchers used public datasets in their works. They used one or more personal collections, Dredze, spam archive, Princeton spam corpus, image spam hunter, and so on as their datasets. For example, [33] used only Dredze dataset. The detailed analysis of data sets used in both text-based and image-based spam classification is shown in Table II and their locations in Table III. Table II depicts the name of the datasets and sample size, the number of studies, and their references (where a specific dataset is used). This study reviewed that the Dredze dataset is the most commonly used datasets in image spam classification. This dataset consists of a total of 5789 spams (with 3239 spams and 2550 ham). Ten (10) studies adopted Dredze dataset, followed by Spam Archive dataset (with seven studies), image spam hunter (four studies), Trec07, ICDAR2003 and Char74k (two studies each), while the others datasets (Enron corpus, SMS spam, Princeton spam corpus, LingSpam, SpamAssassin and Indian corpus) have one studies each. Seventeen (17) studies used personal collection datasets from twitter. The location where the datasets can be downloaded and utilize are also presented and showed in Table III. Fig. 5 shows the name of datasets with the corresponding number of articles that adopted the datasets.

C. Feature Set Analysis and Review
This section discusses the feature sets used in all the studies under review. A feature describes the specific or distinctive attributes of image spam during processing. One of the essential steps to design efficient and accurate algorithms in spam classification is the feature extraction and selection [3]. A brief overview of these features is explained below. Table  V shows the features used in image spam classification and Fig. 6 presents the graph of the number of articles versus the image features.
• Text area: This is the boundary the text occupied in an image. It is also called a text boundary. This is a way of identifying the presence of text in an image.
• Low-level (Color): These attributes are entropy values of the image RGB color, brightness, hue, and saturation. Other values include variance, skew, and the mean. The mean value represents the average pixel value of the image and it is applied to define the background of an image. In these features, there are distinct histogram attributes for a spam and ham image. Skewness is used in identifying the surfaces of an image. Spam images normally have high kurtosis values than ham images.
• Image similarity (Texture): The local binary pattern (LBP) is useful in measuring the similarity and information of adjacent pixels in an image. LBP is a powerful tool for identifying image spam which is simply text placed on a white background.
Several researchers as showed in Table IV used image features to identify an image spams [30]. For instance, [30] proposed an image spam classifier using Maximum Entropy, Decision Tree, and Naïve Bayes methods. They focus only on the low level and image metadata features of the image for the classification and achieved an average accuracy of 95% with a computation time of 2.5-4.4ms. They considered a few features set for the training of the algorithm. Features reduction and elimination techniques such as principal component analysis (PCA), recursive features elimination (RFE), and univariate features selection (UFS) are very vital in optimizing or reducing the number of features in an image in order to achieve better feature classification and accuracy. Author in [35] used PCA and SVM to developed a classifier for image spam. They used a few image spam hunter and personally collected datasets to trained their classifier and claimed 70-97% accuracy. They did not take the processing time into consideration. Author in [17] used the same feature reduction and elimination approach in their work. The authors looked at 38 features of the image and used RFE and UFS to reduce the undesirable features. They employed the SVM method to train their classifier using 920 spam and 810 ham of image spam hunter dataset and 1089 spam and 1029 ham of Dredze and personal collected dataset. Accuracy of 54-98% and falsepositive of 0.01-0.79 were obtained. The time taken for the classification was not considered.

D. Spam Classification Techniques Analysis and Review
Spam Email classification techniques as depicted in Fig. 4 are categorized into five (5) groups. These are supervised machine learning, unsupervised machine learning, semi-supervised machine learning, content-based learning, and statistical learning [3], [42], [43]. In supervised machine learning, input instances are given for the learning procedure and the output labels do not conveniently recognize a function that approximates this behavior. Supervised machine learning techniques include Decision Tree, Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Bayesian Network, Network. In unsupervised machine learning, the learning procedure is equipped with input instances but with output labels. Here, the leaning procedure tries to recognize related patterns through input instances to determine output. An example of unsupervised machine learning is k-means clustering [3]. Semi-supervised machine learning is a combination of supervised and unsupervised machine learning.
In semi-supervised machine learning, some of the input datasets are labels and the learning procedure requires large labelled data. Active learning is one of the examples of semi-supervised machine learning. Content-based techniques use keywords in classifying the spam email [3]. Examples are optical character recognition (OCR) and Sobel filters. In statistical learning, each keyword is assigning a probability and the overall probability is used to classify the image spam. Supervised machine learning is the most frequently used techniques in spam classification even though researchers used all the other types of techniques. Table IX presents the distribution of spam classification techniques [3]. Thirty (30) studies adopted supervised machine learning techniques, four (4) used unsupervised techniques, eight (8) and five (5) studies adopted content-based learning and statistical learning respectively.
Fumera et al. [20] developed an algorithm for detecting and classifying text-based spam using optical character recognition (OCR) tool where they used 445 spam and 4852 ham of spam archive dataset and 5608 spam and 9526 ham of personally collected dataset to train their model using support vector machine (SVM). The authors focus only on the true positive and false positive rate and the result obtained are 0.81 and 0.01 respectively. They did not consider the time taken for the classifier to detect and classify a spam email and the method used is inefficient since it cannot handle large datasets conveniently. The proposed classifier cannot detect image spam email. The same OCR tool was used in the work of some researchers [21], [22], [24], [23]. They examined and applied OCR software to filter image spam email. While [22] used KNN, Naïve Bayes and Reverse DBSCAN in his work, [24] used Sobel operators (filters) to process the image as displayed in Table VI. Image spam classifiers have been proposed using a near-duplicate detection approach but with different distance measurements [39], [38], [37], [36]. They both considered low level and image similarity features of the image spam in training their models. While [39] used Visual and Object Semantics as a distance measure to classified the image spam and achieved an accuracy of 96 %, [38] used Histogram and Euclidean distance measures to obtain a better result of 98% accuracy. The reason for the difference observed in the two results was because of the former used a larger dataset than the later. The computation time was not considered except in the study of [36]. The time taken to detect image spam and classify it as either spam or ham in this research is 50ms. This is displayed in table VII. Table VIII  presents the keys of the abbreviations as used in Tables IV, V,   Support Vector Machine (SVM) method is one of the most commonly used classification algorithms in image spam classification [3] and has been adopted by many researchers in their works [44], [35], [17], [33], [32], [31]. SVM is suitable for binary classification problems but difficult to handle large datasets [27]. In the work of [31], in order to identify the image as spam or ham, they considered 3 features domain namely, text area, low-level features (image color), and text obfuscation (noise) of the image. They claimed to have obtained 94-98% accuracy with 1200ms computation time.
Singh [5] proposed an image spam algorithm using deep learning algorithms. They did not consider the time it took to identify and classify image spam and used only a few datasets concentrating on low level, image metadata and image obfuscation (noise) features of the image. They obtained 95.63 to 98.95% accuracy. An approach to object segmentation was not used to detect the segmented spam area. After their advent, deep learning has not been well exploited in classifying image spam. Deep learning has the ability to handle large dataset and can more accurately extract image features than existing image processing techniques [5].
Web content-based approaches can be combined with machine learning techniques to build a system for phishing website and email detection [45]. The author in [45] used this approach to designed a 92% accuracy detection system known as CANTINA+. Web structured-based method using Google PageRank has been (IJACSA) International Journal of Advanced Computer Science and Applications, Vol 11, No. 11, 2020  4 Total used to achieve 98% accuracy in classification [46]. A Bayesian algorithm and the incremental forgetting weight algorithm were used to create a model that effectively tackled idea drift and data bias in the classification of spam emails [25]. It is possible to Combine statistical analysis of website URLs with machine learning techniques to develop a classification algorithm with a better precision rate [47].
Many researchers work on detecting and classifying email phishing but did not focus on spam emails. [14], for example, used the dataset gathered from twitter and implemented an algorithm    [49].
One of the hybridized approaches used in email phishing detection is neuro-fuzzy, which is the combination of fuzzy logic and neural network. [1] used this approach to developed an anti-phishing model and obtained an improved detection accuracy of 98.36%. A better result of 99.29% accuracy was obtained using the same method [50]. While [50] research did not focus on missed detection and false alarm rates, a high rate of missed detection and a false alarm was reported by [1].
In the literature, decision tree data mining techniques such as associative rule mining and classification were well used. A classification algorithm has been proposed using these methods to derive new rules from the phishing data sets [51], [52]. The main challenge with this approach is that the set of rules is not objective and largely depends on the programmer [1]. A classifier that can categorize emails written in Chinese into spam or ham based on a specific feature was created using the same method [26]. Data mining knowledge discovery procedures were used to develop an intelligent classification model that was tested using Random Forest, J48, SVM, MLP, and Bayes Net. Using the Random Forest and J48 algorithm, an accuracy of 99.1% and 98.4% was achieved respectively [53].
Convolutional Neural Network has recently been used to create a text-based spam classifier with the introduction of long short time memory neural network (LSTM NN) and an accuracy of more than 92-98% has been achieved [18]. [28], [44] used KNN and Naïve Bayes to implemented his work with the Dredze image dataset. The authors used a distributed associative memory tree to extract features of the image. This feature extraction method performs best in comparison with other distributed approaches with a relatively small amount of resources for spam detection. A 98% accuracy has been reached [28]. A Random Forest has the best accuracy, precision, recall, and F-measure than SVM and multilayer perceptron when PCA was used to construct a twitter-dataset image spam model. An accuracy of 96.3% has been achieved in this study [27].
Naiemi et al. [9] proposed a new algorithm to recognize characters in image spam by improving the existing feature extraction of HOG using SVM as the classifier. The study improved scale and translation robust HOG (STRHOG) developed with the Chars74K dataset with an accuracy of 72.2% [29]. In STRHOG, the matrices of the oriented gradient for input images of different sizes have a high computation value and a large part of this matrix does not have any effect in recognizing the image. [9] were able to overcome these problems in their work and obtained a detection accuracy of 84.91%. Some of this study's weaknesses are briefly debated. Support Vector Machine (SVM) adopted in the work is good and suitable for problems in binary classification [27]. SVM works perfectly when dealing with 2,3,4 classes but the Char74K dataset used in the work has 62 classes and is therefore a multiclass problem. Additionally, we are trying as much as possible not to lose data in machine and deep learning. In fact, generating data for any missing attribute within a dataset is advisable. In HOG, the image passes through cropping, and in the process, data is loose. Finally, the study did not consider the time it took to detect and classify the image spam. Because of its complex computation, the canny algorithm used for the edge detection consumes a lot of time and it will be hard to implement to hit the real-time response.
In most of the reviewed articles, the computational time was not considered. Table X shows the reference of the articles that considered time in text-based and image-based spam classification.

E. Performance Metrics Review and Analysis
Confusion matrix (CM) as shown in Fig. 8 measure the performance of a classification algorithm in terms of accuracy, recall, precision, and F-measure. These definitions are enumerated below. CM is a matrix between True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN). TP is when the image is a spam image and the classifier label it as spam. TN is when the image is a ham image and the classifier label it as a ham. FP is when the image is a ham image and the classifier label it as spam. FN is when the image is a spam image and the classifier label it as a ham [5]. The often-utilized performance metrics and their formulas as highlighted in the works of [3], [55] are discussed below.
(a) Accuracy: This is the percentage of predictions that are correct.
It is used to determine how well a classifier works. It is defined mathematically as: where P = TP + FN and N = TN + FP (b) Precision: This is the percentage of image spam classified correctly as ham. It is calculated as: (c) Recall: This is the percentage of image spam classified correctly as spam. It is defined as: (d) F-Measure: This is how effectively a classifier identifies positive labels. It is the weighted average of precision and recall. F-Measure is calculated as: (e) Simplicity: This is how effectively a classifier identifies negative labels. It is defined as: (f) Area Under Curve (AUC): This is the ability of a classifier to prevent incorrect classification. It is given as: The performance of the existing state-of-the-art image spam models using the above metrics is shown in Table XI. The existing works considered one or more of the performance metrics. For instance, [36], [37], [31], [40], [5], [56] considered only the accuracy.

III. FUTURE RESEARCH DIRECTIONS
We discuss some of the challenges and open issues in the existing studies on image spam classification research in this section.

•
Dataset: Image spam classification is a binary classification problem (ham or spam). Some of the datasets used in the reviewed articles have four or more classes and these types of datasets are suitable and work perfectly for a multiclass problem and not for binary problems. A more challenging dataset is required for future image spam classification research.
• Optical Character Recognition (OCR) Approach: Most of the existing works used the OCR technique. In the OCR method, data is lost by cropping the image during the pre-processing stage. The images don't have the same dimension and are forced to be of the same size, thereby losing some of the important data. Machine learning algorithms tries as much as possible not to lose data. In fact, it generates data for any missing attribute in a dataset. More suitable techniques are needed for the extraction of the features of an image in future research on image spam classification.
• Deep Learning Technique: The state-of-the-art image spam classifiers developed using machine learning techniques, which work with few datasets have difficulty in extracting the relevant features of the images and this has negative effects on the overall output of the classification. Deep learning models have the capability to handle large datasets and can extract image features more accurately than machine learning techniques [5]. This approach has not been well exploited in image spam classification since its advent [16]. With this in mind, the future image spam classifier can be implemented using deep learning techniques like deep neural networks, and convolutional neural networks to make the classifier more powerful and improve the performance in terms of the accuracy and precision of the classification algorithms.
• Fog Architecture: Fog Computing, also known as fog networking or fogging is a newly introduced concept. It is an internet of thing (IoT) architecture that expands the cloud so that it is closer to end devices. It supplies information, computing asset like storage and application services to the end devices. More also, at the edge of networks, fog bolsters high versatility because it pulls services given at places close to the end-users [61]. Fig. 7 shows the architecture of fog computing where it clearly depicts the three (3) layers namely, end devices (IoT) layer, fog layers, and cloud layers [48]. This concept which was recently used by [1] to detect phishing websites produced high detection accuracy. Also, the authors revealed that fog-based services are faster than cloud-based services and that it is manageable and easy  [20], [31], [32], [33], [34], [35], [17], [27], [9], [44] 4. K-Nearest Neighbor 3 y y [22], [28], [29], [44] 5 x y [28] to implement a machine learning algorithm on fog nodes than on the cloud. In view of this, an algorithm can be implemented on a fog node to increase the detection speed of the image spam classification.
• Computation Time: Image spam detection and classification should be a real-time process in order to minimize response delay. In the reviewed articles, the time taken to classified the image is neglected. The canny algorithm mostly used for edge detection in the histogram of oriented gradients (HOG) method consumes a lot of time due to its complex computation. It is difficult to implement to reach the realtime response. Future research should consider reducing the processing and classification time using recent hardware technology.

IV. CONCLUSION
This study provides a thorough overview of image spam classification studies to help researchers in this field in gaining excellent knowledge and understanding of current image spam classification solutions in the major areas. Journal articles published between 2006 to 2020 on image spam detection and classification were thoroughly studied and grouped into two application domains; textbased and image-based. The selected papers were analyzed from five dimensions of rationality: spam classification application domains, datasets adopted and features sets utilized in the two application domains, the methods used, and the matrices considered for the performance evaluation. More than 50 articles on spam classification were energetically picked and examined. A comprehensive analysis of several techniques, features set, datasets, and performance evaluation metrics used in spam detection and classification were summarized. The survey revealed that Personal Collection, Dredze, and Spam Archives datasets are the most commonly adopted datasets. Similarly, low-level and image metadata features are the most widely used features sets in spam classification research. The various methods of image spam classification as pinpointed in this study are supervised machine learning, unsupervised machine learning, semi-supervised machine learning, content-based and statistical learning. Among these methods, the most commonly used is the supervised machine learning method. Support Vector Machine (SVM) provides the best performance and it is often used in supervised learning. This is followed by Naïve Bayes and K-Nearest Neighbor techniques. The commonly investigated matrices for the performance evaluation are accuracy, recall, precision, f-measure, simplicity, and confusion matrix that depicts the relationship between TP, TN, FP, and FN. Finally, we present promising directions for future research.