Effect of Header-based Features on Accuracy of Classifiers for Spam Email Classification

—Emails are an integral part of communication in today’s world. But Spam emails are a hindrance, leading to reduction in efficiency, security threats and wastage of bandwidth. Hence, they need to be filtered at the first filtering station, so that employees are spared the drudgery of handling them. Most of the earlier approaches are mainly focused on building content-based filters using body of an email message. Use of selected header features to filter spam, is a better strategy, which was initiated by few researchers. In this context, our research intends to find out minimum number of features required to classify spam and ham emails. A set of experiments was conducted with three datasets and five Feature Selection techniques namely Chi-square, Correlation, Relief Feature Selection, Information Gain, and Wrapper. Five-classification algorithms-Naïve Bayes, Decision Tree, NBTree, Random Forest and Support Vector Machine were used. In most of the approaches, a trade-off exists between improper filtering and number of features. Hence arriving at an optimum set of features is a challenge. Our results show that in order to achieve the objective of satisfactory filtering, minimum 5 and maximum 14 features are required.


I. INTRODUCTION
Email communication has become an essential part of all spheres of personal life as well as professional life. But all the emails are not relevant for every user. Day by day the email traffic is increasing, making it imperative to filter spam emails. According to a survey conducted by Radicati 2017, total emails sent and received per day would reach to 319.6 billion by the end of year 2021 [1]. As per Infocomm survey 2016 for internet usage, "Sending and receiving emails" (94%) and "Information Search" (92%) are two main activities on internet [2].
Spam finds the first mention as early as in 1975in RFC 706 by John Postel. According to RFC 2505, mass unsolicited emails, sent in large volumes to target the consumers, are called spam emails. Text Retrieval Conference (TREC) defines spam as "unsolicited, unwanted email sent indiscriminately, directly or indirectly, by a sender having no current relationship with the user"(spam track) [10], [11]. According to survey conducted by GFI software in 2014, spam emails consume bandwidth and detract the user from the work. The purpose of sending spam differs from a person-to-person and organization-to-organization. It is used to send phishing, advertising emails or to spread viruses and worm.
An email contains headers and body. Email header field format is defined in RFC 822, RFC 2822. One may classify an email inspecting the body content and headers. Email header contains useful information (Metadata). Contents of the body can be a text, pictorial data, or even sound. This is a purely unstructured part of the email. Our work intends to find out: a) The minimum number of features in header, necessary to identify spam email.
b) The effect of identified features on the accuracy of classification of email.
c) The best combination of features selection technique and classification algorithm.
This paper consists of three sections, in the first section different approaches for spam classification, as found in literature, are discussed. The next section presents the details about data collection and experiments carried out in for this research. The discussion on results of the experiments follows, along with the conclusion.

II. LITERATURE REVIEW
There are four commonly used techniques for spam classification namely, a) Use of blacklist [14] b) Protocol-based approach c) Use of keywords or content filtering d) Header based [20], [28], [21], [5], [36], [13] In the first case, a list of email the network administrator maintains addresses or domain name databases. The classifier matches new record with blacklisted database and simply rejects some mails and puts them onto the spam folder. However, this blacklist requires continuous updation of list. The blacklist approach may fail if the sender"s address is fake [37]. The Second approach is protocols based where traffic coming from specific IP address can be blocked. But IP addresses can be easily forged [17], [6]. In the third method of keyword or content filtering [16], spammers bypass the filter by embedding text into images. Such models provides better filtering, however it come with two disadvantages, a) It is time consuming. b) The process is language dependent [9]. *Corresponding Author. www.ijacsa.thesai.org That is why this paper is focused on the fourth approach of header based filtering. Spam classification helps us to filter the unwanted emails from the email Inbox. There have been various attempts to classify the spam email based on using email header [20], [21], [5], [36], [37], [38], [13], [4], using email body [3], [41], [35], [29], [27], [30], [7], [31], [32], [33], [34] and also using both body and header [18], [23], [21], [15], [42] and statistical features [19], [25]. The email header classification is performed using techniques such as Naïve Bayes (NB), Decision Tree (DT) [40] [43], and Support Vector Machine (SVM) [23], [24], [20], [13], [26] Random Forest (RF) [4], [13]. When these techniques were adopted by the researchers using various features and datasets, Random Forest showed better performance than the other techniques. Selecting appropriate set of features is important because that influence accuracy of classifier [8]. Author in [5] used total 26 features derived from behaviour from headers and syslog of emails with backpropagation neural networks (BPNN) and achieved accuracy of 99.6%. But one of the drawbacks of using BPNN is its unstable time to Converge. The number of features and training data affects the performance of BPNN. So the results can fluctuate. In [13] authors have used IP address and subject with other four features which resulted into accuracy of 96.7%. But IP address may get forged. So we have not considered IP address in this research. Our attempt is to suggest optimized features without use of any text data from subject and Body of email. Therefore, we have use combination of different features from literature and by study of personal spam data.

III. RESEARCH METHODOLOGY
The experiments are conducted in two phases; first Feature Selection techniques are applied on datasets which generate subset of features. In second phase, the resultant feature subsets are used for classification to find the effect on accuracy of classifier. The minimum number of features with classifier is selected as result.
The steps are as follows, 1) Input: Email datasets.

2) Extract Email header features.
3) Apply feature selection techniques. 4) Select subset of features generated by feature selection techniques.
5) Apply classification on Email datasets with selected feature subsets.
6) Classify email into spam and non-spam. 7) Note down the accuracy of the classifiers.

IV. DATA COLLECTION AND PRE-PROCESSING
We collected emails as reference database to carry the necessary experiments. These emails were collected from personal email account during a period of last 7 years. The two Benchmark corpora available publicly, namely Spam Assassin Corpus and CSDMC2010 corpus are also used in this experiments. These datasets contains spam and ham files. Description of data collected for experimental purpose is given in Table I. A. Use of Features in Spam Classification RFC 822 and RFC 2822 are the standard formats, which define email structure and various email header fields. Therefore, the email header field as features are adopted from the above two. The list of features was obtained by study of personal database and from literature. Some of the earlier researchers have not addressed the following six header fields: In further discussions the term S(U_B f ) (set of universal base features) is used to refer to above ten features, for the ease of explanation.

To
To is Empty Check value of "To" header field exists or if it contains "Undisclosed Recipients" or "<>" symbol [37], [44] To is Undisclosed [36], [9] To contains <> Proposed feature BCC,TO To is empty and BCC is not Empty BCC_not empty_ To_empty Check if "BCC" contains email address and "To" do not have any email address. [36] To To_number of address Return-Path, From Return path is NOT matching with From address Return path_From Domain Check domain in "return_path" and "From" are same [45] www.ijacsa.thesai.org

V. EXPERIMENT
As mentioned earlier, experiments were conducted on three dataset emails as described in Table I. A code is developed in python to extract email header data according to Table II. Our proposed model evaluates email using these 17 features. Each feature is assigned score of 1 (one) if condition is satisfied otherwise it is marked as 0 (zero). The sum of scores was calculated in the end. In this experiment, chisquared [19], correlation based Feature Selection [39], Information Gain, and relief [22] and Wrapper Feature Selection techniques are applied to find significant features of an email. Classifiers namely Naïve Bayes, Decision Tree, Random Forest, NBTree, Support Vector Machine were used in the experiment.
The data mining tool Weka has been used for applying the machine learning techniques. All the Feature Selection methods and classifiers were adopted in Weka as a selectable runtime parameter. Collected data were arranged in a CSV file in the following format: feature 1, feature 2, feature 3, feature n, class label (Class label indicating two classes, Spam and Ham.) 10 fold cross validation technique is used for data validation. This method uses 90% of the data for training and 10% for testing.
The average weight of each feature generated by all Feature Selection techniques is calculated and listed in Fig. 1. It can be clearly observed that our proposed features namely content-transfer-encoding, and Authentication-result, belong to the first five features by weight and have significant contribution to spam classification. The next two features Subject_symbol and From_symbol are among the top ten features. However, our proposed features namely BCC_notempty_To_empty and Message-ID_dollar do not have any significant contribution in spam classification.

VI. RESULTS AND DISCUSSION
In this experiment, we have not considered any text feature value from either body or subject. Following are the conventions used in Table III, Table IV and Table V.  Table III indicates, for dataset S1, the results showed accuracy of 93.53% with 17 header features. The maximum features are generated by Relief technique (RT), i.e. 14 features. It maintains best balance between false positive rate and true positive rate. Accuracy of RF is improved by 0.03% with 14 features. With accuracy of 93.56%, Random Forest (RF) outperformed the other four classifiers--Naïve Bayes (NB), Decision Tree (DT), NBtree and Support Vector Machine (SVM).Further, Next to RF, DT classifier also performs well. Naïve Bayes shows stable performance when features are increased from 11 to 14. As number of features reduced, performance of DT and RF decreased. When number of features varied between 11 and 14, Support Vector Machine performed well. However when features are reduced from 11 features to five features, performance of Support Vector Machine decreased by 0.9%. In this paper, we evaluated performance of five Feature Selection techniques and five classifiers on email headers. Our header based approach for Feature Selection showed that minimum five features generated by correlation based Feature Selection technique performed well on all three datasets with varying accuracy 70.58% to 90.65%. Relief Feature Selection technique generated the maximum fourteen features with varying accuracy of 91.06% to 94.78%.This implies that the features we proposed namely, Authentication-result and content-transfer encoding play significant role in identifying spam emails. The result of our experiment result shows that Random Forest performs better than all other classifiers in terms of accuracy as well as number of features.