Feature Weight Optimization Mechanism for Email Spam Detection based on Two-Step Clustering Algorithm and Logistic Regression Method

This research proposed an improved filtering spam technique for suspected emails, messages based on feature weight and the combination of two-step clustering and logistic regression algorithm. Unique, important features are used as the optimum input for a hybrid proposed approach. This study adopted a spam detector model based on distance measure and threshold value. The aim of this model was to study and select distinct features for email filtering using feature weight method as dimension reduction. Two-step clustering algorithm was used to generate a new feature called “Label” to cluster and differentiate the diversity emails and group them based on the inter samples similarity. Thereby the spam filtering process was simplified using the Logistic regression classifier in order to distinguish the hidden patterns of spam and non-spam emails. Experimental design was conducted based on the UCI spam dataset. The outcome of the findings shows that the results of the email filtering are promising compared to other modern spam filtering methods. Keywords—Two-step clustering; spam filtering; classification; detection; feature weight; logistic regression


INTRODUCTION
Nowadays, email messages are considered as economic and most essential communicative way in the world.It is efficient, simple and accessible for all due to the internet availability.The availability of email makes it susceptible to many hackers and threats [1].Spam is considered as a very important threat to email; practically all email users in the world tolerate spam.The term spam was used to define the undesirable message, junk-mails sent to web users' inbox.It is most opportune for email spammers to send lots of messages to millions of users simply and without cost [2].This makes it a public situation for all web users to receive unsolicited email regularly.
The versatile way of unsolicited email by the utilization of immense mailing tools prompts the requirement for spam recognition.Execution of various spam discovery strategies in view of machine learning methods was proposed to address the issue of various email spam desolating the system.Past calculation utilized as a part of email spam identification contrasts each email message and spam and non-spam information before creating finders.This study' proposed system propelled by the two-step grouping calculation with strategic relapse system utilizes highlights weight as advancement procedure to produce locators to cover the spam space.
Diverse strategies have been embraced to stop the danger of spam or to definitely lessen its measure.An anti-spam law was authorized by enacting a punishment for spammers who circulate email spam [3].In spite of the diverse methodologies and strategies that have been received to battle the danger of email spam, the web today still shows a huge measure of spam [4]- [6].Therefore, more consideration is required with respect to how the risk can be radically diminished if not completely disposed.The fight against email spam is an extremely troublesome fight; therefore, it bodes well to battle a versatile email spam generator with a versatile system.
In this study, a new hybrid method that is inspired by descriptive and predictive models will be introduced.It consists of a Logistic Regression Method (LRM) as a prediction method with the integrated effort of Two-step Clustering Algorithm (TSCA) as description technique.To produce more precise filtering results, the standard dimension of spam dataset has been reduced based on feature weight (FW).The engineering aims required in this study's hybrid method can be viewed in three ways; firstly, generating new dataset based on feature weight (FW) to reduce the dataset dimensionality; secondly, to limit the maximizing distance between spam detectors and the non-spam space by using twostep clustering algorithm (TSCA); and thirdly, is to filter the email to spam and no-spam using logistic regression method (LRM) based on the output of FW and TSCA.The aim of this study is to find possible increase in the accuracy and reduction in the miss-filtering emails.This article is structured into six sections: Section 1 discusses the motivation and Introduction; Section 2 covers the article related work, the improved method, and its integral system will be described in Section 3. Experimental design and results of the study and discussions in details are in Section 4 and Section 5, respectively.The conclusion of the research is described in Section 6. www.ijacsa.thesai.orgII.
RELATED WORKS Several attempts have been proposed to block spammers and reduce a number of undesirable emails across the internet and user's inbox.One of these attempts is called anti-spam law [3].This law was defined by enacting a penalty for spam users who send spam emails to user's inbox.Another two common methods have been proposed in email spam detection; a Machine Learning (ML) method, a data mining (DM) and knowledge discovery (KDD) method [4].In the DM method, researchers introduced an origin-based filter technique based on web protocol address approach to differentiate the spam and non-spam messages.On the other hand, in the KDD method, researchers categorized spam or non-spam message based on sets of generating rules using KDD algorithms as filter techniques.The authors claim a promising spam filtering results.However, they need to update the rules continuously, which is time wasting and inadequate for many users.Spam detection based on ML is not required to generate and update any rules as DM and KDD based methods; only training data for classifying an email message is required.Classification techniques based on email messages characteristics were applied to learn the filtering rules and to distinct spam and non-spam email messages [5].Some approaches were adopted to stop the spam, however, the web still currently observe a large set of spam [6], [7].Therefore, more consideration is required by improving spam detection algorithm on how the threat can be significantly decreased if not completely excluded.For this aspect, many spam-filtering algorithms have been applied in machine learning [5].Examples of these algorithms include neural network (NN), Support Vector Machine (SVM), k-nearest neighbor (KNN), and Naïve Bayes (NB).Several studies in machine learning approach applied in email spam filtering (Table 1).Marsono et al. [8] implemented naïve Bayes email spam filtering based on layer processing, without any requirement for reassembling.They suggested controlling middle boxes step to filter the received email spam from the email servers [9].W. El-Kharashi et al. proposed a spam controlling method using hardware structure of naïve Bayesian inference engine [10].The method can categorize more than 117 million features per-second based on probability inputs [10].Y. Tang et al. introduced a model that applied the SVM for email filtering.This model extracts spammers behavior using the distribution of the global senders and then investigate them by assigning a value of no-spam to each IP-address email sender [11].Their empirical results presented that the SVM technique is precise and faster than the Random Forests (RF) algorithm [11].Yoo, S., et al. presented an email classification method called Priority Email Personalized technique (PEP) [12].The PEP focused on analyzing the personal social networks to detect user groups and to achieve the user viewpoint based on the user social roles and then apply them for email message classification.Silva et al. [13], [14] assessed the neural network algorithm for internet spam.They also investigated how different groups of features influence the filtering accuracy rate.Largilliere and Peyronnet [15] developed a combination approach for internet email spamming on the PageRank method.Liu et al. [16] introduced features of user behavior for distinguishing spam and non-spam pages.They also developed a hybrid machine learning system aided by user-behavior to filter spam pages [16].Content-based features method were proposed by Castilho et al. [17] and Rungsawang et al. [18].These studies investigated and extracted both link features and content for spam filtering pages with some improving email spam detection using ant colony optimization method [18].Also, they used the topology of the web-graph by extracting the web link dependencies between the internet pages.
The logistic regression method has some benefits compared to other classification methods such as SVM and Naive Bayes.The excessively robust conditional independence assumptions of Naïve-Bayes and SVM mean that if two variables are correlated, the naïve-Bayes and SVM will multiply them together as if they were independent, overrating the evidence.On the other hand, the LR is much more strong to correlated variables; if two features (A) and (B) are faultlessly correlated, LR will only allocate half the weight to w(A) and a half to w(B).Thus, when there are various correlated variables, LR will simply allocate a more precise probability than the SVM and naïve-Bayes.This LR is better than many other data mining methods in the small and large dataset [19], [20].These reasons prompted the investigation and examination of the LR in spam email filtering.Analyzing the personal social networks to detect user groups and to achieve the user viewpoint based on the user social roles and then applying them for email message classification Largilliere and Peyronnet [15] Combination approach for internet email spamming PageRank method Liu et al. [16] A hybrid machine learning system aided by user-behavior to filter spam pages Features of user behavior for distinguishing spam and non-spam pages.

Castilho et al. [17] & Rungsawang et al. [18]
Content-based features method Extracting both link features and content for spam filtering pages with some improving email spam detection using ant colony optimization method and the topology of the web-graph www.ijacsa.thesai.org

III. PROPOSED MODEL AND OPERATIONAL SYSTEM
The presented improved model and its constituent systems upgraded strategies in current circumstances have broad achievement in numerous true complex critical thinking.The significance of a joint system is not debatable, in light of the way that an individual system has its shortcoming, and an enhanced system is intended to complement the shortcoming of these individual shrewd systems.A brilliant mix of twostep bunching calculation and strategic relapse strategy is researched keeping in mind the end goal to compliment the parameters of every segment of the system.This is work by utilizing the benefits of an individual system against its inconveniences while lifting each powerless segment individual from both systems to accomplish dependability, consistency and a precise keen system extendable for utilization in grouping.The proposed enhanced system is utilized to shape a superior enhanced system with weighted elements in light of highlight weight handle.
This proposed method combined with different techniques such as Two-step clustering algorithm and logistic regression.The integrated techniques are then applied through several steps such as pre-processing (dividing the dataset into training and testing data) and weighing each feature based on the average values that can generate from each feature.The proposed system model is demonstrated in Fig. 1.

A. Data Pre-processing
Pre-processing is one of the important data mining steps to prepare the dataset before the mining procedure.In this study, data preparation was used (and the dataset were divided into training and testing part), feature weight and feature reduction were based on feature weight step as three initial phases in this stage.
For preparing the dataset, there are several benchmark datasets for email spam classification and clustering roles [21].One of this dataset is called Spam based which was reported by UCI Machine Learning repository and used in the spam filtering research such as [22], [23].The main function of this dataset is to test and classify email messages to spam and non-spam messages.The spam based data is collected of 4,601 e-mails messages with 39.4 % (1,813) messages marked as Spam and 60.6 % (2,788) reported as non-spam [24].Fig. 2 shows the investigation of e-mail messages (spam and non-spam).In the proposed method using two-step clustering and logistic regression, the dataset was divided into 10 parts as 10-fold validation to examine the variation of the whole dataset.These parts employed for training and testing data.Each part consists of 460 instances except the last part, which consists of 461 instances.The proposed method was evaluated 10 times with each time nine parts employed as training dataset and one part considered as testing.In each round, it was considered that the testing part will be replaced with one of the nine training parts of the test and each part are done separately.
A combination of two-step clustering and logistic regression was conducted for training classifiers using the generated spam and non-spam features to filter the testing sample.

B. Data Clustering using Two-Step Algorithm
The two-step clustering technique is connected to wildcat algorithm developed to reveal natural groups inside a data set that might or not be clear [25].The algorithm employed by this procedure has many captivating options that discriminate it from ancient clustering approaches:  Ability to produce clusters in a continuous and categorical data type.
 The algorithm can control the generated clusters automatically.
 Ability to interact with a huge dataset probably.

C. Clustering Fundamental
The two-step technique uses distance criteria to handle continuous and categorical dataset.The likelihood considers that the data variables in the cluster system are freelance.Also, each categorical data is intended to own a multinomial distribution, and each continuous data is predictable to own a Gaussian distribution.Empirical interior testing determines that the procedure is efficiently strong to violations of each belief of independence and therefore the spatial arrangement assumptions.Conversely, it is necessary to try to remember that some of these assumptions are met.The two-steps of the technique's rule are summarized as follows:  First Step.Pre-clustering the instances (or cases) into many small sub-groups.The procedure begins with the development of a Cluster Feature (CF) Tree.The tree starts by placing the first instance at the root in a leaf www.ijacsa.thesai.orgnode that carries variable information for that instance.Every consecutive instance is then additional to associate present node or forms a new node according to the similarity between the current nodes.
 Step 2. Cluster the sub-groups resulting from preclustering step into the coveted number of groups.It can also choose the cluster number automatically.By using agglomerative clustering (AC) approach, the leaf nodes of the Cluster Features tree are then grouped.
The AC can be conducted to range the produced solutions.The optimum number of clusters can be specified by comparing these clusters based on the Akaike Information Criterion (AIC) or Schwarz's Bayesian Criterion (BIC).The similarity scores between items calculated using an Euclidean distance measure that is described in (1).
An Euclidean vector is the position of a point in a likelihood n-space.Therefore, X is (Xn, Xn, … , Xn) and Y is (Y1, Y2, … , Yn) are likelihood vectors, starting from the origin of the space, and two points are indicated by their tips [26].The Two-step algorithm process is demonstrated as above.
The distribution of the email messages and clustering representation process using two-step clustering algorithm is demonstrated in Fig. 3. Fig. 3 represents the clustering output using the two-step clustering method to cross the spam dataset.It was observed that the number of extracted clusters is 3.One of the advantages of the two-step clustering algorithm is that it has the ability to determine the number of clusters automatically.An observation was noted that the size of the small cluster is cluster 3 with 253 (5.5%) email messages distribution ratio.On the other hand, the largest cluster size is cluster 1 with 3524 (76.6).The ratio of cluster 1 to cluster 3 is 13.93%.A new feature labeled as cluster represents the output of these clusters.By this feature, we can integrate the clustering algorithm with another mining method for a possible improvement reason.

D. Data Classification using Logistic Regression
Logistic regression is considered as one of the important statistical methods for investigating data in which there is one or more autonomous feature that defines results.The results are measured with a dichotomous feature, which means that the possible outcomes are two only.Based on the logistic regression mechanism, the dependent variable can be dichotomous or binary.For example, the data can only be coded as 1 (positive, Spam, Malware, detect, etc.) or 0 (negative, non-spam, non-malware, not detected, etc.).One of the main aims of the logistic regression is to find the optimum fitting model to represent the association between a set of predictor (independent) features and the interest dichotomous characteristic.Logistic regression extracts the significance levels and standard faults named coefficient values.The equation to classify a logic transformation probability of occurrence of the interested characteristic formulates as: In the classification based on logistic regression, only two classes y = 0, and y = 1 is formulated.A parametric form of P(y = 1 | x, w) is considered where w is the parameter vector.
The log odds of class 1 are a linear function of x as an example.
The proposed method used the discussed classifier using logistic regression to classify and filter the email into spam and non-spam.The experimental design based on the logistic regression will be discussed in the next section.www.ijacsa.thesai.orgIV.
EXPERIMENTAL DESIGN This experiment aimed to detect and filter the spam and non-spam messages from the email messages.The experiments were implemented on 4061 email messages, each message located as spam or non-spam according to the Spambase dataset.A method was executed by searching for the spam and non-spam email messages within the original dataset.
The spam dataset was broken down into 10 sets.Each set had a certain number of instances (email messages).The instances increased for each set with each weighting test round, starting with 460 email messages in the first set.Then, adding 460 more instances to the first set, and then, multiplying the amount of the data by 2, 3, 4, … 10 for the second set, third set, fourth set, to the tenth set, respectively.The objective of this grouping procedure was to study the pattern of the spammer user for each message so it can be focused.The average value of each of the features in the dataset was calculated as a first stage and it was noted that some of the features conveyed a very small value or had inverse proportion and some of them had a direct proportion between the number of instances and the feature values when the average was calculated.These pointers reflected the increasing and decreasing weighted score between the email features and the pattern of the spammer writing style.Possible hypothesis about this assumption was seen as a threshold for selecting the important features from unimportant features.The significant features were then nominated to enter the second training and testing experiment process.Conversely, the features that had a reverse proportion were ignored.
Training and Testing were implemented once again after features selection.The accuracy was declining as compared to the first experiment which caused the degree of learning depending on the number of significant features extracted from the email messages, and the decreasing of insignificant feature consequently led to the rise of the filtering accuracy and vice versa.The accuracy score was computed, and then the Spam base dataset was employed for training and testing process.The significant features that were selected based on the weighted process are shown in Table 2.
Table 2 demonstrates the sample results across the group of instances (messages).We have 57 features represented in each email message, and one feature named (class) represents the type of suspected message either spam or non-spam.According to the average values of these features, it was observed that several features conveyed a very small value or had inverse proportion.This score indicates that the feature is unimportant or not effectively on the filtering process of spam and non-spam.On the other hand, the significant features were reported in Table 2.This table represents features that had a direct proportion and definitely can affect the classification result by filtering the email messages to spam or non-spam.The weighting for each feature were computed to improve the achieved results that were obtained in Table 4 according to the following formula: Where, = the weight of feature in the instance I; F(i)= Total number of values in feature i; i = (406, 920, 1380, 1840, 2300, 2760, 3220, 3680, 4140, and 4601).After the improvement process using feature weight, the effect of the weight enforcing the observation in inverse and direct proportion was observed.

RESULTS AND DISCUSSION
In this study, the experiments were built based on two types (original and weighted) spam datasets.The original dataset is the common spam data that was normally used in spam filtering research, while the weighted dataset is generated from the original dataset (Spambase) by calculating the average of each feature inside the original data.The reason for the weighted data is to study the pattern of the spammer for each feature and distinguish it as a significant or nonsignificant.Thus, the voted features that were selected based on the weighted process only can be used for spam filtering.By selecting the important features, the spam filtering performance will increase due to the features reduction that occurred by weighting process.To classify and filter the email messages, different types of an empirical study based on logistic regression and two-step clustering algorithm were conducted.The results that were generated behind the hypothesis will be presented in different phases: Logistic regression with all features in the dataset, logistic regression based on important features only, hybrid two-step and logistic regression with all Spam base feature datasets and the combined two-step with logistic regression based on important features that were extracted using feature weight process.The filtering accuracy computed based on the equation: Where, True Positive (TP): The number of spam and non-spam emails executable correctly classified; False Positive (FP): The number of spam executable classified as non-spam; True Negative (TN): The number of spam and non-spam executable incorrectly classified; False Negative (FN): The number of non-spam executable classified as spam emails.
The results of emails filtering using logistic regression methods based on dataset features and important features are illustrated in Tables 3 and 4, respectively.
The tables show the results of 10-fold cross validation to examine all the parts of the dataset.Each part implemented in one round from round 1 to round 10.For each experimental round, nine parts represent a training dataset while the remainder part (only one part) represents a testing dataset.The testing part is becoming one of the training datasets during each experiment.The total results are an equal average value for all the ten parts.These results represent the filtering accuracy of the training and testing data, the misfiltering ratio, the area under the carafe, and the number of correct filtering messages to spam and non-spam in the dataset.In Table 2, it was observed that the achieved results on the 30 important features excluding the target feature (Class) are better than using all the dataset features.This indicates that the selected features are more significant.Also, the process time will be reduced accordingly because only the important features extracted will be tested rather than all features.Another criterion that was used for evaluating the proposed method is the Area under carafe (AUC).It is an assessment metric normally used in binary classification challenge.When the accuracy computed based on the true and false positive rate as the threshold rate for classifying an element as 0 or 1: if the predictor is best, the true positive ratio will rise rapidly, and the AUC will be close to 1. On the other hand, if the predictor is less than the random predicting, the true positive ratio will rise linearly with the false positive ratio and the AUC will be around 0.5 [27], [28].AUC metric is important because it can evaluate the predictor's performance on the unbalanced dataset.It is independent of the fraction, of the test population, which is, target, class one, or zero.However, the spam and non-spam dataset that was used is not equivalent.The AUC results represented in Table 4 indicate that the performance evaluation is enforcing the filtering accuracy results and proved better results after weighting process and feature selection.3 presents the prediction of email filtering using the logistic regression method extracted average accuracy results with 90.8% for training phase and 94.96% for testing phase before feature weighting process.However, average accuracy results represented in Table 4 with 93.03% in the training phase and 96.48 % in the testing phase after selecting significant features using feature weight process were achieved.On the other hand, Tables 5 and 6 illustrates the prediction of email filtering using hybrid logistic regression and the two-step clustering algorithm obtained average accuracy result at 97.26% and 97.40% before feature weighting process for training and testing phases respectively.The average accuracy results after selecting significant features using feature weight process, obtained 98.41% and 98.37% for training and testing phases, respectively.
To explore the differences between this study's spam filtering technique based on the logistic regression and twostep clustering algorithms before and after improvement using weighting process and important features, an Independent Sample T-test was performed such as [29].The achieved values can be significant if the result is below 0.05.In Table 7 the significant values are (0.006) between this study's combined LR-Two-step and LR before feature weight, and (0.0007) between the combined LR-Two-step and LR after feature weight, this indicates that the combined method reached significant enhancement on the accuracy results.Thus, a conclusion was drawn that there is a significant difference before and after feature weight and combination process.Table 7 shows the T-test statistical significance results.
Another comparison between this study's integrated technique and current approaches demonstrates in Table 8, Fig. 6 and 7.It was noted that the combined method between the logistic regression and Two-step clustering algorithm obtained best accuracy results based on both all features, and important features in the spam based dataset.

AVG Filtering Accuracy AVGMisfiltering Accuracy
AVGArea under the Carafe www.ijacsa.thesai.orgFig. 6 and 7 represent the comparison result between the proposed method and current spam classification methods.In Fig. 6, the comparison based on all spam features of spam based dataset, while Fig. 7 represents the comparison of results based on significant features that were extracted using weight feature process.It was observed that the proposed LRtwo-step technique achieved the best result using both dataset features and significant features.On the other hand, the lower result was obtained by the naive Bayes method as shown in Table 8.

VI. CONCLUSION
This study is considered one of the main challenges through the email messages.The spammers can easily steal information by sending random spam emails via the internet.This research tried to investigate the email messages based on the logistic regression method to classify the messages to spam or non-spam.A feature weight based on the amount of data is one of the contributing parts proposed in this study to select the significant features.Another contribution is an integrated technique between the logistic regression and twostep clustering method to differentiate the email messages of spam from non-spam.The benefit of using the two-step clustering method is to group the similar emails features to study the spammers' pattern by focusing on their beavering in constructing the email messages.The proposed method used a UCI Spam base dataset to build the spam-filtering model.Based on the obtained results, conclusions were made that not all the email messages writing style features could be used by spammers.Where, only the important features that were selected using feature weight process can improve the computational time of email spam filtering.The proposed method was tested using T-test statistical significant method to prove improvement before and after feature selection and combination process.It has been shown that the LR-Two-Step can significantly enhance the filtering accuracy ratio and decrease the misfiltering error in spam dataset.VII.

Fig. 4
Fig. 4 and 5 represent the average training output of the spam email filtering before and after feature selection using feature weight process.The dataset was examined based on two techniques; the logistic regression and the combined technique between logistic regression and two-step clustering algorithm.Table3presents the prediction of email filtering using the logistic regression method extracted average accuracy results with 90.8% for training phase and 94.96% for

Fig. 6 .
Fig. 6.A comparison between this study's proposed methods and other spam filtering methods based on all features.

Fig. 7 .
Fig. 7.A comparison of this study's proposed methods and other spam filtering methods based on important features.

TABLE .
III. RESULT OF LOGISTIC REGRESSION WITH ALL FEATURES IN THE DATASET

TABLE .
IV. RESULT OF LOGISTIC REGRESSION WITH IMPORTANT FEATURES 2 www.ijacsa.thesai.org

TABLE . V
. RESULT OF HYBRID TWO-STEP AND LOGISTIC REGRESSION WITH ALL SPAM BASE FEATURES DATASET TABLE.VI. RESULTS OF COMBINED TWO-STEP WITH LOGISTIC REGRESSION BASED ON IMPORTANT FEATURES