Improvement of Persian Spam Filtering by Game Theory

There are different methods for dealing with spams; however, since spammers continuously use tricks to defeat the proposed methods, hence, filters should be constantly updated. In this study, Stackelberg game was used to produce a dynamic filter and the relations between filter and adversary were modelled as a turn game in which there is a leader and a follower. Then, an attempt was made to solve the game as an optimization program via the evolutionary stable strategy (ESS). The dataset used in the study for evaluating and analyzing the proposed method was a real dataset including the emails of four users’ personal emails. The results of the conducted evaluations and investigations indicated that the proposed method had an 8% improvement over the three-class classification method and a 0.8% improvement over the ESS-based equilibrium point method. Keywords—Spam Filtering; Game theory; Stackelberg game; Evolutionary Stable Strategy; Email Classification; Stackelberg equilibria


INTRODUCTION
misuse of email is becoming an increasingly serious problem for both individuals and organizations, The occurrence of more and more spam has made up great threat to the security of the internet, it not only occupies numerous network bandwidth and causes great network resources, but also affects people's normal life.Since spammers know some ways not to be recognized, techniques are discussed which may detect and filter spams out of legal emails; indeed, such techniques and methods can be applied on words, phrases or specific fields.Nevertheless, even such good and acceptable filters sometimes fail to accurately classify unsolicited messages.The two chief categories in technique classification are list-based methods and content-based methods.Recently, content-based methods have attracted more attention and interest; The reason for the less popularity of list-based methods in comparison to content-based ones is that list-based methods have fixed rules and regulations which need to be updated frequently.Spammers usually make some changes on spams so that filter would identify them as regular and normal mails.
In this study, the behavior and relation between adversary (spammer) and spam filter (classifier) was investigated.In the method introduced in the present study, relation between spammer and filter modelled as a sequential Stackelberg game and filters detect spams by learning the adversary's behavior.
On the other hand, adversary try to deceive filters by learning their parameters.For finding the equilibrium point of the game in infinite case, evolutionary stable strategy (ESS) was used here.
The entire paper is divided into five sections.Section II provides the related works.In III we formulate Stackelberg games with infinite strategy space for spam detection and Then, explain the evolutionary stable strategy to find the needed equilibrium.in section IV we describe the experimental setup and presents and discuss the experimental results.And it followed by conclusion in section V.

II. RELATED WORKS
Numerous methods have been proposed in the literature for dealing with unsolicited emails and researchers continue introducing new methods for managing the different extensive dimensions of spams.As a case in point, list-based methods such as black list, white list and grey list can be mentioned.By classifying senders into the two category of spam and legal senders, these filters try to identify spams.Although these filters are simple, they are not adequately efficient because they possess fixed rules and regulations which can be identified and eluded by spammers.In [1], methods were discussed in which they change addresses to detect spams as emails and also to detect websites which spammers use to hide their webpages.In [2], DNSBL was used for analyzing the behavior and the efficiency of the black list and the elimination power and the conditions at which black list is not used have been discussed.The results showed that combining black listing with a spam refining program can have higher efficiency.
Due to the shortcomings of the list-based methods, contentbased methods which are categorized into word-based, rulebased and statistical filtering methods gained more popularity and attraction.Machine-learning methods are a kind of content-based methods.Each of the machine learning methods can have acceptable performance, the studies [3], [4], [5], [6], [7] demonstrated the positive effect of pre-processing and feature selection in machine learning methods such as Naïve Bayesian, Support Vector Machine, K-Nearest Neighbor, Artificial Neural Network and Decision Tree.Also, the combination of these methods leads to the improvement of the precision [8].Based on the results of the studies reported in [9], [3], [8], [10], [11] and [12], it can be maintained that NB is one of the strongest available algorithms for categorizing emails which uses probabilities for detecting the class of new data.www.ijacsa.thesai.orgAlso, filters, it is deceived less than list-based filters.In general, each of the machine-based learning methods has its own merits and demerits.According to the findings reported in [4] and [5], the benefit of SVM method is that it can solve the problem of finite samples.However, it should be noted that the Kernel function and c parameter should be determined correctly for it.A function was proposed for kernel in [13].The main challenge of the artificial neural networks is determining the number of nodes in the secret layer and the number of appropriate repetitions for finding weight.Nevertheless, despite having high precision, extensive calculations are one deficiency of the artificial neural network [6], [7], [14], [15] .The simplicity of implementation and the ease and speed of learning with regard to KNN have resulted in the popularity of KNN.At the same time, it should be noticed that the specification of similarity criterion and the selection of the appropriate number of neighbors (k) are the main problems of this algorithm [16].Decision tree is a common method in datamining which is easy to understand and evaluate.The exponential growth of the decision tree in line with problem growth is a challenge of this algorithm.Hence, NBTree and J48 were proposed to sort out this issue [17].
Inasmuch as spammers try to deceive filters by changing the content or components of emails, hence, investigating the behavior of the spammer and considering his actions will lead to the optimization of the email classifying methods.The issue of learning adversary's behavior was first proposed in [18] where a game between spam filter and adversary was proposed.However, since any changes in the players will make this method non-executable, hence, it is practically unusable.For solving this problem in [19] and [20], spam senders try to identify filter parameters by first learning the adversary's behavior and reverse engineering; next, they plan their attacks based on it.In this method, there is no equilibrium point for the adversary and spam filter [21].Also, in [22], adversary's learning model was designed based on game theory.The rationale behind this modelling is to achieve an equilibrium point for the adversary and spam filter and it is assumed that filter and adversary have full understandings of each other's behavior.The genetic heuristic algorithm was used for solving this game in the unlimited space and achieving an equilibrium point.In [21], the relations between the adversary and filter was modelled as a turn game where there is a leader and a follower.Then, an attempt was made to solve the optimization program through evolutionary strategies.The results indicated that filtering accuracy rate has been improved in this method.In [23], the relations between spammers and user was modelled as a competitive game.Then, the strategy between them is consequently predicted.Next, the prediction is used for adjusting spam filters.In this method, it was assumed that there is a filter on the system and the user can choose whether he wants to read the received email or the spam.This game has a Nash equilibrium and the objective is to achieve an equilibrium so that the adversary's strategies can be guessed by means of this equilibrium.

III. THE PROPOSED METHOD
The method introduced in this paper is based on game theory which was used to guess the adversary's action and, consequently, adopt the best strategy for the filter.As illustrated in figure 1 for the proposed method, after preprocessing and using a learning algorithm for filtering spams, the relations between adversary and filter was assumed to be of the Stackelberg game type.That is to say, the first player, namely the adversary plays the role of leader and it can impose its own strategy on the second player, i.e. filter.Hence, assuming that the follower selects the best possible decision based the leader's strategy, hence, the leader tries to adopt the most appropriate decision.Then, based on the leader's strategy, the follower reacts with respect to the payoff.In most cases, backward induction is used for finding Stackelberg equilibrium point.In the assumption of the problem, the strategy space for the two players was considered to be complex and unlimited.

A. DataSet
Inasmuch as there was no available Farsi dataset for filtering spam, the dataset used in the study included personal emails of 4 users for approximately 8 months which were collected in 2012-2013.The dataset which was collected by the researchers included 682 legal emails and 629 spams.Tagging was used for isolating email sections.Among the collected emails from the dataset, about 80% was used for learning system and 20% was used for the testing phase.www.ijacsa.thesai.org

B. Preprocessing
Email classification methods are similar to text classification methods.Figure 2 depicts the text classification.For doing the preprocessing stage for each user, at first, spam classes and legal classes should be read from the related folder and the class of each one should be specified.After reading the content of emails from each class, the available words in the email which were obtained from the body and title parts were separated token by token and the emphasis words and the stop words were eliminated.Next, the words were extracted from inside of the emails; then, the obtained words were illustrated in a vector based on their frequencies.Next, using CHI method [24] and [25], some words which could distinguish classes from one another were selected as the best words.Afterwards, by weighing words via TF-IDF method [26], the weight related to each feature was calculated; then, feature mean and variance were calculated for class which were used in the next stage.

C. Game modelling
It was assumed that data belong to a specific onedimensional space and they are distributed normally.Spam distribution is determined by and legal email distribution is determined by .İn the game, spammer plays by moving the boundary towards and also filter tries to maximize its payoff by changing the boundaryline.
The initial state of the game is illustrated in figure 3. TPs stand for spams which are truely spams but FPs are legal emails which were wrongly classified as spams.Moreover, TNs are the legal emails which are correctly classified but FNs denote wrongly classified spams.Adversary's purpose is to enhance the number of FNs.Hence, it will lead it message feature space towards the space of legal email feature (figure 4).In contrast, as the filter notices the adversary's strategy and system learning for a second time, it determines the best threshold (figure 5).The ultimate goal of the game is to achive the equilibrium point.Assuming that players are aware of one another's payoff function, the purpose function of the players will be used to examine it.Then, ESS is used to calculate the equilibrium point of the game.

D. Stackelberg equilibrium point
Each player consists of a set of strategies which are labelled as U and V for the leader and the follower, respectively; also, u and v denote the feature space of them, respectively.Furthermore,each player has one differentiable payoff function which is defined as .U and V are bounded and convex and regarding the adversary player's movement, the strategy is defined in the following way [22]:   Stackelberg equilibrium point is an equilibrium in which until the time the player sticks with its own selected strategy, none of the other two players have any motivations for changing their strategies [21].Refers to the cost function for i player.The best response for each i player is denoted by R i .One solution for the Stackelberg equilibrium is produced by the following method [21]:  By doing so, the follower responds whether the optimal value is:  One mathematical method for obtaining Stackelberg equilibrium in two-player games is to apply bi-level programming problem [27].In this study, equation 8 can be used as the bi-level programming problem where each player tries to maximize its own payoff [22].


 www.ijacsa.thesai.org In equation 9, h and g variables are the limitations in U and V action spaces.Since bi-level programming problem is regarded as an NP-Hard problem, ESS was used for finding the equilibrium point.

E. Formulizing payoffs
Using probability density function in the normal space √ , we can obtain cumulative density function ∫ .The adversary's payoff can be defined as the degree of spams which can pass through filter as legal email-the cost of transmission from one normal space to another normal space (equation 6).In case there is a dataset with several features, assuming that the features are independent, equation ( 7) is obtained [21].( 6) FNR (7) ∑ q denotes the number of features and KLD is a notion for estimating the impact of transmitting the space of spam the size u on the initial data; indeed, it is a criterion for evaluating the degree of closeness of probabilistic models to the accurate distribution of the population.For measuring the impacts of transmission from to equation ( 8) is used Filter rewared is interpreted as the enhancement in the rate of correct acceptance and correct rejeaction [22].( 9) TPR+TNR (10) ∑ α adjusts the distance criterion on the degree of spam space transmission.İts enhancement leads to FNR reduction.In contrast, by controlling the degree of movement of w, β leads to the FPR reduction.İn case zero is selected for both parameters, will be considered as the threshold algorithm.Moreover, since both parameters are zero, the adversary will not pay any cost for transmission.As a result, by selecting the highest transmission value, spam space will be completely consistent with the space of legal emails.Under this condition, the degree of system error will be 50%; in fact, α can not practically be zero.As more value is regarded for α, the adversary will have to pay more cost.Consequently, a smaller value is selected for u. in case β has the value of zero, the system will not distinguish between wrong and right pass rate.However, as β increases, filter will hardly change the threshold.Consequently, a value close to the previous threshold value will be selected.This parameter can be used to distinguish between wrong and right pass rate.In other words, it can be used for weighing.

F. Evolutionary stable strategy (ESS)
In case ESS strategy is selected by all the members of a population, no other mutation strategy can overcome it.If a strategy is evolutionary and stable, hence, there will be Nash equilibrium but the requirement for the evolutionary stability of a Nash equilibrium is that it will be a strict Nash equilibrium.In most studies and books, ESS method has been considered for the symmetrical games.Nevertheless, in this study, there are two non-symmetrical (different) players.Player I has the strategies of i=1,2,…, m and player J has the strategies of j=1,2,……...,n. at any moment, p i and q j indicate the selection of i and j by the players I and J, respectively.p = ( and q = ( are the probabilistic vectors which determine population status together.The fitness of i is denoted by F(i) and the fitness of j is denoted by G(j) which depend on (p,q) at the given moment.Hence, the fitness function is defined as F(i|p,q) and G(j|p,q).

Overall schematic view of the proposed algorithm: ESS(evolutionary stable strategy)
1. Initial population is randomly created 2. Players play together and the profitability of each strategy is measured.

3.
The strategies with more profitability values are reproduced in society.

4.
Steps two and three are repeated until the society achieves stability.5. Some members begin to change their own strategies (mutate).6.Until reaching the termination condition, algorithm is repeated from step 2.

A. Results of training phase
As mentioned before, game model was produced in the training phase and was executed on each component of dataset; then, using ESS which was proposed in [21] and the ESS of the equilibrium point, the results were measured.The higher the adversary payoff and the higher the filter payoff, the algorithm will be more precise and accurate.Variables of the problem were evolved in 100 generations and the best member of each generation was used.Similar to the study reported in [21], the value of α parameter in the Kullback-Leibler equation was considered to be 0.01.As illustrated in figures 7 and 8, it can be argued that ESS method is far better than ES (evolutionary Strategy).For a second time, algorithms were executed separately for 100 features and average payoffs were taken into consideration.90 samples were used for training.Payoff function is linear and the problem variables were evolved in 100 generations.ineach generation, average chromosomes were given (not the best chromosomes).The payoff which is obtained for ESS is the best chromosome of the generation because, at the end of each execution, strategies are stabilized and all of them become identical.Regarding these issues, as depicted in figures 9 and 10, it can be maintained that, in most cases, ESS has better performance than ES.Furthermore, with respect to the results, it can be observed that 100 generations are not sufficient for evolution.Hence, more repetitions are need for achieving evolution.However, since there is not one equilibrium point, it can be argued that, in 100 generations, a near-optimal response can be achieved.Based on the strategy of the opposite side, payoffs in each generation can be more or less because each player not only has unlimited strategy space and acts smartly but also it is aware of the strategy of the other side.Based on the other side's strategy, it selects the most appropriate condition for itself.

B. Results of the experimental phase
In this phase, the final position of the filter and adversary at the equilibrium point were obtained for each feature by the data of the training phase.Then, these positions were used to obtain the following evaluation criteria:  One sample is selected from each dataset of the experiment phase.It should be specified that each feature dedicates the sample to which classification.
 Class label for all the features of the sample is determined by the previous method.
 The sample class depends on the largest number of labels which is specified by the features.
The researcher and collector of the dataset has argued that 65 samples and 90 features are the best and most desirable number of samples and features for training and testing in his own research; she used the same number of samples and features in his evaluations.In a similar vein, in this study, 90 features and 65 samples were used to test the implementation methods.The average evaluation criteria were used and the equilibrium point was measured by the ES and ESS which are given in table 3.
The highest precision was desirable in the system used in this study because the higher is the precision, the lower will be    11.Comparison of average accuracy of the results of the proposed method with those given in [9] As showed in Figure 11 the methods of Stackelberg game and finding the equilibrium point by ES and ESS had 89.4% and 90.22% which achieved the highest precision among other methods.It should be noted that the method used in [9] which had used three-class classification on the dataset of Farsi emails had the precision value of 86.42%.

V. CONCLUSION AND DIRECTION FOR FURTHER RESEARCH
In the present study, game theory was used to investigate the relations between filter and adversary and the detection of spams by filter was considered.Regarding their payoff functions, the two players selected the best strategies.After repeating the game, they reached an equilibrium point.It should be pointed out that both players played smartly and they were interested in enhancing their own payoffs and, consequently, favored a reduction in the adversary player's payoff.The results of the experiments indicated that the evolutionary stable strategy (ESS) was able to find game equilibrium point with more precision than ES method.That is, ESS had 79.6% accuracy rate, 90.2% precision, 75.5% F criterion and 70.3% recall.In contrast, ES had 79.5% accuracy rate, 89.4% precision, 75.4% F criterion and 69.3% recall.Since speed and time are of high significance in modern life, as a direction for further research, we should focus on reducing and optimizing training time in future studies.

Fig. 6 .
Fig. 6. schematic view of the proposed algorithmIV.EXPERIMENS & RESULTSIn the experimental phase, after training the system by the training data (80% of samples), test data (20% of samples) is used for evaluating system.The results of experiments are given below in the form of tables and figures.Evaluation criteria: in text classification problems, the following criteria are usually used:(11)

Fig. 7 .Fig. 8 .
Fig. 7. average filter payoff with one feature on Evolutionary Strategy and Evolutionary Stable Strategy

Fig. 9 .
Fig. 9. Average filter profits for 100 features on Evolutionary Strategy and Evolutionary Stable Strategy in 100 generations (average chromosomes)

TABLE I
Indeed, it should be noted this issue is of high significance in filtering and investigating spams.The obtained results indicated that the precision of the filter for the modelling method and solving the game by ESS and ES was better than those of other methods.

TABLE II .
AVERAGE EVALUATION CRITERIA FOR FINDING GAME EQUILIBRIUM POINT BY ES AND ESS ON REAL DATASET

TABLE III .
RESULTS OF THREE-CLASS CLASSIFICATION FOR FARSI EMAILS BASED ON THE CRITERIA: PRECISION, RECALL AND F ON REAL DATASET