Sentiment Based Twitter Spam Detection

Spams are becoming a serious threat for the users of online social networks especially for the ones like of twitter. twitter’s structural features make it more volatile to spam attacks. In this paper, we propose a spam detection approach for twitter based on sentimental features. We perform our experiments on a data collection of 29K tweets with 1K tweets for 29 trending topics of 2012 on twitter. We evaluate the usefulness of our approach by using five classifiers i.e. BayesNet, Naive Bayes, Random Forest, Support Vector Machine (SVM) and J48. Naive Bayes, Random Forest, J48 and SVM spam detections performance improved with our all proposed features combination. The results demonstrate that proposed features provide better classification accuracy when combined with content and user-oriented features. Keywords—sentiment analysis; spam detection; twitter


I. INTRODUCTION
Spam is a real threat to usefulness of the web. Spammers mask their content as useful or relevant content and hence is delivered to the user. The legitimate users consume this spam data considering it relevant to their information needs. Clay Shirky [2] remarked that a communication channel isnt worth its salt until the spammers descend.
Spams are not easy to stop. For several years, email services like Gmail, Microsoft and others have been successfully detecting spam emails but still spam emails are in circle on the web. These services have been reporting that email spamming has been up to 90 to 95 percent of the total email exchanges [3], [4], [5]. Even after successful detection of spams, companies are unable to stop spammers which ensures about the economical benefits spammers get when they trap a user clicking on a spam link. The severity of the threat posed by spamming has increased with the emergence of online social networks and twitter is one of the most popular online social network which has been highly affected by spam. twitter spamming is more threatening because its more targeted towards the trending topics of the twitter and hence bit easier to get penetrated especially because of hash-tag operator. Another fact that makes twitter a rather easier and fruitful target for spammers is its variety of audience. twitter users span across all sectors of life i.e. it can be the teachers or students, celebrities or politicians, marketers or customers or even general public. They belong to all age groups but most widely age group that uses twitter is between 55 to 64 years. There are about 60% users that access twitter from their cell phones 1 . twitter has 288 million monthly active members that make it widely growing social networking site. There are around 400 million tweets posted on daily bases, the average posts on twitter is 208 tweets per users account.
Due to this continuous distribution of information, a user faces many problems with search results that shares recurring and irrelevant information. This also can be very worrying at the times since a user has to scroll through the all information in direction to get an overall view of topic. Spam detection on the twitter network is difficult due to the noticeable usage of URLs, abbreviations, informal language and modern language concepts [6]. Old-style methods of detecting spam information fall short here. To date, study has been available on many techniques for detecting spams on twitter and blogs by using different features. After knowing the existing importance of spams on twitter, we take inspiration or motivation from this user need and decided to design and develop improved techniques to detect spams on twitter.
In this paper, we propose a spam detection approach for detecting spam tweets. This approach is based on sentimental features of a tweet. The idea is to exploit the philosophy that spammer use to force a user to click on a particular link. They definitly seek help of some motivational words (like 'the best web site', 'excellent service', etc) to make people believe in a certain tweet (examples of some spam tweets given in the  In section II, we highlight some of the previous works done while in section III, we discuss proposed features. In Section IV-A, we describe the data collection used for experimentation. In Section IV, we describe our experimental results and comparisons of different features combinations and the conclusion is described in Section V.

II. RELATED WORK
In this section, we describe several work related to spam detection on twitter. As discussed above that spamming on twitter is different in technique and in nature as compared to other web spams like email spam. Sarita Yardi et al discussed this in a very detailed way in their work [8]. They describe that motivating question for spammers while spamming twitter is that in which way to target and when to target the user. And also what trending topics the spammers should to target and how long they can continue their activities with spamming techniques. Being more practical, Gianluca Stringhini et al [6] explore how the spam has entered in social network sites. They use Random Forest algorithm as a classifier with Weka framework by using features like FF ratio (first feature that compares friend requests that a user sent to the number of friends she has), URL ratio, Message Similarity, Friend Choice and Friend Number. They study how spammers operate to target the social network sites. M. Chuah and M. McCord in [9] discuss some content and user based features as these features are not similar among legal users and spammers.
Zi Chu et al in [10] described that previously all spam detection methods check only individual messages or account for the existence of spam. They focused on the detection of spam campaigns that supervise multiple accounts to spread spam on the twitter network. Alex Hai Wang in [11] proposed a graph model called directed graph model to discover the friend and follower relationship on twitter network. By using Nave Bayesian classifier graph based and content based features are suggested for the detection of spam tweets. In graph based features three features are used namely friends, followers and the reputation of a user is calculated for discovering spam. In content based features duplicate tweets, HTTP links, replies and mentions and trending topics computed for spam detection. In [13], Nikita Spirin studies URLs shared by users on twitter and the estimation of spam for those users who share these links in the network and utilize the information to web spam detection algorithms by proposing a new set of URL derived features for a twitter user representation. Also propose a solution for construction of automatic dataset by analyzing URLs shared by non-spam users in social media for the problem of web spam detection.
In [14] another approach is discussed for spam detection in twitter network. They study the propagation of spam in the network. And they want to find out whether there is a pattern that spammers used for spam proliferation through the network and to determine whether the accounts are either been compromised or overtaken by spammers or certain accounts are purely created for spam activities in the network. They examine the characteristics of the graph of spam tweets and run Trust Rank technique on the collected data. In [15] introduced features for spam tweets detection without earlier statistics of the user and use statistical presentation for the analysis purpose of language to identify spam in twitter topics.
Jonghyuk.S et al in [16] discussed that previously spam detection schemes were based on the features of account information like age of the account, ratio of URLs in tweet and the content similarity of tweet. These features can easily be used by the spammers for spam proliferation activities. They introduced connectivity and distance features (of relation features) for spam detection in twitter which detects spam messages by using connectivity and distance features (of relation features) among the sender of the message and the receiver of the message for checking the spam in the message which is being in progress. Their proposed distance and connectivity features are problematic to operate upon by the spammers and these (relation) features can easily be composed rapidly. Fabricio Benevenuto et al in [12] discussed the problem of detection of spammers in the twitter network as a replacement for spam tweets. The author use social behavior and content based characteristics for the detection of spammers in the twitter network. In [17] spam identification approach is proposed and evaluated for twitter trending topics. Two components of this methodology are detection of timestamp gap among the two consecutive tweets of a user and recognizing the tweet content resemblance amongst the tweets posted by the user.

III. SENTIMENTAL AND CONTENT-BASED FEATURES
We propose sentimental features (combined with content and user based features) as part of our spam detection approach for twitter. All proposed features are described in table ?? in detail.

A. Data Collection
We downloaded tweets for 29 the most trending topics of twitter for year 2012 using APIs provided by twitter. After basic pre-processing, we are left with 29K (1K for each topic) tweets. Manual annotation of these tweets was done with spam or not-spam labels using two annotators A and B. Kappa score [7] for this annotation was found satisfactory (0.82) to proceed with the experiments. We decide to use standard metrics for measuring the usefulness of our approach and hence precision, recall, and F-measure are used.

B. Features Performance Comparison
Here we will discuss our proposed features spam detections performance by using five selected classifiers (SVM, Random Forest, Naive Bayes, Bays Network and J48). We have compared the performance of different features by making different combinations, We have discussing just one combination "all proposed features with baseline features combination" , its performance are given in Table III.   Table III shows the accuracy of all features with baseline  features by using 10 folds cross validation while figure 1 shows  the graphical representation of the information represented in  table III. As we have seen in table III result and 1, Naive Bayes spam detections performance improved with our proposed features. Naive Bayes accuracy with baseline features is 14.13%, result improved a lot with our proposed features combination with baseline features to 25.30% (i.e. 11.18% improvement). We have also got good improvement in Random Forest and J48 classifiers. Random Forest with baseline accuracy is 91.81% is improved with all proposed features to 92.29% with gives 0.48% improvements in accuracy while J48 has given 0.47% improvement. SVM has also shown some improvements in spam detection performance (0.14%).

C. All Features and Baseline Features Comparisons
We repeated the experiments using 70% training dataset fetched by using "Remove Percentage Weka" 2 unsupervised filter by setting percentage property to 70% (contain 20141 spam and non-spam tweets) and testing datasets (contain 6042 spam and non-spam tweets) is fetched by setting the "invert selection" properties to false.  As we have seen in Figure 2, Naive Bayes and Random Forest spam detections performance improved with our proposed features with 70% training and testing datasets. Naive Bayes accuracy improve further as compare to the previous experiments of 10 fold cross validation (i.e. 25.30% vs 26.68%). Random Forest has also shown some improvements in spam detection performance (0.80%). As described in the table and figure, for Naive Bayes classifier we have got good improvement in all combinations but the best combination stands "All Combined" while Random Forest gets improvement in "POS sentimental features" combination. With J48 and SVM as we seen we are getting good performance in all features combinations.   As we have seen in figure 4, Naive Bayes shows good performance as compared with baseline features accuracy. It gained 11.18% improvement as compared with baseline features in all combinations; and with proposed (Sentimental), content and users based features. In Random Forest we have got good percentage performance in Sentimental score and POS features combination with baseline features, its improving 0.59% in spam detection performance with all features combination its just 0.47% performance improvement. In J48 as we have seen its performance improves in POS, POS and emotions combinations with baseline features both have 0.59% improvement with all features combination its just 0.47% performance improvement. SVM also have showing little bit improvement in spam detection accuracy performance its best improvement coming in combination of all proposed features with baseline features gaining 0.14% performance better then as compared with baseline features accuracy. Sentimental score and emotions features combination also have same performance output 0.14%. BayesNet have lost spam detection performance in almost all combinations V. CONCLUSION In this paper, we have suggested some sentimental and POS based features that are combined with content/user based features which can be used to differentiate between spam tweets and legitimate tweets on the twitter a popular online social networking site. Our suggested features are influenced by twitter spam detection policies and our observations of spam behaviors. By using twitter API we collected our dataset of 29 most trending topic in 2012. We proposed sentimental and some content based features which will help in identifying spam tweets and return spam filtered result set when user visit twitter with good accuracy rate. We evaluate the usefulness of our suggested features in spam detection by using five traditional classifiers like BayesNet, Naive Bayes, Random Forest, Support Vector Machine (SVM) and J48 schemes. Our experiments results shows that Naive Bayes, J48 and Random Forest classifier gives over all best performance than the other classifiers like SVM (it shows some improvements in spam detections as compared with content and user based baseline features) and BayesNet. Naive Bayes, Random Forest, J48 and SVM spam detections performance improved with our all proposed features combination. Naive Bayes accuracy with baseline features is 14.1313%, results improved a lot with our proposed features combination with baseline features to 25.3084% and it gives 11.18% performance improvement in spams detections. Random Forest baseline accuracy is 91.8118 % is also improved to 92.2914% which given 0.48% improvement. J48 baseline features accuracy is 91.8778% is improved to 92.3435% which gives 0.47% improvement. SVM baseline features accuracy is 91.2765% with combination to our all proposed features improved to 91.4156% which gives 0.14% performance improvement. By using Naive Bayes, J48 and Random Forest classifier, our suggested features can achieve 93% precision and 95% F-measure. We are leaving future work for now to evaluate our spam detection scheme using larger twitter dataset as well as other online social networking sites like Facebook.