Machine Learning based Analysis on Human Aggressiveness and Reactions towards Uncertain Decisions

Tweet data can be processed as a useful information. Social media sites like Twitter, Facebook, Google+ are rapidly growing popularity. These social media sites provide a platform for people to share and express their views about daily routine life, have to discuss on particular topics, have discussion with different communities, or connect with globe by posting messages. Tweets posted on twitter are expressed as opinions. These opinions can be used for different purposes such as to take public views on uncertain decisions such as Muslim ban in America, War in Syria, American Soldiers in Afghanistan etc. These decisions have direct impact on user’s life such as violations & aggressiveness are common causes. For this purpose, we will collect opinions on some popular decision taken in past decade from twitter. We will divide the sentiments into two classes that is anger (hatred) and positive. We will propose a hypothesis model for such data which will be used in future. We will use Support Vector Machine (SVM), Naive Bayes (NB), and Logistic Regression (LR) classifier for text classification task. Further-more, we will also compare SVM results with NB, LR. Research will help us to predict early behaviors & reactions of people before the big consequences of such decisions. Keywords—Opinion mining; Naïve Bayes; linear regression; support vector machine


I. INTRODUCTION
Internet is providing all the services a normal user looking for. Starting from the health, education, government and business, all categories of modern life have been covered in the shape of internet. Internet provides connectivity [1,2] between people and information publicly shared globally. Similarly, social media such as Facebook, Twitter, YouTube are platform to remain updated with current news and a airs. Through social media people [3,4,5] can share news, share their opinions and participate in activities being held online. Social Networking Sites (SNS) such as Twitter and Facebook have a beneficial effect on our way of life. SNS has been used for expressing opinions on different issues. In this work, we propose a sentiment based method for the predication of aggressive estimation.
In the age of technology [6,7] millions of people are using social media sites like Facebook, Twitter, Google Plus, etc. to share and express their views, emotions, and opinion about their daily lives. Through the online communities, we get an interactive media where consumers inform and influence others through forums. Social media are now become rich of data in the form of tweets, status updates, posts, blog, comments, reviews, etc. [8,9]. These social sites are not just using for personal use, but now it become a fastest tool to reach the people. It provides an opportunity for businesses by giving a platform to connect with their customers for advertising. Mostly people rely on user generated content or reviews to a great extent for decision making. The online content generated by users is too rich to analyze by normal user. The thing is to automate the process to take the views of user's as opinion. The online contents are mainly consider as opinions, sentiments, attitudes, and emotions [10].

II. LITERATURE REVIEW
Machine Learning, Data mining and Natural Language processing all used together for the classifications of text documents widely. These three techniques also used to discover patterns from the electronic documents. Text mining is used to discover hidden useful information from the documents and deals with the operations like, retrieval, classification (supervised, unsupervised and semi supervised) and summarization [11].
There have been many e orts regarding text classifications in the past. Krishna and Gonghzu [12] have analyzed large data from clinics and try to find the clinical disorders. Sonia and Shruti [15] have used Machine Learning techniques for analysis of social network E-Health data. Roshan and Rio D Souza [13] have analyzed product value using sentimental analysis publicly given on Twitter. Both have [16] worked to solve the problem of reading millions of reviews by a single user for a particular product, they have developed a model usng reviews posted which gives product classification in term of positive, negative and neutral reviews. In the same context, Barnaghi et al. [14] used Twitter sentiments to predict event winner. They used Bayesian Logistic Regression (BLR). They manually labelled tweets into two categories positive and negative. A model proposed [17,19] by them can be used to predict winner of any event using sentiments. In our research we will propose a methodology to analyses the pattern of human behaviors towards uncertain decisions. Our proposed methodology saves time and cost for such a huge public review posted daily on social networks. Nirbhay Kashyap et al. [20] have worked on music lyrics to categorize the mood of individuals. They have used different text mining and data www.ijacsa.thesai.org mining approaches to deal with such a problem. They have considered music associations, melody choice and music proposal as a feature to demonstrate the data. It is beneficial for predicting more ac-curate understanding of the music mood in the mood mapping process. Similarly, many studies have been found to investigate the online business trends using social data. Online business and larger companies' world-wide used use feedback which has been given on social sites for the improvement of product and business need with the passage of time. The amount of text and information shared on twitter in the form of tweets have valid information and it can be used to track the progress of product. They have categorized the data into different categories such as against, positive and negative and used machine learning clustering algorithms to do so. They have found that the data available online can be used for the process of information extraction and it is beneficial for the companies to track the progress of their product and handy for future considerations [21].
Santoshi et al. [22] have used twitter data differently. They have tried to figure out the user behavior towards political parties. They have captured twitter data before the election and categories the raw data into 5 different categories such as positive, negative, happy, sad and neutral. This type of information is very handy for political parties before the election. It is also effective to solve the real problems of people so that you can change the thinking of users. They have considered BJP and INC for their purpose. These are the biggest political parties in India. Using text mining and unsupervised lexical method classified tweets related to these parties to identify people emotions for the parties.
Xin Li [23] have adopted the same platform for his studies with his group mates. They have used different Natural language processing techniques [25] for the awareness of social issues human facing. Social awareness in-formation is analyzed by applying text mining and social network analysis. AK Rathore et al. [24] has collected twitter data for the prediction of Pizza success after its launch. It is very handy information they have worked. This type of methodologies can be used to predict the behavior of any user for a particular product. Rathore and his company has used R and NodeXL for analyzing tweets collected from twitter. Furthermore, they have used different text mining, Natural Language Processing and Network Analysis techniques to predict user behavior. Any company or food delivering company can used this sort of information for t [26,27] he purpose of success and failure of product. Nobody has worked to analyze the behavior of certain decision and their impact of human life before. In our research we will propose a methodology to analyses the pattern of human behaviors towards uncertain decisions. Our proposed methodology saves time and cost for such a huge public review posted daily on social networks.

III. PROPOSED METHOD
The solution we suggest involves Twitter data. Tweets collected with Twitter Search API [18]. Our methodology consists of two steps: training and testing phases. Feature representation and tweets collection and classifier training comes in training phase, while the testing phase have four phases: tweets collection for testing, feature representation, hypothesis prediction and evaluation. The first two tasks (i.e. tweets collection and feature representation) are shared between training and testing phase. Some popular classifiers such as SVM, NB and LR used in training and hypothesis. We have used WEKA tool for training and testing of our propose methodology. Firstly, we divided the data sets into two parts, training data and secondly testing data.

A. Preprocessing
Preprocessing reshape the data into desired form. The data collected is not purified for the process of classification, for this we have applied data Processing methodologies to transform the data into meaningful features. Fig. 1 is showing the training of dataset. This involve mainly tokenization (or featuring), feature weighting and data cleaning (removal of irrelevant features). Once the data is collected, URLs from the tweets and replies were removed. Data only with image or with a link but there was no textual information was also re-moved. Stop words also do not give any information about topic and just create noise in the data so using stop word-list they were also removed from the data. Pre-processing is the key process in data classification tasks. It also improves the effectiveness of proposed classifier. When data is pre-processed it helps in saving classifier time while classifying. Collected tweets are further pre-processed with following steps.

1) Tokenization:
Tokenization deals with breaking of long text strings into substrings which may include phrases and words collectively known as tokens. Among two ways of tokenization (phrase and word tokenization), word-level tokenization is considered as more effective due to statistical significance. In this process, the sentence for in-stance "Trump is mentally disturbed person" was bro ken into tokens "Trump", is, mentally, disturbed, person. The algorithms which are used to tokenize a sentence separate the tokens with whitespace and some are based on built in dictionary. Text can be tokenized in two ways, by words (often called bag of words) or phrases.
2) Feature Weighting: A standard function to compute the weights is TF-IDF. TF-IDF scheme is based on two parts: TF and IDF. TF stands for term frequency which is used to counts the represented terms/tokens in a document. It can give a complete measure of term occurrence. IDF stands for inverse document frequency of a term in a collection of documents.

B. Sentiment Classification
Once we applied the pre-processing, we have data in a suitable format to apply classification algorithm on it. We have categorized the data into two formats. A data with false words labeled as Negative and data with positive words labeled as Positive. A sample of tweets rows which we have labeled. Different algorithms are available in this domain that can be used to train the classification task. Different experimental studies have been directed to analyze these methods for text categorization.
Once we applied the pre-processing, we have data in a suitable format to apply classification algorithm on it. We have categorized the data into two formats. A data with false words labeled as Negative and data with positive words labeled as Positive. A sample of tweets rows which we have labeled. Different algorithms are available in this domain that can be used to train the classification task. Different experimental studies have been directed to analyze these methods for text categorization.

IV. CLASSIFICATION
Supervised classification is a machine learning approach in which training data are used to construct the model and test data are used to evaluate the constructed model on unseen data to measure the performance of algorithm. There are a number of classifiers that exist to classify data, and below in Table I we will discuss the classifiers which we have explored in this work. SVM provides better results than other Machine Learning algorithms in sense larger boundary distributions. SVM also supports high dimensional data. SVM is suitable for millions of features at the same time. SVM also supports optimization problems. Software libraries present for the implementation of SVM are lib-linear, libsvm. In logistic regression function, we have the hypothesis below, and sigmoid activation function.
Nave Bayes is probabilistic classifier which strongly based on Bayes Theorem. Simple Bayes, Independence Bayes are common names which are used. It is mostly used in classifying text information into their respective categories. There are some other example which are associated with the classifier such as to check either email is spam or not, either emails is related to sports or not.

A. Evaluation Measures
We used various evaluation measures to assess the results, and these measures are described below in Table II. The results of sentiment classification using Logistic classification are given in Table III. Precision, recall, and fmeasure are approximately 83%, 84%, and 84%, respectively.
Here we have given the results of sentiment classification. The results of sentiment classification using Support Vector Machine (SVM) classification are given in Table IV. Precision, recall, and f-measure are approximately 92%, 85%, and 88%, respectively.
The results of sentiment classification using Naïve Bayes (NB) classification are given in Table V. Precision, recall, and f-measure are approximately 85%, 86%, and 85%, respectively.  Precision (Positive Predictive value) can be defined as relevant instances from the retrieved instances. The concept is used for binary classifications. Whereas recall is the number of relevant instances from total number of relevancy. This is also known as sensitivity.
To get good performance of classifier precision and recall are often used together [28]. F-Measure can be defined as harmonic mean of precision and recall.

B. Tools for Evaluation
To perform desired task, we used WEKA. WEKA is open source free software which has been used for various machine learning problems using data. It contains tools which can be used for classifications, pre-processing, clustering, visualization, association rules etc. Machine Learning is nothing without giving an artificial intelligence to your data. Machine learning methods are very similar to data mining algorithms. WEKA has collection of Machine Learning (ML) algorithms which are applied on data to extract desired results from it.

C. Comparative Analysis
A comparison analysis of classifiers for sentiment classification is given in Table VI. We can see that SVM provides best results and it gives approximately 88% Fmeasure which is much better than from NB and LR results.

VI. DISCUSSIONS
In last chapter we have described tools, data source, and different technologies that we have used in our approach. In this chapter we will present the obtained results. Fig. 2 is showing the work flow of our experimentation. Three classifiers Support Vector Machine, Naive Bayes and Logistic Regression are used in our experiment and to measure the e effectiveness of each classifier we have used three measurements i.e. recall, precision, and f-measure by applying standard 10-folded cross-validation.

VII. CONCLUSION
Twitter is one of the most important social sharing platform for useful information. Tweets posted on twitter are expressed as opinions. These opinions can be used for different purposes such as to take public views on uncertain decisions such as Muslim ban in America, War in Syria, American Soldiers in Afghanistan, etc. These decisions have direct impact in users life such as violations & aggressiveness are common causes. We have collected tweets of such decisions and labeled the tweets into two categories such as anger (hatred) and positive. We have used classifier algorithms such as Sup-port Vector Machine (SVM), Naive www.ijacsa.thesai.org Bayes (NB), and Logistic Regression (LR) for building models. We have also compared SVM results with NB, LR. This research is useful for predicting early behaviors & reactions of people before the big consequences of such decisions.
In the future we interested to build a tool which can work as a recommender system to classify tweets automatically into two categories such as Anger and Positive. a 2 + b 2 = c 2