Detecting Health-Related Rumors on Twitter using Machine Learning Methods

Nowadays, the huge usage of internet leads to tremendous information growth as a result of our daily activities that deal with different sources such as news articles, forums, websites, emails and social media. Social media is a rich source of information that deeply affect users by its useful content. However, there are a lot of rumors in these social media platforms which can cause critical consequences to the people’s lives, especially if it is related to the health-related information. Several studies focused on automatically detecting rumors from social media by applying machine learning and intelligent methods. However, few studies concerned about health-related rumors in Arabic language. Therefore, this paper is dealing with detecting health-related rumors focusing on cancer treatment information that are spread over social media using Arabic language. In addition, it presents the process of creating a dataset that is called Health-Related Rumors Dataset (HRRD) which will be available and beneficial for further studies in health-related research. Furthermore, an experiment has been conducted to investigate the performance of several machine learning methods to detect the health-related rumors on social media for Arabic language. The experimental results showed the rumors can be detected with an accuracy of 83.50%. Keywords—Health-related misinformation; cancer disease; fake information; Twitter; classification formatting


I. INTRODUCTION
Tremendous amount of information are generated as a result of our daily activities and from different sources such as news articles, forum, websites, emails and social media. Therefore, information spread quickly, especially through social media such as Facebook, Twitter, Instagram and others. Hence, social media is a rich source of information that deeply affect users by its contents. This content can be useful for the needs of many users in different areas such education, politics, economics, advertisement, health care, shopping and others. At the same time, there are a lot of information which can be false (rumor) [1][2] [3].
Social networks are established in order to connecting people, enhancing relationship and sharing useful information [4]. Recently, it becomes the communication channel for education, advertisement and many other activities. There are a lot of benefits of it in marketing [5] [6], and other professional purposes [7] [8]. In despite of advantageous provided from social networks, the quality of information is low, especially on news and health care information [1] and [2]. Thus, anyone using a social media is able to write self-content as advice or recommendation even without a-prior knowledge and spread such information to many people in minutes [9]. This information could be related to medical treatment and healthrelated issues [2] [10] [11] [12]. In addition, social media users widely rely on themselves to obtain medical advices from social media. Therefore, the creditability of such information is very important.
Information on social media lacks to quality, credibility and trust-ability as emphasized in health-related misinformation [13] [14] [15]. This misinformation/rumors could have critical consequences to the people's life, especially if it concerns on health information that can lead to health risks [16]. In the existing studies of detecting rumors in health-related information, a little attention has been given to cancer-related information using Arabic language. Therefore, the purpose of this paper is to apply several machine learning methods for detecting health-related rumors aiming cancer treatment over social media using Arabic language. In addition, a dataset for cancer information treatment called Health-Related Rumors Dataset (HRRD) has been created. HRRD has been collected from Twitter, classify by domain experts into true and false information. Then, different preprocessing methods were applied on the dataset such as stemming, tokenization, feature extraction and oversampling. Furthermore, several machine learning methods have been applied and evaluated using different metrics. This paper organizes as follows: Section II demonstrates related studies. The methods and materials are presented in Section III. While, Section IV explains the results and discussion. Finally, conclusion of this paper is highlighted in Section V.

II. RELATED WORKS
There are few studies focusing on rumors on social media such as [17]  on the importance of checking and verifying online perceptions and credibility of information by health professionals and physicians in domain, otherwise it will be harmful to user's health. In the other way, the authors in [29] [30] [32] presented rumor detection methods by detecting the health-related misinformation using extracting and identifying the fake features. In [30] and [32], Health-related Misinformation Detection framework was developed in order to detect unreliable and reliable health-related information.
Zhang et al. [33] applied logistic regression model for distinguishing between true and false health-related rumors. For this purpose, 453 health rumors from Chinese website were collected and analyzed. The results showed that lengths of rumors headlines, statements and presence of pictures within the context are the most distinctive indicators of false rumors, whereas rumors that contain numbers, hyperlinks and source cues are more likely to be true.
On the other hand, the authors in [34] studied human behavior regarding travel to these areas affected by Zika virus. They have combined content analysis with several machine learning techniques in order to identify first-person reactions and change of travel-related decisions during the Zika outbreak. For this purpose, 29,386 English-language tweets were collected. Only 2000 English-language tweets were annotated and labeled by two annotators and then out of them, 400 tweets were used for training binary logistic regression classifier. The classifier's performance was evaluated using Precision, Recall and F1-score. The best F1-scores were 0.65 for travel change decision, 0.63 for travel consideration and 0.92 for identifying the first-person reactions. In [35], it was dealt with the Zika virus outbreak and gathered around 30 million tweets posted around the world. They incorporated health professionals and crowdsourcing methods to capture and annotated health-related rumors, and used several machine learning techniques including naïve Bayes, random forest and random decision tree to classify the tweets. The data set consists of 3,343 labeled tweets, in which 1786 were rumors. Regarding to the performance of the classifiers, the best achieved results were yielded when random tree was employed with precision of 0.946 and recall 0.944. In [36], they examined questionable health-related information that are posting on Twitter, in particular these tweets related to cancer treatments. For this purpose, they studied 3,212 Twitter users who posted unverified information about cancer treatment. A total of 215,109 tweets about rumor topics were harvested. Then, rigorous filter criteria were applied to exclude irrelevant tweets and users accounts from the data set. At the end, only 4,000 tweets remained, total of which 2,890 were labeled as information about cancer and 1,110 tweets were labeled as non-related to the cancer topic. The logistic regression using ngrams features was employed on this dataset and showed good results.
In addition, the authors in [37] examined 1.5 million tweets mentioning obesity and diabetes epidemics. The main purpose of this study was to assess the quality of information circulating in these conversations, as well as the behavior and information needs of the users engaged in it. The results showed that 41% of the circulated obesity-related tweets and 50% diabetes were posted by non-governmental or academic institution. Furthermore, other studies focused on creating automatically a health misinformation dataset harvested from an online health discussion forum such as [38]. Also, [39] analyzed vaccine rumors in news and social media by developing a dashboard platform that has two networks visualization: the user-as-nodes and tweets-as-nodes. To demonstrate the robustness of the system, a total of 875,088 tweets and 4,020 news articles about vaccine-related topics were collected. It was found that this tool is useful only for tracking the most influenced accounts who post frequently such news or tweets. Similarly, [40] modeled the trustworthiness and reliability of online information using deep learning technique, in particular, convolutional neural network (CNN). The applied model was used to generate a recommendation of trusted medical articles with average veracity score of 78.32%.

III. MATERIALS AND METHODS
The main methodology of this study is illustrated in Fig. 1. It briefly shows the main four phases of conducting this study, which include dataset generation, data preprocessing, applying machine learning methods and evaluation the model.

A. Phase 1: Dataset Generation
The dataset generated for this study is called Health-Related Rumors Dataset (HRRD), which includes a collection of tweets that are related to rumors on cancer disease/treatment. To the best of our knowledge, there is no available dataset for health-related rumors on social media for Arabic language. The phases of generating this dataset includes five steps, which are identifying the keywords, extracting the tweets using Twitter search APIs, extracting the tweets manually, screening the tweets, and labeling the tweets. In this study, the collected tweets are related to cancer symptoms, causes, prevention, treatment and awareness on tweeter that were written in Arabic language. These five steps are described in detail as follows: (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 326 | P a g e www.ijacsa.thesai.org Step 1: Identifying the keywords Several keywords were used in order to automatically or manually extract tweets regarding to cancer disease in Arabic language. These include: cancer disease, cancer causes, cancer treatment, fighting cancer, awareness about cancer, campaign about cancer, warning about cancer, health and cancer, avoid cancer and information about cancer.
Step 2: Extracting the tweets using Twitter search APIs Using the keywords mentioned in the previous step, 18,684 tweets were automatically extracted from Twitter using Twitter search APIs. These APIs are based on the REST architecture which allow to access Twitter data such as tweets and the user profile information. However, due to the limitation on tweets extraction, the user can only perform a limited number of requests daily. Therefore, additional tweets were extracted manually, as described in the next step.
Step 3: Extracting the tweets manually In this step, tweets extracted manually due to limitations of extracting tweets using Twitter search APIs which also retrieve huge number of irrelevant tweets, additional tweets were extracted manually using the above keywords to provide more relevant tweets to the dataset. A total of 180 tweets were extracted manually.
Step 4: Screening the tweets The extracted tweets in step 1 and 2 were screened manually to exclude any irrelevant tweets, which were posted by product sellers, companies, fake/untrusted accounts and others. The total number of tweets was reduced tweets to 175 tweets.
Step 5: Labeling the tweets In the process of labeling, the extracted tweets were divided into four groups and sent to domain experts (medical doctors) to label the tweets into three options: rumors= yes, no and not sure. The first group were answered by nine experts, while the rest were answered by seven experts only. The majority voting was used to find the final label for each tweet. The results of labeling were (yes: 31, no: 41, 87: not sure, 16: the decision cannot be made because of equal voting). Then, the tweets with labels: -not sure‖ and -no label‖ were combined (103) for relabeling. Also, additional tweets were extracted manually and included to this group (33 tweets). These tweets were divided into three groups and sent to domain experts including oncologists. The first group was answered by seven experts, the second group was answered by four experts and the last group was answered by two experts. The majority voting also was applied here to combine the votes and label the data. In case of any not-sure answers or equal number of votes, more weights were given to the oncologist's answers. The total number of labeled tweets were 208, which include: yes: 128, no: 80. The distribution of the classes for the final dataset is shown in Fig. 2.

B. Phase 2: Data Preprocessing
Python 3.6 with Windows 10 operating system were used for preprocessing the dataset and conduct the experiments. Several libraries were installed including: NLKT for stemming of Arabic texts. In addition, the raw data were tokenized and represented using unigram, bi-gram, trigram, 4-gram and 5gram. The feature extraction was performed using TF-IDF. The impact of these different preprocessing methods was investigated for detection of health-related rumors in Arabic language. In addition, as shown in Fig. 2, the dataset is unbalanced. Therefore, oversampling method was applied on the minority class in order to provide balanced dataset. The impact of oversampling method also was investigated.

C. Phase 3: Machine Learning Methods
To detect the health-related rumors for Arabic language in social media (Twitter), several machine learning methods were used which include Support Vector Machine (SVM), Logistic Regression (LR), Bernoulli Naive Bayes (BNB), SGD Classifier, K-Nearest Neighbor (K-NN) and Decision Tree (J48). In addition, three ensemble machine methods were used which are Random Forest (RF), AdaBoost (Ada), and Bagging (Bag).
To apply the machine learning methods, the dataset was split into training set (70%) and testing set (30%). Then, several evaluation metrics were applied to measure the performance of detecting the health-related rumors for Arabic language in social media. The details of these measures are described in the next subsection.

D. Evacuation Metric
The evaluation metrics were used for evaluating classification methods that were combined with different preprocessing methods. These includes: Precision, Recall, F-1 score and Accuracy. The definition of these measurements is illustrated as follows: (1) 327 | P a g e www.ijacsa.thesai.org where is true positive; is true negative; is false positive, and is false negative.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
The experiments have been conducted on two stages without and with applying oversampling for the dataset. In each stage, five steps were done by applying different tokenization methods (1-5 gram). The accuracy of each classifier was reported, and the precision, recall and F-score values for rumor class and for the two classes (weighted value) were presented. The best value(s) of each evaluation criterion was highlighted (bold).
Tables I to V shows the performance of the nine machine learning methods for different tokenization methods (1-5 gram) before applying the oversampling. The results showed that the best accuracy was obtained using SGD classifier (76.19%) for bigram method. For detecting the rumor class, the best precision (0.80) was obtained by Adaboost classifier with 5gram method, while both BNB and RF obtained the best recall values (1.0) using all tokenization methods. Furthermore, the best F-score obtained for this class was (0.82) by SGD classifier with bigram method. On the other hand, the best weighted precision (for the two classes) was obtained by KNN classifier (0.80) with trigram method, while the best recall and F-score values were obtained using SGD classifier (0.76, 0.75 respectively) with bigram. The experiments reported the good performance of SGD classifier to detect health-related rumors using unbalanced dataset (before oversampling).  In the second stage, oversampling method was applied and the five tokenization methods (1-5 gram) were used. The performance of detecting the health-related rumors using machine learning was consistently improved. Tables VI to X show the performance of the nine machine learning methods with oversampling method. The results showed that the best accuracy was obtained by RF classifier (83.50%) with 4 and 5gram, and by using SGD classifier (83.50%) using 4-gram method.
For detecting the rumor class, the best precision value (0.83) was obtained by Bag (with unigram) and LR (with 4 and 5-gram), while the best recall and F-score values (1.0 and 0.86, respectively) was obtained by BNB classifier using 3, 4 and 5gram).
For the weighted average values, the best precision and recall value was obtained by BNB (0.87, 0.83 respectively) with trigram method. In addition, other classifiers obtained superior performance for recall such as SGD and RF classifiers. The best F-score value was obtained by RF (0.83) using 4 and 5-gram methods.
To compare the accuracy of all machine learning methods using all tokenization methods with and without oversampling, the results were summarized in Fig. 3 to 7. The results showed the consistent enhancements obtained when oversampling was used for all machine learning methods. The best accuracy was obtained by RF using 4 and 5-grams.

V. CONCLUSIONS AND FUTURE WORKS
This study investigated the performance of several machine learning methods to detect the health-related rumors in social media for Arabic language. The dataset (HRRD) was generated by extracting tweets regarding cancer disease from Twitter using Arabic language. The experiments were conducted by applying several preprocessing methods such as stemming, tokenization and oversampling. Then, several machine learning methods were applied. The experimental results showed that when the data is balanced (using oversampling method), the performance of machine learning methods clearly improved. The best accuracy was obtained by random forest classification (83.50%) using 4 and 5 gram as tokenization methods. Therefore, this study recommends using random forest to detect the health-related rumors in social media written in Arabic language. This study opens the door for other researchers to work on health-related rumors in Arabic and also provide the HRRD dataset available that can be also beneficial for further studies in health-related research. In future work, other machine learning methods can be applied with different preprocessing methods. In addition, the dataset can be enriched by including more tweets on cancer disease from social media.