Fake News Detection in Arabic Tweets during the COVID-19 Pandemic

In March 2020, the World Health Organization declared the COVID-19 outbreak to be a pandemic. Soon afterwards, people began sharing millions of posts on social media without considering their reliability and truthfulness. While there has been extensive research on COVID-19 in the English language, there is a lack of research on the subject in Arabic. In this paper, we address the problem of detecting fake news surrounding COVID-19 in Arabic tweets. We collected more than seven million Arabic tweets related to the corona virus pandemic from January 2020 to August 2020 using the trending hashtags during the time of pandemic. We relied on two fact-checkers: the France-Press Agency and the Saudi Anti-Rumors Authority to extract a list of keywords related to the misinformation and fake news topics. A small corpus was extracted from the collected tweets and manually annotated into fake or genuine classes. We used a set of features extracted from tweet contents to train a set of machine learning classifiers. The manually annotated corpus was used as a baseline to build a system for automatically detecting fake news from Arabic text. Classification of the manually annotated dataset achieved an F1-score of 87.8% using Logistic Regression (LR) as a classifier with the n-gram-level Term Frequency-Inverse Document Frequency (TF-IDF) as a feature, and a 93.3% F1score on the automatically annotated dataset using the same classifier with count vector feature. The introduced system and datasets could help governments, decision-makers, and the public judge the credibility of information published on social media during the COVID-19 pandemic. Keywords—Fake news; Twitter; social media; Arabic corpus


I. INTRODUCTION
The rise of social networks such as Facebook, Twitter, and many others has enabled the rapid spread of information. Any user on social media can publish whatever they want without considering the truthfulness and reliability of the published information, which introduces challenges in information reliability assurance. Twitter is one of the most popular social media platforms. It is designed to allow users to send information as short texts, known as tweets, with no more than 280 characters, and each user on Twitter can follow as many accounts as he or she wants. Nowadays, and with the outbreak of the COVID-19 pandemic, millions of tweets are generated daily, which has caused some adverse effects that impact individuals and society. For example, the spread of misinformation about COVID-19 symptoms may harm people [1]. For instance, it could be anxiety-inducing for a person who experiences COVID-19 like symptoms even if they have not been infected with the virus. The terms fake news and misinformation are closely related and are often used interchangeably. Authors in [2] defined rumors as: "a hypothesis offered in the absence of verifiable information regarding uncertain circumstances that are important to those individuals who are subsequently anxious about their lack of control resulting from this uncertainty." Another definition presented in [3] is: "unverified and instrumentally relevant information statements in circulation that arise in contexts of ambiguity, danger or potential threat, and that function to help people make sense and manage risk." Detecting fake news in English tweets is an active research area and many studies and datasets have been published during the COVID-19 pandemic [4]. In Arabic, fake news detection is fairly new and has a long way to go to reach the level achieved in other languages, especially English. Therefore, the fight against fake news requires a system that automatically assists in verifying the truthfulness of shared information about the COVID-19 pandemic on social media. Fake news detection is a very challenging task, especially with the lack of available datasets related to the pandemic. An automated fake news detection system is necessary by utilizing human annotation, machine/deep learning, and Natural Language Processing techniques [5]. These techniques help to determine whether a given text is fake news or not by comparing the text with some preknown corpora that contain both fake and truthful information [6].
In this paper, we address the problem of fake news detection on Twitter during the COVID-19 pandemic period. Our focus is to build a manually annotated dataset for fake news detection from Twitter's social media platform. We rely on fact-checking sources to manually annotate a sample dataset. The consideration of these fact-checking sources could help in reducing the spread of misinformation [7], [8], [9], [10]. As manual annotation is expensive and time-consuming [11], we also developed a system to expand the manually annotated dataset by automatically annotating a large and unlabeled dataset. We use a supervised learning classification to train and test both the manually and automatically annotated datasets to ensure the quality of our annotation. We use six different machine learning algorithms, four different features with each algorithm, and three pre-processing techniques. The rest of the paper is organized as follows. In Section 2, we cover related work. Section 3 presents our methodology to annotate and automatically detect fake news related to COVID-19. In Section 4, we present the results and discussion. Finally, the conclusion and future work are presented in Section 5.

II. RELATED WORK
Recently, many works have been done to tackle the issue of detecting fake news, rumors, misinformation, or disinformation in social media networks. Most of these studies can be categorized into supervised and unsupervised learning approaches. Moreover, there are fewer works that tackled the problem using semi-supervised techniques.
For the supervised approach, a system based on machine learning techniques for detecting fake news or rumors in the Arabic language from social media during the COVID-19 pandemic is presented in [12]. The authors collected one million Arabic tweets using Twitter's Streaming API. The collected tweets were analyzed by identifying the topics discussed during the pandemic, detecting rumors, and predicting the source of the tweets. A sample of 2,000 tweets was labeled manually into false information, correct information, and unrelated. Different machine learning classifiers were applied, including Support Vector Machine, Logistic Regression, and Naïve Bayes. They obtained 84% accuracy in identifying rumors. The limitations of this research include the unavailability of the dataset, and the fact that it relies on a single source of rumors: the Saudi Arabian Ministry of Health.
Identifying breaking news rumors on Twitter has been proposed in [13]. The authors built a word2vec model and an LSTM-RNN model to detect rumors from news published on social media. The proposed model is capable of detecting rumors based on a tweet's text, and experiments showed that the proposed model outperforms state-of-the-art classifiers. As rumors can be deemed later to be true or false, their model is unable to memorize the facts across time; it only looks at the tweet at the current time. Detecting rumors from Arabic tweets using features extracted from the user and content has been proposed in [14]. The authors obtained rumors and nonrumors topics from anti-rumors and Ar-Riyadh websites. More than 270K tweets were collected, containing 89 and 88 rumour and non-rumour events, respectively. A supervised Gaussian Naïve Bayes classification algorithm reported an F1-score of 78.6%. This research's limitation is that the proposed dataset is not verified using any of the benchmark datasets.
In [15], a supervised learning approach for Twitter credibility detection is proposed. A set of features including content-based and source-based features, were used to train five machine learning classifiers. The Random Forests classifier outperformed the other classifiers when used with a combined set of features. A total of 3,830 English tweets were manually annotated with credible or non-credible classes. The textual features were not studied to examine their impact on credibility detection. Another supervised machine learning approach was proposed in [16] to detect rumors from business reviews. A publicly available dataset was used to conduct rumour detection experiments. Different supervised learning classifiers were used to classify business reviews. The experimental results showed that the Naïve Bayes classifier achieved the highest accuracy and outperformed three classifiers, namely, the Support Vector Classifier, K-Nearest Neighbors, and Logistic Regression. This work's limitation is the small size of the dataset used to train machine learning classifiers.
Detection of fake news using n-gram analysis and machine learning techniques was proposed in [17]. Two different feature extraction techniques and six machine learning algorithms were investigated and compared based on a dataset from political articles that were collected from Reuters.com and kaggle.com for real and fake news. Another Arabic corpus for the task of detecting fake news on YouTube is presented in [18]. The authors introduced a corpus that covered topics most concerned by rumors. More than 4,000 comments were collected to build the corpus. Three different machine learning classifiers (Support Vector Machine, Decision Tree, and Multinomial Naïve Bayes) were used to differentiate between rumour and non-rumour comments with the n-gram TF-IDF feature. The SVM classifier achieved the highest results. Authors in [19] proposed identifying fake news on social media. They used several pre-processing steps on the textual data, and then used 23 supervised classifiers with the TF weighting feature. The combined text pre-processing and supervised classifiers were tested on three different real-world English datasets, including BuzzFeed Political News, Random Political News, and ISOT Fake News.
An automatic approach to detecting fake news from Arabic and English tweets using machine learning classifiers has been proposed in [4]. The authors developed a large and continuous dataset for Arabic and English fake news during the COVID-19 pandemic. Information shared on official websites and Twitter accounts were considered a source of real information. Along with the data collected from official websites and Twitter accounts, they also relied on various fact-checking websites to build the dataset. A set of 13 machine learning classifiers and seven other feature extraction techniques were used to build fake news models. These models were used to automatically annotate the dataset into real and fake information. The dataset was collected for 36 days, from the 4th of February to the 10th of March 2020.
A large corpus for fighting the COVID-19 infodemiconsocial media has been proposed in [11]. The authors developed a schema that covers several categories including advice, cure, call for action, or asking a question. They considered such categories to be useful for journalists, policymakers, or even the community as a whole. The collected dataset contains tweets in Arabic and English. Three classifiers were used to perform classification experiments using three input representations: word-based, FastText, and BERT. The authors only made 210 of the classified tweets public.
Two Arabic corpora have also been constructed, without manual annotation. In [20], more than 700,000 Arabic tweets were collected from Twitter during the COVID-19 period. The corpus covers prevalent topics discussed in that period and is publicly available to enable research under different domains such as NLP, information retrieval, and computational social media. They used the Twitter API to collect the tweets on a daily basis, covering the period from January 27, 2020, to March 31, 2020.
The second corpus is presented in [21]. The tweets were collected during the period of the COVID-19 pandemic to study the pandemic from a social perspective. The corpus was developed to identify information influencers during the month of March 2020, and contains nearly four million tweets. Different algorithms were used to analyze the influence of information spreading and compare the ranking of users.
For fake news detection in other languages, there are many corpora that are publicly available to tackle the spread of false information. A multilingual cross-domain fact-checking news dataset for COVID-19 has been introduced in [22]. The collected dataset covered 40 languages and relies on fact-checked articles from 92 different fact-checking websites to manually annotate the dataset. The dataset is available on GitHub . Another publicly available dataset called "TweetsCOV19" was introduced in [23]. This dataset contains more than eight million English tweets about the COVID-19 pandemic. The dataset can be used for training and testing a wide range of NLP and machine learning methods and is available online. A novel Twitter dataset is presented in [24], which was developed to characterize COVID-19 misinformation communities. Authors categorized the tweets into 17 classes, including fake cure, fake treatment, and fake facts or prevention. They performed different tasks on the developed dataset, including identifying communities, network analysis, bot detection, sociolinguistic analysis, and vaccination stance. This study's limitations are that only one person performed annotation, the analyses are correlational and not causational, and the collected data covered a short period of only three weeks. MM-COVID is a multilingual and multidimensional fake news data repository [25]. The dataset contains 3981 fake and 7192 genuine news contents from English, Spanish, Portuguese, Hindi, French, and Italian. The authors explored the collected dataset from different perspectives including social engagements and user profiles on social media. Sentiment analysis has also been used in fake news detection has also been facilitated. In [26], the authors used sentiment analysis to eliminate neutral tweets. They claimed that tweets related to fake news are more negative and have strong sentiment polarity in comparison with genuine news. The main issue in using this approach to detect fake news from Arabic text is the lack of Arabic sentiment resources, including sentiment lexicons and corpora [27]. Testing whether emotions play a role in the formation of beliefs in online political misinformation is presented in [28]. The authors explore emotional responses as an under-explored mechanism of belief in political misinformation. Understanding emotions helps in different domains including capturing the public's sentiments about social events such as the spreading of misinformation on social media [29].
To summarize, most of the existing datasets target the English language, with only a few targeting Arabic. Furthermore, most of the Arabic datasets related to COVID-19 are published without annotation. Datasets that are annotated were annotated automatically and collected during a short period of time. Additionally, not all of these datasets are publicly available. In this research, we address these issues by employing three annotators to manually perform the annotation task.
III. METHODOLOGY Fig. 1 presents the architecture of the proposed fake news detection system. In the first step of the framework, we collect data from Twitter using the Twitter Streaming API. In the second step, we perform the extraction of tweets which discuss rumors or fake news topics during the pandemic, annotate a small sample of tweets manually, and develop a system to annotate a large dataset of unlabeled tweets automatically.
In the last step, we store the dataset in a database and use it to accomplish our experiments and analysis. This research intends to build an Arabic fake news corpus that can be used for analyzing the spread of fake news on social media during the COVID-19 pandemic. To address this need, we perform the following four steps: 1) data collection, 2) rumor/misinformation keyword extraction, 3) data preprocessing, and 4) fake news annotation.

A. Data Collection
In this section, we describe the process of data collection from Twitter. In the first instance, we prepared a list of hashtags that appeared during the COVID-19 outbreak, as shown in Table I. Armed with the Tweepy Python library and using Twitter's API, we proceeded to collect Arabic tweets related to COVID-19 from January 1, 2020, until May 31, 2020. We then searched for tweets containing one or more of the defined hashtags in the tweet's text. This step allowed us to collect more than seven million unique tweets. After applying some filters such as removing the short and repeated tweets, the remaining tweets are 5.5 million tweets. However, as some of the collected tweets were irrelevant, we decided to keep only those tweets relevant to the COVID-19 pandemic and containing fake news keywords.

B. Fake News Keywords Extraction
To collect a list of keywords relevant to the rumours circulating during the pandemic, we used two sources: • Agence France-Presse (AFP) 1 with its newly formed health investigation team which have the responsibility of dealing with large amounts of fake news in various languages and indicating their error or inaccuracy.
• The Anti-Rumours Authority (No Rumours) 2 , an independent project established in 2012 to address and contain rumours and sedition to prevent them from causing any harm to society.
After reading and analyzing rumours and misinformation circulated on social media using the above-mentioned sources, a list of 40 keywords was extracted and used to prepare our dataset, as shown in Table II. These keywords cover a variety of topics associated with fake news, rumours, racism, unproven cure methods, false information. For example, there was a rumour that herbal tea is used to treat COVID-19.
Another topic circulated was that Cristiano Ronaldo offered to transform his hotels into hospitals and give free treatment to COVID-19 patients. One alleged that the corona virus targets only those who have yellow-skin and Asian people to reduce population density. Other topics include the conversion of non-Muslims to Islam.
We extracted a corpus of more than 37,000unique tweets related to rumors and misinformation topics during the COVID-19 pandemic. The tweets were written by 24,117 users with an average of 1.5 tweets per user. Statistical information details about the corpus are presented in Table III.

C. Data Pre-processing
We performed several text pre-processing steps based on the procedure described in [35] in order to sanitize the collected tweets before annotation and classification. Our dataset, which is a mixture of modern standard Arabic and dialectical Arabic, requires further filtering such as removing duplicated letters, strange words, and non-Arabic words. The following is a complete list of the steps performed: • Removing mentions, hyperlinks, and hashtags.
• Removing non-Arabic and strange words.  • Removing punctuations and Arabic diacritics.
• Removing repeated characters which add noise and influence the mining process.
Two libraries were used to perform further pre-processing on the corpus text, including stemming and rooting. The first library is NLTK, which was used to perform stemming on the corpus text using ISRIStemmer4 3 . The second library is (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 6, 2021 Tashaphyne 4 , which was used to get the root of each word in the corpus.

A. Manual Annotation
A sample of 2,500 tweets was manually annotated into fake or genuine classes. We developed a small application to facilitate the annotation process, as shown in Fig. 2. We involved three annotators in the annotation of the sample dataset. Two of the annotators performed annotation while the third was tasked with evaluating their output and resolving conflicts. We requested the annotators to read and understand the list of guidelines and informed them to skip tweets in which there are mixture of fake news and genuine topics and only annotate tweets that have a clear and distinct fake news topic. The following are the guidelines: • Tweets are generally considered fake if one fake news topic is discussed in the tweet.
• Tweets are considered to not be fake if one fake news topic is discussed in the tweet, and the topic is negated.
• Tweets that contain a mixture of both fake and genuine news are skipped. The annotation process resulted in a corpus containing 1,537 tweets (835 fake and 702 genuine), after excluding duplicated tweets, tweets that contain mixed fake and genuine news, and tweets where the fake news was meant as sarcasm. Statistical information about the manually annotated corpus is shown in Table III. We used Cohen's kappa coefficient to measure the inter-annotator agreement, obtaining a value of 0.91. Table IV shows an example of some annotated tweets.

B. Automatic Annotation
Initially, we trained different machine learning classifiers on the manually annotated corpus and used the best performing classifier to automatically predict the fake news classes of remaining unlabeled tweets. The outcome of the prediction process is 34,529 tweets (19,582 fake and 19,582 genuine) as shown in Table III. During the annotation process, the annotators found some tweets containing fake news keywords but carrying sarcasm. In this case, the annotators were requested to annotate them as genuine.

V. EXPERIMENTS
In this section, we present the results of the fake news classification after describing the employed feature extraction techniques, experimental setup, classifier model training, and evaluation measures.

A. Feature Extraction
The next step after performing text pre-processing is to prepare the features to build classification models. To accomplish that, we used the following features: • Count Vector: The text in our corpus was converted into a vector of term counts.
• Word-Level TF-IDF: Each term in our corpus is represented in a TF-IDF matrix.  • N-gram-Level TF-IDF: We used unigram, bigram, and trigram models in our experiments. We then represented these terms in a matrix containing TF-IDF scores.
• Character-Level TF-IDF: We represent TF-IDF character scores for each tweet in our corpus.
These features are used to train multiple classifiers in order to build machine learning models with the ability to decide the most probable category for new, unseen tweets.

B. Experimental Setup
This section describes the experimental configurations used to perform the text classification task. We designed a set of experiments aiming to validate and ensure the quality of manually and automatically generated annotations. We also explored fake news detection as a binary classification problem (fake and genuine). The total tweets in our fake news dataset are 1,537 and 34,529 tweets in both manually and automatically annotated corpora, respectively. We divided both datasets into 80% for training and 20% for testing. Table VI shows

C. Models Training
Once the numerical form of the textual tweets was complete, the data frame containing the count vector, word-level TF-IDF, n-gram level TF-IDF, and character-level TF-IDF representations for each tweet in our corpus were used to train six different classifiers. We used scikit-learn, a Python library for classifier implementation and prediction of the classes of the unlabeled dataset. K-fold cross-validation was used to select the classifier that provides the highest results and shows the best ability to generalize. The collection was split into fivefolds, four of which were used for training on each iteration, and the fifth for evaluation.

D. Evaluation Measures
The evaluation was carried out using three measures: Precision, Recall, and F1-score as follows: Recall = T rueP ositive T rueP ositive + F alseN egative) (2) Where: • True Positive: the number of fake tweets that are correctly predicted as fake tweets.
• True Negative: the number of genuine tweets that are correctly predicted as genuine tweets.
• False Positive: the actual class is genuine, but the predicted class is fake.
• False Negative: the actual class is fake, but the predicted class is genuine.

E. Experimental Results
We present the experimental results on the Arabic fake news dataset. Six machine learning classifiers (NB, LR, SVM, MLP,RF, and XGB) were used to perform our experiments on the manually and automatically annotated datasets. We used count vector and TF-IDF vectorization (word-level, n-gramlevel, and character-level) to train the classifiers. Precision, recall, and F1-score are the measures that have been used to evaluate each classifier using 5-fold cross-validation. Bold values indicate which setting yielded the best classification performance of fake tweets.
The results showed that using the LR classifier with the n-gram TF-IDF feature and without applying further preprocessing on the text (such as stemming or rooting) yielded a significantly better classification performance. The classifier gave an 87.8% F1-score classification result with the manually annotated corpus, as shown in Table VII. The same classifier, with the word count feature and without applying stemming or rooting, obtained the best classification performance when applied to the automatically annotated corpus, as shown in Table VIII. It achieved an F1-score of 93.3%.
As shown in Fig. 3, the highest precision value was obtained using the n-gram TF-IDF feature with the LR classifier (87.8%) and the count vector feature with the LR classifier (93.4%) on manually and automatically annotated corpora, respectively. The results obtained using raw text is better than with the corpus text after applying stemming and rooting. We can conclude that performing further pre-processing did not enhance the classification results with the text from social media.
As shown in Fig. 4, the highest recall was obtained using the count vector TF-IDF feature with the LR classifier (87.7%), and the count vector feature with the LR classifier (93.3%) on manually and automatically annotated corpora, respectively. The highest F1-score, as shown in Fig. 5 was obtained using the n-gram-level TF-IDF feature with the MLP classifier (87.8%) and the count vector feature with the LR classifier (93.3%) on manually and automatically annotated corpora, respectively.

VI. DISCUSSION
The primary objective of this research was to build a benchmark dataset for fake news in Arabic related to the COVID-19 pandemic. We introduce a new fake news corpus in the Arabic language, collected from Twitter. It is clear from the experimental results that the manually annotated corpus can be used as a baseline for further research in the domain of fake news and misinformation. As there remains no benchmark dataset for fake news detection in Arabic related to the COVID-19 pandemic, this corpus will help the research community once the dataset is publicly available. The proposed corpus was manually annotated by three annotators to ensure the quality and usefulness of the developed corpus. We used a set of machine learning classifiers to train different machine learning models on the manually annotated corpus. The best model was selected to predict fake news classes of unlabeled tweets (more than 35,000 tweets). The statistical analysis showed lower precision, recall, and F1-score values in the classification of the manually annotated corpus, while the automatically annotated corpus showed improved results. From the results presented in the previous section, we notice that increasing the size of the dataset leads to an improvement in the classification results using precision, recall, and F1-score measures.
The use of machine learning methods to classify the fake news corpus using content-based features gives better results than the user-based features. The corpus can be further expanded using two methods: 1) increasing the number of verified rumouror misinformation topics, or 2) performing classification on more unlabeled tweets related to the COVID-19 pandemic. After then, the deep learning approach can be used to enhance fake news classification.

VII. CONCLUSION AND FUTURE WORK
In this paper, we introduced a new Arabic corpus of fake news that will be made publicly available for research purposes on this link: (https://github.com/yemen2016/FakeNewsDetection), after preparing the tweets ID's and their associated classes. We explained the collection process of fake news and gave details about how we select rumors and misinformation topics during the COVID-19 pandemic. The classification task was performed using six classifiers (Naïve Bayes, Logistic Regression, Support Vector Machine, Multilayer Perceptron, Random Forest Bagging, and eXtreme Gradient Boosting) to test the possibility of recognizing fake and genuine tweets. We used four feature types: count vector, word-level TF-IDF, n-gram-level TF-IDF, and character-level TF-IDF. We noticed that the achieved performance varies depending on the features and classifiers used. Along with considering the raw text as an input to the machine learning classifiers, we also used two pre-processing methods: stemming and rooting. Both techniques failed to improve the classification results as the corpus text was collected from Twitter, which includes various dialects and language mistakes. Therefore, the stemming and rooting procedures did not produce correct results. The study concluded that we can achieve higher performance with more annotated data.
In the future, we plan to expand our corpus with additional verified rumour and misinformation topics. We also look forward to investigating the performance of new classification methods such as deep learning. In this research, we only used content-based features to classify and analyze fake news, though user-based features may also be utilized.