Healthcare Misinformation Detection and Fact-Checking-A Novel Approach

Information gets spread rapidly in the world of the internet. The internet has become the first choice of people for medication tips related to their health problems. However, this ever-growing usage of the internet has also led to the spread of misinformation. The misinformation in healthcare has severe effects on the life of people, thus efforts are required to detect the misinformation as well as fact-check the information before using it. In this paper, the authors proposed a model to detect and factcheck the misinformation in the healthcare domain. The model extracts the healthcare-related URLs from the web, preprocesses it, computes Term-Frequency, extracts sentimental and grammatical features to detect misinformation, and computes distance measures viz. Euclidean, Jaccard, and Cosine similarity to fact-check the URLs as True or False based on the manually generated dataset with expert’s opinions. The model was evaluated using five state-of-the-art machine learning classifiers Logistic Regression, Support Vector Machine, Naïve Bayes, Decision Tree, and Random forest. The experimental results showed that the sentimental features are crucial while detecting misinformation as more negative words are found in URLs containing misinformation compared to the URLs having true information. It was observed that Naïve Bayes outperformed all other models in terms of accuracy showing 98.7% accuracy whereas the decision tree classifier showed less accuracy compared to all other models showing an accuracy of 92.88%. Also, the Jaccard Distance measure was found to be the best distance measure algorithm in terms of accuracy compared to Euclidean distance and Cosine similarity measures. Keywords—Misinformation detection; sentiment analysis; document similarity; fact-check; healthcare


I. INTRODUCTION
Online social media and the web as a whole have become the spring of information to users all around the world. Due to its convenience, feasibility, unrestricted access, and reasonable cost the internet have become popular amongst the community [1], [2]. The people read, share, write, and view the articles, blogs, news, videos, audios, etc., all over the internet. The rate of sharing articles, blogs, news, etc., has been accelerated dramatically. However, the users not only share immaculate information but also try to spread wrong or incorrect information either knowingly or unknowingly in a moment. This widespread misinformation has relentless consequences on individuals, commercial, health, government, and all other facets of society. The ramification of the misinformation is catastrophic and may lead to extermination. For example, the political disinformation spread during the 2016 USA presidential elections led to public shootings. These enduring consequences of misinformation contribute towards ferocious conflicts that are preventable otherwise [3], [4].
The internet has become the most popular and the first choice of the public to investigate health problems. However, people get misinformed with wrongly populated content. A famous and perfect example is the misconception among the public about the measles, mumps, and rubella (MMS) vaccine causing autism. Health misinformation is defined as "A healthrelated claim of fact that is currently false due to a lack of scientific evidence" [5]. The promulgated experiences of people over the internet or articles were written about certain diseases without knowing or verifying the fact or having a lack of evidence can cause health ruination of readers and thus can lead to complete desolation [6], [7]. The misinformation related to health can have hazardous effects on people's life directly, thus detecting misinformation in healthcare is a need of time [8]- [10].
Misinformation detection has become the topic of interest amongst researchers in the literature. The researchers have studied different types of false information. The first category is termed misinformation, which is the inaccurate or incorrect information that is confirmed with existing evidence [11]. The other categories include the fake news [12], [13], rumor [6], satire news [14], hoaxes [15], misinformation [16], [17], disinformation [18] and opinion spam [19]. To detect each of these categories of false information the authors have used several features like sentiment analysis, user-specific features, syntactical features, grammatical features, image or message specific features, etc. Also, there are readily available datasets for false information detection in various domains viz. politics, news, business, and healthcare. Few examples of these datasets are LIAR, FakeNewsNet, BSDetector, etc. With the help of features and datasets, machine learning and deep learning techniques are applied to detect false information [11]. However, detecting misinformation is an exhaustive task. This is due to two main reasons: first, is the availability of dataset in a certain domain and second is fact-checking of the data [11], [20], [21]. It is difficult to get the benchmark and gold-standard datasets in a specific domain. Also, manual fact-checking of data is time-consuming, requires expert guidance, and involves laborious tasks. Thus, automatic fact-checking of data is a need of time to endure with the speed of the newly arriving and changing data. Document Similarity is a measure of the distance between the two documents (DS). There are several distance measures available in the literature to compute the similarity between the *Corresponding Author. www.ijacsa.thesai.org documents like Euclidian Distance, Cosine Similarity, Jaccard Distance, etc. The concept of document similarity can be used to fact-check the information with the existing verified documents and thus can help to detect misinformation. Document Similarity is a measure of the distance between the two documents (DS). There are several distance measures available in the literature to compute the similarity between the documents like Euclidian Distance, Cosine Similarity, Jaccard Distance, etc. The concept of document similarity can be used to fact-check the information with the existing verified documents and thus can help to detect misinformation. Sentiment Analysis (SA) techniques to detect the polarity of data into positive, negative, and neutral have been widely used in the literature to detect misinformation, fake news, rumors, etc. The process of knowing the opinion of the people about the products, services, movie reviews, etc. can be easily captured using sentiment analysis [20], [22]- [26]. The literature related to misinformation detection or finding the credibility of information using sentiment analysis has marked that the articles or blogs containing more positive words are tend to be spreading true information while the articles having negative information contain more negative information [27], [28].
Thus, to detect misinformation and perform fact-checking automatically the authors have proposed a hybrid approach of sentiment analysis and document similarity. In this research paper, the authors have created a sentiment-based Bag-of-Words (BoW) as a dataset related to the healthcare domain. Further, features like sentiment analysis, grammatical and lexical features are used to detect misinformation and document similarity measures viz. Euclidian distance, Cosine similarity, and Jaccard distance are used to perform factchecking.
The remaining sections of the paper are structured as follows: Section II provides the literature survey describing the techniques of using sentiment-based features to detect misinformation in the healthcare domain and also the document similarity-based approaches used to fact-check the documents which could help to detect misinformation in the healthcare domain. Section III describes the proposed model architecture, dataset collection and cleaning process, and methodology used in the proposed model. Section IV discusses the results generated based on the proposed model of a hybrid approach of sentiment analysis and document similarity and section V describes the conclusion and future enhancements.

A. Sentiment Analysis in Healthcare
In terms of web articles, the sentiment analysis is an expression that measures the attitude of the author in terms of positive, negative or neutral towards the article topic. Especially, when talked about healthcare-related articles, people like to express and share their opinions about their experiences about the disease which they have suffered from. Therefore the readers get biased towards the opinion of the author and believe the article without verifying the facts or evidence. Due to the rich contents of health information available online, the web has become the first choice of patients or users to know about the cure of disease and related remedies. Thus, understanding the sentiment of the article contents is much needed when it comes to misinformation detection. In the state-of-the-art techniques, the authors have analyzed the moods of cancer patients from tweets. Long Short Term Memory (LSTM) techniques were used to find the sentiments from the tweets [29]. In another research, authors collected 1,000 text comments of medical experts through various medical animation videos of the Youtube repository, and applied sentiment analysis to these comments to enhance the reputation of telemedicine education across the globe [30]. To study the effectiveness or popularity of a medicine, authors have performed sentiment analysis on public reviews using weighted word representation techniques and added linguistic constraints to model the contextually similar words [31]. Also, sentiment analysis techniques were used to detect misinformation in herbal treatments of diabetes in Arabic comments of YouTube videos [32]. The sentiment analysis is widely used in the healthcare sector to understand the sentiment polarity of the text and thus it can act as a major feature for misinformation detection. Table I displays the recent techniques of sentiment analysis in the healthcare domain in comparison with the proposed model techniques.

B. Document Similarity in Healthcare
Document similarity measures the distance between two documents in a numeric value. The document similarity measures are used to find the similarity between healthcare documents. For example, to detect medical codes of the documents the authors have used an attention mechanism which targets the most informative parts of the documents [33]. In another research, Jaccard distance measure was used to compute the similarity between medical documents using a Non-negative matrix factorization algorithm [34]. In another research, the Term Frequency-Inverse Document Frequency (TF-IDF) of a document is computed and document similarity is measure using cosine similarity, further k-means is used to cluster the documents of similar types. The authors have also used the Unified Medical Language System (UMLS) to extract domain-specific features and select the required features using Principal Component Analysis (PCA). Further, the authors have used expected maximization techniques to cluster the similar documents together [35]. The document similarity is extensively applied in the healthcare domain to group similar documents together. This technique along with sentimental features will be useful for detecting misinformation in the healthcare domain.

C. Sentiment Analysis and Document Similarity Approaches
The document classification can be best achieved using document similarity measures. The amalgamation of sentiment analysis and document similarity is effective in terms of document classification as found in the literature. The deep learning techniques along with cosine similarity measures are used to successfully classify documents related to stock news based on the sentiments in literature, resulting in the merging of most relevant documents together [36]. In another approach, One-Class Support Vector Machine (OCSVM) and Latent Semantic Indexing (LSI) were used to classify text documents into positive and negative [37]. In another approach, NET-LDA model was proposed to find the semantic similarity between documents using sentiment polarity and cosine similarity approaches [38]. There are three different types of measures followed in the literature for document similarity measurement viz. Jaccard Distance, Cosine Similarity, and Euclidean Distance. However, the authors didn't find any articles with document similarity measures used along with sentiment analysis to classify documents based on their similarity. Thus, the hybrid combination of document similarity and sentiment analysis is a novel approach and can be used to detect and fact-check healthcare related misinformation. Table III displays the recent techniques of document similarity and sentiment analysis and the proposed model techniques

D. Document Similarity and Fact-Checking
The major challenge faced in detecting misinformation is performing the fact-checking of data as there fewer benchmark datasets available specific to a certain domain like healthcare.
With the enormous amount of information generated online, it is a highly challenging task to perform manual fact-checking of individual articles or blogs available online. Therefore the recent tools and techniques are automated using features from the text like sentimental features, user-specific features, grammatical features, etc. In the literature, authors have used techniques like Term Frequency Inverse Document Frequency (TF-IDF), and cosine similarity measures with k-means, Support Vector Machine, and Multilayer Perceptron to detect credibility of Indonesian news. Also, in another research, Latent Dirichlet Allocation (LDA) and Jaccard distance measures are used to detect fake news on the Buzzfeed dataset. In research to collect evidence for fake news detection word embeddings were used followed by Word Mover's distance measure to measure the similarity between the documents. However, it was observed that Word Mover's distance is very expensive for a large amount of data [39]- [42]. Table IV displays the recent techniques of detecting misinformation using document similarity and sentiment analysis. Though there are few studies handling fact-checking using document similarity measures, not major work is carried out in this field. Thus, in this paper, the authors propose a model with a hybrid combination of sentiment analysis and document similarity approach to detect and fact-check the misinformation. www.ijacsa.thesai.org

A. Model Architecture
The proposed model architecture for misinformation detection in the healthcare domain and performing factchecking automatically is shown in Fig. 1. Sections B, C, and D describe in detail the architecture building. Section B talks about the data collection method, section C describes the features extracted and used for model building in detail, and section D explains the process of working model.

B. Dataset Creation
The authors have crawled 60 URLs from the web on the healthcare domain and classified them as True and False with the help of expert opinion. This dataset is used to verify and classify other URLs from the healthcare domain. Further, authors have crawled 898 web URLs related to the healthcare domain. Out of which, 280 URLs are used for training the model and 618 URLs are used for testing purposes. These 1000 URLs are the combination of true and false URLs in the healthcare domain and are classified with the help of document similarity measures, sentimental features, and grammatical features along with machine learning techniques.

C. Feature Extraction
There are mainly three different types of features extracted from the URLs datasets. First, the authors focus on sentimental features which include a number of positive word count, number of negative word count, percentage of positive and negative word counts, and the total number of words. In a research to find sentiments of people in a covid-19 pandemic, authors have created a large benchmark dataset based on tweets generated on the twitter [43]. Thus sentimental features are crucial in healthcare domain. In grammatical features, authors have extracted noun, pronoun, verb, and adjectives from the URL text. The third type of feature is document similarity measure. There are three measures used in this paper to factcheck the URLs with manually classified web URLs related to healthcare. The first is Euclidean Distance, which measures the straight line distance between two points in Euclidean space. Equation1 depicts the Pythagorean formula to compute the Euclidean distance between two points x and y [44].
In this paper, the authors have used Euclidean Distance (ED) measure as a feature computed separately for true and false URLs. The other distance measure used is Jaccard Distance (JD) measures the similarity between two documents by finding the ratio of the size of the intersection and size of the union. Equation2 shows the formula to compute Jaccard Distance between two documents to find the similarity between the documents [44]. Another document similarity measure is the cosine similarity measure. Cosine similarity computes the cosine angle between the vectors. It is represented by the dot product and a magnitude between the vectors. Equation3 shows the formula to compute the cosine similarity between two documents A and B [44].

( ) || || || ||
(3) In this paper, authors have used Euclidean, Jaccard, and Cosine similarity measures as features to perform factchecking of the URLs and thus detect misinformation in the healthcare domain. Table V lists the final set of features used in the proposed model.

D. Working Model of Misinformation Detection and Fact-Checking
In the proposed model, the training dataset is first preprocessed to remove punctuations, stop-words, numeric data, duplicate data, etc. This is required to get the cleaned data for the execution of the model. After pre-processing the URL contents, Term-Frequency (TF) is computed to find the count of terms from the URL textual contents. This term-frequency is stored in the CSV file for future use. The next step is to generate features. The first type of features is sentimental feature that focus mainly on the polarity in terms of positive and negative words of the textual contents from the URL. This is computed to the TF generated in the previous step. Along with sentimental features, grammatical features are also retrieved like noun, pronoun, verb, and adjectives. In misinformation detection, sentimental features play a significant role. It was detected that a text containing misinformation generates more negative words compared to positive words and vice-versa. Thus, more negative sentiments can lead to misinformation [17]. Thus, sentimental features and grammatical features together help to detect misinformation in this proposed model. The next aim is to perform automatic fact-checking of newly arriving URLs from the test dataset. For this reason, a fact-check URL dataset is generated. Fact-Check URL dataset contains manually fact-checked URLs from healthcare-domain classified into True and False. To perform fact-checking of URLs from the test dataset, the authors have used standard distance measures like Euclidean Distance, Cosine Similarity, and Jaccard Distance as features. Therefore, every URL from the test dataset is first preprocessed to clean the data, term-frequency, and sentimental features are generated and finally, distance measure features are created using the standard formulas explained in section C. To compute the distance measures URL from test dataset is matched with URL from the fact-checked dataset of URLs which gives two numeric values viz. numeric value for distance between true URL from the fact-checked dataset and second numeric value with False URL from the fact-check dataset. These two values are compared and the minimum value is considered as a final feature value. This process is repeated with every URL from the test dataset and for every distance measure. When all the features are generated, machine learning classifiers are applied to test the accuracy of the model. Authors have used five machine learning state-of-theart classifiers from the literature viz. Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), Decision Tree (DT), and Random Forest (RF).

IV. RESULTS AND DISCUSSION
This section explains the experimental results carried out to evaluate the performance of the model. The proposed methodology is evaluated on five different state-of-the-art classifiers namely LR, SVM, NB, DT, and RF. Section A displays the performance matrix of the model in terms of Accuracy, Precision, Recall, and F1-Score based on the three different parameters viz. Jaccard distance, Euclidean distance, and Cosine similarity distance measures and contains the confusion matrix for the NB classifier. Section B explains the word clouds generated to show the words related to true information and false information from the URLs and Section C explains the analysis of misinformation detection.

A. Performance Matrix
The performance matrix is measured in terms of accuracy, precision, recall, and F1-score. Fig. 2 shows the accuracy of the proposed model based on 5 different classifiers. It was observed that NB outperformed all other models in terms of accuracy showing 98.7% accuracy whereas the decision tree classifier showed less accuracy compared to all other models showing an accuracy of 92.88%. Fig. 3, Fig. 4 and Fig. 5 display the precision matrix, recall, and F1-score of the proposed model on various classifiers using three parameters viz. Jaccard Distance, Euclidean Distance, and Cosine Similarity measures. Table VI, Table VII and Table VIII Fig. 7 and Fig. 8 displays word clouds of URLs having misinformation and legitimate information respectively. It can be seen that the URLs having misinformation contain more negative words like death, false, etc. whereas URLs with true information contain more positive words like well, symptom, increase, etc. This shows that sentiment analysis can play a vital role in detecting misinformation.   Fig. 9 displays the average percentage of misinformation and true information in the web URLs. It can be seen from Fig.  9 that for around 200 URLs the percentage of misinformation is high compared to true information and it is at a peak for URLs ranging from 200 to 300. Fig. 10 displays the average count of positive and negative words in the URLs classified as True. It is been observed that the average positive count of words is 71% in True URLs and the negative count is 29%. Fig. 11 displays the average count of positive and negative words in the URLs classified as False. It is been observed that the average negative count of words is 62% in False URLs and the positive count of words is 38%. Thus, the authors found that for URLs with misinformation the average count of negative words is more and positive words are less. Therefore, sentiment analysis is an important feature to detect misinformation in web URLs.

V. CONCLUSION AND FUTURE WORK
In this research, authors have proposed a model to detect and fact-check misinformation in the healthcare domain. The fact-checking of URLs using distance measures improves the performance of the model than standard techniques of manual fact-checking of data. It was observed that the sentimental features are crucial while detecting misinformation as more negative words is found in URLs containing misinformation compared to the URLs having true information. It was observed that NB outperformed all other models in terms of accuracy showing 98.7% accuracy whereas the decision tree classifier showed less accuracy compared to all other models showing an accuracy of 92.88%. Also, the Jaccard Distance measure was found to be the best in terms of accuracy compared to Euclidean distance and Cosine similarity measures. In the future, authors want to collect more URLs and observe the difference in the accuracy of the model. Also, the authors want to identify the spreaders of misinformation by keeping track of the percentage of misinformation containing in the text published by these authors.