A Novel Evolving Sentimental Bag-of-Words Approach for Feature Extraction to Detect Misinformation

—The state-of-the-art misinformation detection techniques mainly focus on static datasets. However, a massive amount of information is generated online and the websites are flooded with this legitimate information and misinformation. It is difficult to keep track of this changing information and provide up-to-date accurate status of webpages giving either legitimate information or misinformation. Therefore, to keep the features up-to-date, authors have proposed evolving sentimental Bag-of-Words approach. This involves, updating sentimental features every time the new or changed web contents are read. This process accumulates the sentimental features at different time intervals that can be utilized to detect misinformation in URLs and upgrade the status of the webpage with timely information. Apart from sentimental features, other state-of-the-art features viz. syntactical, Part-Of-Speech Tagging (POST), and Term-Frequency (TF) are updated in a timely manner and utilized to detect misinformation. The model performed well with the support vector machine showing an accuracy of 80% while the decision tree classifier showed less accuracy of 56.66%.


I. INTRODUCTION
In an era of the information explosion, the volume of information available on the web has increased rapidly. Social media and the web as a whole have become an important source of information for people due to its easy access, low cost, availability, and popularity [1], [2]. At every moment millions of people access the internet to share information and interact over social media. This information has many hidden patterns inside. In one of the surveys in 2012 from the USA, it was found that 49% of people have connected to social media and the web to disseminate information, while in 2016 over 62% have accessed social media regularly to browse news articles [3], [4]. Therefore, the data disseminated on the web and social media have become the topic of interest for many researchers.
This large volume of proliferated data can cause the spread of misinformation. The misinformation or false information is inaccurate or incorrect information that is confirmed with existing evidence [4]. The false information may appear in the form of fake news, rumor, satire news, hoaxes, misinformation, disinformation, and opinion spam [3]. Thus, a massive amount of false information has been observed to spread in an uncontrolled fashion over the web.
The misinformation propagated via the web gets spread quickly, resulting in a widespread impact on people, business, healthcare, politics, and all other aspects. For example, in the 2016, USA presidential elections, the spread of misinformation resulted in public shootings. Also, rumors about the outbreak of the Zika virus were portrayed as a bioweapon and that the Zika vaccine is developed to depopulate the earth had been the considerable activity on Twitter and Facebook [5]. Hence, the persistent effects of misinformation are harmful and can contribute to violent conflicts which may lead to the death of people, especially in the healthcare domain. A substantial quantity of fictitious and fabricated information passes through the web creating an untrustworthy feeling among the public. Thus to raise the reliability of the web and social networks and reduce the detrimental effects of misinformation, detecting and combating false information has become the need of time and is highly recommended [6], [7].
Detecting misinformation is a challenging task due to three key reasons. First, to verify information with factual data; the second is the unavailability of structured data in a specific domain; third, adapting to the changing web data [4], [8]. The researchers have been keen on finding the answers to these challenges and detecting misinformation from raw text data. The detailed process of detecting misinformation is explained in the related work section of the paper.
The sentimental features are exclusively used in the literature to detect misinformation, fake news, rumors, etc. *Corresponding Author 266 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications Vol. 13, No. 4, 2022 The studies have observed that the sentiment polarity features in misinformation and true information are totally different. For misinformation, there are more negative words that may be due to guilt or tension while writing wrong information leading to very strong negative sentiment [9]. In a study of identifying prominent features to detect rumor propagation authors found that rumors are dominated by sentiments like anxiety, and uncertainty [10]. In [11], authors considered sentiment analysis as one of the important factors to assess information credibility.
Although there is a sufficient machine learning-based model in the literature to detect misinformation using sentiment features, the authors did not find any work dealing with newly emerging data in the healthcare domain. As the data change gradually during the period of time the sentimental words may also alter i.e. increase or decrease and it may affect the percentage of misinformation. As information in the real world is time-critical and newly emerging data needs to be adapted accordingly by the model, the efforts are required to furnish the newly arriving data, thus making it a challenging task [4], [12]. Hence, a technique like incremental learning is most suitable to handle this new chunk of data arriving at different time intervals due to its less memory requirement and short training time [13].
Therefore, in this research authors have used incremental learning to record the sentimental changes occurring in the data by generating sentimental "Bag-of-Words" in the healthcare domain and thus detect the change in percentage of misinformation at different time intervals. The aim is to find the URLs showing legitimate content initially and which later spread misinformation at different time intervals and vice versa and detect a change in the percentage of misinformation.
Hence, the objectives of this study are as follows: 1) Generate a sentimental "Bag-of-Words" for healthcare domain, 2) Incrementally evolve the sentimental "Bag-of-Words" using sentiment relevance score function, 3) Generate fact-check dataset of 100 URLs related to the healthcare domain, 4) Extract features to detect misinformation 5) Use state-of-the-art machine learning classifiers to classify the URLs as legitimate or non-legitimate, 6) Identify and study the change in URLs at Time T1 and T2 in terms of misinformation and sentiments.
The remaining sections of the paper are structured as follows: Section 2 defines the literature survey in the field of misinformation detection using sentiment analysis and incremental learning in the healthcare domain, Section 3 explains the methodology, Section 4 discourses the results and discussion and Section 5 describes the conclusion and Section 6 elaborates on limitations and future enhancements.

II. LITERATURE SURVEY
This section discusses the outline of related work, which are focused on: (i) misinformation detection techniques, (ii) Incremental Learning and sentiment analysis approaches and their applications, (iii) application of incremental learning in the healthcare domain (iv) application of sentiment analysis in the healthcare domain. Although tasks (ii), (iii), and (iv) are effectively examined independently, there is a shortage of research that would merge them and come up with a sentiment analysis-based approach for detecting misinformation, which would be capable of controlling changes in newly arriving data. In other words, there is no implementation of a model which would detect misinformation in the healthcare domain with incremental learning and sentiment analysis on a realworld dataset. Similar efforts by [14], incorporate behavioral features with linguistic features to detect misinformation. However, the proposed research categorically focuses on using incremental learning to accommodate newly arriving information into the model along with the sentiment and linguistic features so that the real-time changes are incorporated in the model.

A. Misinformation Detection Techniques
The researchers are keenly driving in with a lot of efforts to detect misinformation. Several supervised, unsupervised and semi-supervised machine learning and deep learningbased algorithms exist which can detect misinformation. Machine learning is the component of Artificial Intelligence that can acquire knowledge and progress with past experiences [3], [4].
The basic process of detecting misinformation using machine learning techniques contains the following steps: data collection, data pre-processing, manual classification in case of the gold standard dataset using expert knowledge, feature extraction, and classification. The latest machine learning models from the literature have used features like content/syntactical features, user-specific features, sentiment features, grammatical features, linguistic features, imagespecific features. Upon identifying and extracting the required feature set, the significant machine learning classifiers are used to perform the classification viz. Support Vector Machine, Naïve Bayes, Random Forest, Decision Tree, Logistic Regression, k-Nearest Neighbors [3]. Feature Extraction is a significant aspect of misinformation detection as the effectiveness of machine learning algorithms is mainly dependent on feature extraction. According to [11] sentimental features are considered to be crucial in detecting the credibility of contents. Hence, authors have considered sentimental features as central features in this study.

B. Incremental Learning and Sentiment Analysis
Sentiment analysis is the process of detecting sentiment polarity in terms of positive and negative words in a text. However, the user perception of data in terms of sentiment polarity may change as time passes. The static models designed require updates to adapt to the changes. The incremental learning affords to adapt to these changed instances of data with the efficient cost of computations. Hence, incremental learning and sentiment analysis together can tackle the problem of sentiment shifts and new word generations [15], [16]. In the literature, researchers have used sentiment analysis to detect commuter emotions using incremental learning by computing emotion density at different time intervals [17]. Sentiment analysis with incremental learning benefits to make a more granular analysis of the user's opinion. In another research, lexicons are built dynamically and evolved incrementally for Arabic text. Thus, incremental learning helps to build and expand the lexicons dynamically [18].

C. Incremental Learning in Healthcare
Incremental learning in healthcare was used in literature to enhance the predictive accuracy and decrease the computation time of classification many disease diagnosis systems have used incremental learning techniques, incremental SVM, and incremental PCA. Incremental learning was applied to the numeric datasets to classify diabetes disease using a combination of support vector machine, Expectation-Maximization, and Principal Component Analysis [19]. An incremental knowledge-based health management information system is designed to fetch and process the electronic medical records and expert guidance to accommodate the knowledge system [20]. In another approach, for newly arriving chunks of data, the network was trained using incremental learning by remembering the earlier patterns of data. This approach helped the researchers to discover heart rate changeability patterns in mobile healthcare services [21]. Hence, the newly arriving chunk of data of electronic medical records is captured using incremental learning techniques.

D. Sentiment Analysis in Healthcare Domain
The sentiment analysis determines the dominant sensitive views from the text to know about the author's approach towards sentiments as either positive, negative, or neutral. In the state-of-the-art techniques, [22] the researchers collected medical terms from the Unified Medical Language System (ULMS) and clustered similar documents with medical terms using Latent Dirichlet Allocation (LDA) to form condition topics. The sentiment analysis helped the user to identify the feelings of other users with similar conditions. The researchers have been using sentiment analysis to discover the tendency of sentiments about vaccination like measles vaccine, and HPV vaccine through social media platforms. Sentiment libraries like VADER, Google Cloud Sentiments are used for analysis. Machine learning techniques such as TF-IDF, K-means clustering, and topic modeling using Latent Dirichlet Allocation (LDA) are applied for sentiment classification [23]- [25]. These investigations help the health experts or researchers to identify the various reasons for vaccine scares or hesitancy and take corrective actions to improve vaccine uptake. Sentiment analysis is proved as efficient and consistent while dealing with huge amounts of data [11]. However, the existing sentiment polarity libraries are restricted to general-purpose sentiments and not towards domain-based sentiments. To acquire improved accuracy of sentiment analysis, domain-based sentimental libraries are essential. However, due to the scarcity of such domain-based sentiment libraries in the healthcare domain, it is essential to generate sentimental Bag-of-Words in the healthcare domain [18].

E. Research Contributions
With the proposed system platform following are the research contributions.
1) Generation of sentimental Bag-of-Words (BoW) using incremental learning.
2) A novel sentiment-based incremental machine learning model to detect misinformation from web URLs.
3) Analyzing the performance of the proposed model.
III. METHODOLOGY Fig. 1 shows the diagrammatic representation of the proposed model. The detailed techniques are discussed in the following section.

A. Data Collection and Pre-Processing
This section explains the methodology used to collect the dataset for misinformation detection. To fetch the web URLs, a Google search engine is used. Google is extensively used for collecting information online. According to the statistics, 60% of desktop users and 90% of mobile or tablet users use Google while searching [11]. Initially, the authors identified and shortlisted 25 keywords related to the healthcare domain by identifying the most frequently occurring words in the healthcare domain. Table I displays the 25 keywords used for data collection. The various combinations of these keywords are used in the query for the Google search engine and the top 100 URLs are fetched every time, resulting in a group of 500 URLs. These 500 URLs were a mixture of healthcare URLs, non-healthcare URLs, and duplicate URLs. Thus, the authors manually segregated the URLs into two categories healthcare and non-healthcare, and also removed duplicate URLs and the final set of 100 URLs was used for analysis. Further, authors along with domain experts performed fact-checking of URLs and manually categorized 100 URLs into legitimate and nonlegitimate based on contents, contact information available on the website. Also, the authors have considered URLs from the CoAID [26] dataset after verifying them with domain experts. The authors observed that maximum legitimate URLs are from .gov, and .edu domains while non-legitimate URLs are from other domains. Further, URLs are pre-processed to get cleaned data. The process begins with the grouping of data from web URLs and generated words. Regular expressions are used to remove punctuations, and numbers from the data. The pre-processing technique sanitization was used to transform the words into lower case, followed by the stop-word removal technique which is applied to reduce the corpus size by removing unnecessary words not providing any useful information. Also, all the single characters and duplicate data are removed to further minimize the corpus size and build quality information. In addition, 2 wordlists are developed manually containing Bag-of-Words (BoW) with sentiment polarity of 452 positive words, and 341 negative words by using healthcare URLs after carefully understanding the meaning of each word, especially the core medical terms by referring to MeSH (Medical Subject Headings) dictionary which explains the biomedical terminologies [27] and taking expert opinion about the usage of those words. Table II displays the sample words list of manually labeled sentiments of words.

B. Incremental Learning
Upon getting cleaned data, the term-frequency is generated and matched every term with the sentimental "Bag-of-Words". Here, each term is referred to as a word. These newly arriving words are assigned sentiment polarity using the sentiment relevance score function; thus, updating the sentimental bagof-words incrementally. This function generates synonyms and antonyms of the newly arriving word using the WordNet library of the NLTK tool. To find the closeness between newly arriving words and their synonyms and antonyms authors have used Jaccard similarity co-efficient. The Jaccard similarity coefficient of sets A and B is given by |A ∩ B| / |A U B|. It is the fraction of the intersection of the number of elements of A and B to the number of elements of their union. It is used in various applications like mail filtering [28] to find the co-occurrence of values of all pairs of words in an email. Further, the threshold value was chosen after carefully analyzing the synonyms and antonyms generated. For example, for the word "good" the relevant words above 0.5 were honourable=0.9, beneficial=1.0, effective=1.0, etc. while below 0.5 were like satisfactory=0.4, fair=0.33, etc. Hence, the threshold of 0.5 was decided manually to assign sentiment polarity to a new word. However, for the first batch of 10 URLs, it was observed that Bag-of-Words was upgraded with the most relevant words, but during the second batch of 10 URLs, there was an addition of a few irrelevant words. Hence, authors again sensibly tried identifying the best applicable threshold value which could fetch only relevant synonyms or antonyms of the word. This was possible with a change in the threshold of 0.95 instead of 0.5. After using a new threshold value of 0.95, for the next batches of URLs, it showed a good collection of relevant words. Therefore, choosing the right threshold value can improve the collection of the most relevant sentimental "Bag-of-Words" in the healthcare domain. This process continues in an incremental fashion. The relevance score function is defined as shown in Fig. 2. Next, incremental learning is adopted to update features. This is achieved by fetching the URLs at time T2. Thus, URL TF has computed again at time T2 and these terms are matched with bag-of-words to get the final list of terms. These newly arriving terms are appended to TF. Also, based on changed bag-of-words the sentimental, syntactical, and grammatical features are extracted and updated. The syntactical features consist of word count and sentence count. Grammatical features contain a number of verbs, nouns, adverbs, adjectives, and pro-noun. While sentiment-based features include a number of positive, negative words, the percentage of positive and negative words. The final set of 11 features is as listed in Table III.

C. Model Building and Performance Evaluation
To build the model, five different state-of-the-art supervised classification algorithms Logistic Regression (LR), Support Vector Machine (SVM), Naïve Bayes (NB), Decision Trees (DT), and Random Forest (RF) are used. Further, the performance of the model is evaluated based on confusion matrix, accuracy, precision, recall, f1-score, ROC, and AUC.
if jd <= 0.95, append w with "word" antonym 10. elseif jd > 0.95, append w with "word" sentiment 11. End Procedure   Table V, along with the combination  of features set. From Table IV, the syntactical grammatical and sentiment features when applied together, the models Support Vector Machine and Logistic Regression showed 80% and 76.6% of accuracy. Fig. 3 shows the performance of all the models based on precision, recall, and F1-score. Fig. 4 represents the percentage of misinformation in terms of sentiment features for 30 URLs of test data. It is observed that URL 2, URL 4, URL 8, URL12, URL18, URL19, and URL26 showed the maximum percentage of misinformation. Fig. 5 displays the word cloud for objectionable words on a cancer web page. It is seen that objectionable words like symptoms, smoking, chemotherapy, disease, death, risk factor play a vital role in detecting misinformation. Also, Fig. 6 represents the word cloud for legitimate words which can be used to detect true URLs. The words like covid, vaccine, treatment, vaccine trials sound positive and refer to legitimate information. Fig. 7 displays the Receiver Operating Characteristic (ROC) curve with 10-fold cross-validation and Area Under the Curve (AUC). Algo_1 represents the first fold, Algo_2 represents the second fold, and so on.     The analysis of the percentage of URLs updated with positive or negative words and the URLs which remain consistent shows that 23% of URLs changed in the textual contents after the second iteration whereas 77% of URLs remained unchanged. From among the 23% URLs which got updated in the second iteration it was observed that in 43% of URLs the percentage of misinformation has changed, i.e. either the misinformation has increased or decreased. In 57% of URLs, although there was a change in textual contents, the percentage of misinformation remained consistent. Fig. 8 shows the statistics of misinformation changed from among 43% of URLs. It can be observed that four URLs (2, 4, 6, and 9) had negative change, which means the percentage of misinformation has reduced. However, six URLs (1, 3, 5, 7, 8, and 10) have positive changes, depicting an increase in the percentage of misinformation.    It is observed that 39% of negative changed URLs showed a change in misinformation. Fig. 8 shows this fluctuated change in misinformation. Fig. 10 and Fig. 11 display the sentiment median values and standard deviation of sentiment values for legitimate and nonlegitimate URLs respectively. It is observed that in legitimate URLs the positive sentiment values are more compared to negative sentiment values. In non-legitimate URLs, the negative sentiment values count is superior to positive values. Fig. 12 depicts the increase in Bag-of-Words for the first batch of 10 URLs. Fig. 13 shows the increase in Bag-of-Words after 10 batches of 10 URLs each. Also, the same URLs were scraped again on 3rd April 2021 to observe the increase or decrease of words in sentimental "Bag-of-Words". Thus, it can be seen from Fig. 14 that the Bag-of-Words evolved over time reaching up to 1069 words at T1 and 1663 words at T2. Fig. 15 shows the total number of positive and negative words generated at time T1 and dT2, respectively.

A. Theoretical Implications
While there is a reasonable volume of earlier research to detect misinformation on the web in the healthcare domain, these methods do not adapt to the recurrently changing data and are thus unable to encounter the sentimental changes occurring in the data. Theoretical Implications of the study are as follows. First, the study describes a facet of incremental learning and sentiment analysis to generate evolving sentimental "Bag-of-Words" in the healthcare domain. This theoretically generated knowledge can be suitably implemented in the other domains like politics and business which are more susceptible to misinformation generation. Hence, this is an important theoretical contribution. Secondly, the study defines a new aspect by developing sentiment analysis and incremental learning models for false information detection. The study presents a collaborative process of incremental learning and sentiment analysis to classify healthcare-related web URLs as legitimate or non-legitimate. The authors have experimentally verified and confirmed the importance of sentimental features and incremental learning in false information detection. These results have exposed new factors of false information change and its detection, proposing an incremental learning and sentiment featuresbased model. These observations have certainly augmented the knowledge base of misinformation detection in the healthcare domain.

B. Practical Implications
Practically, results indicate that this evolving Bag-of-Words can be used to find the sentiment intensity of the articles in the healthcare domain. The sentiment shifts of the patients can be studied using "Bag-of-Words" in the healthcare domain. Thus, these outcomes could help Researchers, Scientists, Doctors, Patients, and Organizations to employ them to detect false misinformation and mitigate it. Table IV shows the comparison of studies in healthcare domain. It can be seen that proposed model have outperformed the existing techniques.  In this research, we have proposed a method to detect healthcare-related misinformation using sentiment-based incremental learning approaches. This framework can help to analyze and identify misinformation using central features such as sentiments and other features consisting of syntactic and grammatical features. A relevance score function assigns the sentiment to the newly arriving words using the Jaccard distance measure. The model was built with five machine learning classifiers viz. LR, SVM, NB, DT, and RF. The results show that SVM has outperformed other classifiers with an accuracy of 80%. The decision tree classifier showed the lowest accuracy of 56.66%. Thus, it was observed that incremental learning plays a crucial role in determining the consistency of correct information by detecting changes in the percentage of misinformation at different time intervals.

VI. LIMITATIONS AND FUTURE ENHANCEMENTS
The research has several benefits in terms of finding misinformation in the healthcare domain, but it does have certain limitations too. Firstly, though authors have used the BoW approach, the n-gram approach can be used to correctly predict the sequence of occurrence of words. The second limitation is the dataset size. It is a need to increase the dataset to thousands to verify the change in accuracy.
In the future, the authors are willing to dynamically assign threshold to the sentiment relevance score which may lead to more accurate Bag-of-Words, and later build the model using incremental clustering techniques to handle the real-time data.