Drug Sentiment Analysis using Machine Learning Classifiers

In recent times, one of the most emerging subdimensions of natural language processing is sentiment analysis which refers to analyzing opinion on a particular subject from plain text. Drug sentiment analysis has become very significant in present times as classifying medicines based on their effectiveness through analyzing reviews from users can assist potential future consumers in gaining knowledge and making better decisions about a particular drug. The objective of this proposed research is to measure the effectiveness level of a particular drug. Currently most of the text mining researches are based on unsupervised machine learning methods to cluster data. When supervised learning methods are used for text mining, the usual primary concern is to classify the data into two classes. Lack of technical terms in similar datasets make the categorization even more challenging. The proposed research focuses on finding out the keywords through tokenization and lemmatization so that better accuracy can be achieved for categorizing the drugs based on their effectiveness using different algorithms. Such categorization can be instrumental for treating illness as well as improve one’s health and well-being. Four machine learning algorithms have been applied for binary classification and one for multiclass classification on the drug review dataset acquired from the UCI machine learning repository. The machine learning algorithms used for binary classification are naive Bayes classifier, random forest, support vector classifier (SVC), and multilayer perceptron; among these machine learning algorithms, linear SVC was used for multiclass classification. Results obtained from these four classifier algorithms have been analyzed to evaluate their performances. The random forest has been proven to have the best performance among these four algorithms. However, multiclass classification was found to have low performance when applied to natural language processing. On the contrary, the applied linear SVC algorithm performed better for class 2 with AUC 0.82 in this research. Keywords—Machine Learning Algorithms; natural language processing; drugs sentiment analysis; text mining


I. INTRODUCTION
Among many research dimensions in Natural Language Processing (NLP), sentiment analysis has become one of the promising fields of research in the recent century [1] [2]. A wide range of research domains has been covered by sentiment analysis, i.e. economy, polity, and medicine. In the pharmaceutical industry, large volumes of online user's views are evaluated automatically in order to obtain useful information about the efficacy and side effects of pharmaceuticals that could be utilized to enhance pharmacovigilance systems. Throughout the years, Sentiment analysis techniques have grown significantly in the last decade, evolving from basic rules to advanced machine learning techniques like deep learning, which has become a prominent technology in many NLP tasks. This triumph is not lost on sentiment analysis. Besides, several machine learning systems have recently been shown to be better than previous methods. These methods have achieved impactful results on standard sentiment analysis datasets [3] [4].
Various aspects such as 'medical condition', 'treatment procedure', etc., use the medical sentiment as a research study that directly impacts the users' health conditions. Any sort of progress or deterioration can be identified by analyzing patient status periodically. The medical condition can be expressed implicitly or explicitly. Mentioning the symptoms is a part of the implicit sentiment in the medical context. For example, consider the statement: 'I recently started Lexapro 3 days, I'm on extreme weight losses'. The term" weight losses here do not reflect a negative sense; however, it implies the negative medication side effect, where sentiment is defined as negative in the preceding statement. Hence, for making correct interpretations, additional information is required.
On the contrary, analyzing the health conditions is relatively much easier in the case of explicit sentiment. For instance, considering the statement, "I recently started Lexapro 3 days, I'm absolutely lost I feel weak and shaky every day". The words absolutely 'lost', 'weak', and 'shaky' are used to describe symptoms in this statement. Deciding about patients' medical issues is an important aspect, specifically when they learn from other patients' experiences, i.e., choosing a hospital, clinic, and medication [5]. Hospitals gain from this information since it allows them to understand better and address the interests and concerns of their patients. The experience covered with sentiment analysis and passions are being shared by the patients; sentiment analysis is being taught by the power of this type of experiment since this type of study identifies people's sentiment about a topic as well as its characteristics. The medical material available on the internet is completely free. Manually analyzing such a large volume of data is ineffective because of its existence in large volumes. Assessed examinations are denoted as positive or negative, for the most part, based on the pre-programmed acceptance of extreme suppositions. The online and traditional *Corresponding Author. www.ijacsa.thesai.org review methods are supplanted by notion investigation nowadays, which is led by organizations for finding a broad conclusion regarding their products and service. As a result, their marketing approach and product awareness increase, and user management improve. It is quite imperative to be broken down since a tremendous amount of content is available online. That includes deep comprehensions of standard dialects are included in the programmed examination of this data. In our everyday life, thoughts and sentiments play an essential role. Basic leadership, learning, correspondence, and mindfulness in human circumstances are assisted. Socially produced regional substances are becoming prevalent in online life; hence the importance of dealing with and comprehending vernacular content is growing. Existing materials notwithstanding, such as nearby sayings, Myths, and fables are unearthed, widely disseminated on the internet.
Compared to reviews of other products, drug reviews are investigated less. When analyzed, drug reviews are primarily utilized to categorize a particular drug as a positive or a negative one as multi-class classification from text mining can be unyielding. The proposed research facilitates the categorization of drugs not into two categories rather into five classes based on their effectiveness. Such an outcome can be beneficial for both consumers and manufacturers to understand the effectiveness of drugs as well as whether a particular drug has any significant side-effect. This paper is organized as follows: Section II illustrates literature review for sentiment analysis, Section III explains the algorithms those classify the sentiment of the drugs, Section IV portrays the findings of the study, and Section V demonstrates the contributions of the proposed research.

II. LITERATURE REVIEW
Supervised machine learning methods used by Twitter datasets, such as support bigram, vector machines and unigram, were analyzed by a research study led by Balahur (2013) [6]. Following the applications of these approaches to Twitter data, the results indicated that methods of unigram and bigram support vector machines are outshined. Emotive words, modifiers and unique tags were included in these results, enhancing the performance rating of emotions. Another study conducted by Jianqiang et al. [7] (2018) presented an approach that is word embedded using unsupervised learning as a base. This suggested technique makes use of hidden contextual semantic connections and characterization between words and tweets. The characteristics of mood polarity and n-gram are combined with the score of the embedded word to structure. A deep convolutional neural network was used to include a collection of emotional characteristics. Facebook, Instagram, and Twitter are a few examples of social media platform that helps to generate data and circulate content quickly. The amount of hate utterances has risen significantly while circulating content related to a particular topic. To filter these sorts of utterances, a research study presented by Schmidt and Wiegand [8] suggested a filtering tool for natural language processing. According to the results, character-level strategies are superior to token-level ones. The authors' methodology demonstrated that using a lexical list of resources to rank them might be beneficial when utilized with others. Based on Kmeans and cuckoo searching methods, Pandey et al. [9] suggested a unique metaheuristic approach. The best feasible cluster heads are found using this method based on the Twitter dataset's emotional subject material. Wang and Li [10] categorized the changed text algorithms to anticipate motions in image data for the sentiment analysis. Textual and visual features for labelling emotions inside an image are unsuitable for the forecast, according to their technique. The authors conducted experiments on two datasets and found that the recommended technique outperforms current methods in terms of accuracy. Unique research on Hierarchical Deep Fusion (HDF) emotional analysis methodology was studied by Xu et al. [11]. The relationship between the properties of text, images and sentimental content has been analyzed in the proposed model. The authors combined visual content with textual content using three-level Hierarchal Long Short Term Memory (H-LSTM) to investigate the inter-modal association of text and image at various levels. Some of the most widely applied machine learning and deep learning algorithms have been described in Table I. Most of the above-mentioned researches primarily focus on unsupervised learning method. Compared to other product reviews, number of researches conducted on drug reviews is significantly low. One of the key challenges with dataset similar to the one used in this research is the lack of technical terms. A few researches that utilize supervised learning methods, perform binary classification as multi-class classification using the existing machine learning algorithms have been proven to be challenging. In this research, tokenization and lemmatization identify the key words. Also multi-class classification has been performed unlike the researches mentioned in this section. www.ijacsa.thesai.org

Neural Network
The neural network approach technique has a very high performance. It is a widely used technique for sentiment analysis and is capable of detecting all possible interactions between attributes. It is effective for dealing with a nonlinear connection between variables that is complex. The main disadvantage is that it takes longer to compute than other algorithms [12].
Naive Bayes It is a Bayes' theorem based probabilistic classifier. Researchers use this method less commonly to make a prediction. The primary advantage is that it is scalable in comparison to other algorithms [13].
Support Vector machine It is also a way of supervised machine learning for classification and regression analysis. When dealing with small datasets, the Support Vector Machine is very effective. It is more efficient than other approaches of classification and regression [14].

Decision Tree
Decision trees are simple but extensively used tools for prediction. IF-THEN rules can be simply converted from a decision tree. According to a previous study, prediction and forecast can be done using a decision tree. It can predict drug sentiment with low accuracy [15].

K-Nearest Neighbor
It is a popular pattern recognition method that is less non-parametric. It has the ability to utilize both regression and classification. It provides the finest performance and precision. It is the most basic machine learning algorithm [16].

AdaBoost
AdaBoost combines a number of poor classifiers to create a powerful one by iteratively retraining and weighing the classifiers depending on their accuracy [17].

Logistic regression
The log odds of the dichotomous result can be modelled as a linear combination of the predictor factors using this method [18].

Convolutional neural network
A feed-forward neural network that has been trained to extract key characteristics for the prediction job at hand. Nonlinear functions are used to filter features via convolutions. The dimensionality can then be reduced via pooling [19].

Maximum entropy
The greatest entropy concept is used to create a probabilistic classifier [20].

Conditional random fields
Given an observation series based on a conditional probability distribution across label sequences, this approach for segmenting and labelling structured data can be used [21].

III. METHODOLOGY
The dataset that used in this experiment has been collected from the UCI repository [22]. Tokenization and lemmatization were performed on the data after collecting the dataset. Four machine learning algorithms have been applied to the dataset for binary classification. The classes for binary classification are class 0, and class 1, where class 0 represents the effective drugs and class 1 indicates the ineffective drugs. The algorithms used for binary classification are naïve Bayes classifier, random forest (RF), support vector classifier (SVC) and multilayer perceptron (MLP). Linear SVC has also been applied to the dataset for multiclass classification. Table II represents the classes for multiclass classification along with counts for each class. Class 0 represents highly effective drugs with a count of 1741, whereas class 1 represents considerably effective drugs with a count of 1238. Class 2 represents moderately effective drugs with a count of 529. In addition, class 3 represents marginally effective drugs with a count of 329. Besides, class 4 represents ineffective drugs with a count of 263.

A. Tokenization
In the case of natural language processing, a series of welldefined processes need to be carried out for analyzing the text. One of the primary processes is known as tokenization which plays a crucial role in the efficiency and correctness of the entire analysis. Tokenization refers to splitting the text into meaningful smaller units known as tokens. In most cases, tokens are identified as words or word sequences. Tokens are usually recognized when a white space character is encountered just after scanning a token. Preprocessing of text for punctuation removal and uppercase to lowercase conversion are often involved with the tokenization process [23].

B. Lemmatization
One of the most essential and elementary processes associated with natural language processing (NLP) is lemmatization. The base or dictionary form of a word is called a lemma. The term lemmatization refers to the morphological conversion of a word that exists in the textual form of the dataset to its lemma. The basic idea for this conversion is the removal of the declension from the end part of the word. In the case of a verb, lemma represents the infinitive form; in the case of a noun, lemma represents the singular form, and in the case of adjective or adverb, lemma represents the positive form. For example, the lemma for the word 'better' is 'good'. The lemma for the word 'brought' is 'bring'. Lemmatization can be perceived as a normalization method in which various morphological variants of a word are analyzed as a single item by mapping them into the same underlying lemma. As the aggregate number of specific terms are reduced, the complexity for analyzing the text is significantly decreased and thus, the overall time and resource utilization is improved. Lemmatization is widely applied for preprocessing the text in information retrieval, document clustering, sentiment analysis, etc. [24].

C. Chi-Squared for Feature Selection
Chi-square scores have been calculated for the most utilized twenty words or terms from the dataset. There exist two possible classes (positive and negative). The Chi-square test can be utilized for evaluating the significance of a given the word to discriminate between the classes [25]. Fig. 2 shows chi-square scores for most utilized words. The word with the highest score is 'none', and the secondhighest score is the benefit.

D. Sentiment Classification
After tokenization and lemmatization, we use RF, SVC, MLP, and NB to find the polarity of the drugs.

1) Naïve bayes classifier:
Naïve Bayes Classifier is a well-established machine learning algorithm used for the classification of data. For its ability to work in a time-efficient manner and to disregard noise or irrelevant data, and its simplicity to implementation, Naïve Bayes Classifier is a viral algorithm for test classification in the fields like spam or fake news detection, sentiment analysis, etc. This algorithm is based on a theorem known as Bayes Theorem, first invented by a British scientist named Thomas Bayes [26]. The idea of the Bayes Theorem is to calculate the probability of an event based on any previous knowledge or conditions that have an impact on the event.
The standard formula for the theorem is: In equation 1, As there are multiple aspects to be considered to classify the data, multiplication rules need to be applied, and the equation becomes: In equation 2, X is a class variable, and Y is a dependent feature vector ( ) As the denominator of the right-hand side of the equation will remain the same for all data in a particular dataset, it can be written as: After simplifying equation 3, we get the following probabilistic model:

Equation 4 exhibits the probabilistic model, which is obtained after simplifying equation 3.
From this probabilistic model, a classifier model is generated by calculating the probability of all given inputs for the possible values of the class variable, and the maximum value is identified to determine the specific value of the class. It can be expressed as follows: 2) Random forest: The high variance problem of the decision tree classifier, where a minor change in the training data set can produce a very different tree, makes the decision tree classifier unstable. To eradicate this issue, the concept of the random forest was proposed which is an ensemble of decision trees. Random forest is a classifier with various classification methods or a single method with various parameters from the dataset. Assume a learning data set D=((x 1, y 1 ), (x 2, y 2 ), …., (x n, y n )) that consists of n vectors, where x ∈ X (X is a set of numerical observations) and y ∈ Y (Y is a set of class labels). For a classification instance, a classifier maps X -> Y. Each tree of the forest is responsible for classifying a new input vector. Random forest combines the idea of bootstrapping data from a learning dataset to form training data set and selecting parameters randomly to construct decision trees. Bootstrap refers to selecting threefourths of the learning dataset (sometimes two-thirds of the learning dataset) and replacing the rest of the data with some of the selected samples. While constructing a decision tree, features and their positions as nodes in a particular tree are chosen randomly. Thus random forest classifier, h, can be defined as: In equation 6, h k is a decision tree having parameters k, which is a subset of features chosen randomly [27].
3) Support vector classifier: Support Vector Classifier is a supervised learning algorithm to analyze data for classification. One of the distinctive properties of this algorithm is the ability to reduce empirical classification error and expand the geometric margin simultaneously. SVC provides high accuracy in text categorization, image classification, hand-written digits recognition, data classification, sentiment analysis, etc. SVC constructs a maximal separating hyperplane. Two parallel hyperplanes, which represent two different classes, are built on both sides of this separating hyperplane. By using SVC, an input vector is mapped to any of these two parallel hyperplanes. The purpose of the separating hyperplane is to maximize the space between the two parallel hyperplanes. It is assumed that less classification error can be achieved with a higher distance or margin between the parallel hyperplanes.
Consider a training dataset ((x 1, y 1 ), (x 2, y 2 ), …., (x n, y n )) where each x n is a p dimensional vector that maps to corresponding y n, which specifies the class. The value of y n can be either -1 or +1. The equation of the separating hyperplane can be written as: (7) In equation 7, w is a one-dimensional vector and b is a scalar. The equations for the two parallel hyperplanes can be written in equation 8 and equation 9 as follows: The distance between these two hyperplanes is | | which implies that by minimizing |w|, we can maximize the distance between the two hyperplanes and thus achieve high performance.
In the case of hard margin where no misclassification is allowed from the training dataset, the problem can be stated as: Minimize |w| for In equation 10, the classifier is determined by solving w and b for the above problem statement.
In the case of soft margin where a few misclassifications are allowed from the training data set to achieve better accuracy for testing dataset, the problem can be stated as: In equation 11, the trade-off between putting x i in the right hyperplane and maximizing the distance between the parallel hyperplanes is determined by [28].

4) Multilayer perceptron:
Inspired by the functioning procedure of human nervous systems, the concept of artificial neural networks has been developed and applied to design mathematical models to solve complex classification or regression problems. The building blocks of the artificial neural network are 'artificial neurons' or 'neurons'. Frequently these neurons are referred to as nodes. In a multilayer perceptron, which is a feedforward artificial neural network, these neurons are organized in layers and completely interconnected with each other via edges to construct a directed graph. The term 'feedforward' refers that this graph as acyclic. Each of these edges is associated with a real number which is called the weight of the edge. The layers of multilayer perceptron neural networks are the input layer, a number of hidden layers, and the output layer. For each neuron, there exists a summation function and an activation function. The summation function can be written as: In equation 12, The output of this summation function becomes an input of the activation function. There are various types of activation functions. One of the most applied activation functions is a nonlinear 'S' shaped curved sigmoid activation that can be expressed as: Applying this activation function from equation 13, the output of the neuron can be expressed as equation 14, which is shown below: Once the neural network is constructed, the set of weights are tuned to estimate the required result [29].

IV. RESULTS
A confusion matrix has been developed as a binary prediction for each algorithm utilized to evaluate the performance system. One of the widely utilized methods for calculating predictions is a binary prediction which consists of the most significant building blocks of a ROC curve [30]. Each classification problem contains two classes. There exist two sets of positive and negative ((P) and (N)) labels of class for every instance. There are four possible categories for a classifier instance. True positive (TP) refers to the number of positive instances being classified appropriately. Similarly, true negative (TN) represents the number of negative instances being classified without any error. Opposite to that, if a positive instance is classified as a negative instance, it is considered as false positive (FP). Likewise, if a negative example is classified as a positive example, it is labelled as a false negative. While applying the algorithms to the dataset, in each instance, 80 percent of the dataset has been utilized as training data and the rest 20 percent has been used for testing.
We found accuracy, precision (P), recall (R), and F1 score by using the following equations: In equation 15, accuracy is measured as the total number of correctly identified cases divided by the total number of test cases.
In equation 16, precision is measured as number of true positive cases divided by number of all predicted positive cases.
In equation 16, recall is measured as the number of true positive cases divided by all actual positive cases.
In equation 18, F-1 score is measured from the calculated precision and recall values. Table IV shows the confusion matrix for the proposed machine learning classifiers for all features. Random forest works as the best classifier with 94% accuracy among the four algorithms. The accuracies for MLP, SVC, and NB are almost the same. Later we calculate the accuracy by changing the number of features ranging from 5000 to 9000. Table V shows the accuracy for the proposed machine learning classifiers for features ranging from 5000 to 9000. The proposed model provides consistent results for different features ranging from 5000 to 9000. Fig. 3 shows ROC curves for individual classes and all classes for linear SVC. Fig. 3(a) shows that the area under the ROC curve is 0.56, which indicates that linear SVC is not very efficient in the case of identifying class 0 (highly effective drugs). Fig. 3(b) shows that the area under the ROC curve is 0.64, which indicates that linear SVC is slightly more efficient in the case of identifying class 1(considerably effective drugs) than class 0. Fig. 3(c) shows that the area under the ROC curve is 0.82, which indicates that linear SVC is very efficient in the case of identifying class 2 (moderately effective drugs). Fig. 3(d) shows that the area under the ROC curve is 0.65, which indicates that linear SVC is approximately as efficient in the case of identifying class 3 (marginally effective) as class 1. Fig. 3(e) shows that the area under the ROC curve is 0.59, which indicates that linear SVC is slightly less efficient in the case of identifying class 4 (ineffective) than class 3. Fig. 3(f) plots precision against recall for all the classes. After analyzing the ROC curve, it is conspicuous that linear SVC has a significant positive performance in identifying class 2 drugs.    The evaluation metric used to represent the multiple class classification is the ROC curve which plots True Positive Rate (TPR) against False Positive Rate (FPR) at different threshold settings. Though the ROC curve is applicable for binary classification, there is an alternative way to integrate it for multiple-class classification. This approach is known as 'one vs rest' where a multiple-class problem is treated as a binary classification problem. In a ROC curve, a higher value in the X-axis indicates a greater false positive rate than the true negative rate A higher value in the Y-axis indicates a greater false-negative rate than the true positive rate. The efficient way to discriminate between accurate and inaccurate classification is to measure the area under the ROC curve (AUC). This area accepts values between 0 and 1, where 0 indicates the completely inaccurate classification of the class and 1 indicates perfectly accurate classification. Although www.ijacsa.thesai.org multiclass sentiment classification is extremely challenging for textual data, but Fig. 3(c) shows very promising accuracy in this research.

V. CONCLUSION
From the experimental result, the calculated average accuracy for Radom Forest, Multilayer Perceptron, Support Vector Classifier and Naïve Bayes Classifier is 94.06%, 86.82%, 88.63% and 88.57%, respectively. It is found that the Random Forest algorithm has generated the best accuracy among the four algorithms. In the case of Random Forest, higher precision, recall, and f1-score have been achieved for effective drugs compared to those measurements of ineffective drugs. The reason behind calculating the f-1 score is to get accuracy measurement from a different perspective as the f-1 score delivers the balance between precision and recall. Although multiclass classification is a challenging task for sentiment analysis, linear SVC shows the promising result for class 2 (moderately effective drugs). In this research, we have applied five machine learning algorithms. Unlike most of the similar researches in NLP when text mining is used for clustering the data, supervised learning methods have been implemented in this research to gain a better understanding of a drug by measuring its level of effectiveness. It can play an important role for curing diseases. In future, we intend to apply deep learning algorithms like Long Short Term Memory Networks (LSTMs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs). In addition to that, we would like to implement multi-language sentiment analysis using datadriven approaches.