LSTM, VADER and TF-IDF based Hybrid Sentiment Analysis Model

Most sentiment analysis models that use supervised learning algorithms consume a lot of labeled data in the training phase in order to give satisfactory results. This is usually expensive and leads to high labor costs in real-world applications. This work consists in proposing a hybrid sentiment analysis model based on a Long Short-Term Memory network, a rulebased sentiment analysis lexicon and the Term FrequencyInverse Document Frequency weighting method. These three (input) models are combined in a binary classification model. In the latter, each of these algorithms has been implemented: Logistic Regression, k-Nearest Neighbors, Random Forest, Support Vector Machine and Naive Bayes. Then, the model has been trained on a limited amount of data from the IMDB dataset. The results of the evaluation on the IMDB data show a significant improvement in the Accuracy and F1 score compared to the best scores recorded by the three input models separately. On the other hand, the proposed model was able to transfer the knowledge gained on the IMDB dataset to better handle a new data from Twitter US Airlines Sentiments dataset. Keywords—Sentiment analysis; hybrid model; long short-term memory (LSTM); Valence Aware Dictionary and sEntiment Reasoner (VADER); term frequency-inverse document frequency (TF-IDF); classification algorithm


I. INTRODUCTION
With the massive use of social networks such as Facebook, Twitter and Instagram, and dedicated platforms for sharing reviews and comments such as IMDB and Airbnb; it has become extremely difficult to track down published information, let alone extract relevant information such as reviews about a product or service, on the one hand, because of the abundance and variety of published data [1], and on the other hand because of the unstructured nature of the published texts, which makes it almost impossible to analyze them by classical computer methods [2].
The content produced by the social media community reflects one of the richest sources of data in terms of opinions and knowledge, and offers greater opportunities for businesses, governments, and society to extract valuable, expressive, and diverse knowledge, both in terms of the content itself and context-related knowledge [3]. Indeed, decision makers need to perceive how people feel about their services in order to improve the aspects that customers find unsatisfactory. Therefore, mining and analyzing the data left on these platforms with automated tools is crucial. Sentiment analysis is a field of analysis that aims to determine the opinion and subjectivity of people's criticisms and attitudes towards entities and its attributes from unstructured written text [4]. A multitude of sentiment vocabulary analysis methods have been proposed over the past decades. As an example, based on the emotional attributes of words, Turny [5] used a simple unsupervised classification learning algorithm to compute pointwise mutual information to measure sentence sentiment polarity.
Wang et al. [6] proposed a topic-specific sentiment analysis method based on LSTM with attention mechanism, which focused on the features of different parts of the sentence through the attention mechanism, and achieved good performance on the task of classifying topic-specific sentiments. This work was conducted to address the problem that sentiment vocabulary generally changes with context information [7]. In [8] Pang et al. advocated for the first time the supervised learning model in sentiment classification, which performed significantly better than the traditional sentiment vocabulary-based parsing algorithms [9]. In addition, this study also pointed out that sentiment classification is more challenging than general classification tasks.
Although the models analyzed in the existing literature, which are characterized by the diversity of different features, improve performance that can be evaluated by metrics such as accuracy, Recall and F1-score, these supervised models have been trained on a large volume of data and, therefore, require a lot of labeled data, which is usually costly and leads to high labor cost in real-world applications [10,11].
On the other hand, the use of an intuitive lexicon-based classification does not work well, unlike a simple text classification. The reason is that among the overwhelming number of reviews, there are reviews that do not contain any intuitively subjective words and yet express a strong opinion. Other reviews contain very pejorative words and express a positive opinion (and vice versa) [12].
The idea of our work is to propose a sentiment analysis model that uses a low volume of labeled training data, while obtaining satisfactory results. Our approach is to combine three sentiment analysis models; the Long Short-Term Memory (LSTM) model, the Valence Aware Dictionary and sEntiment Reasoner (VADER) which is a rule-based sentiment analysis lexicon built on the wisdom of the crowd www.ijacsa.thesai.org and the Term Frequency-Inverse Document Frequency (TF-IDF) weighting based sentiment analysis model. Each of these three input models returns a sentiment positivity score in the text to be analyzed. Then we included a classification model where each of the following five algorithms has been implemented: Logistic Regression, k-Nearest Neighbors, Random Forest, Support Vector Machine and Naive Bayes. This classification model returns a binary result that indicates the sentiment experienced in the input text.
Our model improved the Accuracy, Recall, F-Score obtained by the three input models used individually (LSTM, VADER and TF-IDF). In addition, its evaluation on data from a different field than the one that provided the training data indicates that it was able to transfer the knowledge gained on an IMDB dataset to better handle a new Twitter US Airlines Sentiments dataset.

A. Recurrent Neural Network and Long Short-Term Memory
Recurrent neural networks (RNNs) are artificial neural networks that model the behaviors of dynamic systems using hidden states [13,14]. They have been the answer to most sequential data and natural language processing (NLP) problems for many years. This is because traditional neural networks take in a fixed amount of input data at a time and produce a fixed amount of output each time. In contrast, RNN do not consume all the inputs at once. Instead, they take them one at a time and in a sequence. At each step, the RNN performs a series of calculations before producing an output. The output, called a hidden state, is then combined with the next input in the sequence to produce another output. This process continues until the model is scheduled to terminate or the input sequence ends.
However, a major shortcoming that affects the typical RNN is the problem of gradient disappearance/explosion. This problem arises during backpropagation through the RNN during formation, especially for networks with deeper layers. For this reason, the LSTM was proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1997 [15].
Long Short-Term Memory (LSTM) is a type of RNN architecture implementation that is faster and more accurate than standard RNN. Indeed, LSTM leads to many more successful executions and learns much faster. It also solves complex tasks that have never been solved by previous recurrent network algorithms and shows better performance for long range sequences than conventional RNN architectures [16,15].
LSTM has found its application in many fields that require sequential models, in this case NLP and especially in sentiment analysis. For example, Thomas et al. [17] modeled the LSTM neural network to find the sentiment of transliterated text that has become the language of social media websites such as WhatsApp, Facebook and Twitter. A transliterated dataset is collected using scrapping of different websites. A sample of 10,000 datasets was prepared. Two of the layers were created for training and testing the data. Their model was trained with 65 units and a learning rate of 0.01. This work was able to achieve an average accuracy of 0.8151.
In order to solve sentiment analysis problems and improve the execution time, Zhixing et al. [9] proposed a fast sentiment analysis algorithm, called FAST-BiLSTM. The algorithm is realized by merging FastText and Bi-LSTM models. First, FastText has a fast speed for linear fitting and can generate pre-trained word vectors as a by-product. Second, Bi-LSTM uses the generated word vectors for training and then merges with FastText to perform full sentiment analysis. The results show that the temporal efficiency of the algorithm is improved by more than 30% and that FAST-BiLSTM can sufficiently extract contextual semantic information from texts.
In a similar context, a new architecture is proposed by Soubraylu et al. [18] by combining long-term memory (LSTM) with word embedding to extract the semantic relationship between neighboring words, and weighted selfattention is also applied to extract key terms from reviews. Based on the experimental analysis of the IMDB dataset, the authors showed that the proposed word-embedded selfattention LSTM architecture achieved an F1 score of 88.67%, while the LSTM and word-embedding based LSTM models resulted in an F1 score of 84.42% and 85.69%, respectively. In [6], Wang et al. propose an LSTM that provides an attention mechanism to focus on different parts of the opinion sentence, given several aspects. Embedding of the aspect expression is taken into account with word sequence folding to assign attention weights with respect to a given aspect to each word.
In order to propose software to extract Business Intelligence from SA using a modified LSTM algorithm by having a different activation function. Sreesurya et al. [20] analyzed the data using LSTM machine learning approach, evaluating the sentiments on a scale from -100 to 100. A new proposed activation function is used for LSTM giving the best results compared to the existing artificial neural network (ANN) techniques. In [21], Dhanalakshmi et al. propose an analytics system that collects employee comments from open forums and performs sentiment analysis using the RNN-LSTM algorithm. In the sentiment analysis, the employee comments are classified as positive or negative so that the organization can identify the social sentiments of its brand and can take corrective actions to retain the employees. This paper also captures the performance of various models in training and predicting the employee feedback dataset and the models evaluated are logistic regression, support vector machine, random forest classifier, AdaBoost classifier, gradient amplification classifier, decision tree classifier and Gaussian Naive Bayes. The classification ratio and accuracy of each model are captured. When training the RNN-LSTM algorithm with a dataset of size 30k, the accuracy was 88%.

B. Valence Aware Dictionary and Sentiment Reasoner Lexicon and Rule-based Sentiment Analysis
The specific nature of social media content poses serious challenges to applications of sentiment analysis due to its huge bias and big data nature [33,34]. Indeed, traditional methods www.ijacsa.thesai.org of textual sentiment analysis are mainly devoted to the study of extended texts, such as news stories and full documents. Microblogs are considered short texts that are often characterized by large noises, new words, and abbreviations. Previous emotion classification methods generally fail to extract meaningful features and produce a poor classification effect when applied to the processing of short texts or microtexts [35].
Valence Aware Dictionary and sEntiment Reasoner (VADER) is a rule-based lexicon and sentiment analysis tool that is specifically adapted to sentiments expressed in social media. VADER uses a sentiment lexicon which is a list of lexical features that are generally labeled based on their semantic orientation as positive or negative.
VADER is based on a wisdom of crowds (WotC) approach [36] to acquire a valid point estimate of the sentiment valence (intensity) of each lexical feature. The VADER evaluation was conducted by ten independent human raters (for a total of over 90,000 ratings), leading to the adoption of 7,500 lexical features with valence scores that indicate the polarity and intensity of sentiment on a scale of -4 (Extremely negative) to +4 (Extremely positive) [34]. This work has shown that VADER's performance exceeds even individual human raters.
VADER is sensitive to both the polarity and intensity (how positive or negative the sentiment is) of emotions, and it is adapted to the content of social networks that generally use informal writing (several punctuation marks, acronyms, emoticon, slang...). Indeed, some of the heuristics used by VADER to incorporate the impact of each subtext on the perceived intensity of the sentiment in the text are part of the writing style on social networks, in this case punctuation (such as the exclamation mark that increases the magnitude of the perceived intensity) and capitalization that emphasizes an important word for the sentiment in the presence of other noncapitalized words [34].
The fact that VADER is a pre-trained model gives it an advantage with respect to users. For example, Borg et al. [37] examine sentiment analysis among customers of a large Swedish telecommunications company. The dataset consists of 168010 emails with no sentiment information available. Therefore, the VADER model is used together with a Swedish sentiment lexicon to provide an initial labeling of the emails. It is after the labeling provided by VADER that the content is used to train two Support Vector Machine models in extracting and classifying the sentiment of the e-mails. In another work, Valdez et al. [38] analyzed the average daily sentiment of 86,581,237 U.S. time-series tweets with the VADER tool to understand what themes emerge from a corpus of U.S. tweets about COVID-19 and whether the sentiment changes in response to the pandemic. In [39], Al Mansoori et al. attempted to assess criminal behavior on Facebook and Twitter, and effectively classify the collected data as negative, positive, or neutral in order to identify a suspect by performing sentiment analysis using the VADER model. The VADER model was also used by Scholz et al [40] to perform an integrated semantic analysis to provide the sentiment of tweets retrieved between 2008 and August 2018 for the purpose of detecting tourism flows in the province of Styria in Austria.

C. Term Frequency-Inverse Document Frequency
Statistical approaches such as machine learning and deep learning work well with numerical data. However, natural language consists of words and sentences. Therefore, before a sentiment analysis model can be created, text must often be converted into numbers. For this purpose, several approaches have been developed, such as Bag of Words, N-grams, Word2Vec and TF-IDF.
The Term Frequency-Inverse Document Frequency (TF-IDF) algorithm [41,42,43] is used to evaluate the importance of words in a textual corpus. The importance is proportional to the number of times the words appear in the document and inversely proportional to the frequency of words appearing in the corpus. Indeed, in a simple Bag of Words, each word has the same importance. The idea behind TF-IDF is that words that appear more frequently in one document and less frequently in other documents should have more importance because they are more useful for classification.
TF represents the frequency of words, i.e. the number of times they appear in a corpus (Func 1). This consists in calculating the number of occurrences of the word out of the total number of words present in the corpus.
IDF is the measure of the importance of the term in the whole corpus. It consists in calculating the logarithm of the inverse of the proportion of documents in the corpus that contain the term (Func 2). This consists in calculating the total number of documents contained in the corpus over the number of documents where the word is present. It is the logarithm of this result that constitutes the value of the IDF.
The TF-IDF weight is calculated by multiplying the two measures (Func 3). Thus, the higher the weight, the more significant the word in question is within the corpus.
The TF-IDF algorithm is often applied to texts for sentiment analysis. For example, Soumya et al. [44] performed sentiment analysis of Malayalam tweets using machine learning techniques. They used TF-IDF and Unigram with Sentiwordnet for training feature vectors of the input dataset, before classifying them using different techniques such as Naive Bayes (NB), Support Vector Machine (SVM) and Random Forest (RF). In [45], Ullah et al. proposed an algorithm and method for sentiment analysis using both text and emoticons. The two modes of data were analyzed in combination and separately with machine learning and deep learning algorithms to find sentiments from Twitter-based airline data using several features such as TF-IDF, N-gram and emoticon lexicons. On the other hand, Ayo et al. [46] adopted an approach that proposes an improved hybrid integration with a topic inference method and an improved neural network for hate speech detection in Twitter data. The www.ijacsa.thesai.org proposed method uses a hybrid nesting technique that includes TF-IDF for word-level feature extraction and LSTM longterm memory for sentence-level feature extraction.

III. ARCHITECTURE OF OUR HYBRID SENTIMENT ANALYSIS MODEL
The objective of our study is to build a hybrid sentiment analysis model (Fig. 1) that is based on three input models:  A model based on the use of LSTM layer and which was trained on a corpus of labeled IMDB reviews.
 The VADER lexicon which is a pre-trained model based mainly on the wisdom of the crowd.
 A TF-IDF model that takes into account the importance of words in the text to estimate the sentiment. This model was also trained on the same dataset as the LSTM model.
The scores calculated by these three models are then combined in a classification model that returns whether the sentiment of the input occurrence is positive or negative.

A. LSTM Model
LSTM is a class of powerful neural networks for modeling sequence data such as time series or natural language. An optimal use of LSTM layer requires the preparation of the text to be analyzed. This preparation consists of cleaning and filtering, followed by tokenization, then word embedding. The vector representation of the words in the sentence is the input to our LSTM model which uses Softmax as an activation function to produce a multi-class categorical probability distribution and the Cross Entropy loss function.

1) Cleaning and filtering:
Once the sentence to be evaluated is available at the input of our model, it is first cleaned in order to eliminate all occurrences that may bias the subsequent processing, such as multiple spaces or spurious characters like excessive successions of punctuation marks.
The filtering operation was also carried out on the data used to train and test our model. This is an IMDB dataset containing 50,000 movie reviews for natural language processing, text analysis or binary sentiment classification [47].
2) Tokenization: Tokenization is a process used to divide text into single words (unigram) or combinations of successive words (n-gram). This operation also creates an index mapping dictionary using the vocabulary of all the words in the model training text.
The N-gram model is widely used in computational linguistics to predict the next element in such a contiguous sequence of n elements from a particular sample of text. However, in our case, and in order to use the GloVe model, the text has been divided into one-word tokens.
The resulting sequences have different lengths, and in order to handle both short and long criticisms, it is preferable that all entries have the same length. This length has been defined as the sequence length. This sequence length is identical to the number of time steps for the LSTM layer and is the maximum length calculated for a comment in the training corpus (1744 tokens).
3) Word Embedding with GloVe: Word embedding is a class of approaches for representing words using a dense vector representation. It is an improvement over traditional bag-ofwords model coding schemes which consist in marking each word in a vector to represent an entire vocabulary. Since the latter is vast, then a given word will be represented by a large vector consisting mostly of null values.
Semantic vector space models of the language represent each word with a real-valued vector. Vectors can be used as features in various applications, such as document classification [48] or named entity recognition [49]. Indeed, Word embedding improves text classification by solving the sparse matrix and word semantics problem. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 7, 2021 269 | P a g e www.ijacsa.thesai.org The two most common word integrations are: Word2Vec and GloVe.However, GloVe (Global Vectors for Word Representation), as its name suggests, is better at preserving global contexts because it creates a global co-occurrence matrix by estimating the probability that a given word cooccurs with other words.
GloVe is an unsupervised learning algorithm for obtaining vector representations of words. Training is performed on global word-word co-occurrence statistics aggregated from a corpus, and the resulting representations have linear substructures of the word vector space [50]. 100-dimensional GloVe integrations of 400,000 calculated words were used.

4) LSTM layer:
When defining the LSTM layer, 256 hidden units have been fixed. This layer is linked to a Softmax activation function. The Adam optimizer, which is one of the methods that compute the learning rate, known to work well in practice, and compares favorably with other adaptive learning algorithms has been used (Table I).

5) Softmax layer:
The softmax function is a function that transforms a vector of K real values into a vector of K real values that sum to 1. Whatever the values of the input, the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities.
Softmax is a generalization of logistic regression that can be used for multi-class classification. Many multi-layer neural networks end with a penultimate layer that produces realvalued scores that are not properly scaled and can be difficult to work with. Here, the softmax is very useful because it converts the scores to a normalized probability distribution.
In our case, the softmax layer outputs two probability scores that correspond to the positivity and negativity of the input sequence.
Training and evaluation of the LSTM model From the 50,000 reviews available in the dataset, 5,000 reviews were selected from the train set and 2,000 from the test set of our LSTM model. We checked that the number of positive reviews and the number of negative reviews in the dataset were balanced. Most of these reviews consist of several hundred words, and some reviews exceed a thousand words. The average number of words used in the reviews in the dataset is 1309. B. VADER Lexicon VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based lexicon and sentiment analysis tool that is specifically adapted to sentiments expressed in social media. VADER uses a combination of a sentiment lexicon and a list of lexical features (e.g., words) that are generally labeled according to their semantic orientation as positive or negative.
VADER produces four measures of sentiment from these word ratings. The first three, positive, neutral, and negative, represent the proportion of text that falls into these categories. The final metric, the composite score, is the sum of all lexicon scores that have been normalized between -1 and 1.
It should be recalled that the VADER model is sensitive to punctuation and capitalization [34]. Therefore, special characters have not been filtered out and text has not been converted to lowercase in order to capture the full sentiment.
Evaluation of the VADER model The VADER model was evaluated on the same test dataset used for the evaluation of the LSTM model (2000 opinion). The composite score was retained after normalizing it between 0 and 1.

C. TF-IDF Model
The TF-IDF approach is used to create numerical feature vectors from text. It is a method very often used in text classification that gives information about the occurrence of words.

1) Data pre-processing:
As for the LSTM model, the targeted sentence is first cleaned to eliminate all unnecessary occurrences such as multiple spaces, strings of numbers or URLs... Then all the text is converted to lowercase so as not to have 2 different dimensions for the same word at the time of vectorization Stopwords have also been removed from the text. These are very common words in the studied language that do not bring any informative value for the understanding of the meaning of a document and corpus. In addition, they are very frequent and are part of the common vocabulary, which has the effect of significantly impacting the speed of the processing that follows.

2) Tokenization and lemmatization:
The same tokenization module used for the LSTM model is used for the TF-IDF model, i.e. the input texts have been segmented into tokens of one word each (unigram) before being lemmatized.
Lemmatization refers to a lexical treatment of a text in order to analyze it. Stemming and lemmatization refer to text normalization in the field of natural language processing and are widely used in text mining.
The difference between stemming and lemmatization is that stemming simply removes the last characters, which often leads to incorrect meanings and sometimes even misspellings, whereas lemmatization considers the context and converts the word into its canonical form recorded in the dictionaries of the relevant base language. www.ijacsa.thesai.org For our model, WordNet lemmatizer which uses the WordNet repository database to search for word lemmas has been used. Indeed, Wordnet [51] is a large lexical database, freely and publicly available for the English language, aiming at establishing structured semantic relations between words. Nouns, verbs, adjectives and adverbs are grouped into cognitive synonym sets (synsets), each expressing a distinct concept. The majority of WordNet's relationships connect words from the same Part Of Speech (POS). Among the features it offers is one of the oldest and most commonly used lemmatizers.
3) TF-IDF vectorization: Unlike a Bag of Words (BoW) which converts text into a feature vector by counting the occurrence of words in a document without considering their importance, TF-IDF is based on the Bag of Words (BoW) model, which contains information about the most important and least important words in a document.
In order to convert a collection of raw documents into a TF-IDF feature matrix, a vocabulary which only considers the first 500 terms classified by term frequency in the corpus has been built, and then removed terms that appear too frequently (in more than 50% of documents) or infrequently (in less than 7 documents) (Table II). This allows us to ignore words that have very few occurrences to be considered significant, or conversely, too frequent in the corpus.

4) Linear regression:
Regression is a method of modeling a variable (called target) as a function of independent predictors (called features), where the algorithm involved tries to find causal relationships between the variables [52].
Since the TF-IDF feature matrix contains 500 dimensions, and each of these dimensions represents a relevance score of each word (tfidf i ), our goal is to establish a regression model (Func 4) that will allow us to compute the relative weights (β i ) to the 500 most significant words in the corpus with respect to the sentiment score (Score i ).
In this kind of application (sentiment analysis), it is rather classification models that are used and not regression models. However, the objective is not to calculate a binary score, but a continuous value (like the scores calculated with the LSTM and VADER models). These three scores will constitute the inputs of the final classification model (Fig. 1).
In order to train the regression model, the same dataset as the one used for training the LSTM model (5000 reviews) was used.

Evaluation of the TF-IDF model
The TF-IDF model was evaluated on the same test set used to evaluate the LSTM and VADER models (2000 reviews). The evaluation scores of the three models (LSTM, VADER and TF-IDF) are used as reference values to compare them to the scores of our proposed architecture in this study.

D. CLASSIFIER Model
We recall that our objective is to combine the 3 models of sentiment analysis of the input with a classification model in order to improve the performance of predictions on the sentiment conveyed through the input text. Indeed, the LSTM model, which is part of the RNN, is distinguished by its ability to adapt to sequential data. The VADER model has proven its efficiency in the microblogging domain. Finally, the TF-IDF model is characterized by its ability to handle the most significant words in a document. A higher Accuracy and F1 scores than those obtained by the three models used separately on the same data is expected. We also recall that the LSTM and TF-IDF models have been trained on IMDB review texts, while VADER is a pre-trained model.
Our classification model contains three inputs that are directly related to the outputs of the LSTM, VADER and TF-IDF models. The values of these inputs are continuous in a range of [0,1] and the output of the classification model returns a binary result (positive or negative) which is the prediction of the sentiment of the text of the full model input (Fig. 1). 5000 random reviews have been selected from the dataset that are different from the training set and test set data used for the LSTM and TF-IDF models. We ran them through the input of our global model to obtain the predictions computed by the LSTM, VADER and TF-IDF models. Then we divided these results into two batches (75% for the train set and 25% for the test set), in order to train and evaluate our binary classification model, implementing each of the following five classification algorithms: Logistic Regression (LR), k Nearest Neighbors (k-NN), Random Forest (RF), Support Vector Machine (SVM) and Naive Bayes (NB).
The hyper-parameters of each of the classification algorithms used were manipulated to have the best possible evaluations for our data set. Table III gives an overview of the most important hyperparameters that were applied to our classification models.

A. Evaluation of the Binary Classification Model
The binary classification model is the block that returns the final result of the sentiment experienced in the input text of the full model. Its three inputs come from the three input models (LSTM, VADER and TF-IDF). Table IV lists the Accuracy of each classification model following its evaluation on the test data.

B. Evaluation of our Model with IMDB Dataset Data
In the following, the results obtained using the proposed architecture will be exposed. In order to better identify the performance improvement that it has allowed, the complete model was evaluated on the same test set that evaluated the LSTM, VADER and TF-IDF models separately. Fig. 2 shows mean micro-averaged for the model by implementing the five different algorithms in the model Classifier. Table I shows that after training the LSTM model, its evaluation on the test set gave an accuracy of 0.829 and an F1 score of 0.835. As for the VADER model (which is a pretrained model), its evaluation on the same testset gave an accuracy of 0.723 and an F1 Score of 0.766. With the TF-IDF model, an accuracy of 0.789 and an F1 score of 0.792 have been obtained. Between these three basic models, it turns out that the LSTM model shows higher scores in terms of accuracy, Recall and F1 score.  After training and evaluating our model using the five proposed classification algorithms, the performance metrics shown in Table VI have been obtained. The evaluation scores obtained using our model is different depending on the classification algorithm used. However, whatever the algorithm, the scores are better than those obtained using the three models LSTM, VADER and TF-IDF separately, except for the F1 scores obtained using Random Forest (0.83) which is slightly lower than the F1 scores obtained using the LSTM model (0.835), but higher than the F1 scores obtained using VADER and TF-IDF (respectively 0.766 and 0.792).
The average Accuracy obtained using our model using the 5 classification algorithms separately (0.854) is 9.517% higher than the average Accuracy obtained using the three models LSTM, VADER and TF-IDF (0.780%), and the average F1 score obtained using the full model (0.856) is 7.363% higher than that obtained using the three models separately (0.797) (Table VI).
It should be noted that the Logistic Regression model offers a better Accuracy (0.878) compared to the accuracy obtained with the three models LSTM, VADER and TFIDF (respectively 0.829, 0.723 and 0.789), i.e. 5.91% higher than the accuracy obtained with LSTM which is the best score recorded among the 3 initial models. Logistic Regression also offers a better F1 score (0.881) which is 5.51% higher than the F1 score of LSTM (0.835) (Table VII). Overall, Fig. 3 shows that our model gave the best performance using the Logistic Regression, k-NN, SVM and Naive Bayes models. The Random Forest model on the other hand gave a slightly lower F1 score than the LSTM model, but still outperformed VADER and TF-IDF.

C. Evaluation of our Model with Data from the Twitter Dataset
The proposed model has also been evaluated on a US Airlines Sentiments Twitter dataset available on Kaggle [53]. This is a set of labeled tweets that was posed as a binary classification problem. The dataset contains 14427 unique texts that were used as a test set for our models.
It should be noted that the structure of the data encompassed in this dataset is different from that of the IMDB movie review dataset. On the one hand, the tweets contain text that is too short (with an average of 104 words, compared to 1309 words for the IMDB reviews), and on the other hand, due to the nature of the topic being reviewed, the vocabulary used most likely contains words that our LSTM and TF-IDF models never saw during training. Fig. 4 shows mean micro-averaged for the model by implementing the five different algorithms in the model Classifier. Obviously, the performance of the LSTM and TF-IDF models has dropped considerably. Indeed, the accuracy score and the F1 score of the LSTM model are respectively 0.66 and 0.67. For the TF-IDF model, these scores are respectively 0.667 and 0.637. On the other hand, the VADER model showed almost the same scores as for the IMDB data (Table VIII).  However, the Accuracy score of our model remains higher than that of the LSTM, TF-IDF and VADER models (0.767, 0.8, 0.747, 0.754 and 0.773 for Logistic Regression, k Nearest Neighbors, Random Forest, SVM and Naive Bayes respectively). The same is true for the F1 score of the three models Logistic Regression, k-NN and Naive Bayes which are respectively 0.733, 0.75 and 0.746 (Table IX). The average Accuracy obtained using our model using the five classification algorithms separately is 12.58% higher than the average Accuracy obtained using the three models LSTM, VADER and TF-IDF, and the average F1 score obtained using the full model is 7.26% higher than that obtained using the three models separately (Table X). If the VADER model which displayed the best scores (Accuracy=0.72 and F1 score=0.723) is taken as a reference, then we can notice that the proposed model recorded an improvement in accuracy and F1 score (respectively 11.11% and 3.73%) using the k-NN algorithm.

V. DISCUSSION
According to the results obtained, the proposed model shows better performances in terms of accuracy and F1 score, and which can exceed the performances of the best among the three input models (LSTM, VADER and TF-IDF) by 5.91% for accuracy and 5.51% for F1 Score. This peak was obtained by implementing the Logistic Regression algorithm in our Classifier model and by evaluating our model on the IMBD dataset, knowing that the training data also comes from this same dataset.
On the other hand, when the proposed model has been evaluated using the Twitter US Airlines Sentiments dataset, the performance obviously decreased, but it remains globally more advantageous than those obtained using the three input models. Indeed, we were able to record a higher accuracy score of 11.11% and a higher F1 score of 3.73% using the k-NN algorithm. These comparisons were made with respect to the highest Accuracy and F1 scores that were displayed by the VADER model (0.72 and 0.723, respectively).
It would be useful to recall that the structure of the US Airlines Sentiments Twitter dataset is different from the IMDB movie review dataset in terms of text size and vocabulary used. However, our model managed to display better scores compared to the three input models (LSTM, VADER and TF-IDF).
This improvement could be explained by the combination of the different techniques used in the three input models. Indeed, LSTM is more adapted to sequential data such as time series, speech and text [16]. VADER is a pre-trained lexicon focused on the wisdom of the crowd and mainly adapted to microblog data [34]. TF-IDF, on the other hand, takes into account the presence of the most significant words in a textual corpus [41]. We can therefore conclude that the combination of these three basic models through a classification model has allowed this performance improvement by capturing each of the different features of the input text according to their operating mode.
On the other hand, considering that most machine learning algorithms are based on the assumption that the training dataset and the test dataset belong to the same descriptor space and follow the same probability distribution [19], our model was able to transfer the knowledge gained on an IMDB dataset to better process a new US Airlines Sentiments Twitter dataset. Although the scores obtained are not very high, they are still much better than those obtained by the LSTM, VADER and TF-IDF models separately.

VI. CONCLUSION
The content created by users of social media (such as Twitter, Facebook or Instagram) and dedicated platforms (such as IMDB or Airbnb) reflects one of the richest sources of data in terms of opinions and knowledge. The data they encompass offers great opportunities for companies to extract valuable and expressive knowledge. For this reason, a field like sentiment analysis, which seeks to determine the opinion and subjectivity of people's reviews from unstructured written text, is growing rapidly.
Although for more than a decade, many sentiment analysis models have been proposed, they are generally data-intensive and computationally expensive. Indeed, most of these models generally require a huge amount of training data to achieve satisfactory performance metrics, namely, accuracy and F1 score.
The objective of our study is to propose a hybrid sentiment analysis model based on three basic models, namely, LSTM, VADER and TF-IDF. Each of these models captures different specifications of the same text. These models are then combined in a classification model where each of the following five algorithms has been implemented: Logistic Regression, k-Nearest Neighbors, Random Forest, Support Vector Machine and Naive Bayes. The output of our model www.ijacsa.thesai.org delivers a binary score that reflects the sentiment of the input text. The proposed model was trained on 5000 IMDB movie reviews and then evaluated on other reviews from the same dataset, then it was evaluated on Twitter US Airlines Sentiments which has a different structure in terms of text size and vocabulary used.
The results suggest that, depending on the classification algorithm implemented, our model displays higher Accuracy and F1 scores than those achieved by the three basic models. Indeed, with Logistic Regression, an improvement of 5.91% for the Accuracy and 5.51% for the F1 Score on the evaluation data of the IMBD dataset has been noted. These scores were calculated with respect to the best performances achieved by the three basic models. After evaluating our model on the US Airlines Sentiment Twitter data, an overall decrease in performance has been noted. However, the performance of the model is still much higher than those recorded by the three basic models. Indeed, we were able to record a higher accuracy score of 11.11% and a higher F1 score of 3.73% using the k-NN algorithm, which indicates that our model was able to transfer the knowledge acquired on an IMDB dataset to better process a new US Airlines Sentiments Twitter dataset.
As a perspective, it would be interesting to improve the proposed model by implementing a BiLSTM model based on self-attention in order to capture the polarity of a whole sentence that may contain several term-aspects. Such an improvement would have a significant impact on the evaluation metrics, namely the accuracy and the F1 Score.