An Efficient Aspect based Sentiment Analysis Model by the Hybrid Fusion of Speech and Text Aspects

Aspect-based Sentiment Analysis (ABSA) is treated to be a challenging task in the domain of speech, as it needs the fusion of acoustic features and Linguistic features for information retrieval and decision making. The existing studies in speech are limited to speech and emotion recognition. The main objective of this work is to combine acoustic features in speech with linguistic features in text for ABSA. A deep learning and language model is implemented for acoustic feature extraction in speech. Different variants of text feature extraction techniques are used for aspect extraction in text. Trained Lexicons, Latent Dirichlet Allocation (LDA) model, Rule based approach and Efficient Named Entity Recognition (E-NER) guided dependency parsing approach has been used for aspect extraction. Sentiment with respect to the extracted aspect is analyzed using Natural Language Processing (NLP) techniques. The experimental results of the proposed model proved the effectiveness of hybrid level fusion by yielding improved results of 5.7% WER and 3% CER when compared with the traditional baseline individual linguistic and acoustic feature models. Keywords—Acoustic; aspect-based sentiment analysis; decision making; emotion; extraction; hybrid; lexicon; linguistic; natural language processing; speech


I. INTRODUCTION
Sentiment analysis or opinion mining is the area of study in NLP where it helps to analyze the polarity with respect to the given context. Sentiment analysis depicts the state-of-art-ofmind to automate the process of analyzing the opinion, emotion, polarity, appraisal, interest, ideology, attitude, feelings towards an entity. Sentiment analysis plays an important role in our daily lives for analysis and decision making. In most of the existing studies, sentiment analysis is been carried out on text and the performance is been differentiated by varying the type of linguistic features extracted from text. The features on text are generally called as linguistic features and play a very crucial role in sentiment analysis. Due to the tremendous growth of data in World Wide Web, now-a-days traditional and web-based surveys are been replaced by sentiment analysis [1]. As WWW is a combination of text, audio and video, there is a need for analysis of sentiment on multimodal data. Feature extraction for sentiment analysis will be differed for different types of input like text, audio and video. The field of sentiment analysis in NLP had gained its popularity by implementing on text. By the evolution of massive data, research is been expanded and now it"s confined not only to text but also had gained its popularity in different modalities. When sentimental analysis came into picture, it"s been carried out only on text using NLP and machine learning techniques, where the polarity of the given document or sentence is classified as either positive, negative or neutral [1]. Next era of sentiment analysis is aspect-based sentiment analysis (ABSA) and had gained its popularity in recommender systems. Most of the recommender systems that used ABSA have identified the sentiment with respect to the aspect in the given text. Parts-of-Speech (POS) tagging was one of the widely used aspect identification technique for ABSA [4]. In this paper, aspect-based sentiment analysis was been carried out by combining both audio and text features.
Most of the research so far carried out on audio data is confined to speech analysis and emotion recognition. In the existing studies [6], various acoustic features are analyzed and are classified for speech emotion recognition. Identifying sentiment in speech is a challenging task because of following reasons.
 Even though both the terms emotion and sentiment express feelings with respect to the context but the way they are analyzed is different.
 Emotion is the one that can be analyzed in speech by means of various acoustic features and prosodic features like pitch, intensity, energy, loudness etc. Whereas in text the sentiment is defined as an adjective that qualifies the respective noun.
 There is a difficulty to map emotion in speech with parts-of-speech in text for analyzing the sentiment. Even though there are many existing studies carried out on speech for sentiment analysis, the work is limited in analyzing only the emotion in speech like happy, sad, angry, fear and etc.; but not the positivity and negativity in the given context.
As speech and text features are different, so there is a need to bridge the gap between them to perform sentiment analysis. Speech in call-centers and text in recommender systems has gained its popularity in the field of sentimental analysis [15]. Fig. 1 depicts the sentiment analysis model by considering bimodal speech and text features. The main contributions of the proposed work are:  The importance of linguistic and acoustic features for ABSA is analyzed.
 A hybrid level fusion of acoustic and linguistic features for ABSA is evaluated using Word Error Rate (WER) metric and machine learning algorithms.
 The obtained results from proposed combined model are validated with the individual implementations of speech and text-based sentiment analysis.

II. MOTIVATION
The field of sentiment analysis is catching everyone"s attention in marketing, corporate and academia by executing the tasks in an easy and efficient manner. But most of the traditional frameworks are confined to work only either on text or audio or video. There is very limited study carried out on multimodal data. Now-a-days, sentiment is been analyzed as aspect-based sentiment analysis and major limitations are been identified in feature extraction and sentiment related aspect category identification. So, this made my work to drive towards implementing aspect-based sentiment analysis on multimodal speech and text data. Identification of sentiment with respect to the aspect helps to improve quality of service when compared with document and sentence level sentiment classification. Now-a-days, tremendous growth of data available in social media and online commercial websites made everyone to provide online reviews demonstrated as a video in YouTube. Previously, consumers used take their decision for any purchase by analyzing the text reviews given by the customers [16]. In some cases, like where there is no customer who had already bought the product, there will be no rating and review provided for that product. In such cases consumer is not in a state to make a decision whether to go for it or not. So, this made me to develop an aspect-based sentiment analysis model on YouTube review data for improving the quality of service to consumers.

III. RELATED WORK
The main objective of the proposed model is to analyze Aspect based sentiment analysis by combining both linguistic and acoustic features. Acoustic feature extraction techniques and Linguistic feature extraction techniques are applied for feature extraction [14] on YouTube product review dataset. The base line models are implemented by considering individual linguistic and acoustic features, validated using machine learning algorithms. A hybrid level fusion of acoustic and linguistic features for ABSA yields improved results when measured in terms of accuracy, precision, recall and F-score.
Existing methodology in sentiment analysis had used bag of words, parts-of-speech tagging as feature extraction techniques on text [2]. The work is limited to classify domain specific sentiment and resulted in document level sentiment classification with poor efficiency. Evolved topic modelling and by the use of LDA, it is made possible to classify sentiments by grouping into topics. But this approach is limited to automate the process of assigning labels by grouped topics, where manual assignment is needed. The literature in this paper is carried out to analyze the impact of aspect-based sentiment analysis by considering linguistic and acoustic features.

A. Linguistic Features: Aspect-based Sentiment Analysis on
Text Data Sentiment analysis had a wide variety of applications experimented on textual data. The evolution of sentiment analysis made the job of many real time applications easy in commercial markets for analyzing the customer, employee feedback in a working organization, recommending a product in ecommerce, decision making in any kind of purchase, political opinion, movie reviews, etc. Many studies have been carried to identify sentiments on text at various levels like document level, sentence level, aspect level, context level [1].
Md. E. Mowlaei et al; proposed adaptive lexicon-based ABSA [2] using three different types of lexicons like opinion lexicon, Sent-WordNet, Subjective to implement dynamic aspect-based sentiment analysis. The proposed methodology overcomes the limitations of existing domain dependent static lexicon approaches. The model lacks to identify implicit aspects even though it draws the attention to identify context dependent aspects in a dynamic way.
O. Alqaryouti et al; to improve the efficiency of sentiment classification proposed an integrated lexicon and rule-based approach [3] for aspect-based sentiment analysis to identify both implicit and explicit aspects. But the Lexicons used for generating the aspects are manually assigned to achieve higher efficiency in identifying the implicit and explicit aspects. A rule-based approach is used to integrate the extracted aspects and sentiments for classification. The model is implemented on government review data where general public post their opinions and it was suggested that it can be useful in mobile apps to analyze the feedback from public or customers.
V.S. Anoop et al; proposed an aspect-based sentiment analysis model on text using a topic modelling technique called LDA [4]. The input text by the use of LDA algorithm is segmented into topics, which then mapped manually to a relevant aspect. In case where there is a need to process huge data for sentiment analysis, it will be very difficult.
M. Shams et al; proposed a language independent aspectbased sentiment analysis model which undergoes through three phases of fine-grained operations [5]. The aspects are extracted (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 9, 2021 162 | P a g e www.ijacsa.thesai.org by having prior knowledge on dataset been used and used aspect word sets for mapping the polarity to the aspect. And finally used an expectation-maximization algorithm for calculating weightage of each word with respect to its aspect and assigned sentiment.
M. Syamala et al; to overcome the limitation of manual topic label assignment to the topics extracted from LDA proposed a deep fusion mechanism [19]. The extracted topics from LDA are converted into word embeddings and trained over a one-layer neural network to determine topic label for each set of extracted topics. The proposed sentiment classification model was compared against the models implemented with LDA and without LDA.

B. Acoustic Features: Aspect-based Sentiment Analysis on Audio Data
Most of the research carried out on speech for analysis is on either speech recognition or emotion recognition. Emotion recognition in speech differs from sentiment identification in text. Recognition of emotion in speech depends on various factors like pitch, volume, frequency, time, intensity, jitter, noise, and etc. But in case of text, identification of sentiment is independent of all the external environmental factors. So, there is a need to know the fusion mechanism between speech and text features for performing sentiment analysis. In this section, some of the existing works carried out on speech data for sentiment analysis is presented.
D. Griol et al; proposed a fusion mechanism between speech and text features [6]. The features extracted from both these modalities are trained for emotion classification in speech and sentiment classification in text. Acoustic and contextual features are been extracted from speech for emotion classification and semantic features are been extracted from transcriptions for sentiment identification. The proposed fusion model takes the account of peculiar context related errors in the transcriptions derived from speech. Zhiyun Lu et al; proposed an end-to-end automatic speech emotion recognition model using pre-trained speech and text features from IEMOCAP dataset [7]. Build a speech sentiment database to enhance the sentiment in speech and also which is been considered as one of the current challenges in this field of research. The trained features are classified using a selfattention Recurrent Neural Network (RNN) to differentiate sentiment with respect to language model. Bryan Li et al; combined acoustic and lexical features to develop a sentiment analysis model in order to analyse customer call services [8]. The acoustic low-level descriptors like MFCC, Intensity, pitch, loudness features are extracted using open SMILE. Lexical features are been extracted by considering n-grams. The Lexical classifier model was built on IEMOCAP dataset to make a comparison between the speech transcriptions and to choose the perfect speech recognition model. Implemented a decision-level fusion mechanism also known to be a late fusion to train the two modalities input to a classifier for decision making or classification. Dong Zhang et al; proposed a REINFORCED approach which differs from self-attention model by concentrating on the word-level features in both speech, text and avoided the low level weighted and noisy features [9]. In this paper, the title of the paper is to depict sentiment on speech and text but actually emotions are been classified by training the extracted features into a deep learning model using SoftMax layer. Maghilnan S et al; proposed speech sentiment analysis on speaker specific data [10].
In the proposed model, conversation between two entities is taken as input but can"t able to handle if both the entities speak simultaneously. Two independent tasks are carried out to perform speaker identification and speech transcribes generation. Later both these outputs are used to map the transcribed text with respect to its speaker ID. Finally, the output text dialogue is classified into sentiment based on its polarity.

IV. DEEP FUSION OF LINGUISTIC AND ACOUSTIC FEATURES
As the proposed model in this paper analyses ABSA by considering both speech and text data, it"s important to know the different ways of fusing linguistic and acoustic features. In general, the research that is been carried out in this area, defines three basic variants of fusing mechanisms like featurelevel fusion, decision-level fusion, hybrid-level fusion.

A. Feature-Level Fusion
Feature-level fusion is also known as early-fusion where features from various modalities are extracted separately and a deep classification analysis was performed by fusing the models to enhance the performance. The main advantage with this type of fusion is, in the early stage it helps to derive or extract modality dependent features making the models to achieve more improvement. The main drawback with feature level fusion is that the aspects with respect to the modality may differ and accurate analysis can"t be achieved when combined analysis is performed. For example, in speech the features are acoustic and in text features are linguistic. Poria S et al; in his paper multimodal emotion recognition and sentiment analysis [11], used feature-level fusion to fuse three modalities of YouTube data. A deep convolutional neural network was used to extract speech and visual features and word-embeddings, parts-of-speech tagging was used to extract textual features. A multiple kernel learning classifier is used to fuse and analyze the sentiment.

B. Decision-Level Fusion
In decision-level fusion, the features from different modalities are extracted separately and classified separately. The results obtained from each classification are merged into a feature vector for final decision making. The advantage of this approach is that the final feature vector obtained from decision fusion of individual modalities will be in same format so that no conversion is required. The drawback of this fusion is to perform classification on different modalities involves different types of classifiers.
Wӧllmer M et al; used decision level fusion mechanism in his paper [12] to fuse audio, visual and text features of YouTube input data. The extracted acoustic, visual features are trained by a LSTM for sentiment score evaluation and Support Vector Machine (SVM) is used to train and derive the sentiment score of textual features. Final decision level late fusion was performed for final sentiment prediction by www.ijacsa.thesai.org calculating the weighted sum on the sentiment score obtained by assigning a weightage of about 1.2 to linguistic and 0.8 for audio and visual score.

C. Hybrid-Level Fusion
Hybrid-level fusion includes the model to use both feature and decision-level fusion mechanism in order to overcome the drawbacks in individual fusions.
Yue Gu et al, proposed an attention-based hybrid multimodal network for spoken language classification using hybrid fusion approach [13]. Word2Vec and Mel-frequency spectral coefficients (MFSCs) of text and audio features are been extracted. The extracted features are individually trained over a LSTM to obtain informative context related words and frames undergoing a feature level fusion. And finally, modality level fusion i.e., a decision-level fusion is performed by passing the extracted individual test and audio features through an attention layer to extract informative modality level features.

V. PROPOSED MODEL
In this paper, a novel Aspect-based Sentiment Analysis model was implemented on speech and text data. The dataset used for implementing the model is drawn from YouTube social platform. In order to evaluate the experimental results for sentiment modality comparison, both the speech and text models are been tested on the same dataset. The domain chosen for carrying out our experimental analysis is real-time product review data. In the initial phase, the raw audio format of the product review YouTube Video is trained over a speech analysis model. The speech analysis model maps the acoustic spectrogram features of the speech signal into the respective word utterances using a deep learning and language model. The word utterances from the speech analysis model are trained over different variants of text feature extraction techniques for deriving related and relevant aspects. The sentiment with respect to the derived aspect is analyzed for performing Aspect-based sentiment analysis. The components (features) in speech and text data are processed individually and are then fused. So, the whole process uses a hybrid fusion mechanism for mapping speech and text features for performing ABSA. Fig. 2 explains the work flow of the proposed speech and text analysis model for efficient Aspect based Sentiment Analysis.

A. YouTube Product Review Data Collection and Processing
In this phase, YouTube product reviews of Samsung M31mobile were downloaded as dataset. YouTube, a social platform where people share their live experience in the form of reviews have a natural, spontaneous speaking style. As the way the speaker speaks have a direct impact on describing the accuracy of the model, made me to motivate and download the dataset from YouTube for performing Aspect-based sentiment analysis on speech data. In total 40 YouTube reviews of size 90 KB on the Samsung M31 product having strong presence of subjectivity, positivity and negativity are randomly collected and converted to .wav files, are used for ABSSA.

B. Speech Analysis Model
Emotion in speech is treated to be a kind of sentiment, which expresses an individual feeling in terms of happy, sad, fear, disgust, angry and etc. Sentiment in text differs from emotion in speech and there is a need to perform speech analysis in the form of automatic speech recognition to enhance sentiment from audio data. There are many ASR models and online speech -text conversion API's.
To enhance the performance of traditional ASR models and to overcome the limitations in online speech-text API's, in our proposed model used a deep learning framework for analyzing the acoustic features and a bi-gram language model to map the word utterances. As sentiment analysis is independent on the speech features like pitch, intensity, volume and etc. Initially, the spectrogram features are extracted from the input Wav audio file and are trained over a Convolutional Neural Network and a Bi-directional Recurrent Neural Network (Bi-RNN). The acoustic features when trained over these deep neural networks produces a character sequence of spoken utterances. A bi-gram language model by the use of chain rule retrieves the maximum occurrence of character sequences and the same are mapped into word utterances. Fig. 3 shows the text transcripts extracted from the proposed speech analysis model. www.ijacsa.thesai.org

1) Creation of Spectrogram
Input: Audio signal Output: One-Time frame vector Step 1: Dividing the input audio signal into time frames of frame size 1024, with a sampling rate of 16 kHz.
Step 2: Each frame signal is then split into its frequency components with a hop size of 512 samples between each successive Fast Fourier Transform window.
Step 3: Finally, each time frame is then represented as a onetime frame vector with a vector of amplitudes at each frequency.
Step 4: The one-time frame vectors obtained when lined up in time series order gives us the visual representation of input audio signal as a spectrogram.
2) Language model: The word sequences obtained from the above acoustic model need to be refined as the acoustic model finds the probability of character utterances based on sound and there are cases where two words can utter same sound. The use of language modelling followed by acoustic modelling helps to rectify this problem and increases the likelihood score of a particular sequence of word utterances. Equation (1) formulates the representation of a sequence of word utterances using N-gram language model. In this proposed speech analysis model, used a bi-gram language model by a chain of rule mechanism to find the respective sequence probability.
3) Text analysis model: For improving the efficiency of Aspect-based sentiment analysis, in this paper three variants of text feature extraction techniques are applied for aspect level feature extraction. Sentiment was analyzed with respect to the extracted aspect at decision level. The way the decision level aspects are extracted and analyzed for sentiment are presented in detail in the below section. a) Lexicon based semi-supervised pattern generation technique (Model 1): The opinion words representing the sentiment called as aspect terms are extracted by generating patterns from bi and tri grams. As defined, it is a Lexicon based approach, the similar aspect words related to the input dataset are assigned statistically. In addition to statistically stuffed aspect words, the hypernyms of aspect terms extracted from patterns are generated by using wordnet. The final aspect terms are obtained by considering statistically assigned words and its similar words generated from the patterns. Consider a set of word sequences (2) Bigram approximation is represented as N-gram approximation is represented as The final aspect terms obtained are mapped with the patterns generated to extract the sentiment terms. Sentiment score was computed on the trained aspect and sentiment terms by importing "testimonial. sentiment. polarity" library.

b) Topic Modelling Technique LDA (Latent Dirichlet Allocation) (Model 2):
Text features are been extracted from the pe-processed input using LDA. Python libraries like gensim and ldamallet are used for extracting the dominant topics (aspect terms). Extraction is done by calculating the term document frequency on pre-processed lemmatized data by considering NOUN, ADJ, ADV and VERB from n-gram data. Dominant topic words termed to be as aspects with respect to the input qualifying the sentiment are extracted. Some of the examples of aspect terms in the context of electronic gadgets are battery, display, power etc.
Probability based topic extraction using LDA is formulated in (5).
Aspect category groups a list of aspect terms into its relevant category. For example, the aspect terms like battery, display, power can be categorized under the category mobiles and similarly taste, flavor, ambience can be categorized under the category restaurant. Aspect category is detected by training the extracted dominant/aspect terms into a Convolutional Neural Network (CNN). Using polarity as a measure, respective sentiment terms are extracted from the extracted aspect category terms.

c) Efficient Named Entity Recognition (E-NER) (Model 3):
The aspect terms in this approach are extracted by a dependency parsing mechanism using POS tagging, an NLP technique.
A convolutional neural network was used to map the extracted aspect as relevant aspect categories. Word embeddings mechanism (6), (7) is used to train the input aspect terms as vectors to CNN (8). Filtering aspect related sentiment words and aspect sentiment classification uses the same www.ijacsa.thesai.org methodology as followed for aspect category detection for aspect term polarity extraction and sentiment classification.

4) Hybrid level fusion:
In text analysis model, by the three variants of aspect extraction techniques we performed decision level fusion for analyzing the sentiment. The extracted aspects with respect to their sentiment have undergone a feature level fusion for enhancing the performance. By means of this decision level fusion followed by feature level fusion, helps to overcome the problems of dimensionality and filters the weighted aspects by deriving improved performance. In hybrid level fusion phase, employed a Normalized Weighted Aspect Extraction (NWAE) mechanism (10) in which the aspects extracted from each technique are filtered based on their weights. A decision rule was applied for classifying the polarity class of the derived weighted aspects (11).
Experimental results obtained from the three different text analysis models, discussed in Section 5 are made a comparison. Accuracy, precision, recall and f-score are the metrics used to measure the performance of the proposed model. Fig. 4 lists the different aspects extracted from the patterns generated by means of bi-gram and tri-grams in text analysis model1.  For effective aspect extraction in model 1, hypernyms are generated for the statistically assigned aspects. Extracted hypernyms for the statistically stuffed aspects are shown in the below Fig. 6.   Fig. 7 shows the performance analysis comparison of model 1 when validated using machine learning algorithms in terms of accuracy, precision, recall and f1-score. From the analysis it shows that decision tree algorithm derived better accuracy of 73% among all the other compared machine learning algorithms.       Table II and Fig. 11 shows the performance analysis comparison of model 2 when validated using machine learning algorithms in terms of accuracy, precision, recall and f1-score. From the analysis it shows that Random Forest algorithm derived better accuracy of 89% among all the other compared machine learning algorithms. The Fig. 12 presents the way the aspects are extracted using POS tagging by dependency parsing mechanism and Fig. 13 presents the aspect-based sentiment analysis on the derived aspects using model 3. Table III and Fig. 14 shows the performance analysis comparison of model 3 when validated using machine learning algorithms in terms of accuracy, precision, recall and f1-score. From the analysis it shows that Random Forest algorithm derived better accuracy of 95% among all the other compared machine learning algorithms.