Modeling a Functional Engine for the Opinion Mining as a Service using Compounded Score Computation and Machine Learning

—The ever-growing use of the digital platform for the various walks of the applications, primarily on the collaborative platforms of e-commerce, e-learning, social media, blogging, and many more, produces a large corpus of unstructured text data. Many potential strategic solutions require an accurate and fast classification process of the Opinion's text corpus hidden patterns. In-premise applications have various real-time feasibility constraints. Therefore, offering an Opinion as a Service on the cloud platforms is a new research domain. This paper proposes a design framework of the evolution of the classification engine for opinion mining using score-based computation using a customized Vader algorithm. Another method for scalability is a machine learning model that supports a large corpus of unstructured text data classifications. The model validation is performed for the various complexes, unstructured text datasets with the different performance metrics of the cumulative score, learning rate, loss function, and specificity analysis. These metrics indicate the models' stability and scalability behaviors and their accuracy and robustness across different datasets.


I. INTRODUCTION
The evolution of web2.0 and Cloud has brought a complete change in the digital system's development and production [1]. Global resource constraints and economic liberalization lead to realizing a collaborative business model. A highly distributed production-distribution and consumption market require an ecosystem of technology that has high availability and scalability-Cloud computing service offerings cater to these demands [2]. The competitive environment of cloud service providers (CSP) and the enterprise demands various services apart from the Cloud's traditional offerings. The evolution of the words' representation into vectors provides an ease to process the word corpus and leads a technology, namely text analytics. Various open platforms offer a facility to express the feedback or textual expression in many contexts of the brandbuilding process, marketing, or product campaign. The corpus of the text contains the hidden treasure of the Opinion. It is not economically feasible for the individual organization to set up dynamically evolving methods for the opinions mining as inpremise computing infrastructure. Therefore, the CSPs are in the process of building an ecosystem to offer Opinion-Mining as a Services (OMaaS). This paper proposes an architectural model for the Opinion-Mining design as a Service (OMaaS) offering from the CSPs. The basic workflow diagram of the 'OMaaS' is as in Fig. 1.
The framework for the OMaaS provisions a system to acquire the Cloud users (CS) text corpus (Tc) through a dedicated channel with the dashboard of the virtual layer (VL) to the cloud data store. It handles the large corpus that further gets synchronized to the cloud data text analytics Engine (TAE), where the opinion mining's effective algorithm gets executed. Finally, the respective CSi ∈{CS} gets the visual or statistical representation of the mined Opinion from the respective Tc. Such a model's overall success largely depends upon how effective, and in a scalable manner, the view is mined on a real-time basis.
Many ubiquitous applications are conceptualized, where text analytics plays very crucial roles. Many of such application may include: i) Dynamic info-system on the dashboard of the vehicles, ii) business strategic decision tools, iii) topic modeling, iv) summarization, v) patent data matching, vi) health care decision support system, vii) the forensic tool, viii) decision making based on feedback -sentiment analysis, ix) political campaign, x) historical literature analysis, xi) visual search. Section II describes various researches that took place in the field of text analytics in a different context. Section III provides the descriptions of the diverse dataset taken into consideration for the model variation followed by the Sections IV and V for the two respective models of cumulative score and machine learning-based classification algorithms as a proposed engine Opinion mining to be synchronous with the OMaaS. Finally, Section VI discusses the results and analysis, followed by a conclusion in Section VII. 150 | P a g e www.ijacsa.thesai.org

II. REVIEW OF LITERATURE
The use of text-data from social media like Facebook and other feeds and surveillance camera images (Neuhold et al., 2018) is found. The text analytic is exploited to display the road condition on the dashboard [3]. The big-data and the business complement's process management complement if the text generated is adequately analyzed [4]. A tree-based visual representation of text is being practiced, but this method is not scalable [5]. In research, the usual challenges to the researchers are to handle a large query result. The bag of words, along with natural language processing and visual analytics, is being studied in the work of Benito et al., (2019) [6]. Basole et al., (2019), has reviewed topic modeling on an extensive text description used in a business domain based on text analytics [7]. Summarization or building abstraction of a large document is beneficial to grab quick knowledge. Text analytics is used by Han et al., (2020) in their work [8]. Reghupathi et al., (2018), has examined the use of text analytics on patent data corpora using a concept like a word count and co-occurrence and the machine learning model [9]. Health care industries are another domain which produces a vast amount of unstructured text data (Kumar et al., 2019), has performed text analytics for decision support system [10].
The use of sequence-to-sequence learning in text analytics is becoming popular to build many text analytics-based applications with higher accuracy and lower training time (Keneshloo et al., 2019) [11]. Many giants like Facebook and Google use text analytics for their respective goals. Similar benefits can be achieved in further education, banking, and marketing sectors [12]. The forensic sector benefits if the complex text data from various communication sources are being analyzed (Koven et al., 2019), devices a tool that uses text analytics on the email data corpus [13]. Nowadays, the topic modeling algorithm is gaining popularity (EI-Assady et al., 2018), proposes a decision-making technique based on relevant feedback using text analytics [14]. The study of sentiment analysis in crowdfunding is presented by (Wang et al., 2017) [15]. Media is another domain where a large corpus of text data is generated. Text analytics facilitates benefits on the topic description of an event as in the work of (Lu et al., 2018) [16]. Text analytics has also shown its benefits in the political election campaign (Gad et al., 2015), proposes an analytics tool for the visual representation of the social message trend [17].
The analysis of semantic with its content plays a vital role in content analysis [18]. Ojo et al., (2019), present patient sentiment analysis using textual data [19]. Karam et al., (2016), proposes a design of new hardware that supports the ecosystem of processor and memory for test analytics [20]. Vatrapu et al., (2016), explores set theory-based visualization to complement text analysis [21]. The sedimentation-based visualization concept of coordinated structure in text analysis has been studied by Liu et al., 2016) [22] and Sun et al., (2016) [23], respectively. Different regional history analysis is possible by text data analysis such study for Roman history is being carried out in the work of Cho et al., (2016 ) [24], various web-based visualization tool and fundamentals of visual text analytics are described in the work of Liu et al., (2019) by analyzing a large corpus of published papers using concurrence relationship [25]. The basic features like parts of speech, text color, and font size make the corpus complex; an extensive survey is being conducted by Strobelt et al., (2106), different understanding highlighting, and visual search techniques [26]. In most text analytic methods, structuring the respective word with their meaning is crucial to arrive at an efficient qualitative and quantitative representation to achieve accuracy like a human [27].

III. DATASET DESCRIPTION
The OMaaS framework proposes two core models for the classifications, which use the following datasets for evaluating the algorithms for the text analytics engine for the opinion classifications: i) Partial Complex Text and emojis, ii) fastText Facebook's AI Research (FAIR) lab [28], iii) Opinion Data from the University of Illinois, Chicago [29] IV. MODELLING A COMPLEX CONTENT: HYBRID OPINION USING TEXT AND SYMBOL USING CUSTOMIZED VADER ALGORITHM

A. Vector of Text Token (TTo)
The simulation environments are controlled by initializing a Mersenne Twister generator with seed '0' [30]. The system deals with the complex heterogeneous constructs using text token and the symbols as Cf ={T∪S}, where T= text token and S= symbols, as nowadays it is a fashion that people express their statements or Opinion with the combined format of the text sentence partially and complement it with some symbols (shown in Table I). In a start-up, the file containing the complex inputs of 'T' and 'S' as Fn gets assigned in Cf's initialization. The 'Cf' transformation takes place into the tabular format for ease of computation, making the characteristics of ∀ content ∈ Cf type: 'character' (Ch), and tCf represent the transformed format. To generate a vector of text token (TTo), the ∀ text data (Td) ∈ tCf passes through a function of the document tokenization process (f2()), as shown in Fig. 2.

B. Score Weightage
The score weightage (Sw) for ∀ tokenized-Document (tD) ∈TTo is computed using a popular "valence aware dictionary" for sentiment reasoning: "Vader" as customized Vader(cV) [31]. The large corpus of unstructured textual data transformation and gaining a quantitative ratio are the engine's main goals. In future applications, artificial intelligence (AI) based service are depicted as 'OaaS' models in the enterprises' CRM applications' intrinsic parts. The cV is basically a ruleoriented lexicon model (ROLM) based on the set of {sL, gR,sYC}, where, sL= 'sentiment lexicon, gR= 'grammatical rules and sYC= 'convention', such that sYC:{s.P,s.I}, such that s.P= polarity , s.I=intensity. The cV constructs a 'wordlist' with the wide-ranging list of feature-vector (Fv) such that Fv={Word(W), Phrases(P), Emo-icons (Ei), Acronyms (Ac)} with the rating of s. P and s.I in a -ve score to the +ve score, and the average is assigned as Sw. Vader's customization involves the handlers for the other parts of speech, characterization, and punctuations. The cV takes the entire tCfand their associated Fv and operates on {s. P,s. I} as per the specific rule sets. Finally, the summation of all the Fv scores gets normalized by scaling it in the range: R [-1 to 1] using equation (1) for the compounded score, Sc. (1)

Update: Sc End
The value of approximates the maximum probability of the expected cost of the score S. The algorithm is explained in algorithm 2, and the algorithm is implemented into two distinguished data set to measure the compound scores and the time computations. The results are described in Section VI of the results and discussion.

A. Auto Label Annotation for Data Model
The artificial intelligence research group (FAIR) by Facebook provides a model for creating a vector depiction of the equivalent word as a library. This library is popularly known as 'fast-Text' and is used to learn text classification by different machine learning models (MLM). The system model takes the dataset provided by the University of Illinois, Chicago, namely: 'Opinion-Lexicon (OL),' which contains {6789} word list of both Class: {Negative (Pw), Positive (Nw)} as a text token (Tt) [32] sorted in the sequence of az. Further, a pre-trained model, namely, {'Word-Embedding'} provides an object named Dictionary (Dc) containing 9,99,994 tokens of words as string [33]. The explicit function f1() takes OL as an input argument, check the correctness of the files and convert ∀ tokens (Tt) ∈ {Negative (Pw), Positive (Nw)}as a string and the concatenation of Pw ∪ Nw, generates a list 'W' of size m x n, where n=1. Since the W ∈ {String Datatype}, therefore it is characterized as 'Not a Number (Nan),' a function f2() converts a list of 'Nan' of size (m x 1) into a list of Categorical variables to store the label annotation (La) for ∀Tt ∈ W. Further, the corresponding elements of the W are mapped: La as a categorical Labels (CLa) annotation. The pair of (W and CLa) provides Labelled Annotated data (D la ) used for the Learning models.

B. Token-based Filtering
The token-based filtering takes the Labeled Annotated data table (D la ) from the previous procedure of Auto Label Annotation for Data Model. The explicit function f3 () takes the D la as an input argument to return the tokenized documents (D toc ), which is a set of {T1, T2, Tk, Tn}, where possible as per the text dataset T1 to Tn could be ∈ {#, , www.address.com,}. The process of the function f3() removes the stop words (SW) and also executes the process of stemming [34] or lemmatization [35]. Further, an additional argument passed to the f3() provides the BoW, which can be extended to multi-lingual analysis. Additionally, all the Unicode punctuations or symbols get eliminated after passing the Dtoc into the function designed to remove it. The English language has approximately 225 stop words eliminated from the updated Dtoc after passing a function that handles these stop words as a noise before further processing the text analytics. Finally, the noise processed Dtoc transforms into lower cases for further processing.

C. SMO based Support Vector Machine Classifier
The SMO based support vector machine classifier (SVMC) creates a dictionary(D) object from the pretrained fastText [28] word model(fTW). The fTW is a training model-1 which has already taken T1 time, and whenever a new dataset needs to be trained so if the transfer learning model [36] is used, then for a new training model as training model-2 takes T2 time, which is lesser time as T2< T as T=T1+ T2 as in Fig. 3. Algorithm 4 describes the process of the SMO-based support vector machine classifier building steps. The algorithm 3 LabelAnnotation Data for Learning Model provides Dla ={W, CLa}, further the indexes of all the words Dla(W), which does not belong to the D created as Idx. The total data size Ds is the total number of word count numW. The Ds partitioning occurs for cross-validation as a random partition on Ds to define the partition for a statistical model ({D-Train, D-Test). The mapping processing of the words to the vector is an essential technique in NLP, which uses ANN to learn a large corpus of the text data, where every word is represented as a list of numbers as a vectorizing simple mathematical function that maps to a semantic similarity as D-trainWord2vector[D-train] and finally the training input to the SVM classifier is obtained as [D-train U Cla]  Td. With the Td, the support vector machine (SVM) classifier for oneclass and binary classification is trained to get the text classifier model as tModelF(Td). Fig. 4 and Fig. 5 show the hyperparameter optimization results status.
The model t is trained on the low-dimension(Low-D) predictors by mapping the independent variables as predictors using a kernel() and support sequential -minimal optimizer(SMO) using an iterative-Single data kernel function or L1-sofmargin minimization whose adjustments with every cycle is shown in Fig. 4 and 5. The confusion matrix for a different dataset for the test performance is described in the results section.   Fig. 7 represents the normalized compound scores for TTo with the number of token N= 60 and 1,60,0000, respectively.
The time of processing, including all the process of tokenization, score computation, and visual presentation, is tabulated in Table II below: It is seen that when the dataset with 50 statements, each average statement time has taken is 3.8 seconds. In contrast, when evaluated on the complex text corpus of 160,0000 views, then the average time taken is 991 sec. Therefore, the consistency y is not maintained as the method is entirely rulebased, and the complexities of the text corpus also vary. For the scalability test, when the same dataset of 50 statements is made multiple copies of 50x, then the time to process is shown in the Table III and its variance as in Fig. 8.    The SMO-based SVM model provides the following confusion matrix on Opinion Data's test data from the University of Illinois, Chicago [29]. The confusion matrix among the predicted class and true class is given in Fig. 9.

VII. CONCLUSION
The increasing corpus of text data brings challenges to the data storage and the text analytics' computational effort. These papers propose a framework for offering the Opinion Mining design process as a Cloud Services. The subscribers can avail themselves of fast and cost-effective services for the opinion analysis on their text corpus data. The paper proposes two distinguished methods as a Vedar based score computation and another as an SVM-based learning model as an opinion analytics engine. The score-based algorithm performs well on the small dataset, whereas the learning-based model is computationally effective on the large corpus.
The proposed algorithm for futuristic research can be considered with different datasets. Further, the given algorithm can be incorporated in other cloud services where opinion mining is necessary. Also, the security parameter can be incorporated in the ongoing and future researches in opinion mining.