A Novel Hybrid Sentiment Analysis Classification Approach for Mobile Applications Arabic Slang Reviews

—Arabic language incurs from the shortage of accessible huge datasets for Sentiment Analysis (SA), Machine Learning (ML), and Deep Learning (DL) applications. In this paper, we present MASR, a simple Mobile Applications Arabic Slang Reviews dataset for SA, ML, and DL applications which comprises of 2469 Egyptian Mobile Apps reviews, and help app developers meet user requirements evolution. Our methodology consists of six phases. We collect mobile apps reviews dataset, then apply preprocessing steps, in addition perform SA tasks. To evaluate MASR datasets, first we apply ML classification techniques: K-Nearest Neighbors (K-NN), Support vector machine (SVM), Logistic Regression (LR), and Random Forest (RF), and DL classification technique: Multi-layer Perceptron Neural Network (MLP-NN). From the examination for pervious classification techniques, we adopted a hybrid classification approach combined from the top two ML classifier accuracy results (LR, RF), and DL classifier (MLP-NN). The findings prove the adequacy of a hybrid supervised classification approach for MASR datasets.


I. INTRODUCTION
Mobile app stores supply an amazingly wealthy source of information on app specification, characteristics, and utilize, and analyzing these information supplies knowledge and a more profound comprehension of the idea of apps. However, manual analysis of this tremendous measure of information on mobile apps is anything but a basic and clear task; it is expensive as far as human effort and time [1]. There are different mobile app stores, for example, Google, and Apple app store, and others that include free and paid mobile apps [2].
Mobile app classification phase is classified based on a significant category or class. In case users want to investigate and discover an app reasonable for their requirements, it is more helpful to have a special predefined classification scheme by which all apps are classified [3].
Being a significant provenance of data for organizations, the requirement to produce exact SA is a significant issue. Most sentiments accumulated from Arabic resources like social media is in colloquial Arabic, as the utilization of Modern Standard Arabic (MSA) in online is uncommon [4].
A few researches have been directed to analyze English mobile apps [5] [6] [7] [8] [9] [10]. In addition, according to the literature review, few researches have analyzed Islamic Arabic mobile apps and Saudi governmental services mobile apps [1] [11] [12]. However, no previous study has constructed, classified or analyzed Egyptian Dialect Arabic (DA) mobile apps reviews dataset.
The contributions in this research can be summed up as follows: 1) Introduce present MASR, simple Mobile Applications Arabic Slang Reviews of Egyptian reviews dataset for SA, ML and DL applications.
2) Investigate the structure, properties of the dataset, and perform tests on selected attributes for sentiment polarity classification.
3) Apply a various supervised ML, DL classifiers to the simple MASR that we gathered. 4) Adopted a hybrid supervised sentiment analysis classification approach including heterogenous approaches: Machine Learning (ML) approach such as: Logistic Regression (LR), and Random Forest (RF), and Deep Learning (DL) approach: Multi-layer Perceptron Neural Network (MLP-NN) classifiers to enhance the performance models of predicting MASR datasets and accuracy.
5) Compare our proposed model approach performance with various ML, and DL models. www.ijacsa.thesai.org phases of our proposed hybrid classification approach methodology. Section IV presents experimental results and discussion. Section V presents conclusion and Section VI presents future works.

II. LITERATURE REVIEW
Slight endeavors have been made to anatomize mobile apps reviews to handle mobile apps requirements evolution, advancement information and significant software. Related previous studies handle many aspects in mining mobile apps reviews for different sentiment analysis purposes such as building lexicons, classifying non-functional requirements, classify buggy apps, recognizing high-rated apps, and hybrid system to find the most similar word in lexicon for Egyptian Arabic tweets.

1) Arabic sentiment analysis tasks:
El-Beltagy et al. [13] build a sentimental Egyptian Dialect lexicon. Their tests showed that their proposed methodology gave improved results with regards to twitter even with the poor utilized resources.
Fu et al. [14] dealt with an enormous user reviews dataset including about 13 million mobile apps reviews from google play store. The creators proposed a WisCom framework to recognize the motivations behind why clients dislike specific mobile apps.
Gómez et al. [15] construct mobile apps reviews dataset to evolve a framework that identifies conceivably buggy mobile apps by enforcing a linkage in consent patterns and fault related reviews.
Chen et al. [16] presented a SimApp framework for identifying similar apps utilizing machine learning algorithms. SimApp inspects multimodal different data in app stores. They construct numerous kernel functions to degree app similarity. The outcomes exhibit that SimApp is powerful and promising for use in numerous applications, for example, app categorization, search and recommendation.
Tian et al. [5] research the main factors for recognizing high-rated apps by implementing random forest classifier. The test indicates that the main factors are promotional images numbers appeared on the app page, app size, and app version.
Lu et al. [17] suggest an approach to deal with classify mobile apps reviews automatically in light of non-functional requirements. They gathered 11,096 mobile apps reviews from Apple Store and Google Play.
Hameed et al. [11] explore existing Islamic apps accessible on Google Play app store. They handled the issue of the shortfall classification and the mis-categorization of Islamic apps. Therefore, they recommended another categorization for the Islamic apps' dependent on their common features such as download numbers, app ratings, and languages. They gathered proposed 5 distinct classes for the Islamic apps: Zakat, Qibla/Prayer Time, Quran, Hadith, and Supplications.
Abuelenin et al. [18] proposed hybrid system to find the most similar word in lexicon and increase the accuracy of Egyptian Arabic using the cosine similarity algorithm and the Information Science Research Institute Arabic stemmer (ISRI).
2) State-of-arts hybrid models: Heikal et al [19] propose a model which applies a hybrid model consists of CNN, and LSTM on ASTD. This model prediction performance is to 65%.
Al-Twairesh et al [20] suggest a model which applies a hybrid model SF+ GE + ASEH on SemEval. This model prediction performance is to 80.36%.
Mohammed et al. [21] propose a model which applies a hybrid model LSTM+Augmented on Arabic tweets. This model prediction performance is to 88.05%.
Furthermore, few previous works suggested a hybrid classification SA model for classify Egyptian Dialect Arabic mobile apps reviews.

III. A HYBRID SENTIMENT ANALYSIS CLASSIFICATION APPROACH FOR MOBILE APPS ARABIC SLANG REIEWS (MASR) METHODOLOGY
This paper methodology depends on previous qualitative, quantitative and SLR research methodology [22]. It built according to previous observations after analyzing ASA survey, comparative framework [23] and future relationship hypothesis, user satisfaction surveys and case studies. The proposed methodology will be based on applying Natural Processing Language (NLP) and Data Mining (DM) Tools, Methods and Techniques. It depends on the quality of extracted features that express user opinion and its sentiment for Arabic Mobile Apps'. Finally, the main goal for it is to help developers improve and enhance new releases of Mobile Apps to meet rapidly changing in requirements evolution.
This research adopted a hybrid classification model which consist of six phases for collect, analyze and classify sentimental Arabic Dialect mobile apps reviews on google play store, as shown in Fig. 1. This paper construct six phases for a hybrid classification Model methodology as indicated by Fig. 1; phase 1 MASR collection phase involves how to scrape and gather the dataset from google play store via Appbo 1 scraper tool and describing the dataset characteristics. The second phase involves the implementing of various pre-processing steps which will be applied on MASR dataset. The third phase is implementing feature extraction using Bag of Words (BOW) and Tf-idf. The fourth phase is implementing famous supervised machine learning classification algorithms such as Support Vector Machine (SVM), Naïve Bayes (NB), Linear Regression (LR), Neural Network (NN), and KNN classifier. The fifth phase proposing hybrid classification techniques according to the results of classifiers which accomplish highly accuracy results from the previous phase to enhance MASR accuracy results. www.ijacsa.thesai.org The last phase is to evaluate and compare the classification results utilizing recall, precision and accuracy.  In this research, the Mobile Apps Egyptian DA reviews dataset was extracted using Appbot scraper tool which follows those steps:  Choose Google Play Store.
 Select 9 various categories of mobile apps as shown in Table I.
 Focus on reviews of Egyptian mobile apps, and another Egyptian reviews for non-Egyptian mobile apps such as: Instagram, as shown in Table I.
 Save extracted attributes and reviews in CSV file: app category, app name, review, rating, and review polarity, as shown in Table II. 2) MASR Properties: MASR dataset comprises of 2469 reviews made up of 653 positive, 756 neutral and 1060 negative reviews. A negative review is characterized as a review that has been given a rating of "1" or "2" or "3". A positive review is one where the review has been given a rating of "3" or "4" or "5". At last, Neutral reviews with a rating of "1" or "2" or "3" or "4" or "5". The MASR dataset was made from the gathered data and comprises of the following fundamental attributes as shown in Table II.
3) MASR distribution: MASR dataset covers 2469 mobile apps reviews contributed by various reviewers from 12 mobile apps which covers nine various mobile apps categories such as social, lifestyle, education, maps and navigation, productivity, shopping, travel and tools. The negative reviews comprise 43% of the absolute number of reviews when contrasted with 2 https://play.google.com/store/apps/details?id=com.instagram.android 3  LR + RF + MLP-NN www.ijacsa.thesai.org the 26% of the positive ones. Furthermore, 31% of the reviews are "neutral". As expected, the negative reviews are the greater part class. Fig. 2 presents the classification of ratings for our extracted dataset.

Mobile App Name
Name of selected Mobile App.

Review
opinion of reviewer's written in the ED which is mixing between MSA or DA.

Rating
Applies scale from 1 to 5 showing the scope of the reviewer's satisfaction. Positive reviews instead of using the previous scale from 1 to 10.

Review Polarity
Denotes the sentiment of the review with "+1" for a positive review, "−1" for a negative review, and "0" for a neutral review.

4) Second phase:
Text Preprocessing phase: The initial step is to implement text pre-processing so as to evolve the performance of classifiers by changing the text into a format as suitable as possible. To achieve this, many stages are executed; specifically, normalization, tokenization, stop-word removal and stemming.

6) Tokenization:
For author/s of more than two affiliations: To change the default, adjust the template as follows. By tokenizing, you can appropriately separate text by word or by sentence. This will permit act with smaller sets of text that are still comparatively meaningful regular outgoing of the context of the remainder of the text. In this research, Regexp Nltk14 method applies on MASR dataset. It divides a string into substrings utilizing a standard expression. It can utilize its regexp to look like delimiters instead.

7) Stop word removal:
The second stage is to eliminate all stop-words from the reviews. Stop words are characterized as words that don't increase any sentiment value to a review; they are typically the most widely recognized words in a language. They can either be specially made or gained from the web. Unfortunately, there is no clear list accessible and there are slight lists accessible for the Arabic language. This research adjusted Arabic stopword list from many resources in addition to Egyptian stopword list from [24]. 8) Stemming: Stemming is a text processing method of decreasing a word to its root. It maps various patterns of the similar word to a public "stem" -for example, the Arabic stemmer maps ‫,طفم‬ ‫,اطفال‬ ‫,االطفال‬ ‫,اطفانكم‬ ‫,فأطفانكم‬ ‫,اطفانهم‬ ‫,واالطفال‬ ‫,فاطفانهم‬ ‫,وطفم‬ ‫,انطفىنت‬ ‫,وانطفهتيه‬ and ‫طفهتان‬ to ‫.طفم‬ In this research, Snowball 15 stemmer applies on MASR dataset. a) Third phase: Splitting phase: MASR dataset was separated into two sections: training sets, and testing sets. The training sets represent 70% of the datasets, and the testing sets represents 30%. The training sets utilized to train models, while the testing sets utilized to evaluate models. 14 16 : BOW is a process of eliciting features from text for utilize in modeling, such as with ML algorithms. BOW model assigns a corpus with word counts for every document.
10) Term Frequency-Inverse Document Frequency (TF-IDF) 17 : Tf-IDF weight is a statistical measure utilized to estimate how significant a word is to a document in a corpus. The significance grows proportionally to the frequency of times a word represents in the document. It is formed by two sections: For the second issue: This research intends to propose a novel Hybrid Supervised Classification Approach to automatically classify and predict the polarity of mobile apps Arabic Slang user reviews. This model mixes various supervised ML, and DL approaches. In ML approach, we suggest various modeling approaches: decision tree approach, and statistical approach. While in DL approach, we suggest linear & non-linear approach. In decision tree approach, we apply RF classifier. In Linear & Non-Linear approach, we apply MLP-NN classifier. In Statistical approach, we apply LR classifier. The reason for selecting those classifiers came after applying various ML classifiers in a previous phase. The results shows that the top classifiers that gain best accuracy for classify or predict MASR datasets are: RF, LR, and MLP-NN. Finally, we propose to apply a hybrid classification model that combines those three techniques to improve accuracy performance.
IV. RESULTS AND DISCUSSION For empirical study, ORANGE Data Mining tool utilizes a component-based, inclusive model for DM and ML users and developers. Also, this research utilizes it for ML, and DL Models purposes. It is a combination of Python-based, and NLTK library modules which perform a set of functions such as data input, pre-processing, splitting, visualization, classification, prediction, and evaluation. Classifier methods used to classify MASR dataset utilizing: ML approach which perform KNN, SVM, NB, & LR, DL approach which perform MLP-NN for ASA. In addition, this paper suggests a novel hybrid classification technique which combined from two top ML classifiers in addition to DL classifier: LR + RF +MLP-NN to enhance accuracy for classification and prediction. kfold cross-validation was utilized with k = 10. Accuracy, F1, Precision, Recall, AUC were utilized for evaluate MASR sentiment polarity datasets.
The results are discussed separately for each evaluation criterion. Moreover, to ensure the performance of the classifiers, this paper combined various domains to test the accuracy of various ML, DL, and our proposed hybrid approach using Arabic dialect features.   After applying ML classifiers on Positive Sentiments, results show that LR (87.5%), and RF (83%) shows better accuracy compared to a KNN (72.6%), SVM (45.5%), and NB (35.1%), respectively. In addition, after applying DL classifier: MLP-NN, results observe that NLP-NN accuracy (86%) is approximate to ML classifiers: RF, and LR. And finally, after applying our proposed approach ML+DL (LR+RF+MLP-NN), results recognize that it performs better accuracy (89.4%) than the top three classifiers: ML (LR, RF), and DL (MLP-NN).
After applying ML classifiers on Neutral Sentiments, results mention that RF (46.8%), KNN (36.9%) and LR (36.7%) shows better F1-Measure results compared to a SVM (6.7%), and NB (10%), respectively. In addition, after applying DL classifier: MLP-NN, results observe that F1-Measure of NLP-NN (53.1%) perform better than top two ML classifiers LR, RF. And finally, after applying our proposed approach ML+DL (LR+RF+MLP-NN), results recognize that it performs better F1-Measure results (61.8%) than the top three classifiers: ML (LR, RF), and DL (MLP-NN).     Accuracy. After applying ML classifiers, results mention that LR (72%), and RF (70%) shows better accuracy compared to a SVM (50%), KNN (49%), and NB (45%), respectively. In addition, after applying DL classifier: MLP-NN, results observe that NLP-NN accuracy (69%) is approximate to ML classifiers: RF, and LR. And finally, after applying our proposed approach ML+DL (LR+RF+MLP-NN), results recognize that it performs better accuracy (74.2%) than the top three classifiers: ML (LR, RF), and DL (MLP-NN).  In Table IV, a comparison between the performance of our model accuracy and state-of-arts hybrid models on the various Arabic datasets (SemEval, ASTD, COVID datasets, AJGT) is presented. Researchers observe the excellence of our proposed hybrid model approach compared to the previous works.

V. CONCLUSION
This paper aims to collect a simple dataset of Mobile Apps Arabic Slang Reviews (MASR) which focus on Egyptian Arabic Slang for sentiment analysis purposes. In addition, propose a hybrid supervised classification approach which combine ML, and DL approaches to automatically predict user requirements evolution to help developers update new versions. In ML approach, apply a LR which considered a statistical method, and RF which considered a decision tree method. In DL approach, apply MLP-NN which considered a linear and non-linear method. This paper utilized various evaluation metrics like: accuracy, f-measure, recall, precision, AUC, and ensemble classifier averaging. Results show that our proposed hybrid supervised classification approach achieves good performance results in the following: A limitation in this research is the size of the dataset because it focuses only on Egyptian Arabic Slang mobile reviews. However, it considered a contribution because till now no studies concentrate on it.

VI. FUTURE WORK
In future, researchers intend to accomplish various researches in various points: 1) Apply our proposed hybrid supervised approach for automatically classify Mobile Apps categories.
2) Apply our proposed hybrid supervised approach for different Mobile Apps Arabic Slang datasets in different languages.
3) Add different feature extraction methods like word embedding, and word enrichment and n-grams, also apply different tokenization, and stemming methods. 4) Propose different hybrid ML, and DL modelling approaches and compare them with our proposed approach on different Arabic Slang datasets.

5)
Apply also lexicon approach in addition to MASR dataset.
6) Extract functional, and Non-Functional, and Sentimental requirements from MASR datasets using Topic Modeling approach.