Aspect-Based Sentiment Analysis and Emotion Detection for Code-Mixed Review

Review can affect customer decision making because by reading it, people manage to know whether the review is positive, or negative. However, positive, negative, and neutral, without considering the emotion will be not enough because emotion can strengthen the sentiment result. This study explains about the comparison of machine learning and deep learning in sentiment as well as emotion classification with multilabel classification. In machine learning comparison, the problem transformation that we used are Binary Relevance (BR), Classifier Chain (CC), and Label Powerset (LP), with Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Extra Tree Classifier (ET) as algorithms of machine learning. The features we compared are n-gram language model (unigram, bigram, unigram-bigram). For deep learning, algorithms that we applied are Gated Recurrent Unit (GRU) and Bidirectional Long Short-Term Memory (BiLSTM), using selfdeveloped word embedding. The comparison results show RF dominates with 88.4% and 89.54% F1 scores with CC method for food aspect, and LP for price, respectively. For service and ambience aspects, ET leads with 92.65% and 87.1% with LP and CC methods, respectively. On the other hand, in deep learning comparison, GRU and BiLSTM obtained similar F1score for food aspect, 88.16%. On price aspect, GRU leads with 83.01%. However, for service and ambience, BiLSTM achieved higher F1score, 89.03% and 84.78%. Keywords—Sentiment analysis; emotion; multi-label classification; machine learning; deep learning


I. INTRODUCTION
Review is an evaluation to entities such as product, restaurant, place, etc. that can be used by customers or owner as product input. This review usually contains several aspects such as in laptop [1], the aspects that can be evaluated are hardware, price, etc. This evaluation can affect the decision making from customer. For instance, when people want to go to trip, they will read the review of several places and compare them. One of domain examples that usually get many reviews is restaurant. There are several platforms in internet for restaurant review, such as Zomato 1 and Yelp 2 . In the platform, mostly people only see the ratings of the restaurant, however reading the review is very important because the customers will obtain specific information rather than only seeing the ratings. In addition, sometimes people also give ratings that are very different from the actual review. So, it can be concluded ratings not always give the information about the quality of restaurant. Beside for decision making of customer, review also important for the product owner. Pontiki et al. [2] stated that feedback from customer will help companies measure their customer satisfaction, and for the development of their product and services they provide. For identifying the sentiment of aspect, sentiment analysis can be conducted. However, classifying the sentiment is not enough without considering the emotions from customers. Knowing the emotion can strengthen the sentiment results from a review. Furthermore, mostly a review contains two or more languages, or called code-mixed languages. This kind of review is difficult to understand by computer because computer cannot identify the languages easily like human. This also a big challenge for sentiment analysis and emotion detection. There are several classification methods that can be used, such as machine learning and deep learning. Mohammad et al. [3] used Support Vector Machine when classifying sentiment data from Twitter 3 . In the other hand, Stojanovski et al. [4] applied deep learning algorithm for sentiment analysis and emotion detection for Twitter data.
This research focuses to conduct sentiment analysis an emotion detection in every aspect that appeared in a restaurant review. The data were collected from Indonesian restaurant review platform, named PergiKuliner 4 , and this study using "food", "price", "service", and "ambience" as aspects. The sentiment polarities that were used for emotions are "positive", "negative", and "neutral", while "happy", "sad", "surprised", and "neutral". The addition of "neutral" because there is a possibility that a review contains sentiment polarity, but the emotion is difficult to detect. The method of classification that we applied is multi-label classification while the algorithms that we used are from machine learning and deep learning.
The rest of paper was organized into: in Section 2, we explained about several researches that related to our study. In Section 3, we illustrate the research steps of our experiments. For Section 4, we showed the classification results as well as analyzing them. Then in last part, we concluded the results and future work for this study. www.ijacsa.thesai.org

II. RELATED WORK
There are many studies about sentiment analysis and emotion detection. Mohammad [5] did a literature studies regarding several researches about valence, emotion, and other aspects that can affect the feeling from a person. From that study, the writer describes the challenges for sentiment and emotion detection, such as language complexity, nonstandardized language, lack of labeled data, subjectivity, culture differences, etc. Stojanovski et al. [4] did a sentiment analysis research using SemEval 2015 5 and emotion detection using Twitter data. The sentiment polarities that we used are "positive", "negative", and "neutral", while for emotions, we utilized "love", "joy", "surprise", "anger", "sadness", "fear", and "thankfulness". After that, the writer applied Deep Convolutional Neural Network for sentiment and emotion detection. However, the sentiment analysis and emotion detection were conducted in separated dataset. Another study about emotion was conducted by Hassan et al. [6]. This study was emotion classification using Skip-thought Vector. Khawaja et al. [7] also did an experiment about emotion which is developing an automatic lexicon for emotion.
In Indonesia, there are also few researches about sentiment and emotion. Wikarsa dan Tahir [8] studied about emotion detection using data from Twitter, but the data were in English. Savigny and Purwarianti [9] also conducted emotion classification using YouTube 6 comments. For sentiment analysis, [10] [11] studied it for restaurant review in Indonesia.
Several studies also have conducted for sentiment analysis and emotion detection using code-mixing data. Shalini et al. [12] studied sentiment analysis for Facebook 7 comments with Kannada-English languages. The experiment was done by applying Facebook's fast text, Doc2Vec with SVM, Bidirectional LSTM, and CNN. Lee and Wang [13] experimented using Chinese-English data and proposed multilearning framework for emotion detection.

III. RESEARCH STEPS
This section explains the methodology that applied in this research as shown by Fig. 1.

A. Data Collection
The data were collected from PergiKuliner platform by scraping them. The collected data are the reviews for several restaurants in Jakarta, Bogor, Depok, Tangerang, and Bekasi, and the total are 20000 reviews. After filtering the data, such as deleting the duplicate and removing the spam reviews, the final data that annotated are18908 reviews. The data were including reviews that use Indonesian, English, and codemixed (Indonesian-English). Below are the examples of data: 1) Indonesian: Akhirnya cobain taichan sm martabak tipkernya Dann taichannya enak!! Hehehe Asik jg tmptnya rame. (Finally, can taste its thaichan and martabak tipker and the taichan was delicious!! Hehehe it was fun, the place also crowded.).
2) Mixed: Finally got to try this current happening Korean food! Gyeran Jim (22k) Ini kaya steamed egg, yang rada di bake. Telornya ga tawar, tasty dan pinggirannya agak kering gitu. Menurut gue worth sih 22k buat ini, hehe. Probably gonna try again :) (Finally got to try this current happening Korean food! This Gyeran Jim (22k) was like steamed egg. The egg wasn"t blend, tasty and the crust is bit dry. In my opinion 22K was worth for this, hehe. Probably gonna try again :)) 3) English: Been here for several times I've been loving this place so much. The ambience is truly Japanese izakaya dining. If you eat with many people (sharing) the price would be reasonable, however if you only eat for two the price might get a little high for izakaya. Though the foods are mostly great. Cool place to hangout!

B. Building Annotation Guidelines
After collecting data, next step is building the annotation guidelines. There are two annotation guidelines that were made. First is annotation guideline for sentiment annotation, and another one is for emotion annotation. The aspects that used 'food', 'price', 'service', and 'ambience'. The sentiment polarities that used, following [14], which are 'positive', 'negative', and 'neutral', while for emotions, we followed [15], that divided emotions into 'happy', 'sad', 'surprised', 'angry', 'disgusted', and 'fear'. We also added 'neutral' for emotion list because the possibility if the emotion is difficult to detect. Below are the definitions of the label that used. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 9, 2020 181 | P a g e www.ijacsa.thesai.org 1) Sentiment labels: a) Positive: Positive value can be seen by the appearance of positive terms, such as: "delicious", "recommended", "cheap", "clean", "friendly", etc. b) Negative: Negative label is given if the negative terms occur, for instance: "bad", "horrible", "not recommended", "pricey", "expensive", "dirty", etc. c) Neutral: A review is classified as neutral if the terms that appear do not show positive or negative values. Besides, it can be noticed by the appearance of neutral terms, such as: "standard", "so so", "not bad but not good", etc. In addition, the neutral label also given to the aspect that does not appear, because we assumed if an aspect does not mentioned, that means the polarity will be neither positive nor negative.
f) Fear: Review is classified as fear if the terms like "afraid", "worried", etc, appears. g) Neutral: Neutral label in is given if the emotion in a review difficult to be interpreted. In addition, neutral emotion also will be given even though the aspects are not mentioned, like neutral definition in sentiment.

C. Annotation
The next step is annotating the data. The annotation step consists two stages, which are sentiment annotation and emotion annotation. The method for deciding the annotator is crowdsourcing method, following a study from Sabou et al. [16]. The annotators are not linguistic experts. Besides, every review is annotated by 3 people in every stage. The method for retrieving the final label is major voting. After sentiment annotation, there are 562 data that cannot be used because the major voting results indicated that every annotator has labelled them with different labels. So, the data for the next annotation stage are 18346 reviews. However, because the limited time and number of annotators, the data that annotated for emotion label are only 15046 reviews. After applied major voting, the results of data that used are 14188. But the number of data with "angry", "fear", and "disgusted" labels are very small, so we decided to remove those data, and the final number of data that we used for classification are 14103 reviews. Then, the labels that used are "positive", "negative", and "neutral" for sentiment, while "happy", "sad", "surprised", and "neutral" for emotion.

D. Data Preprocessing
After the annotation process, the next stage is data preprocessing. This stage adapted the research from [17] and consists few steps, which are: 1) Emoticon Processing: In this step, emoticon characters, such as: :( was changed into "sad", and :) into "happy". This was conducted to avoid losing the information about the emoticon. Furtherore, when removing non alphabetical characters step is applied, the emoticon is not removed.
2) Case Folding: All of strings were changed into lowercase format to match the structures. For example, "Food" was converted into "foods".
3) Abbreviation and Spelling Correction part 1: In this part, the word spelling was corrected into formal form. For illustration, "I"ve visted the place, that wasn"t too crowd" was corrected into "i have visted the place, that was not too crowd". We used the abreviation dictionary that id selfdeveloped by [17], and contains abreviations from indonesian and english.

5) Abbreviation and Spelling Correction part 2:
In this step, the words are checked again whether all of them have been corrected. This step was applied to avoid the words that has the possibilities haven't been corrected in the third step. For instance, the phrase "tmptnya ga bgs!!" was changed into "tempatnya tidak bgs!!" after third step, but the word "bgs" does not change into "bagus" (good) because there are exclamation marks "!!" that attached after words "bgs". So, after the exclamation marks were removed in the fourth step, the phrase "tempatnya tidak bgs", was corrected again into "tempatnya tidak bagus" (the place was not good).
6) Removing Stopwords: In this stage, the stopwords that occur, like "i", "you", "always", were removed. This step used dictionary built by [17] by combining NLTK 8 for English and Sastrawi 9 for Indonesian. 7) Removing Repetitive Characters: Sometimes, people like to express their feeling by using many unecessary duplicated characters. These characters should be removed, and to illustrate this step, "happppyyyy" is changed into "happy". 8) Stemming: In this last preprocessing step, we removed the affixes and suffixes from the words to make them back into their base form. The functions that implemented are Snowball Stemmer by NLTK for English, and Sastrawi Stemmer for Indonesian because the data are in Indonesian and English, so, we applied two stemmers. www.ijacsa.thesai.org

E. Feature Extraction
This part explains about the feature extractions for machine learning, and the development of word embedding for deep learning.

1) N-gram:
The features that used for classification using machine learning is n-gram language model word level. The number of gram that extracted as features are unigram, bigram, and the combination of unigram-bigram. We also applied chi-square method for feature selection.
2) Word embedding: For deep learning, we built our own word embedding using all scraped data from PergiKuliner. The method that implemented to build word embedding is skip-gram with dimension = 300.

IV. RESULTS AND ANALYSIS
This part explains about the experiments, results, and analysis of this research.

A. Experiments
In this study, we utilized the dataset that we made and created two scenarios for multi-label classification. Then, we compared several algorithms from machine learning and deep learning. After that, we evaluated the performances of those algorithms by comparing their F1 scores.

1) Data:
This experiment using all data that are retrieved from annotation step. The total of data are 14103 reviews with three sentiment labels and four emotion labels. The distribution of labels for sentiment and emotion can be seen at Fig. 2 and Fig. 3, respectively. By seeing both figures, we noticed that the data have imbalanced labels for both sentiment and emotions. To illustrate, "food" aspect is dominated by "positive" sentiment and "happy" emotion. On the other hand, all aspects beside "food" is dominated by "neutral" for both sentiment and emotion.
2) Scenarios: a) First scenario: In first scenario, we employed problem transformation methods for multi-label classification in machine learning. Transformation methods that we implemented are Binary Relevance (BR), Label Powerset (LP), and Classifier Chain (CC). For machine learning algorithms, we applied are Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Extra Tree Classifier (ET). The features that we used are unigram, bigram, and combination of unigram-bigram. b) Second scenario: In this scenario, the deep learning algorithms that utilized are Bidirectional Long Short-Term Memory (BiLSTM), and Gated Recurrent Unit (GRU). We do not use problem transformations method like machine learning, but we assigned sigmoid as the activation function and binary cross entropy as loss function for retrieving the labels of data. The word embedding that has developed before is employed in this scenario.
3) Evaluation: Evaluation for both machine learning and deep learning is using kfold cross validation technique, with the number of k = 10. The scores that evaluated is f1-scores.

B. Results
This section shows the performance of machine learning in first scenario, and deep learning in second scenario in every aspect of review. After that we assessed every performance in both scenarios by comparing their f1-scores.

1) First scenario:
a) Label powerset: This part presents the performance of machine learning algorithms when classified using Label Powerset (LP) as transformation method.
From Table I, it shows that ET achieved highest score, 88.17% for unigram feature. While for bigram, the highest score was acquired by RF with 87.3% for f1-score. This score was higher 0.61% compared to SVM score as second place. In the other hand, RF and ET claimed same f1 scores for unigram-bigram, which is 88.16%. By seeing the scores, it can be concluded that the best feature in this classification results is unigram. Table II shows the performance of RF that dominated every feature in price aspect. However, for unigram-bigram feature, ET obtained same f1-score with RF, which is 89.54%. For the best feature in classification for price aspect, unigrambigram achieved highest score compared to other two features.  For service aspect, Table III shows ET monopolized the scores for both unigram and unigram-bigram features. While for bigram, the highest score was led by RF with 90.88%, 0.21% higher than ET. However, the best feature for this classification in service aspect is unigram-bigram with score is 92.65% obtained by ET.
Similar to previous table, Table IV shows ET achieved highest scores for both unigram, and unigram-bigram when classifying "ambience" aspect. Also, RF obtained highest score for bigram feature with 81.82%. Then, same with service aspect, in this classification results, the best feature is unigram-bigram with score is 86.98% that achieved by ET.
From Table V, we can see the highest scores in every aspect and in every feature that implemented. By seeing the table, it presents that with Label Powerset (LP), "food" aspect was the only one that has highest score when it was classified using unigram with score 88.17%, while other aspects got their best performances when they were classified with unigram-bigram. Besides, ET obtained highest scores in every aspect except "price" which its highest score achieved by RF. For bigram feature, all aspects were dominated by RF, but the scores are below unigram-bigram features. In addition, the aspect that has the highest score compared to other aspects is service that attained by ET with score is 92.65%. b) Binary relevance: This part presents the performances of machine learning algorithms when classified using Binary Relevance (BR) as transformation method. Table VI shows the performance of RF that attained highest scores in every feature in "food" aspect. ET follows it by obtaining scores that not really far from RF scores. The table also shows that classification result using unigrambigram feature is higher than other features, even though the score is only 0.01% higher than score that retrieved by using unigram feature only.
By seeing the Table VII, for the first time DT attained highest score comparing to other algorithms, with unigram feature. DT achieved 83.60%, followed by ET that got score which was 1.27% lower than DT. For bigram feature, RF achieved highest score when classifying "price" aspect. However, unigram-bigram, once again, become the feature that helped ET to attain highest score for "price" aspect with score 87.56%. Similar to "price" aspect results, Table VIII shows DT achieved highest score again for classifying "price" aspect using unigram feature, but for this time, DT was followed by RF that was 0.39% lower than DT. RF also leads the score by classifying using bigram, and its score is 90.07%. Best feature for this aspect also obtained by unigram-bigram, with ET as classification algorithm. The score ET obtained was 91.28%, 1.21% higher compared to bigram and RF pair.
From Table IX, it can be seen that ET leads in both unigram and unigram-bigram features while classifying the "ambience" aspect. While RF achieved best score when classifying using bigram feature with score id 80.12%. In addition, similar to three previous aspects, best classification score was obtained when using unigram-bigram feature by ET.
From the comparison of all machine learning algorithms that shown in Table X, we can see all best performances were attained by using unigram-bigram as feature. By applying BR method, and unigram-bigram as feature, ET successfully obtained highest scores in three aspects, which are "price", "service", and "ambience". In other hand, RF dominates all "food" aspect scores by using all features, including unigrambigram.  Furthermore, like LP, "service" becomes the aspect that got highest score in Table X, which is 91.28%, compared to other aspects. Then it is followed by "food", then "price", and "ambience" aspect, respectively. c) Classifier chain: This part shows the performances of machine learning algorithms when classified using Classifier Chain (CC) as transformation method.
In Table XI, the classification results of "food" aspect were dominated by RF in every feature that was used. However, in unigram-bigram feature, ET successfully gained same score with RF, which is 88.40%. Moreover, similar to LP and BR methods, by using CC, unigram-bigram still becomes the best feature of multi-label classification for "food" aspect, following by unigram.
For classification of "price" aspect, Table XII shows that RF attained best score in unigram, and also bigram feature. While for unigram-bigram feature, ET obtained the highest score with 89.24%, 1.62% and 3.93% higher compared to results from RF with bigram and unigram, respectively. This also means that once again, unigram-bigram is the best feature for classifying the "price" aspect, similar to previous aspect. Table XIII presents the performances of algorithms for classifying "service" aspect. We can see that ET leads the score for classification using unigram and unigram-bigram, while RF achieved highest score for bigram. However, unigram-bigram still becomes the best feature for this aspect while it was classified using ET, and the f1-score is 92.09%.
Identical to previous aspect, as shown by Table XIV, ET obtained highest score for "ambience" aspect in both unigram and unigram-bigram features. Best score in bigram also obtained by RF with 81.84%. Despite of it, it is still 5.26% lower than score attained by ET with unigram-bigram feature.
Again, unigram-bigram becomes the best feature for "ambience" aspect.
In Table XV, it can be noticed that unigram-bigram becomes the best feature when Classifier Chain (CC) transformation method was applied. Unigram-bigram dominates all aspects, like Binary Relevance (BR). Besides, ET also attained the highest scores almost in all aspects, except "food" aspect that was dominated by RF, also same with BR.
In addition, like both LP and BR results, the best score between all aspects was obtained by "service" aspect when it was classified by ET using unigram-bigram. The score that ET achieved for "service" aspect is 92.09%, 2.85% higher than "price" aspect which was the second highest after "service" aspect.  Table XVI shows the comparison of best performances from Binary Relevance (BR), Label Powerset (LP), and Classifier Chain (CC) with unigram-bigram as feature. We can see from the table that "food" aspect got the highest score when it was classified by RF with CC as problem transformation method. Followed by BR, and LP, respectively. For "price" and "service" aspects, LP is better than other transformation methods when classifying both aspects, followed by CC, then BR. For "price" aspect, the algorithm that obtained the highest score, which is 89.54%, was RF. While for "service" aspect, the best score was achieved by ET with 92.65%. However, in case of "ambience" aspect, ET attained the highest score with CC as transformation method for multi-label classification. The score that was achieved by ET in "ambience" aspect is 87.1%, 0.12% higher than the score it obtained by using LP as problem transformation method.
Furthermore, it also can be noticed that BR cannot surpass both LP and CC, except in "food" aspect where BR score is 0.02% higher than LP. This maybe happened because as transformation method, BR treats the labels independently before they are classified by machine learning. This means, BR does not consider the relationship between the labels. For instance, the sentiment label "positive" is considered does not have relation with the emotion label "happy", because both labels were classified separately. In the other hand, LP transforms the label combinations into new classes before machine learning classified them as multiclass problem. While CC transforms the labels by using the first label that obtained from first classification as a feature for classifying the next label in next classification. Thus, by seeing the way the three transformation methods work, we can conclude that LP and CC consider the relation between labels, while BR does not consider it.
Moreover, both ET and RF always obtain best score than DT and SVM in all aspects inn all transformation methods that were used in this research. It should be remembered that both ET and RF are tree-based ensemble algorithms, which means the way they work is almost similar, except the way they split the nodes and use the samples. However, by seeing Table XVI, we can see that ET dominates "price", "service", and "ambience" aspects for all transformation methods, except for LP in "price" aspect which its best score was obtained by RF. For "food" aspect, all highest scores for all transformation method were attained by RF.
2) Second scenario: This part shows the performances of deep learning algorithms, which are BiLSTM and GRU. From the classification results of both deep learning algorithms, Table XVII shows that GRU and BiLSTM attained same scores for "food" aspect. However, BiLSTM leads the scores for "service" and "ambience" aspects. For GRU, it obtained higher score compared to BiLSTM in "price" aspect, which its score peaks on 83.01%, 0.92% higher than BiLSTM. Nonetheless, the scores from GRU in "service" and "ambience" are not very far from BiLSTM scores. The scores achieved by GRU are 0.33% and 0.86% lower than BiLSTM scores in "service" and "ambience" aspects, respectively. From this experiment, it can be concluded that GRU can compete with performances from BiLSTM, even though BiLSTM already uses future context that can help it to solve more complex classification problems. In addition, like machine learning, "service" aspect becomes the aspect that gotten highest score when it was classified by BiLSTM and GRU. Then, the aspect that becomes the second highest is "food", followed by "ambience" and "price", respectively. Furthermore, it can be concluded that self-developed word embedding can work well with deep learning. Hence, the scores that obtained by deep learning algorithms are quite similar to machine learning.

C. Analysis
Overall, the results of both scenarios show that "service" aspect becomes the aspect that can be classified better than other aspects. After "service" aspect, it was followed by "price", "food", and "ambience", respectively, for machine learning. For deep learning, the second highest score was obtained when algorithms classified "food", followed by "ambience", then "price" aspect, respectively. This may be affected by the way people express their comments towards the aspects. Usually, whenever people comment about "service" aspect, people tend to use words like "service" or "waitress" directly in the comments, same goes with "price" aspects. This kind of writing is different when people talk about "food" and "ambience" aspect, which can be written more creative by customers. To illustrate, people often write all the food names they ordered, and explain them in detail one by one. This can lead to misclassification by the classification program if there is a conflict occurs in an "aspect". For example, the comment "the noodles were very good but too oily, I don"t like it", or "the fried rice was delicious but the orange juice too blend". Those kinds of reviews can create a conflict and affects the classification results. Same goes with "ambience" aspect, people can explain it variatively. For instance, "it has beautiful decoration, but the room was full of smoke".
For second scenario results, "price" aspect become the aspect with lowest score after classification. While "food" and www.ijacsa.thesai.org "ambience" aspects become two and third place after "service" aspect that has higher score. This may be caused by the label distribution in dataset, which "positive" sentiment and "happy" emotions are dominant in "food" aspect, followed by "ambience", "service", and "ambience" aspect. Thus, the deep learning models learned "positive" and "happy" labels well, compared to other labels.
Furthermore, the features that used also affect the classification results. In first scenario, unigram-bigram feature gave more information compared to apply only unigram, or only bigram independently. When classifying, unigram can work well because in unigram, words are treated individually, and those words often appear in the dataset. To illustrate, the sentence "I like the food but it was too pricey". In unigram, it will be "I", "like", "the", "food", "but", "it", "was", "too", "pricey", and for bigram, it will be "I like", "like the", "the food", "food but", "but it", "it was", "was too", "too pricey". When classifying using bigram, the models work well but not always good compared to unigram because the combination of words in bigram are not often appear in reviews compared to unigram. Thus, if unigram and bigram are combined, the models obtain more information about word when they appear individually and when they appear as pairs. Then, for second scenario, classification with self-developed word embedding can give good results with the information especially information about semantic relations between words. Hence, it should be considered to add other features, such as POS tagging, for machine learning and deep learning to enhance their performances.
Label distribution also contributes to affect the classification results. This research has imbalanced dataset, so, it will be good to use data augmentation or apply oversampling/undersampling methods to balance the data.

V. CONCLUSIONS AND FUTURE WORK
For this research, we made experiments and evaluated the performances of machine learning algorithms, which are Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), and Extra Tree Classifier (ET) , as well as deep learning, (Bidirectional LSTM (BiLSTM), and Gated Recurrent Unit (GRU)). We made two scenarios, which in first scenario, we applied transformation methods such as Binary Relevance (BR), Label Powerset (LP), and Classifier Chain (CC) for multi-label classification in machine learning. Then the features that used are unigram, bigram, and combination of unigram-bigram. For second scenario, we utilized sigmoid as the activation function and binary cross entropy as loss function for retrieving the labels of data in deep learning. Then, self-developed word embedding is employed in this scenario for deep learning classification. The results show RF dominates with 88.4% and 89.54% F1 scores with CC method for food aspect, and LP for price, respectively. For service and ambience aspects, ET leads with 92.65% and 87.1% with LP and CC methods, respectively. On the other hand, in deep learning comparison, GRU and BiLSTM obtained similar F1-score for food aspect, 88.16%. On price aspect, GRU leads with 83.01%. However, for service and ambience, BiLSTM achieved higher F1-score, 89.03% and 84.78%.
Since the distribution of label in our data is imbalanced, for the future, it should be considered to use balancing methods such as oversampling or undersampling. We also can apply data augmentation to retrieve new data for labels that have small numbers. Besides, we need to add more features to enhance the performance of both machine learning and deep learning.