Machine Learning Techniques for Sentiment Analysis of Code-Mixed and Switched Indian Social Media Text Corpus: A Comprehensive Review

A comprehensive review of sentiment analysis for code-mixed and switched text corpus of Indian social media using machine learning (ML) approaches, based on recent research studies has been presented in this paper. Code-mixing and switching are linguistic behavior shown by the bilingual/multilingual population, primarily in spoken but also in written communication, especially on social media. Code-mixing involves combining lower linguistic units like words and phrases of a language into the sentences of other language (the base language) and code-switching involves switching to another language, for the length of one sentence or more. In code-mixing and switching, a bilingual person takes one or more words or phrases from one language and introduces them into another language while communicating in that language in spoken or written mode. People nowadays express their views and opinions on several issues on social media. In multilingual countries, people express their views using English as well as their native languages. Several reasons can be attributed to code-mixing. Lack of knowledge in one language on a particular subject, being empathetic, interjection and clarification are some to name. Sentiment analysis of monolingual social media content has been carried out for the last two decades. However, during recent years, Natural Language Processing (NLP) research focus has also shifted towards the exploration of code-mixed data, thereby, making code mixed sentiment analysis an evolving field of research. Systems have been developed using ML techniques to predict the polarity of code-mixed text corpus and to fine tune the existing models to improve their performance. Keywords—Sentiment analysis; code mixing; corpus; deep learning; machine learning; NLP; social media text

www.ijacsa.thesai.org to analyze and decipher information from the text collected from well-known social networking platforms. However, the task is challenging because of a number of reasons. The text present on these platforms is characterized by having spelling errors, Meta tags (hash tags), creative spellings (f9 for fine) abbreviations (BTW for by the way) phonetic typing (becoz for because), word plays (gooood for good) and so on [4]. All these constraints make it challenging for an NLP researcher to deduce valuable information from the text. Therefore, a considerable percentage of text available on these sites is in languages such as Spanish, Chinese, Arabic, Hindi, Urdu, etc. In the recent past people, especially in bilingual countries like India, not only use a native script to write in their own languages, they also write in the Roman script to express their feelings.
Therefore, people write in code-mixed or code-switched form. Code in communications refers to the rule for converting a piece of information into another form of representation. Code mixing and code-switching are used in bilingual communities where people prefer their native language and a second language in different domains. Although codeswitching and code-mixing are usually interchangeable terms in their usage, there are few differences between the two. While code-switching is actually the process of shifting from one language to another, code-mixing on the other hand means the mixing of different phonetic units such as words, phrases, morphemes, clauses, affixes and modifiers of some different language into the expressions of some other language. Thus, the code-switching is inter-sentential, while as codemixing is intra-sentential which is constrained by grammatical principles.
Translation: The principal will reject your application. Take it from me.

Example of Code-Switching
The principal will reject your application. Likh kay leylo.
Translation: The principal will reject your application. Take it from me.
There are many reasons why people use a multilingual approach while expressing themselves on the web and social media sites. Code mixing and code-switching occurs in informal communication and are used by multilingual speakers. In [5] a list of a number of reasons why code mixing occurs. Bilingualism, speaker and partner speaker, social community, the situation, vocabulary and prestige are the main reasons for code-mixing on social media platforms.
The main reason for code-mixing or switching can be the absence of a specific word or a phrase in a language that necessitates a person to use a word or a phrase from his/her native langue to make the receiver understand it better. The detailed motivation and reasons for code-mixing and codeswitching are explained in [5]. On Social media platforms, in a multilingual society, people often mix multiple languages to express their feelings. However, they do not use native language scripts; rather they prefer the roman script to compose non-English words. Automatic language detection in such a scenario is a herculean task. Sentiment Analysis also referred to as opinion mining or emotion analysis, is the identification, recognition or categorization process of people"s views and reviews for a service, a product, social issue, an event or a moment into "positive", "negative" and "neutral" classes [6]. Sentiment analysis of dataset containing the data with code-mixed text is a laborious process, ranging from preprocessing of data, language identification to classification. The challenges which need to be addressed before assigning sentiments are posed mainly by unstructured sentences, mixed language constructs, spelling variants, grammatical mistakes, etc0 [7]. Also because of the noisy nature of code-mixed data and the non-availability of annotated resources, sentiment extraction from a code-mixed text has become a challenging task [8]. Therefore, sentiment analysis of the multilingual text has become increasingly an important research area [9]. The general workflow of the Code-Mixed text data Sentiment analysis process is shown in Fig. 1.
A comprehensive review of ML techniques for sentiment analysis of the code-mixed text is presented in this paper. Techniques and approaches of ML and Deep Learning (DL) for bilingual or multilingual text Sentiment Analysis are described along with their corresponding results in different scenarios and using different types of datasets.
The key research highlights of this study are:  To explore and report the current state of research in Code-Mixed and Switched languages using various machine learning and deep learning techniques.
 To present the results of various machine learning models in terms of their performance metrics used by the recent studies in code-Mixed and Switched English with Indian languages.
The paper is organized as follows: Section II and its subsections provide the Machine Learning and deep learning methods used in sentiment analysis of code-mixed social networking data. Section III presents the results of Sentiment Analysis of Code-mixed Indian languages; Section IV presents a discussion of the study. The conclusion is presented in Section V. 457 | P a g e www.ijacsa.thesai.org Machine Learning allows computers to seek new tasks without being explicitly programmed to perform them. In Sentiment analysis, ML can be used to analyze text for polarity. Sentiment analysis models have been trained to analyze and understand complex natural language such as human patterns of speech, the context of the sentence, sarcasm, idioms, negation, metaphors, etc. with reasonable and accepted accuracy [10]. Researchers have successfully proposed various approaches for sentiment analysis of English language data using Machine Learning and Deep learning models [11] [12].
Deep Structured Learning commonly known as Deep Learning has acquired a lot of consideration from the recent past in the Machine Learning approach of research [13] Deep learning uses multiple layers to mine higher-level features from the given input data. It is used for a number of applications viz. text analysis, pattern analysis, classification, image processing, etc. and uses non-linear information for feature extraction and transformation in the supervised and unsupervised domain [14]. Deep Learning techniques permit computational models that manage various processing layers to learn representations of data with multiple layers of abstraction. Deep in Deep Learning denotes the layer numbers that form the Neural Network in traditional methods neural networks were of three layers viz. input, output and hidden. The maximum the number of hidden layers, the deep is the neural network [15]. Sentiment Analysis Approaches using Machine Learning and Deep Learning approaches have been illustrated in Fig. 2.

A. Support Vector Machine
Support Vector Machine (SVM), designed by Vladimir Vapnik in 1995 [16], is a non-linear classifier and is a popular and robust classification and regression algorithm for data analysis and pattern [17]. The goal of SVM is to find the best and ideal hyper-plane that maximizes the gap between data points of two unique classes. If the data is un-labeled, Support Vector clustering is used [18]. SVM data classification concept has been illustrated in the plot given in Fig. 3. The support vectors represent the data points which are closest to the hyperplane with a distance equivalent to margin.
A word-level classification of English-Nepali and English-Spanish code-mixed public network data was proposed in [19]. The authors performed experiments with linear kernel SVM classifier using word and character n-gram features. The model achieved an accuracy of 77.5% for Nepali-English and 80% accuracy for Spanish-English using basic features and applying a 6-way SVM classifier. The authors suggested that the features of Neural Network may improve the accuracy.
A Code mixed Language identification system for social communication text of Tamil-English and Malayalam-English was proposed in [20]. The system identifies the language on the basis of words. By using the character embedding approach, the system used trigram and n-gram features. For training and testing of the model SVM has been used. The proposed model achieved 93% and 95% accuracy for Malayalam-English and Tamil-English data. The authors suggested that availability of more code-mixed data and using trigram features shall be sufficient for the development of a language identification system.
A Hindi-English Sentiment Analysis system for Twitter data to forecast the sentiment present in the data has been proposed in [21]. Researchers have used tf-idf vector and GloVe Vector features along with the Support Vector Regression (SVR) model. The model achieved an f-score of 0.662.  458 | P a g e www.ijacsa.thesai.org Shared tasks on Sentiment Analysis of Indian Languages (SAIL) have been organized to identify sentiments in codemixed datasets collected from media platforms like Twitter, Facebook and other social media platforms of Indian languages, especially language pairs of Hindi-English and Bengali-English [22]. Details of the shared task held during ICON-2017 (the International Conference on Natural Language Processing-2017) were presented by [23]. The goal of the shared task was to identify sentence-level sentiment polarity of code-mixed datasets of language pairs Hindi-English and Bengali-English. The authors presented a detailed overview of problem definition, dataset collection, participant systems and the evaluation process of the shared task. The SVM classifier achieved the best results. Word and character n-grams features were used and applied to SVM classifier for sentiment identification. Thus f-score of 0.569 were achieved for Hi-En and 0.526 for Bi-En datasets.

B. Naïve Bayes
Naïve Bayes (NB), a data mining algorithm [24] is a probabilistic ML classification approach derived from the application of Bayes Theorem with a vast scope in real world applications [25]. The approach assumes that a new object is categorized to a class on the basis of the supposition that all features are independent given in the class [26]. The theorem can be written as in equation 1 and illustrated in Fig. 4. (1) Using the probability concept given by Bayesian theorem the equation can be represented as: (2) NB classifier has been derived from the concept of Bayes Theorem with assumptions of having strong independence between the features.
A system to prepare, collect, filter and identification of sentiment of Twitter data was presented in [27]. The authors applied various supervised ML algorithms viz. Gaussian NB, Bernoulli NB and Multinomial NB for annotation and classification of English-Bengali code-mixed data. The system also applied Code-Mixed Index (CMI), Code-Mixed Factor (CMF) and other language aspects of sentiment classification.
A system to classify Hindi-English and Marathi-English tweets and comments on YouTube using a number of ML algorithms such as NB, SVM and KNN for performance evaluation of each algorithm was designed by [28]. The results reveal that NB and SVM performed better than KNN.
An automatic POS tagging system was proposed by [29]. The authors used coarse-grained and fine-grained social media text collected from Twitter and Facebook for experimentation purposes. Machine learning algorithms such as NB, Conditional Random Forest (CRF), and random forest along with Sequential Minimal Optimization (SMO) were applied for performance comparison. Various features were used in the process which was done on the word context information. The CRF based model thus attained the f1-score of 0.716.
Authors in [30] carried out experiments to construct an English-Punjabi text sentiment classification system. The data was collected from Facebook posts in the agricultural domain. Two classifiers viz. SVM and NB were applied for sentiment identification. Features like unigram and n-gram were applied to the model. The model achieved best accuracy of 85.5% using Naïve Bayes classifier.
A binary sentiment classification model was proposed by [31]. The model used English-Bengali data collected for movie reviews from social networking sites. For the classification and identification of positive and negative sentiments two supervised ML algorithms, NB and SVM were used. The experimental results reveal that if the test and train data are of similar type that is both language data is in Roman script, SVM gives better results. However, overall Naïve Bayes achieved the best accuracy.

C. Decision Tree
Decision tree (DT) is referred to as a non-parametric ML technique of data mining. Decision Tree is commonly used in regression and classification problems such as marketing, sentiment analysis, scientific discovery, fraud detection, etc.
[38] is one of the famous supervised ML classification algorithms. The decision tree splits data into two or more sets and important features that create the best split are used and calculated by the algorithm as illustrated in Fig. 5.
An essential part of NLP is POS Tagging. For the English language data, POS tagging is a complex task. However, for code-mixed text data, this is more challenging and is a focused research area in which still needs a significant amount of work to be done for Indian languages code mixed data. An approach for three code-mixed Indian language texts in language pairs (Hindi-English, Hindi-Bengali and Hindi-Telugu) POS tagging was presented by [32]. The authors used ICON-2015 codemixed data and applied the Decision Tree ML algorithm for code mixed text POS tagging.  459 | P a g e www.ijacsa.thesai.org Study on Hinglish (Hindi-English) code-mixed tweets sentiment analysis was done in [33]. Datasets were provided in SemEval-2020 (International Workshop on Semantic Evaluation-2020). The system used J48 Decision Tree as a training classifier and Weka as a tool for the classification. Performance evaluation of the model was done and f-score of 0.53 was achieved.

D. Random Forest
Random forest (RF) is a supervised ML approach used both in for classification and regression problems [34]. It is an ensemble learning algorithm developed by [35]. The algorithm combines DTs and collects their results using averaging. Being a type of supervised learning algorithm, RF has been influenced by [36]. The algorithm works on divide-andconquer rule. A Random Forest has generally shown excellent performance in scenarios in which the number of observations are less than the number of variables [37]. The general workflow of the technique has been shown in Fig. 6.
For detection of sarcasm, in Hindi-English code-mixed dataset consists of tweets, a baseline supervised classification approach was proposed by [38]. The authors perform 10-fold cross validation using Random Forest classifier. The proposed system also uses Linear SVM classifier and RBF Kernel SVM for the same dataset. However, the RF classifier achieved the better f-score of 78.4.
In collaboration with Forum for Information Retrieval Evaluation (FIRE), a shared task was organized for Code-Mixed Entity Extraction process in Indian Languages (CMEE-IL) in Kolkata, India [39]. Datasets were collected for Hi-En and Ta-En code-mixed social networking data in the said shared task and an Entity Recognition Model was developed. Random Forest Tree Classifier was used for classification. Conditional Random Field Entity Recognition with hybrid features were experimented on the collected corpus. The model achieved 95% of accuracy on training data and a reasonable performance on testing data.
The researchers in [40] have proposed a POS tagger for three Indian code-mixed language pair"s viz. Hi-En, Bi-En and Telugu-English. A RF classifier along with a dictionary was applied for fine-grained and coarse-grained datasets consists of tweets, Facebook comments and WhatsApp chats collected from ICON-2016 for the three language pairs. The proposed model achieved best f-score of 78.744 in fine-grained model consisting of Hi-En tweets and 77.944 in coarse-grained model consisting of Bi-En Facebook posts.

E. Artificial Neural Network (ANN)
The concept derived from the human brain in which numerous neurons are interconnected to process data in parallel. ANNs are non-linear mathematical models that show an intricate connection among information sources in order to get a new pattern. ANN can be applied in a range of tasks, including text analysis, image processing, speech recognition, machine interpretation and clinical determination. An ANN has an input layer of neurons or nodes, one or two hidden layers of neurons (or even three), along with a final output layer of neurons. A typical architecture of an ANN is shown in Fig. 7. In a Neural Network the lines connecting nodes (or neurons) are associated with weight. In Fig. 8, a transfer function computes the weighted sum of the inputs while the activation function obtains the result.
The authors of [41] introduced a model for sentiment analysis of Hindi-English text using sub-word level LSTM. The data was collected from Facebook posts and used a 3class scale of "positive", "negative" and "neutral". The proposed sub-word level LSTM model achieved higher accuracy than the character-level LSTM model, SVM (Unigram) and Naïve Bayes techniques of machine learning. The overall accuracy of 69.7% was achieved by the proposed system.
Authors in [42] proposed a model in Hindi-English Twitter data for humor detection. Based on models like, Word2Vec and FastText an approach for bilingual word-embedding"s applied to BilSTM system for the detection of humor in the text. The proposed approach achieved an accuracy of 73.6%.   460 | P a g e www.ijacsa.thesai.org Automatic extraction of sentiments from Hindi-English and Bengali-English Facebook posts was proposed by [43]. The corpus was manually created and annotated. Several preprocessing steps have been employed in order to remove unwanted data from the corpus. A Multilayer Perceptron Model was used for the detection of the sentiment polarity. The proposed model achieved an accuracy of 68.5%.

F. Convolutional Neural Network
Convolutional Neural Networks (CNN) in recent years have achieved ground-breaking results in a number of pattern recognition fields, ranging from image processing to voice recognition. The most advantageous feature of CNNs is that they reduce the number of parameters in ANNs. This accomplishment has prompted researchers and developers to tackle broader models in order to solve more difficult problems. CNNs are similar to conventional Artificial Neural Networks (ANNs), consisting of neurons that learn to optimize themselves [44]. The neurons obtain inputs to perform operations like the scalar product and non-linear functions, which acts as a foundation for countless Artificial Neural Networks. The complete neural network exhibit a single observant score function from raw input vectors to the final classification output. The general architecture of the CNN for the classification has been illustrated in Fig. 9.
A CNN based system for the sentiment identification of Hindi-English data was proposed by [45]. The sentiment analysis has been done using three class classifications. The classes included "positive", "negative" and "neutral". The classification of the classes have been done using word-level representations. Since tweets contain informal text, memorization of aspects of the word orthography in a wordlevel representation was done using CNN. The model achieved an f-score of 0.324 for Hindi-English data.
To compare ML and DL approaches researchers in [46] have used three code-mixed datasets viz. Hindi-English, Bengali-English and Kannada-English. The datasets used in the study have been sourced from Facebook posts and SAIL-2017. A number of Machine and Deep Learning techniques were applied on code-mixing datasets for sentiment analysis. The techniques used include Doc2Vec, SVM, CNN and Bi-LSTM. The experimental results showed that CNN performs better for Kannada-English dataset and achieved an accuracy of 71.5%. The BiLSTM performs better for Hindi-English and Bengali-English datasets with accuracies of 60.22% and 70.20% respectively. Authors in [47] presented a hybrid model for sentiment analysis of English-Hindi code-mixed data. The method used CNN architecture for generating sub-word level representation for the sentences. Two BiLSTMs, collective encoder and specific encoder are fed with the sub-word level representation. Finally, a Feature Network consists of orthographic features has been combined with the BiLSTMs to achieve an accuracy of 83.5%. The hybrid approach, therefore, combines surface features with Attention-based Recurrent Neural Networks to produce a single representation that can be trained for sentiment classification.
For the identification of emotions in Hindi-English Twitter and Facebook data, authors in [48] proposed a Deep Learningbased system. Several Deep Learning techniques such as 1D-CNN, LSTM, Bi-LSTM, CNN-LSTM and CNN-BilSTM were used to predict the polarity of the sentence. To generate feature vectors, the pre-trained bilingual model was used. The experimental results showed that CNN-BiLSTM model achieved the best accuracy of 83.21%.
The authors of [49] used Facebook comments of Hindi-English code-mixed dataset provided by Trolling, Aggression & Cyber bullying-I (TRAC-I) and apply machine and deep learning models for the classification of text data into a 3-class scale such as, "Covertly Aggressive", "Overtly Aggressive" and "Non-Aggressive" classes. CNN model worked best with an fscore of 0.58 and accuracy of 73.2% as shown in the experimental results.
The study in [50] explores hate speech detection in tweets written in Hindi-English. The authors have used DL models, CNN-ID, LSTM and BiLSTM the semantics detection of hate speech along with the context. The embedding"s were generated using Word2Vec. The experimental results were compared with the contemporary approaches. The CNN-ID model outperforms the other two and achieved an overall accuracy of 82.62%.

G. Recurrent Neural Network (RNN)
RNNs are being used by researchers since 1990s. RNN is a neural network with feedback connections is known as a recurrent net [51]. RNN is a form of ANN that works with time series or sequential data. Techniques based on RNNs have been used to solve a broad range of problems. Machine Translation, Speech Recognition, Video Tagging, Text Analysis and Image processing are some examples where RNN algorithms are used. The general architecture of an RNN has been shown in Fig. 10. Each hidden state has hidden nodes also called hidden units. An automatic sentiment prediction system of Hindi-English code-mixed dataset consists of tweets was proposed by [52]. The model used the Recurrent Convolutional Neural Network approach to capture the semantics of the text and classify them into three scale classification. The dataset was collected from SemEval-2020 shared task. An f1-score of 0.69 was achieved by the proposed approach.
The authors in [53] [54] proposed a part-of-speech tagger for Hindi-English, Bengali-English and Telugu-English datasets. The datasets were collected from social networking platforms such as Facebook, Twitter and WhatsApp. The proposed model used Recurrent Neural Network (RNN) to predict word-level part-of-speech tags.
A Sentiment Analysis model of Hindi-English data, based on RNNs, was proposed by [55] [56]. Public Facebook pages of popular personalities of Indian Politics and Cinema were used to collect data. The model combines two different BiLSTMs for the identification of sentiment at the sub-word level as well as at the sentence level. The proposed approach used orthogonal features achieved accuracy of 83.5% and f1-score of 0.827.

III. RESULTS OF CODE-MIXED TEXT SENTIMENT ANALYSIS
FOR INDIAN LANGUAGES On Social media sites, netizens in India often use English and their native language such as Hindi in a mixed form to express their opinions on a wide range of topics. Over the years, researchers in the field of NLP have shown keen interest in this new form of text which is often informal and challenging. However, with the advent of NLP tools and techniques, the research related to the analysis of code-mixed textual data has also gained momentum. The significant research studies with their description and the results given for code-mixed text data analysis and sentiment analysis in Indian Languages is given in Table I. IV. DISCUSSION Social networking has emerged as an essential part of our lives. It has not only become a platform for individuals to communicate with each other, it also acts as a news media, a platform to connect with people and develop a relationship. It gives an individual the opportunity to express their views on a particular product, service, social movement, government policy etc. Social media thus helps in business and governance tasks. India being the second-largest populous country of the world has a wide range of linguistic diversity. People often express their views in English as well as in their native language resulting in the proliferation of code-mixed data. Mixing of languages or language varieties either in oral or in written form is known as code-mixing.
Sentiment evaluation of social media data analysis plays a crucial role in modern commerce and governance. Classical sentiment analysis systems were developed for dealing with product reviews. With the advancement in NLP tools and technologies, sentiment analysis systems were developed for other tasks as well. Code-mixed text data sentiment analysis is a relatively challenging task right from data gathering to classification. Various studies have been accomplished on "Cross-Lingual Information Retrieval" (CLIR), "Multilingual Information Retrieval" (MLIR) and "Mixed Script Information Retrieval" (MSIR) [84]. In CLIR, a user queries in one language and retrieves desired information in more than one language. In MLIR, a person can query in one or more languages and retrieve information in more than one language. However, the task of retrieval becomes more difficult when dealing with MSIR, due to Romanized text of non-English languages. Also, the social media text contains many nonstandard forms such as misspellings, improper use of grammar, letter substitutions, non-standard abbreviations and other ambiguities which makes preprocessing a necessary step in the code-mixed scenario. Various tools for POS tagging, language identification as well as named entity recognition (NER) have been developed for the analysis of code-mixed data over the recent years. However, due to limited datasets particularly annotated datasets for some language pairs and the nonavailability of these resources for majority of native Indian languages, and the linguistic catalogues for informal codemixed text, the automatic text analysis tool development is challenging.
Code-mixed text data analysis in multilingual societies like India has become a vital linguistic research area more specifically for social media content. However, processing such type of data for linguistic analysis is a challenging task due to inherited linguistic complexity and the presence of spelling and grammar variations [85] Therefore, to promote research in code-mixed text, MSIR workshops were organized at FIRE since 2008 [86] various workshops have been conducted on linguistic code-switching computational procedures for language identification and NER textual data for in code-mixing scenarios [87]. SemEval workshops (International Workshop on Semantic Evaluation) have also been conducted. SemEval-2020 was aimed to encourage research in code-mixed Sentiment Analysis of Twitter data.
This paper provides the results of a review study for the sentiment classification of code-mixed Indian languages. The adopted languages, ML/DL/ANN approaches, data sets and challenges in sentiment analysis of code-mixed text data have been highlighted. The results also show that various studies have been carried out in different application domains, thus each of the domains requires different analysis approaches to achieve better performance.
The results show that the most used ML classifier for the sentiment classification of code-mixed Indian language text is SVM followed by NB and RF. Ensemble approaches are also used to classify the code-mixed text. The study also showed that in terms of accuracy and f1-measure, Neural Network approaches perform better than the traditional models. Typically LSTM and BiLSTM algorithms are being used by the researchers for the classification of sentiment in codemixed datasets. The study reveals that Twitter is the first choice of data collection followed by Facebook and movie/product reviews. Also, appreciable research has been carried out in the Hindi-English public networking site"s text followed by Bengali-English. Research has also been carried out in other code-mixed Indian languages such as Punjabi-English, Marathi-English, Telugu-English and Malayalam-English. However, limited or no annotated datasets, text analysis tools and SentiWordNets are not available in most of the code-mixed Indian language text.

V. CONCLUSION
A comprehensive study of Machine Learning techniques for code-mixed Indian language text collected from popular www.ijacsa.thesai.org media platforms has been carried out in this paper. Among traditional Machine learning approaches, SVM is the first choice of most researchers. In the case of Deep Learning approaches, BiLSTM dominates the research. Twitter data is used for most of the systems and code-mixed social media text for Hindi-English is most researched. Annotated datasets, text and language analysis tools and other lexical recourses are trivial while dealing with code-mixed datasets. In our future work we are going to present a statistical review of Machine Learning approach for Sentiment Analysis of code-mixed social-media text.