Vietnamese Short Text Classification via Distributed Computation

Social networking has been growing rapidly in Vietnam. The sharing information is diverse and circulates in many forms. It requires user-friendly solutions such as topic sorting and perspectives analysis in analyzing community trends, advertisements or anticipating and monitoring the spread of bad news. Unfortunately, Vietnamese is highly different from other languages and little research has been conducted in the literature on messages classification. The implementation of machine learning models on Vietnamese has not been thoroughly investigated and these models’ performance is unknown when applying in a different language. Vietnamese text is a serialization of syllables, hence, word boundary identification is not trivial. This research portrays our endeavor to construct an effective distributed framework for addressing the task of classification of short Vietnamese texts on social networks using the idea of probability categorization. The authors argue that addressing the task sharps the successful combination of machine learning, natural language processing, and ambient intelligence. The proposed framework is effective and enables fast calculation, suitable for implementation in Apache Spark, meeting the demand for dealing with large amounts of textual data on the current social networks. Our data has been collected from several online text sources of 12412 short messages classified into five different topics. The evaluation shows that our approach has achieved an average of 82.73% classification accuracy. Thoughtfully learning the literature, we could state that this is the first attempt to classify short Vietnamese messages under a distributed computation framework. Keywords—Short text classification; Naı̈ve Bayes; Apache Spark; Vietnamese; distributed computation


I. INTRODUCTION
Social networking, also called virtual social network is a service that connects members on the Internet with many different purposes regardless of space and time. The online social network is a representative of Web 2.0 to simulate real social relationships using Web technology to connect members and allow them to create and share information with each other through mechanisms such as making friends, chatting, liking, sharing, tagging photos, commenting, or subscribing to a blog, a channel. In Vietnam in particular and the world in general, Facebook is the most popular social network. A person who joins a social network will be able to post or comment on another person's post. A person's post can be thoughts of selfexpression, feeling of subjects, commenting on a problem or simply announcing an event. These messages, if interested by many people, will be shared by people and continue to be commented and become hot topics. Innovative social networks completely change how netizens link together and become an inevitable part of every day for hundreds of millions of members around the world. These services have provided platform for the participants to find friends and partners: based on groups (such as school names or city names), based on personal information (such as e-mail addresses, phone numbers), based on personal interests (sports, movies, books, or music), areas of interest (business, sales). According to [1], up to four Southeast Asian countries are in the top ten countries with the most Facebook users. In particular, Vietnam ranks 7th with 64 million users, accounting for 3% of the total global Facebook accounts. Thailand is right behind Vietnam, at No. 8 with 57 million users. Indonesia and the Philippines rank No. 4 and No. 6 with 126 million and 69 million accounts respectively. Capturing the popularity of social networks in Vietnam, many business components including companies, small and medium enterprises and especially online business individuals have found many measures to take advantage of its promotion to meet its business purpose.
Social networks' success has drawn the mindfulness of research on natural language processing. Among the significant exploitation, text classification has been described as one of the most important tasks, becoming a major discipline of the information systems [2]. The message on a social network can be a message between two people, a status line, or a comment of a certain status line. The message needs to have certain content. A person who wants to post an article on a personal page usually knows the topic he is talking about. Social media posts often have content that is not too long and only shows emoticons on a topic. A user can post one or more posts at the same time. These posts are generated continuously and without quantity limits. The spread of these messages is also very fast and wide [3]. The theme of messages on social networks is almost the same as the topics of common types of documents such as Science, Business, Law, Health, Sports, Technology. The problem is that collecting sample data takes a lot of time and effort. But in the classification problem, the sample dataset plays a very important role. In addition, there is an ambiguity about the topics of social media posts, because the messages do not always present a clear purpose for their users. An article can sometimes belong to two or three different topics when placed in another context. The messages themselves are short documents, contain very little information. Most of the messages on social network have images or videos attached, which limit the content analysis. The documents accompanying with these videos or images are sometimes just an unknown introduction, the core content of the article is in the videos and images. The amount of articles that are too large requires building on a large data processing platform, towards realtime processing to meet the needs of huge data analysis. In Vietnam, there are not many research groups in the field of social network analysis. There are not many studies on the classification of published messages especially for Vietnamese so that it is difficult to compare results and evaluation.
Consequently, the need to classify the text-based messages electronically available has significantly grown. One can argue that automated text classification is one of the most crucial tasks in social media analytics [4]. Many research papers have been conducted to solve such problems as topic modeling [5], [6], [7], [8], geolocation [9], [10], and document classification [11]. Although many Vietnamese documents are electrically available, no one has been recently conducted on short Vietnamese messages classification. This research portrays our endeavor to construct an effective distributed framework for addressing the task of classification of short Vietnamese texts on social networks. We have collected a large amount of Vietnamese textual sources of 12412 messages and investigated message representation, word tokenization, and learning method that are suitable for the requirements of acceptable classification accuracy and distributed calculation.
The rest of this research paper is organized as follows. First, we discuss previous research on short Vietnamese text classification in Section (II). Next, we summarize fundamental materials and methodology in Section (III). The proposed framework for addressing the probabilistic classification of short Vietnamese messages is presented in Section (IV). In Section (V), the experiments are thoughtfully discussed. And finally, conclusion is stated in Section (VI).

II. RELATED WORKS
Text classification is a classic problem in data mining and machine learning [12]. The goal of the main classification problem is to find the appropriate topic in a set of predefined topics. Streams of text classification research has been done in several top-tier conference, journal and workshop. Criteria to select the appropriate topic for documents based on the similarity between them with the text in the training material semantically. Automatic sorting of text into a topic makes it easier to organize, store and query documents later. Besides, text classification is also used to assist in the process of searching, extracting information [13], [14].
Over the years, the research field of ambient intelligence has witnessed remarkable achievements. Scholars have raised the potential integration between natural language processing and ambient intelligence that can further generate cuttingedge research leading to substantial technological advances. A vibrant part of many information repositories is textual sources across various media, usually in the nonstructural form. One can conclude that the capacity to process and make use of nonstructural information can boost ambient intelligence systems to a much higher level of quality. Applying machine learning within the task of text classification involves transforming a variety of textual sources into structured knowledge. The authors argue that the combination of machine learning, natural language processing, and ambient intelligence opens up exciting research challenges [15].
A text classification task contains four different research disciplines: feature extraction, dimensionality reduction, classification algorithms, and evaluation. Texts are unstructured data, and we need to transform them in a feature space. Several common feature extraction techniques from basic to more advanced are TFIDF, Word2Vec [16], and Glove [17]. The most crucial step of completing a text classification task is to choose the best classifier, starting from non-parametric techniques, e.g., k-nearest neighbor, to simply logistic regression, to treebased classifiers, to deep learning-based models. Recently, we have witnessed the success of deep learning approaches over previous classification models due to its excellent performance and capacity to address non-linear relationships within text data. However, regarding the feasibility characteristics, Naïve Bayes algorithm, which is a very computationally inexpensive and low amount of memory consumption, is one of the most generative model. These implementations have one characteristic in common: the classifier is trained on a simple machine, ranging from a regular personal computer to a dedicated server. The extend of training phase in case of handing a large amount of data can be broadcast to several machines in a cluster, which yields the idea of distribution computation [18], [19], [20], [21]. In the implementation, we deploy our distributed framework upon Apache Spark.
An early effort to address the task of Vietnamese text classification was conducted more than a decade ago [22]. In that paper, the authors solved the problem of automatically categorizing given textual sources into predefined categories. A comparison between statistical N-Gram language modeling and bag of words approaches has been investigated on their collected dataset. Although they achieved a good accuracy score, the implemented models were not efficient in term of computation time, e.g. only three documents/second comparing to 776 documents/second in our distributed framework. The task of automatic text categorization has been studied by comparing the performance of several term weighting schemes rather than analyzing the actual classification task [23]. Nevertheless, these approaches have investigated short messages in very different classification tasks. For the problem of classifying Vietnamese text, many research projects have been published but their work were done in an isolated environment [24], [25], [26]. Thoughtfully learning the literature, we could state that this is the first contribution to classify short Vietnamese messages under a distributed computation.

A. Problem Definition
Given a set of n input texts denoted as D = {d 1 , d 2 , . . . , d n }. By applying some processing techniques we will classify them into a set of m classes denoted as C = {c 1 , c 2 , . . . , c m }. An example of text classification is the arrangement of news in newspapers into corresponding categories such as Sports, Entertainment, and Society. This can be done manually by the editors but this method faces some of the following difficulties: (i) It takes a lot of time and effort. (ii) Manual classification is sometimes inaccurate because the decision depends on the understanding and motivation of the implementer. (iii) For some professional fields, experts (medical, legal, economic) are needed. The decision of several experts may be contradictable. And (iv) When the number of documents is relatively high, an expert might find it difficult to implement. [00202] Trong hàng triệu triệu cổ_động_viên hướng về đội_tuyển quốc_gia Việt_Nam ngày hôm_nay những_ai có_mặt tại sân_vận_động chứng_kiến trực_tiếp trận đấu có_lẽ là những người may_mắn nhất nhưng có_lẽ cũng là những người vất_vả nhất Thời_tiết ở thành_phố Thường_Châu những ngày này rất khắc_nghiệt với nhiệt_độ âm và tuyết đã rơi trắng đường Nhưng chẳng điều gì ngăn được bước chân của các cđv Việt_Nam. Translated: Millions of fans are heading to Vietnam national team today. Those presented at the stadium witnessing the match directly were probably the luckiest people, but perhaps also the hardest. The weather in Changzhou City these days is very harsh. The temperature drops to negative degrees Celsius and white snow has fallen. But nothing stops the footsteps of Vietnamese fans. [03189] Hè sắp đến rồi CÙNG GIẢI NHIỆT MÙA HÈ THÔI Hay lên kế_hoạch cho những chuyến vi_vu xả_hơi để tận_hưởng thắng_cảnh tại nước_ngoài hay các biển đảo mới được biết đến như Đảo_Nam_Du hoặc tận_hưởng ngắn ngày cho một chuyến du_lịch về vùng sông_nước hoà_mình với thiên_nhiên chưa Hãy cùng Du_Lịch_Việt chuẩn_bị cho một chuyến du_lịch cùng mùa. Translated: The summer is coming and LET'S ENJOY THE SUMMER SEASON.
Have you planned for a relaxing trip to enjoy the sights in foreign countries or new islands known as Nam Du or enjoying a short trip to rivers, mix with nature yet? Let's travel with Vietnam Travel to prepare for a trip with the season.

B. Text Pre-processing
Data pre-processing is the first important step of any data mining process. It makes data in its original form easier to observe and explore. For the problem of text classification, due to specific characteristics, each language has its own characteristics. The preprocessing process will help improve sorting efficiency and reduce the complexity of the training algorithm. Depending on the purpose of the classifier, we will have different preprocessing methods, such as • Convert text to lowercase and correct spelling errors.
• Remove punctuation marks (if no sentence separation is performed).
• • Separate of words by single word method (English) or compound words (Vietnamese).
• Remove the stopwords, e.g. the words that appear most in the text that are not meaningful when participating in text classification.
• Standardize the words, switch back from the original (usually applicable to English).
• Convert text into vectors as input for classification learning machine.

C. Text Transformation and Presentation
One of the first tasks in dealing with text classification is to choose an appropriate text representation model. A raw document (string form) needs to be transferred to another model to facilitate representation and calculation. Depending on the different classification algorithms, we have our own representation model. The vector space model is one of the simplest and most commonly used models in this task. A text source is represented in the form, with an n-dimensional vector to measure the value of the text element. A document is expressed as a collection of tokens and/or words, each token is considered an attribute or characteristic and the text corresponds to an attribute vector. After identifying the properties, we need to calculate the attribute value (or weighted keyword) for each text.
We discuss term frequency-inverse document frequency www.ijacsa.thesai.org (TFIDF) [27], one of the most fundamental techniques for retrieving relevant documents from a text source or from a collection of text sources. Although TFIDF is fundamental, it statically proves the effectiveness in text mining [28], [29].
Having gathered all the tokens from the tokenization step, all given messages are converted from bag-of-words representations of token counts into sparse vectors with TFIDF weights. TFIDF is an acronym of term frequency-inverse document frequency, and this score often used in text processing and information retrieval. The idea of TFIDF weight is to calculate a score that expresses the relative importance of words in the documents. The score is statistically measured by evaluating the significance a token gains in a document and in a collection. The importance of a word is proportionally judged by counting the number of times it exists in a document while compensating its appearance in the corpus. In this way, we discard grammar structure, words' order, and part-of-speech. It is intuitive that the frequency with which a token appears in a message could indicate the extent that the message pertains to that token. The TFIDF weight reflects how significant a token gets to a message. The more appearance a token exists in many messages, the more penalty it gets punished. The best characteristics of the tokens to the message is measured by the highest score of TFIDF.
TFIDF weight is expressed as where TF is how many times a word appears in a document, and IDF is the logarithm score of the number documents in the whole corpus divided by how many documents that the specific word appears. More precisely, the TF is calculated as follows: where the number of times a word w appears in a document d and the total number of words in d are n d (w) and |d|, respectively.
While the TF is calculated on a per-document basis, the IDF is computed on the basis of the entire corpus. Thus, the IDF is calculated as follows.
where |C| represents the number of documents in the corpus and n C (t) represents the number of documents that contains the word w.

D. Naïve Bayes Classifier
Naïve Bayes is a popular machine learning model thanks to its great performance [30]. It merely meant as a machine learning approach that we utilize in the work. Readers may refer to mathematics or probabilities machine learning textbook [31] for advanced information.
Given an observation, model's parameters and a label represented by a vector u, a set of parameter ω and a target t = c respectively, the generative model to classify u is defined as follows: where P (u|t = c, ω), P (t = c|u), and P (t = c) are the class-conditional density, the class posterior, and the class prior respectively. Proportionally, Equation (4) can be computed as in the following equation: Moreover, the class-conditional density P (u|t = c, ω) in Equation (4) is calculated as follows: which we yield a Naïve Bayes classifier.

A. Design Concept
In this research, we introduce a framework to explore and label topics for short Vietnamese messages according to the traditional text classification procedure which is presented in Fig. (1). Nevertheless, we employ the message classification via a distributed framework Apache Spark [32], [33]. The complete design of our proposed framework is presented in www.ijacsa.thesai.org  Fig. (2). Consequently, heavy load, e.g. data pre-processing, vectorized representation, and classification, of a traditional machine learning task is effectively done in parallel.

B. Text Classification Pipeline
Depending on each specific case, the text classification problem will have different processes. Here are some basic steps: (1) Pre-processing data is a step to clean the data before starting to process in the next steps. It includes some concepts of natural language processing such as removing redundant characters, deleting stop words that don't make much sense, removing words that appear in most texts, spell checking.
(2) Separation is an extremely important step, especially for Vietnamese. There are many ways to separate different words, we will learn more in the next section. (3) Document representation is the pipeline of transforming the input text data set into attributes compatible with the classification model in the next steps, facilitating easier problem-solving. (4) Characteristic extraction is the step to find the core characteristics from the original dataset or in other words, to choose a typical characteristic that is representative of the dataset as the basis for the algorithm. (5) Model training is the step in which we use machine learning algorithms to find the best model. Finally, classification (6) is the use of the trained model in the above step to conduct classification for the dataset in practice.

C. Distributed Computing Framework
Many people can agree that one of the most successful cluster computing platform is Apache Spark due to its great ability to compute fast and can be generally utilized in many research and business domains [33]. Hinging on the efficiency of supporting a broad variety of computations' types, Apache Spark can handle stream processing and queries by the extension of the well-known MapReduce model [34]. Moreover, a physical execution engine called the DAG scheduler gains great achievement in processing batch and streaming data. From the very first idea of design, Apache Spark executes computation directly in machine's memory which in turn boosting the computing speed significantly. Apache Spark provides multi-purposed APIs that support many modern programming languages, e.g. in Python, Scala, and Java. Spark Core is the main architecture of Spark consisting of components for fault recovery, memory management, optimization, task scheduling, and storage interaction. Apache Spark's main programming abstraction is resilient distributed datasets, or called RDDs in short, is a distributed collection of elements defined by Spark's main architecture. During computation, RDDs are distributed around a cluster of machines and can be performed in parallel effectively and transparently. A wide variety of machine learning functionalities are integrated into Spark's MLlib library [35]. Apache Spark can be deployed in a standalone machine or associated with Mesos [36]. The overall architecture of our distributed framework is illustrated in Fig.  (2).

A. Data Collection
We utilize a commercial tool called Facebook Fplus [37] developed by a domestic company FPLUS24H. Corresponding to each topic, we choose Facebook pages based on the number of likes and followers in the belief that these pages will focus on writing articles related to their main subject. For each topic, we select the Facebook pages with the most number of likes and followers compared with other similar pages. Our statistics of data collection is presented in Table (II). The process of filtering messages by topic is done through the following steps. First, we filter empty messages, or messages holding too little content leading to unclear meaning and unknown topics. Second, we remove messages embedding videos, images with accompanying texts that do not show the right content. Next, we filter wrong spelling messages, e.g. without Vietnamese accents. Fourth, we remove messages that the topics do not match with the contents. And finally, we split messages into separate files for easy storage and processing in Apache Spark.

B. Vietnamese Text Tokenization
For the tokenization task, we utilize vnTokenizer [38] in our research, see Table (I). The combination of tokenization accuracy among software is out of the main concern of this research paper. We utilize a list of 1942 Vietnamese stopwords [39] in our data processing. Suppose we are solving a binary classification task with a labeled dataset D = {x i , t i }. Given a threshold parameter φ that guilds our decision rule g(x) We also define m + the total of condition positives, m − the total of condition negatives,m + the total predicted condition positives,m − the total predicted condition negatives, and m the total population.
We can compute the sensitivity, also known as true positive rate (TPR), probability of detection, or recall by using: Similarly, we can compute the fall-out, also known as false positive rate or probability of false alarm by using: The true negative rate (TNR) or specificity is defined as follows: The false negative rate (FNR) or miss rate is calculated as follows: If we work with a dataset for binary text classification when the number of negatives is very large or a dataset for multiclass text prediction when class imbalance exists, considering TPR, FPR, TNR and FNR themselves is not very informative. Before going further, we define positive predictive value (PPV) or precision as follows: By combining Equation (7 and 11), we can compute F1score as follows: which is widely used in information retrieval systems.

D. Implementation
The fundamental goal of machine learning models is to make accurate predictions on unseen observations. In order to estimate the strength of a particular learning model, practitioners usually split data into several proportions which serves for specific purposes in the machine learning pipeline. More specifically, the data is split into a training set containing samples to train the model and a test set consisting of instances to pretend an unbiased evaluation of the investigated learning • Splitting scheme a. The percentage of the training and test parts is 50% and 50% respectively. We denote it as 50|50 and 50% hereafter.
• Splitting scheme b. The percentage of the training and test parts is 60% and 40% respectively. We denote it as 60|40 and 60% hereafter.
• Splitting scheme c. The percentage of the training and test parts is 70% and 30% respectively. We denote it as 70|30 and 70% hereafter.
• Splitting scheme d. The percentage of the training and test parts is 80% and 20% respectively. We denote it as 80|20 and 80% hereafter.
• Splitting scheme e. The percentage of the training and test parts is 90% and 10% respectively. We denote it as 90|10 and 90% hereafter.
All experiments have been conducted on a normal laptop including distributed computing infrastructure and virtual machines. The environment specifications are CPU Intel Core i7 MQ, 8GB of RAM, graphics card NVIDIA GT 740M, Apache Spark 2.2.0, IDE IntelliJ IDEA 2017 ver 2.6, Scala programming language.

E. Experimental Results
All experiments have been conducted five times to assure the performance stability of the system. The authors then reported average scores and their standard deviation, e.g., a measure of the amount of variation or dispersion of our fivetimes computation. A low standard deviation indicates that the scores tend to be close to the mean of 5 times, while a high standard deviation indicates that the scores are spread out over a broader range. The experimental process is split into two scenarios. First, we investigate the performance on each topic separately. We present the experimental results on five topics (see Table (III) for Sports, see Table (IV) for News, see Table  (V) for Traveling, see Table ( Technology). Second, we investigate the performance on the complete dataset, see Table (VIII) and Fig. (3).

F. Remark and Discussion
The average execution time of the system is calculated through two phases: Phase I: Filter characters, separate words, vectorize messages, and perform in vector space model for the dataset. The average execution time is 20 minutes 55 seconds. Phase II: Divide the data into two parts, train the machine learning model, and predict the test set, calculate the accuracy of each topic, and analyze the system's efficiency. The average execution time is 16 seconds, which proves the feasibility of Naïve Bayes classifier. It works well with text data and is fast in comparison to other classification algorithms. The advantages of Naïve Bayes the biased assumption about the shape of the data distribution. The model limits the prediction capacity to data scarcity and frequency of words in the whole text source.
With the highest accuracy of about 83.18% with a minimal value of standard deviation 0.93%, the experimental results have proved the feasibility and computation stability of our proposed system. The topic with the highest predictive rate is Traveling with 92.63%. Particularly, the Sales topic has the lowest rate, with 73.80%. It can be explained for this reason because the Lazada fan page specializes in selling goods online. The categories of goods are very diverse. The standard deviation of all experiments is quite low, which indicates the performance stability of the proposed system. Furthermore, the execution time of the system is also relatively short, especially the classification process. When the data set is large enough, the learning process only occurs once, and the classification process must be repeated over time.
TFIDF is one of the most popular term-weighting schemes today as 83% of text-based recommender systems in digital libraries use it [40]. It is widely supported in many machine learning libraries and can be applied as on-the-shelf effectively. Therefore, traditional TFIDF is applied in this paper. The pitfall of this text presentation is that it ignores the semantics and syntactic of the text. The calculation time also depends on how many unique words in all text corpora. The tuning might be considered to improve the shortcomings of the IFIDF algorithm regarding the classification accuracy of machine learning models used, ignoring the calculation efficiency in the classification process. How to improve the accuracy together with efficiency is the direction for further research in the future.
To get a confirmation on how our proposed solution performs, we have conducted several experiments by replacing Naïve Bayes classifier by logistic regression, decision trees, and random forests. The operating configuration is similar, except for the models themselves. The experimental results reported in Tables IX, X, and XI have proved the advancement of our proposed solution.

VI. CONCLUSION
The problem of discovering and identifying themes for social network messages is an urgent problem in the context of the current social network explosion. The topics explored from these messages in combination with analyzing the perspective will contribute to predicting the spread of the messages. It helps develop solutions to monitor and prevent bad information, causing serious impacts, spreads on social networks. The paper has proposed and built a distributed framework for addressing probabilistic classification using Apache Spark to meet the need to handle large amounts of data. The Naive Bayes classification method is suitable to build on a large data processing platform. Initial results for an accuracy score of about 83% and can be further improved when collecting more amount of dataset. we have also built a set of social network messages including five topics for the process of analyzing and researching social networks. The paper contributes to solving the problem of classifying the topic of short messages that are appearing on social networks in Vietnam.