SPAMID-PAIR: A Novel Indonesian Post–Comment Pairs Dataset Containing Emoji

—The detection of spam content is an important task especially in social media. It has become a topic to be continuely studied in Natural Language Processing (NLP) area in the last few years. However, limited data sets are available for this research topic because most researchers collect the data by themselves and make it private. Moreover, most available data sets only provide the post content without considering the comment content. This data becomes a limitation because the post-comment pair is needed when determining the context of a comment from a particular post. The context may contribute to the decision of whether a comment is a spam or not. The scarcity of non-English data sets, including Indonesian, is also another issue. To solve these problems, the authors introduce SPAMID-PAIR, a novel post-comment pair data set collected from Instagram (IG) in Indonesian. It is collected from selected 13 Indonesian actress/actor accounts, each of which has more than 15 million followers. It contains 72874 pairs of data. This data set has been annotated with spam/non-spam labels in Unicode (UTF-8) text format. The data also includes a lot of emojis/emoticons from IG. To test the baseline performance, the data is tested with some machine learning methods using several scenarios and achieves good performance. This dataset aims to be used for the replicable experiment in spam content detection on social media and other tasks in the NLP area.


I. INTRODUCTION
Research on text analysis, especially in context-based detection/classification problems, is increasingly important because of the higher need for system automation.A labeled dataset is needed for supervised text classification to be used as machine learning data.Unfortunately, the datasets for text classification are mainly in English.Other languages (Turki [1], Bangla [2], Chinese [3], Arab [4], and Morocco [5]), including Indonesian, are rare enough [6], [7].One of the challenges in the NLP area is how to understand the context to gets the meaning.Context understanding can also be applied to spam comment detection based on its post by detecting the comment's relevance.If the comment is not related/relavance to its post, it is likely to be categorized as spam.To detect spam comments, machine learning methods require training datasets that can be used according to the context, such as in the context of the language, that are still rare.
The motivation of this research is to overcome the datasets scarcity in Indonesian for the text pairs classification to get the context between two texts in pairs correctly.The authors have collected the dataset for spam comment detection based on social media posts.This dataset is taken from Instagram (IG) based on selected 13 public Indonesian artists/actors inspired by [8], [9].Each of the public Indonesian artists/actors has more than 15 million followers.Each row of this dataset consists of a post and comments text pair called SPAMID-PAIR 1 .SPAMID-PAIR contains 72874 pairs of posts and comments and breaks down into 53837 non-spam data and 19037 spam data.This article introduces the SPAMID-PAIR dataset, a novel dataset collected, labeled, validated, and used as training data for spam comment detection based on their posts with several machine learning methods.This dataset is intended to contribute as one of the Indonesian datasets in NLP for text pair classification problems based on the context.The SPAMID-PAIR dataset has an advantage because it contains symbols, special characters, and emojis that are widely available in social media posts and comment texts.This dataset is useful for NLP research because most researchers discard emojis in their classification techniques.Some examples are news article classification [10], Twitter without emoji [11], spam comments from the blog [12], Twitter (removed emoji) [13], SMS and Twitter without emoji [14], Twitter without emoji [15], Youtube comment without emoji [16], video spam comment without emoji [17], Youtube comment without emoji [18].The emoji is essential because most social media users use emojis to express their feelings, such as to support/deny, show sympathy, joy, sadness, and anger.The emojis in the dataset is needed for research in some fields that learn through emoji expression.This dataset uses the UTF-8 format for post and comment data, so both emojis can be used in emoji pairs expression research.www.ijacsa.thesai.org The contribution of this paper is two-fold; first, the novel SPAMID-PAIR dataset, and second, several machine learning algorithms will be used to implement the supervised text-pair classification using this dataset using the F1 score.This paper is written as follows, firstly, the introduction of SPAMID-PAIR and its purpose.Secondly, the related works of the Indonesian NLP dataset, the experiments and results using this dataset, and finally, the conclusion.

II. RELATED WORKS
Datasets are the primary data source in machine/computer learning.Various machine learning and deep learning techniques are in dire need of data sources for system learning.But in reality, not all public datasets are available, especially in Natural Language Processing (NLP).Even though learning datasets in NLP are quite widely available in English, such as IMDB Dataset [19]- [21], SMS Spam UCI [22], FLAIR [23], [24], Twitter Spam [25], [26], YouTube Comments [15], PeerRead [27], and Huggingface Community Datasets [28].Still, there are few public datasets in other languages, especially Indonesian.
IndoNLU [6] is one of two dataset sources in the field of Natural Language Understanding (NLU) for 12 main tasks that have been attempted to be collected in collaboration with universities and industry.IndoLEM [29], as the second source, is a dataset source in NLP for seven main tasks (post tagging, named entity recognition, parsing, sentiment analysis, summarization, and word prediction).IndoLEM, the second, provides datasets, Indonesian Fasttext, and BERT pre-trained that can be used for other tasks.To the best of our knowledge, unfortunately, for the case of the semantic task in detecting spam comments in social media based on the context of the post in pairs, it has not been found.This article introduces SPAMID-PAIR to enrich the Indonesian NLP dataset collection in spam comment detection based on its post context, which has not been done before.
Spam text detection on social media is mostly done on Twitter [11], [13], [14], [15].Twitter has a structure that is not in the posts and comments pair structure.Otherwise, Youtube, Facebook, and Instagram are examples of social media with posts and comments pair structures.However, the detection of spam comments in previous studies was not based on paying attention to the post.The previous research used some popular machine learning methods.Septiandri and Wibisono use Naïve Bayes, SVM, and XGBoost to detect spam comments from Instagram, and SVM outperformed the others [30].Zhang uses the Random Forest to detect Instagram spam posts and achieve good [31].Research [32] investigated 11 state-of-the-art machine learning methods in text classification using 71 datasets and obtained that Stochastic Gradient Boosting, SVM, and Random Forest were the best methods compared to the others.That research can be used as a reference for the best machine learning in classification.

III. RESEARCH METHOD
This research uses the following steps: data acquisition, dataset construction, data profiling, annotation/labeling, preprocessing, feature extraction/generation, ML algorithm implementation, and evaluation.These stages can be seen in Fig. 1, while a more detailed explanation is in the following sub-chapters.

A. Data Acquisition
In the data acquisition stage, IG was chosen because 1) IG has a lot of spam comments, especially on Indonesian public figure accounts [33], [34].2) Posting and commenting on IG is in pairs suitable for the pair dataset; 3) IG has a lot of nonformal posts and comments, and it also contains a lot of emojis; 4) IG does not have a spam filtering feature in Indonesian yet.For comparison, on Twitter (TW), a tweet is a post, but replies from other users must always use a mention tag, so the form of the reply is not a comment.The reply data is equivalent to the tweet, not as a child node.On Facebook (FB), a user can create a status/post, and others can comment on it.But on FB, the situation tends to be more formal/serious, so it does not contain much spam and emojis.Nowadays, IG is a famous social media with many young IG users; not as serious and formal as FB.
Comparing three leading social media existing today, e.g., IG, TW, dan FB, IG is the best choice for collecting datasets for spam detection.IG is widely used by public figures such as politicians, artists/actors, and well-known people.Very limited datasets are available in languages other than English and Chinese, especially Indonesian [6], making collecting this dataset more critical.The SPAMID-PAIR dataset from IG contains post-comment pairs from 13 Indonesian artists/actors with more than 15 million followers without stating their account names.It is expected that researchers in the NLP field can use this dataset to replicate research and use it as the dataset reference in the topic of spam detection using various algorithms.
The SPAMID-PAIR dataset was retrieved using several tools such as Instaloader and Chrome Selenium Python driver.For the first planning, the data is taken from the 50 most recent posts, and 120 most recent comments are taken from each post.Hence, it was estimated that 78000 data could be collected.However, the data is not as planned in reality because some posts do not have as many comments as expected.The dataset was collected in September 2020, and after data retrieval was completed, 72874 pairs of post and comment data were obtained, which are ready for further processing.
Table I displays all the artists/actor's usernames used in the SPAMID-PAIR dataset.SPAMID-PAIR contains 72874 pairs of posts and comments and breaks down into 53837 non-spam data (73.87%) and 19037 spam data (26.13%).Details of the number of spam and non-spam labels per artist/actor are highlighted in Table II, analyzed using Python Pandas.Table II also shows that the IG ID 24239929 only has 103 data because the user recently had disabled comments, so the data could not be retrieved anymore.Spam comments are detected in all 13 www.ijacsa.thesai.orgIG users chosen with varying percentages.The SPAMID-PAIR dataset consists of 11 fields and is available in Excel format (.xlsx) and comma-separated value (CSV) with UTF-8 encoding, as described per field in Table III.Table IV shows that the number of emojis in this dataset reaches 68%, and the number of emojis in the spam category is higher than in the non-spam category.Table V shows detailed data related to emoji statistics in the dataset.Fig. 2 illustrates the distribution of emoji in the SPAMID-PAIR dataset per IG artist ID and tells us how the emoji is related to the spam or non-spam label.Fig. 3 shows some correlation between some attributes of the SPAMID-PAIR dataset.First, it shows a correlation between the length of comments and spam labels.There is also a correlation between the length of comments and the number of emojis.Lastly, there is a correlation between the length of comments and the post length.

B. Data Profiling and Labelling
After the dataset has been collected, the next step is data profiling, labeling, and validation.Labeling gives each data a "spam" or "not spam" label.The "spam" criterion is given if the post and comment data, text data, or emojis are irrelevant.On the other hand, the "not spam" criteria will be given if the post and comment data are relevant.Two Indonesian labelers carried out the labeling process.Before starting the labeling process, a joint briefing was held between the two native Indonesian labelers to create a common perception of the meaning of "spam" and "non-spam" labels.After that, labeling was done using an excel formatted dataset that was given to each labeler, and there was one additional column, "label," which would be filled with "spam" or "not spam" by each labeler manually.The final label was determined by the final agreement of the two labelers.Based on the Kappa score, the result is the "almost perfect" category with a Kappa score of 0.95, proving that the labeling agreement between the two annotators was relatively easy.The difficulty arises when the comment contains only an emoji, and it is difficult to determine its meaning.However, it can be overcome by looking at the consistency of the type of emoji and the type of "positive" emoji used.Suppose the emojis use "positive" emojis such as expressions of joy, enthusiasm, support, and love.In that case, the label is a high possibility of "not spam."Otherwise, if the post content tends to be "positive" and the comment content tends to be "negative," it is labeled as "spam."Examples of labeling results for data labeled as "spam" and "not spam" can be seen in Table VII. Outfits custom @USER Styling @USER Makeup @USER Hair @USER Photographer @USER Not spam The comment reply a post about how pretty an artist is because the post shows how beautiful the artist in an outfit After the labeling had been completed and re-validated, a data profiling step was carried out to determine additional data from the dataset using Python NLP Profiler.It analyses whether there are emojis or not in the posts or comments.It does statistical analysis on the number of sentences, the number of characters, the number of whitespaces, the number of words, the number of words in the form of numbers, and the number of signs.It also reads and counts the number date format.Data profiling is used to determine the characteristics of the data and assist in determining the appropriate preprocessing steps later.

C. Pre-processing
The pre-processing process consists of the following steps: 1) Generating manual features such as the length, the number of emojis, the number of unique emojis, the number of digits, the number of hashtags, the number of mentions, the number of uppercase letters, the number of special chars, and the number of links.
3) Removing spaces and characters that appear excessively.
b) Slang words normalization using a dictionary.c) Email, hashtag, number, mention to specific TAG (USER, ANGKA, EMAIL, MENTION) d) Abbreviation normalization using a dictionary.e) Some minor spelling corrections using a dictionary.6) Performing stopwords removal (using combined stopwords from standard and stopwords generated from the dataset based on their frequency).
7) Performing stemming using the Sastrawi Python library.
8) Saving the final output and passing it to the model.

IV. EXPERIMENTS AND DISCUSSION FOR BASELINE PERFORMANCE
The testing was carried out using the ML method (Nave Bayes, Complement Naïve Bayes, Decision Tree, and Multi-Layer Perceptron) [32], which was partially or fully implemented using the Python Scikit Learn library (Sklearn).The test scenario was carried out in two forms: a dataset with emoji in symbols and emoji in the text.Pre-processing uses tokenization, Indonesian stopwords, and simple normalization and uses the n-gram TF-IDF features, i.e., 1-gram and 2-gram.Table VIII shows the experiment scenario using the machine learning methods.The dataset splits into 80% training data and 20% testing data.The evaluation score used F-measure (F1) with a score between 0-1.The measurement matrix uses the F1 score with 80% training data and 20% testing.
The authors use Naïve Bayes (NB) with an alpha value of 0.01 and other parameters defaulted from sklearn.Complement Naïve Bayes (CNB) was used in the second experiment, which was expected to overcome datasets whose classes are not balanced.Both methods were used as representatives of the probability classifier.The Decision Tree (DT) method was also used, representing the classifier tree with the random_state parameter set and the information gain using the Gini index.Finally, an artificial neural network-based classification method was used: a multi-layer perceptron with a limited iteration of 300.The F-measure (F1 score) was chosen for the performance evaluation because the F1 score value represents a combination of recall and precision values and can also be used in the unbalanced dataset.
Moreover, the accuracy value alone is inappropriate for the SPAMID-PAIR dataset with an unbalanced number of classes.Table X shows the results of the experimental scenarios using the methods.Fig. 4 to 10 display the confusion matrixes of the models, while Fig. 11 and 12 show ROC curves of the models in testing data.From the confusion matrix in Fig 4(a) and 4(b) (EmojiSymbol NB), all the true positives are higher than the others (true negative, false positive, and false negative).The ability to detect spam comments is good enough, but it also can be seen that the accuracy is better on not-spam comments than on spam comment labels.Fig. 5(a) and 5(b) (EmojiText NB) show that the F1 score is higher than the EmojiSymbol, although the true positives are lower than the EmojiSymbol.From this result, the EmojiText performs better because it can detect spam comment properly in a balanced dataset.Fig. 6(a) and 6(b) (EmojiSymbol CNB) show that the F1 score (based on true positive, true negative, false positive, and false negative) is better than the NB method.CNB works better because it can complement the weight of an unbalanced dataset [34].Fig. 7(a) and 7(b) (EmojiText CNB) show that the CNB in text format outperforms the NB in EmojiSymbol and EmojiText.Fig. 8(a), 8(b), 9(a), and 9(b) shows the performance of the DT method that also has better F1 in EmojiText but not for the EmojiSymbol.Decision Tree can handle the emoji symbol well.The last, in Fig. 10(a) and 10(b), it can be seen that the confusion matrix shows that MLP (a traditional neural network) has close F1 score to NB and CNB but trains slower than them.But, based on Fig. 11(a) and 11(b), it can be seen that EmojiText in MLP works the best from the other methods.
The authors also extract a list of emojis categorized as 'spam' and 'not spam' based on the SPAMID-PAIR dataset.It can be seen in Table IX.It can be seen that list of spam emojis is more than not spam emojis.The intersection between them is also quite a lot, and the emoji only used in the "not spam" category contain very reasonable emojis (clear emoji meaning).Still, on the other hand, the emoji used only in the "spam" category is quite a lot and very random emojis (not clear emoji meaning).The results in Table X prove that the SPAMID-PAIR dataset is a dataset that can be used in Indonesian text classification experiments originating from social media.In Fig. 4 to Fig. 10, all the confusion matrixes of the models use 14.573 (20%) data testing.From Table X and Fig. 4 to Fig 12, It can be seen that CNB and MLP are superior to NB and DT.Fig. 13 shows the ROC curve, which explains that the area of the ROC curve in 13(a) is higher than in 13(b).The EmojiText1Gram is better than the EmojiText2Gram because the TFIDF vectors from 1gram have a better weight representing the text's characteristics.The traditional ML can only achieve an F1 score in the range of 0,72-0,78, but a multilayer perceptron can achieve an F1 score of 0,8.It promises that these results can be improved, such as with the pair context classification approach [35].Hopefully, this dataset can also be used in other related research and enrich the Indonesian dataset collection, which is still rare.This dataset is also important because it contains pairs of posts and comments that can be related and used in problem sentence pair classification in Indonesian.

V. CONCLUSION
This research collected post and comment pairs data from 13 selected Indonesian public figures (artists) / public accounts with more than 15 million followers.Two persons labeled all pair data as an expert in 72874 data.The dataset is called SPAMID-PAIR, containing post-comment pairs and label in Unicode text (UTF-8) text containing emojis.The dataset does not include any account information except the ID number.Unlike the other existing sentence pair datasets, the SPAMID-PAIR dataset is specifically used to determine the context between comments and posts that have never been collected in a large enough dataset.The objective of this dataset is as the primary data source in machine learning, especially in the NLP area, for spam comments detection based on the post context.This dataset is intended as one of the Indonesian language datasets that also contains many emoji symbols from social media so that it can be used to understand human expressions using emojis.SPAMID-PAIR proved that it could be used as a training dataset to detect spam comments based on its post.From the experimental research using some ML methods, it can be seen that ML can only achieve an F1 score in the range of 0,72-0,78, but a multi-layer perceptron (MLP) can achieve an F1 score of 0,8.It significantly promises that these results can be improved in future works.The limitation of this dataset is it includes imbalanced data between not spam and spam categories.This dataset can also be enhanced in the future.
Datasets for text classification can be divided into two types: single-text classification and paired-text classification.Some examples of single-text classification datasets are news classification, sentiment classification, hoax classification, spam, topic classification, and emotion text classification.Examples for paired text classification are text entailment classification, duplicate question classification, text pair similarity classification, including spam comment classification based on a particular post on the social media.

TABLE I .
THE 13 PUBLIC FIGURES USED IN THE SPAMID-PAIR DATASET WITH MORE THAN 15 MILLION FOLLOWERS (PER DECEMBER 2021)

TABLE II .
DETAILED STATISTICS OF SPAM AND NON-SPAM DATA PER ACCOUNT ID IN THE SPAMID-PAIR DATASET

TABLE III .
DESCRIPTION OF ATTRIBUTES IN THE SPAMID-PAIR DATASET

TABLE IV .
NUMBER OF EMOJIS IN THE SPAMID-PAIR DATASET PAIR dataset profile generally has an average comment length of 34.23 characters and an average post length of 252.03 characters.The highest number of emojis (nonunique) in a comment is 359 emojis, and the highest number of unique emojis is 112.In the post data, the highest number of emojis (non-unique) is 32 emojis, and the highest number of unique emojis is 14.Complete statistical details of the comment and post data can be seen in TableVI.The maximum length of the comment is 386, and the post is 3938.www.ijacsa.thesai.org

TABLE VI .
STATISTICAL INFORMATION ON COMMENTS AND POSTS DATA IN THE SPAMID-PAIR DATASET

TABLE VII .
EXAMPLE OF LABELING RESULTS