Automatic Extraction of Indonesian Stopwords

—The rapid growth of the Indonesian language content on the Internet has drawn researchers’ attention. By using natural language processing, they can extract high-value information from such content and documents. However, processing large and numerous documents is very time-consuming and computationally expensive. Reducing these computational costs requires attribute reduction by removing some common words or stopwords. This research aims to extract stopwords automatically from a large corpus, about seven million words, in the Indonesian language downloaded from the web. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor. The method used is Term Frequency – Inverse Document Frequency (TF-IDF) and presents a methodology for ranking stopwords using TFs and IDFs, which is applicable to even a small corpus (as low as one document). It is an automatic method that can be applied to many different languages with no prior linguistic knowledge required. There are two novelties or contributions in this method: it can show all words found in all documents, and it has an automatic cut-off technique for selecting the top rank of stopwords candidates in the Indonesian language, overcoming one of the most challenging aspects of stopwords extraction.


I. INTRODUCTION
Stopwords are any common words that carry low information content [1]. Despite their high occurrence, they only add a little semantic data to the document [2]. They are also referred to as negative dictionary or noise words. They cause a small retrieval degree and prediction outcomes. Since they make up a considerable portion of the documents, textmining tasks will be very computationally intensive. This high computational cost is caused by the dimensionality curse and requires larger computer memory and computational time. Furthermore, in information retrieval experiments, it has been shown that removing stopwords improves precision significantly when compared with when they are not removed [3,4]. Stopwords also play a significant role in feature extraction [5,6], topic modeling [7], classification [8], ontology construction [9], and keyword extraction [10].
There are two categories of stopwords: domain-specific and general. Domain-specific stopwords are a set of words that make no discriminant value inside a specific context or domain. They differ from one domain to another domain. For example, the word "learning" could be a stopword in the education domain, or the word "machine" could be a stopword in the machinery domain, but neither of those words is a stopword in the computer science domain. On the other hand, general stopwords are a list of stopwords or stoplists that are not specific to one domain and are usually available to download as a public domain object.
General stopwords are the most used in Natural Language Processing (NLP) because of their availability, and it takes a considerable effort to develop a domain-specific stoplist. It is easier to create a domain-specific stoplist based on a general stoplist by adding and/or removing some terms. General stoplists, however, need to be updated frequently. In addition, over time, the use of some ordinary words has altered subjects on social aspects such as industrial revolution changes, new media, cultural shifts, and education. For these reasons, reviewing, updating, and adjusting existing stoplists is essential [5]. Updating a general stoplist can be done manually, but it takes time and may omit the latest stopwords. This problem can be solved automatically by developing a general stoplist.
Researchers have developed many methods for automatically creating stoplists, especially in English, since decades ago. Since then, many methods have been developed to create English stoplists. In contrast, there are relatively few studies to develop a stoplist for non-English languages like Indonesian. The problem is that Indonesian is a low-resource language, making it challenging to develop an automatic stopword extractor.
There were only two research documents about general stopwords extraction in Indonesian [11,12]. Those documents show 394 and 330 general stopwords extracted from Kompas daily newspaper. Both of stopwords lists extracted using Term Frequency (TF) method, it is a rare method to use in extracting stopwords. Most researchers use a combination of Term Frequency and Inverse Document Frequency (TF-IDF) in NLP. Unfortunately, TF cannot detect words that occur in all documents and cannot implement threshold to limit the number of generated stopwords.
This research paper aims to solve the problem above and develop an up-to-date general stoplist in the Indonesian language. The method used is crawling recent news from an Indonesian online newspaper's website to gather data and make the required dataset. The stopwords extraction method uses TF-IDF.
This document is organized in this way; the following section comprises a brief coverage of the present literature in the areas of automatic stopwords extraction, the methods used for stopwords extraction, and the experiments. Then we www.ijacsa.thesai.org describe the results of this research. We conclude this research document by presenting the advancement of our methods compared to previous works.

II. RELATED WORKS
Many methods have been used to develop stoplists. Some of them are frequency-based approach [13], Bidirectional Long Short term memory (BiLSTM) [14], Word Embedding [15], Finite Automata [16], and utilizing characteristic and discriminant analysis [17]. The dataset or corpus used to extract or identify stopwords vary. Some researchers used corpus from an online newspaper [18], social network [19], or patent [20]. As the purpose of this research is to develop an Indonesian general stoplist, only relevant research papers are discussed.
There are three research papers discussing the development of the Indonesian stoplist. One of them only involves developing a cuisine-specific stoplist for the Indonesian language [12]. However, since this research aims to develop a general Indonesian stoplist, only the other two papers are reviewed further here based on their proposed method.
Fadillah Z Tala, in his master thesis [11], proposed an Indonesian stoplist because there was no Indonesian stoplist that could be used in his experiment in information retrieval. In his work, he created the dataset based on articles from the "Kompas daily" newspaper. He downloaded the articles every day for one year long, starting from the beginning of January 2001 until the end of December 2001. The total number of articles was 3160. The result of his experiment was 394 stopwords in Bahasa Indonesia.
Yudi Wibisono has created a stoplist in his coursework [12]. As the source of his dataset, he also used articles from the "Kompas daily" newspaper. He used several hundred articles to create 330 stopwords. The method used was Term Frequency, like Tala's work, but he removed some words manually.
Tala and Wibisono used the Term Frequency (TF) method to extract stopwords in their work. These days, Term Frequency-Inverse Document Frequency (TF-IDF) is another method that is generally used in information retrieval systems. TF-IDF is one of the traditional methods based on statistics [21]. It has been used in many different applications, such as document clustering [22], text classification [23], detection of domain name generation algorithms [24], and comparing research trends [25]. Term frequency or word frequency is a rarer method used in information retrieval systems compared to TF-IDF.

III. METHODS
Different methods were used in each stage of this research. As shown in Fig. 1, the steps for this study were differentiated into three stages: data gathering, pre-processing or data cleaning, and stopwords extraction. This work is different from the previous studies in some stages. First, the data source in the data gathering stage of this research is crawled from the "Republika daily" newspaper, whereas the previous studies used data from the "Kompas daily" newspaper. Moreover, in their studies, they used Term Frequently (TF), but in this work, we used the TF-IDF method, a combination of the Term Frequently and Inverse Document Frequency methods.

A. Data Gathering
The dataset or corpus for this research was gathered from Republika, an Indonesian online newspaper. The method used to gather the data was "Focused Web Crawling" [26,27]. "Focused Web Crawling" is a method to download or harvest particular data from websites, commonly from one website. The crawler was developed using Python programming language to crawl web addresses from the Republika website. There were 6111 articles downloaded, containing 6,947,178 words, 87,998 of which were unique.

B. Pre-Processing
Pre-processing is a required process to clean the dataset. Some steps in pre-processing are case folding, HTML tags removal, special characters removal, tokenizing, dealing with missing data, dealing with data error, and stemming. These stopwords extractions implement pre-processes are as follows:  Case folding: Converting characters from uppercase to lowercase. The fastest and simplest way is entirely changing words to lowercase, including words occurring in a sub-title or title and words at the beginning of a sentence. Since some papers used uppercase for Term Frequency and others used lowercase term frequency, so in our research, we converted all words into lowercase, which means that we treat those two phrases as the same phrase.
 HTML tags removal: Removing all HTML tags, scripts, and other metadata from HTML documents is mandatory. It returns clean texts from documents in HTML format.  Tokenizing: It separates each word from documents into an array of items or a bag of words.

C. Stopwords Extraction
Extracting stopwords from Indonesian documents is the primary purpose of this study. The stopwords extraction from the dataset used the TF-IDF method after the pre-processing steps. Eq. (1)-(5) present this TF-IDF: where f (ω i ) is frequency of occurrence of term or word ω i in document j, and N is total number of all documents in document collection {d j }. df (ω i ) indicates the number of documents contain term ω i in the document collection. n ij is the number of occurrences of ith term appearing in jth document. n kj is occurrence frequency of kth term appearing in jth document. |{j: ω i ∈ d j }| is number of document consisting ith term. Getting the value of each term in every document is done by examining every term in the document collection or corpus.
For the whole document collection, corpus or dataset, the average of TF, ( ), is divided by the number of documents consisting of term . Thus, the TF-IDF formula of term for the whole document collection is:

IV. EXPERIMENTS
Several experiments have been done to find the methods. For example, the data gathering method should be tried many times before we can harvest the data automatically. It is because the articles or documents are spread into tens of categories or sub-categories in the data source (https://www.republika.co.id/), such as News, Playing Games, Economics, Football, Islam Digest, or International. Since the structure of these web pages was not crawler friendly, we used Focused Web Crawling strategy to handle them. The data is then processed using the discussed pre-processing methods.
The pre-processing methods used were standard methods for Natural Language Processing. Our experiments regarding pre-processing proceeded smoothly. All pre-processes were done automatically by using applications developed in Python languages. The Python language was chosen because of its many machine learning libraries, especially for NLP. Later, a bag of words or an array of terms resulted from pre-processes fed into the TF-IDF method.
We used the TF-IDF formula shown in (5) to extract stopwords. Since there is no need for a training dataset, this NLP approach is categorized as an unsupervised machine learning method. It contains ( ) that comes from equation (3). It normalizes equation (5), limiting results of ( ) between zero and one.
If ( ) is equal to 0, it means that the i-th word ( ) exists in all documents. Table I shows three words contained in all documents that are republikacoid (republika.co.id), wib (west Indonesian time zone), and lainnya (others). The greater value of ( ) denotate that the word is less significant to be a stopword. Fig. 2 shows the correlation between TF-IDF max and the number of stopwords extracted in the logarithmic scale. This figure shows that those words are stopwords if maximum of ( ) is equal to 1. The number of extracted stopwords using the proposed method depends on the defined TF-IDF threshold. For example, the system extracted only six stopwords for the maximum of TF-IDF 10 -6 , and 414 stopwords if the maximum of TF-IDF increased to 10 -3 . However, for TF-IDF max equal to 0.01, the number of stopwords is blown up to 64,571. The cutoff of the number of stopwords can be done by setting the value of TF-IDF max . Since the range of TF-IDF is 0 to 1, the threshold can be maintained constantly. For example, if the threshold is set to 0.001 and the number of documents doubled the number of stopwords generated by TF-IDF does not change significantly. If we only use TF to extract stopwords and set the threshold to 8,000 and double the number of documents, then the frequency of stopwords generated might be doubled, resulting in the number of stopwords changing significantly as shown in Table I under Tala's and Wibisono's method.
The extracted stoplist contains some words specific to the dataset. For example, since the dataset or document collection source is Republika online daily newspaper, then there are some words with TF-IDF equal to zero. It means that those words occur in all documents. Table I shows the sample of extracted keywords from the same document collection using two different methods. Results in the left column are based on the previous researcher's method, and the right column is based on the method proposed in this work. As shown in this table, other methods can not reveal words that occurred in all documents.
Analyzing the top 100 extracted stopwords shows that the method used in this research, TF-IDF, is better than the previous methods. First, this research output can reveal the words that occur in all documents and place it in the top ranks, while the old method can reveal only two words and place them in ranks 17 and 26. Second, TF-IDF method can expose all words in the sentence "copyright … all right reserved" that occur in most of the documents, where the old method cannot reveal any of those words.

VII. CONCLUSIONS AND FUTURE WORKS
Stopwords extraction using TF-IDF has three advances compared to TF. First, it can detect words that occur in all documents with TF-IDF equal to zero. Second, it can implement threshold to limit the number of generated stopwords. Third, it can expose all words that occur in most of the documents and place it in the top ranks.
This research can be improved for future works in two ways. First, the documents in the corpus should be classified by its domain because stopwords for one domain are different from other domains. Secondly, develop a recommender system, a web-based application for the stopwords extraction that can be accessed by public.