A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language

—This study proposes a new approach in the sentence tokenization process. Sentence tokenization, which is known so far, is the process of breaking sentences based on spaces as separators. Space-based sentence tokenization only generates single word tokens. In sentences consisting of five words, tokenization will produce five tokens, one word each. Each word is a token. This process ignores the loss of the original meaning of the separated words. Our proposed tokenization framework can generate one-word tokens and multi-word tokens at the same time. The process is carried out by extracting the sentence structure to obtain sentence elements. Each sentence element is a token. There are five sentence elements that is Subject, Predicate, Object, Complement and Adverbs. We extract sentence structures using deep learning methods, where models are built by training the datasets that have been prepared before. The training results are quite good with an F1 score of 0.7 and it is still possible to improve. Sentence similarity is the topic for measuring the performance of one-word tokens compared to multi-word tokens. In this case the multiword token has better accuracy. This framework was created using the Indonesian language but can also use other languages with dataset adjustments.


INTRODUCTION
In the current era, the amount of information is increasing very rapidly [1], a lot of information is available in text form from various types of documents such as magazines, e-books, research results, social media, emails, pdf files, video, audio, images, and large amounts of business content. Experts predict the volume of text documents will grow by 80% by 2025. To be useful, text data must be processed into information with text mining techniques [2].
To be processed, text data needs to be prepared at the textpre-processing stage. This stage is the first important step of any data mining process to achieve better accuracy [3]. This process will change the data from its original form into a form that is easier to observe and explore [4]. One of the activities in pre-processing is tokenization besides case folding, filtering/stop-words removal, lemmatization, stemming [5], [6] including normalization and removing irrelevant words [7]. Stopwords are the least important words in a sentence, and ignoring them can help identify the most important words [8].
Tokenization is a fundamental process in almost all Natural Language Processing applications. The standard approach is single-word tokenization, in which the input string is split word by word using spaces as separators [9]. Most NLP research uses this kind of tokenization technique, such as by [10] in semantic similarity, [4] [9] in text classification, [11], [12] in information retrieval, [13], [14] in clustering, [15]- [17] in sentiment analysis, and much more.
Usually tokenization separates each word in a sentence as one token based on the spaces between words, but in fact, not all words in a sentence can be separated. There are words that must remain in pairs so that the meaning of the sentence remains correct. Separating a sentence into its constituent words can result in the meaning of a word deviating far from its actual context [18].
There are several publications that state that tokens are not just one word, but can be several words or even one sentence [10] [13][14] [19]. There is also research into finding multiword expressions (MWE) or combinations of words that must be paired to make sense, such as by [20]- [23]. Most of this research was conducted for documents in English and other languages, including languages that do not recognize spaces as separators between words, such as Mandarin, Japanese or Thai. Research on Indonesian language texts is still limited. All the research above is only for finding word pairs and not for tokenization.
Methods that have been used in previous research include statistics, linguistic, dictionaries, and machine learning. The statistical method calculates the frequency of co-occurrence of two words. Linguistic methods match grammatical patterns based on the types of word labels. Searching for word pairs in the dictionary, that's the dictionary method. Machine learning methods use a set of datasets to predict the output.
Tokens consisting of several words are referred to as multiword tokens. Multi-word tokens must be in the same sentence and same sentence element. In paragraphs that contain many sentences, it is necessary to segment the sentences so that each sentence is separated from each other. In order to segment a sentence, it is very important to know where the sentence boundaries are. It is not easy to find sentence boundaries because there is ambiguity from sentence boundary punctuation.
In Indonesian there are 5 sentence elements, namely Subject (Subjek), Predicate (Predikat), Object (Objek), Complement (Pelengkap) and Adverb (Keterangan) known as SPOK in Indonesia [24]. The subject and predicate elements must be present, while others may or may not be present. Each sentence element contains one or more words as word pairs. www.ijacsa.thesai.org Word pairs can only be formed in the same sentence element. Therefore, it is important to be able to perform sentence structure extraction. This is not taken into account by previous studies. By extracting the sentence structure, each sentence element can be treated as a token, at least for Subject and Object. This paper proposes a new method for sentence tokenization based on sentence structure in Indonesian. This new method of sentence tokenization will generate single-word and multi-word tokens simultaneously. That"s our contribution. To our best knowledge, there is no research on this. This research uses Indonesian, but can be adapted to other languages that use spaces by customizing and retraining the dataset.
To find out the effectiveness of single-word and multi-word tokens, a sentence similarity test was carried out on both types of tokens. From the test results, it shows that multi-word tokens are able to determine word similarity better than singleword tokens. This paper divided into several sections. In Section II, we review the related work on multi-word tokenization including multi-word expression, Section III, we give an overview of the proposed method including sentence segmentation, sentence structure extraction and dataset preparation. Section IV, we provide the result and discussion, and finally, Section V, concludes this paper.

II. LITERATURE REVIEW
This paper is inseparable from the previous studies that have been conducted by researchers. The previous studies are summarized in this section, especially those related to multiword tokenization. There are several methods used in previous research, such as statistics, linguistics, dictionary, and machine learning. We found two research in Indonesian language, that is [25] which perform 2-word extraction to obtain multi-word expression candidates by applying some rules and filtering using a dictionary. Researcher [26] also used rule-based methods and built two dictionaries (close class tagging and multi-word expression dictionary). This dictionary will store two or more words with POS tags of nouns, verbs and adjectives. The study [27] examines the tokenization process using a phrase detection-based approach.
Research in Serbian language with agricultural engineering domain conducted by [28] provides a hybrid approach by combining linguistic and statistical information. The Candidate terms are obtained using the frequency of occurrence of text sequences in the corpus. In an effort to obtain multi-word expressions, the author in [20] examined an implementation in Turkish used four methods: first, statistical methods to calculate high co-occurrence frequencies, second, linguistic methods through POS patterns, third, candidates from idiom dictionaries, and the last is specialized domains such as term dictionaries. Research that presents a method for identification of chemical terms as multi-words was conducted by [23]. In his research, the Multiword Identifying and Representing (MIR) method was implemented to recognize multi-word phrases in chemical literature with an unsupervised data-driven model and the identified phrases were added to the vocabulary. This research uses statistical and linguistic methods without expert annotations. Author in [29] created the MwTExt architecture, for automatic extraction of multi-word terms from unannotated computer science domain English documents. This method uses statistical, linguistic, and logic-based methods and hybrid techniques and focuses only on lexical patterns such as (N P N), (N P N + N), and (N P N P N).
The study [21] built a hybrid approach with the combination of Bi-LSTM + word correlation level and K-Means Clustering to detect MWEs for multiple languages without manual features. Author in [30] proposed a neural network model for learning fixed-size word representations from arbitrary chunks with word embedding. Implementation in French created MWE for Russian dictionary (RuThes). Multi-word expression recognition measure based on similarity of phrase distribution and word components is used for statistical and linguistic methods as well as for word embedding. Author in [31] focus on annotating different types of lexicalized and institutionalized phrases with main goal is to identify MWEs that are perceived as complex by readers and need to be simplified overall. A number of hand-crafted features form the basis for predicting MWE complexity.
From the previous research above, as far as we know, there is no research with a method based on sentence structure as proposed by this research.

III. PROPOSED METHOD
The general tokenization process is shown in Fig. 1. This process works by receiving input in the form of sentences and identifying each word as a token by using spaces as separators between words, resulting in single word token. The number of tokens equals the number of words. This tokenization method is widely used, but it can also cause inaccuracies, such as: 1) The same word or token, will be considered to have the same meaning even if it is in a different order so that only one token will be used and the other tokens will be ignored [32]. Example : Token in English : "sakura","dewi", "looks", "at", "sakura", "tree", "in", "Japan." Token in Indonesian : "sakura", "dewi", "memandang", "pohon", "sakura", "di", "Jepang" The first token and the fifth token, will be considered to have the same meaning even though they are semantically different. One of them will be ignored.

2)
When two or more words are combined and form a whole, a new meaning will be created that is different from each of the constituent words. Example : Token in English: "green table" Token in Indonesian: "meja hijau" www.ijacsa.thesai.org In Indonesian, "meja hijau" means the court, a place to find the truth. If these two words are separated into "meja" and "hijau" then the meaning becomes different, the first is a piece of furniture that has a flat surface as a table top and legs as a support and the second is one of the base colors.
3) Not only the word meaning problem, but also the Partof-Speech (POS) ambiguity problem. The POS of a single word token can vary. For example, separating the two words 'memberi makan' (in English: feeding), consists of the word 'memberi' with POS as the verb and the word 'makan' as the noun (since it is something that is given), but in other contexts such as 'kuda makan rumput', the POS of the word 'makan' is as a verb.
From the previous description, it is known that there are words that cannot be separated or must still be combined. Current tokenization methods does not accommodate this.
The main elements of the proposed tokenization framework are shown in Fig. 2. The framework has two stages, namely sentence segmentation and sentence structure extraction. The input can be in the form of paragraphs or sentences. If the input is a paragraph, it will go through the sentence segmentation stage. This stage will split the paragraph into separate sentences. These sentences, whether they are new input or output from the first stage will be processed in the sentence structure extraction stage.
The output is a sentence structure with its elements (SPOK). Each sentence element is a token. In other words, sentence structure extraction is a tokenization process. These tokens are then used in natural language processing applications.

A. Sentence Segmentation
The task of sentence segmentation can be performed by detecting sentence boundaries [33]. The general pattern of a sentence is that it begins with a capital letter and ends with a special punctuation mark such as a period, question mark, or exclamation mark. The ability to recognize punctuation is a key requirement for knowing sentence boundaries to divide a paragraph into sentences. In this study the sentence segmentation process is described in Fig. 3. 1) Word Tokenization, is a tokenization process as commonly used, breaking text data into words [8] [34]. If there are punctuation marks then they will be attached to this token.
2) Punctuation Checking, is the process of checking the punctuation attached to the token, one of which is a period, question mark, or exclamation mark.
3) If the punctuation on the token is one of the three sentence-ending punctuation marks, the token will be assigned EOS status. Otherwise, it will be assigned NEOS status. 4) Combining NEOS Tokens. All NEOS will be combined into one sentence after finding EOS.
All tokens with NEOS status are combined into one new sentence and tokens with EOS status become the last word in the sentence. The next token will be the first word of the next sentence. This sentence will be used as input for the next process.

B. Sentence Structure Extraction
There are five sentence elements in Indonesian, namely Subject (Subjek), Predicate (Predikat), Object (Objek), Complement (pElengkap) and Adverb (Keterangan). Each sentence consists of at least a Subject and a Predicate and these two elements are arranged sequentially. The sentence elements Object, Complement and Adverb can be used or not used. The combination of these sentence elements forms a sentence structure pattern like SP, SPO, SPOK, SPOE, SPK, SPE, SPEK and SPOEK. Each word or words in each element of the sentence is a unit. Words or tokens that are in different sentence elements cannot be combined into one unit.
The sentence extraction process will identify sentence elements and classify each word in each sentence element.
This will facilitate the tokenization process, especially in determining multi-word tokens. The sentence structure extraction method in this study is as shown in the Fig. 4. www.ijacsa.thesai.org The stages of the extraction process are as follows: 1) The process will accept input in the form of simple and active sentences.
2) A pre-trained deep learning model will predict sentence structure of the input sentence. The model has been trained using a dataset containing a collection of simple and active sentences in Indonesian, complete with labels. The embedded label is the identity of the sentence structure in the BIO tagging format. Label B (for "beginning") indicates as part of a multi-word token with position as the first word. The label I (for "inside") also indicates as part of a multi-word token with the position as the next word and the label O (for "outside") indicates as a stand-alone token or single word token. The dataset is in csv file format with an example as shown in Fig.  5.
This dataset contains 45,079 tokens from 4,740 sentences in Indonesian, with a minimum token range of 2 words and a maximum of 17 words per sentence. The distribution of each sentence element contained in the dataset is shown graphically in Fig. 6.  This dataset was trained using the pre-trained Bidirectional Encoder of Transformers (BERT) model. By dividing 80% as training data and 20% as test data and 10 epochs, an F1-score of 0.7 was obtained. These results show that the model and dataset that have been built are good enough, but need to be improved in the future.
3) The output of this process is a sentence structure prediction with sentence elements, namely Subject (SUB), Predicate (PRE), Object (OBJ), Complement (PEL), and Adverb (KET). There are nine types of adverbs in the dataset so there are thirteen sentence elements as listed in Table I. Each token or word must be a member of one of the sentence elements. Each sentence element can consist of one or more than one word.
The output of the predicted sentence element will be written in the format of a BIO-tag label and the abbreviation of the sentence element, e.g. 'O-SUB' consists of the label O and the abbreviation SUB which means the word has no word pairs and with the Subject role. For sentence elements with more than one word, the first word will be labeled B ('beginning') and the remaining words will be labeled I ('inside' in BIO tags), e.g. 'B-SUB', 'I-SUB', 'I-SUB' which means there are three words that have the role of Subject and as one unit or one token. Such tokens are referred to as multi-word tokens. These tokens are then used in the NLP process.

IV. RESULT AND DISCUSSION
The experimental results of the proposed tokenization framework are quite good. In this section, the output will be discussed and sentence similarity tests will be conducted based on single word tokens and multi-word tokens.

A. The Output
As mentioned earlier, the outputs of this tokenization framework are sentence structures and sentence elements. Each sentence element can consist of a single word called a singleword token or multiple words called a multi-word token. One word means one token, multiple words also means one token. The number of sentence elements indicates the minimum number of tokens. Table II shows an example.
The first sentence consists of two words, the prediction results show that the first word is the Subject (O-SUB) and the second word is the Predicate (O-PRE). Both are independent because they are labeled O. Then each word is a single word token.
The second sentence consists of seven words. The first word 'Tim' is labeled 'B-SUB' and the second word 'Argentina' is labeled 'I-SUB' which indicates that both are in the same group which is Subject (SUB). So both should remain as one with the meaning of a group of soccer players from Argentina. Separating the two words will lose the original meaning. That is, the Subject is a combination of the words 'Tim' and 'Argentina' to become 'Tim Argentina'. This is a multi-word token.
Likewise, the fourth to seventh words are adverb groups (KTM), so these four words are a single unit. In this second sentence, there is also a word labeled 'O-PRE', namely 'win'. This means that the word 'win' has the role of a Predicate that stands alone, and is a single word token. Therefore, it can be seen that the second sentence only has three tokens for Subject, Predicate, and Adverb. More details in Table III. In the third sentence, there are three groups of sentence elements consisting of more than one word, namely words labeled Predicate (PRE), Object (OBJ), and Complement (PEL). Only the subject (SUB) stands alone because it is labeled O. The complete information can be seen in Table IV. In Table IV, it is clear that the Subject is a one-word token 'Prajurit', the Predicate is a multi-word token 'mulai memasuki', on the Object there are two words 'area pertempuran' as multiword tokens and the Complement consists of three words 'dengan senjata lengkap' as multi-word tokens.

B. Sentence Elements as Token
As explained earlier, a sentence element can be a token. A sentence extraction result that produces three sentence elements means it has three tokens. A sentence will have at least two tokens. Tokens can be single-word tokens or multiword tokens.  ["Prajurit", "mulai", "memasuki", "area", "pertempuran", "dengan", "senjata", "lengkap"]   However, not all multi-word tokens derived from sentence elements can be assigned as end tokens. The contents of multiword tokens can be words that do not provide important information.
In the second sentence above, there is the word 'di' in the adverb of place with a multi-word token. The multi-word token www.ijacsa.thesai.org in the third sentence contains the word 'mulai' in the Predicate and the word 'dengan' in the Complement. These words can be ignored and have no effect on the token. Such words are known as stopwords.
From the example sentences above, stopwords can appear in Predicate, Complement, or Adverb. There are almost no stopwords in Subject and Object. Therefore, multi-word tokens in Predicate, Complement, and Adverb need to be filtered first. These unnecessary words will be removed before providing tokens. Filtering is done by comparing the contents of the multi-word tokens of the three sentence elements with a database containing words that fall into the category of stopwords.

C. Evaluation
The outputs of this framework are single word tokens and multi-word tokens. To get an overview of the two types of tokens, the following is an evaluation of both in determining sentence similarity.
The evaluation is done using the token lexical similarity method. Overlap Coefficient, Jaccards Index, Jaccards Distance, Dice Coefficient and Cosine Similarity methods will be used for single word tokens, while Dice-Index Coefficient for multi-word tokens.
Some of the stages of evaluation are as follows: 1) Defines a set of single-word tokens and multiple-word tokens in sentences.
2) Perform statistical calculations: a) For single word token.  Counts the number of tokens in the sentence, which is mathematically symbolized as | K 1 |.
 Counts the number of tokens that appear in both sentences, symbolized as | K 1 ⋂ K 2 |.
 Counts the number of tokens derived from the two sentences, and is symbolized as | K 1 ⋃ K 2 |.
b) For multi-word tokens:  Counts the core (head) token on each token, symbolized as | h 1 | and | h 2 |. Head is a word whose meaning is included in the meaning of another word.
 Perform token combinations according to the token order.
 Counts the number of core tokens (head) present in both multi-word tokens, symbolized as | h 1 ⋂ h 2 |.
 Counts the number of tokens present in both multi-word tokens and is symbolized as | M 1 ⋂ M 2 |.
 Counts the number of tokens from both multi-word tokens, symbolized as | M 1 | + | M 2 |.
c) Measuring sentence similarity Measuring the similarity between sentence1 and sentence2 basically determines how many similarity tokens there are in each sentence divided by the normalization factor.
The sentence similarity measurement function used is as follows:  Overlap Coefficient: is the size of the overlap of the sets K 1 and the sets K 2 divided by the smallest size between K 1 and K 2 .
3) K 3 = "pemerintah kota solo mendapat hibah dari pangeran arab saudi." (she solo city government received a grant from the prince of saudi arabia.) By using the formula described above, the calculation results are as follows in Table V. From the table, it can be concluded that the first sentence is more similar to the second sentence.
Multi-word token similarity measurement uses the concept of lexical similarity based on identifying the common sequence of each token. It is based on the hypothesis that the head is a hyponym of the same term, which is denoted as hn. The visualization of the hyponyms of the multi-word tokens in the above three sentences is shown in the Fig. 7 below. The word sequence of the multiword token P(t) references the set of all sequences in t. The lexical similarity between multi-word tokens t 1 and t 2 is measured based on the Dice-like coefficient formula as follows: The numerator in the formula indicates the set of shared constituents (constituents present in both tokens), while the denominator refers to the total number of constituents.
The multi-word token obtained from sentence structure extraction are shown in Table VI. By using the Dice-like coefficient formula, the level of similarity of multiword tokens is obtained as shown in Table  VII.
From the table above, the multi-word tokens in the first sentence are similar to the third sentence compared to the second sentence, and the multi-word tokens in the second sentence are very different from the third sentence.
From the similarity measurement of the two sentences above, there is a difference in results between single word tokens and multi-word tokens. The measurement with single word tokens concludes that the first sentence and the second sentence are more similar than the other sentences.
While the measurement with multi-word tokens states that the first sentence and the third sentence are more similar than the first and second sentences. Both have the same measurement result, that the second and third sentences are least similar.
In human judgment, the first and third sentences are similar, just like the measurement results of multi-word tokens. This shows that multi-word tokens also have advantages and can help NLP work.

D. Performace
To evaluate the quality of the proposed method, we conducted a manual evaluation of 100 sentences. The evaluation was done by checking the supposed multi-word tokens and then compared with the multi-word tokens extracted by the proposed method, with the results as shown in the Table VIII. From the table, we can calculate Precision and Recall using the following formula: And the results are P = 0.92 and R = 0.86. The success of extracting multi-word tokens correctly is quite dominant, out of 221 multi-word tokens extracted, 204 of them are correct. While the R value has a value of 0.86 which is obtained from 204 correct multi-word tokens out of 237 multi-word tokens that can be generated. These results provide information that www.ijacsa.thesai.org the proposed method is able to extract sentence structure and at the same time produce multi-word tokens that are quite accurate.
We also conducted a comparison with three other studies on multi-word tokens or similar from [21], [27] and [29]. Methods used by [21] are hybrid to train a multi-word expression detector for multiple languages without any manually encoded features. The methods used by [27] is a rulebased. The methods used by [29] are statistical, linguistic and logic-based methods and hybrid techniques, for the automatic extraction of multi-word terms from unannotated computer science domain English documents.
A comparison between these four methods is shown in Table IX. Each method has advantages and disadvantages. However, by preparing and training the sentence structure dataset, the proposed method is excellent in predicting the sentence structure elements. Each element is a token, either a single token or a multi-word token. Thus, this method does not rely on manually constructed lexical patterns. The method is highly adaptable and evolves as new data becomes available.

V. CONCLUSION
A tokenization process that generates single-word tokens and multi-word tokens simultaneously is possible. This is proposed through this research. To our knowledge, we are the first to propose this tokenization method based on sentence structure, which is expected to inspire new research with new ideas. Providing a complete dataset is a very important factor for successful sentence structure prediction. The predicted sentence element (SPOK) can consist of one or more words, i.e. tokens. Multi-word tokens are more accurate than singleword tokens in terms of sentence similarity.
Multi-word tokens are worthy of further research. In the future, we will enhance the dataset with passive sentences and also apply this approach for use in other types of cases such as NER.