Future of Information and Communication Conference (FICC) 2024
4-5 April 2024
Publication Links
IJACSA
Special Issues
Future of Information and Communication Conference (FICC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 14 Issue 2, 2023.
Abstract: This study proposes a new approach in the sentence tokenization process. Sentence tokenization, which is known so far, is the process of breaking sentences based on spaces as separators. Space-based sentence tokenization only generates single word tokens. In sentences consisting of five words, tokenization will produce five tokens, one word each. Each word is a token. This process ignores the loss of the original meaning of the separated words. Our proposed tokenization framework can generate one-word tokens and multi-word tokens at the same time. The process is carried out by extracting the sentence structure to obtain sentence elements. Each sentence element is a token. There are five sentence elements that is Subject, Predicate, Object, Complement and Adverbs. We extract sentence structures using deep learning methods, where models are built by training the datasets that have been prepared before. The training results are quite good with an F1 score of 0.7 and it is still possible to improve. Sentence similarity is the topic for measuring the performance of one-word tokens compared to multi-word tokens. In this case the multiword token has better accuracy. This framework was created using the Indonesian language but can also use other languages with dataset adjustments.
Johannes Petrus, Ermatita, Sukemi and Erwin, “A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language” International Journal of Advanced Computer Science and Applications(IJACSA), 14(2), 2023. http://dx.doi.org/10.14569/IJACSA.2023.0140264
@article{Petrus2023,
title = {A Novel Approach: Tokenization Framework based on Sentence Structure in Indonesian Language},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2023.0140264},
url = {http://dx.doi.org/10.14569/IJACSA.2023.0140264},
year = {2023},
publisher = {The Science and Information Organization},
volume = {14},
number = {2},
author = {Johannes Petrus and Ermatita and Sukemi and Erwin}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.