The Effect of Natural Language Processing on the Analysis of Unstructured Text: A Systematic Review

—The analysis of the unstructured text has become a challenge for the community dedicated to natural language processing (NLP) and Machine Learning (ML). This paper aims to describe the potential of the most used NLP techniques and ML algorithms to address various problems afflicting our society. Several original articles were reviewed and published in SCOPUS during 2021. The applied approach was retrospective, transversal and descriptive. The data collected were entered into the SPSS statistical software v25 and among the findings, it was determined that the most used NLP technique was the Term frequency - Inverse document frequency (TF-IDF), while the most used supervised learning algorithm was the Support Vector Machines (SVM). Likewise, the predominant deep learning algorithm was Long Short-Term Memory (LSTM). This research aims to support experts and those starting in research to identify the most used algorithms of NLP and ML.


I. INTRODUCTION
The Internet has become an exclusive ally for any institution. According [1], there were more than 5'168,000,000 users worldwide, of which Asia accounts for 53.4%, and ranks first. Latin America and the Caribbean are positioned in ranks fourth with 9.6%. According to [2], in Spain the number of users reached 91% of the population. The author in [3] points out that there are currently more than 1'900,000,000 web pages. It also points out that 167 million videos are generated in a minute on the Tik Tok platform; likewise, Amazon customers invest USD 283,000 in e-commerce. It is concluded that in every second large amounts of data are produced in various formats such as images, audios, videos and texts.
The massive amount of data implies the need to automate human tasks through the fast advance of technological innovation. It can be used for decision making in an efficient and effective way. According to [4], such innovation includes Artificial Intelligence (AI). The author in [5] points out that AI trains computers to learn from experience and to do the work of human beings. This field has had an intensified growth due to the COVID-19 pandemic. In the research [6] states that AI surpasses the cognitive abilities of man. AI is an interdisciplinary field [7], of computing capable of solving problems of medicine, psychology, education, health, information technologies -TIC, among others.
On the Internet platform, there is a lot of traffic, users are producing a high volume of unstructured texts; it is difficult to determine which websites are visited by users. The author in [8] propose a model made based on Natural Language Processing (NLP) techniques and neural networks to identify the websites visited by users by translating this problem into a text classification context. This solution is advantageous for the digital marketing because it allows the loyalty of users.
Social networks are platforms on which there are a high proliferation of comments, with absolute freedom and without restrictions from attacks, insults, discriminatory speeches, hatreds and other offensive terms. In the research work [9] proposes a text classification model to detect cyberbullying consisting of a neural network framework that examines the content of the text in order to analyze the effect of the extracted characteristics. The usefulness of this study lies in identifying solid mechanisms for the detection of cyberbullying.
Regarding the scope of the research, systematic review articles published in the SCOPUS Database, period 2021, were reviewed. For example, [10] submitted a systematic review article to provide evidence on the properties of text data used to train machine learning approaches and how they can be applied in clinical practice. In another review article, [11] highlighted the usefulness of NLP and Machine Learning (ML) to structure the comments of free texts issued by patients of health organizations. This led to identify that there is no systematic research that describes the frequency of the ML algorithms used in the various original articles. This work aims to fill this gap, being this the main motivation to carry out this research.
The objective of this study is to systematically review the bibliography of the application of natural language processing to analyze, interpret and classify the high production of unstructured texts produced in digital format. Also, to describe the frequency of ML algorithms such as supervised, unsupervised and deep learning. In this context, the aim is to answer the questions raised in Table I. Question one aims to identify NLP application fields. These features include text preprocessing techniques such as tokenization, etc. Question two describes the frequency of ML algorithms for data analysis. Finally, question three refers to the frequency of deep learning algorithms; this question has been given preference since algorithms are very specialized. Natural Language Processing -NLP is a branch of artificial intelligence and a resource to carry out qualitative tasks of unstructured information, based on mathematical and statistical algorithms on large amounts of data. In this regard, [11] pointed out that the NLP is a computer analytical technique used to extract information from an unstructured text into a structured form, for which syntactic processing of a text is done; it also captures the meaning and identifies links based on semantic relationships. The author in [12] indicates that NLP is a technique of automatic extraction of information from different electronically written resources at the level of documents, words, grammar, meaning, and context. Likewise, [13] stated that the NLP is a key tool for information automation and extraction that can process large amounts of data and its application is useful for issuing reports from the radiology area of a hospital.
Machine Learning (ML) is used for creating models that allow computers to learn without being programmed. In this regard, [11] affirm that ML is a set of statistical algorithms that can train and test a group of data to detect patterns, predict feelings within a text. In the research [14] point out that ML is the process that detects and exploits patterns and trends that are "hidden" in the production of unstructured texts. The author in [15] indicate that ML is a set of machine learning algorithms built into machines to provide knowledge about processes quickly and efficiently. ML is classified into three fields, supervised learning (S), unsupervised learning (US), and reinforcement learning. The present work contemplates the use of algorithms of the first two fields mentioned.
Supervised learning uses algorithms that learn iteratively from data. They find hidden information by which computers learn. The author in [11] point out that algorithms try to predict and classify texts. For example, in the electronic documentation of a health service, the algorithms are able to identify the most common issues expressed by patients. Likewise, [16] points out that supervised ML divides the input data set into training and testing. The training data set has an output variable that must be predicted or classified.
Unsupervised learning uses algorithms to identify patterns and detect anomalies such as fraud, scam of potential users, among others. In this regard, [11] indicates that it is a technique that identifies models or patterns of behavior without the need to know the target attribute or objective that could be present in a text. In the research work [16] points out that the algorithms learn some characteristics from the data. One of the best-known models is the clustering.
The NLP is taking a lot of relevance in the sentiment analysis (SA), positive or negative, in the analysis of unstructured text; a source of application is the comments that are made on social networks. In that aspect, [17], making use of the tasks of NLP and ML methods, propose a model of word processing for SA that uses the comments made on Twitter. The first phase consists of collecting the text, cleaning it, preprocessing, extracting features from a text and then categorizing the data. The proposed corpus is multidisciplinary and can be used in the area of market analysis, customer behavior, survey analysis, and brand monitoring, among others. This contribution is used as a basis for broadening the range of real applications.
The usefulness of NLP and ML has a high level of application in the medicine field. It can be applied to determine the misuse and abuse of prescription drugs in comments made on social networks. In this regard, [18] propose a model to detect self-reports of prescription drug abuse from Twitter. Using these public data, it develops a continuous monitoring system to classify the class of "abuse or misuse".

A. Introduction
The PRISMA method is a structured tool with a systemic approach that helps to present the results of a research. According to [19], the Preferred Reporting Items for Systematic Reviews and Meta-Analyses -PRISMA 2020 is conceptualized as a series of recommendations that contribute to selecting, evaluating and synthesizing for better clarity and transparency of research. In fact, [20], [21] point out that the PRISMA declaration is an essential strategy for conducting good research and publishing the results. In the area of objectivity, this research has been divided into four phases, according to the process proposed by:

1) Retrieval of publications.
2) Review of titles and abstracts.
3) Revision of the full text.

4) System information collection.
With regard to the initial recovery phase, it is necessary to use a strategy that would allow efficient document searches. In this respect, [22] point out that the PICO strategy is relevant for raising research questions in order to optimize the placement of articles. The PICO system is an acronym and a component structure. According to [23], this format has four elements: problem, intervention, comparison and outcome. Table II shows the optimal search of documents, this strategy was adapted to the acronym PIO. In addition, the thesaurus Computer Classification System -ACM was used to identify the appropriate synonyms; the link is: https://bit.ly/3dphAJP. From phase two: review, titles and abstracts; articles were located to be contrasted with the inclusion and exclusion criteria. Titles and abstracts were reviewed, then the method and results, in order to establish the search formula. The database consulted was Scopus, period 2021. In the third phase, the combination of keywords and synonyms was used with emphasis on the variables Natural Language Processing and text analysis. The logical operators AND and OR were used repeatedly until the appropriate formula was obtained. Table III shows the restricted query.  OA(all) AND (TITLE-ABS("Natural Language Processing") OR TITLE-ABS("Natural Language Process") OR TITLE-ABS("Natural Language Text") OR TITLE-ABS("Computational linguistics") OR TITLE-ABS("Word processing") OR TITLE-ABS("NLP")) AND (TITLE-ABS("text analysis") OR TITLE-ABS("text analytics") OR TITLE-ABS("text data") OR TITLE-ABS( "text classification") OR TITLE-ABS ("Data extraction")) AND PUBYEAR > 2020 AND DOCTYPE(AR)

B. Selection of Criteria
Inclusion and exclusion criteria for the efficient search of research articles were identified in the PICO strategy. The query was held on December 19, 2021. The search was restricted since 2021 and 144 articles were located in Scopus database. To ensure the rigor and credibility of the selected articles, they were evaluated by extrapolating the criteria defined in Table IV.

PICOS Inclusion criteria Exclusion criteria
Problem Natural language processing -NLP in the text analysis Natural language processing in formats other than texts (e.g., video, audio).

Intervention
NLP interventions in the data extraction and summaries of text analysis with free software (R language, Python) NLP interventions in which actual text, using an NLP process, is not processed. Data extraction with licensed software. Chatbot.

Comparison
Comparison with other type of intervention such as the elaboration of the linguistic corpus.
Studies that have no other type of comparison.

Outcomes
Report on the impact of the intervention.
It does not contain a report on the impact of the intervention.

Study Type
Quantitative, qualitative, and mixed method studies of original articles.
Systematic review articles, meta-analysis, literature reviews, conferences, dissertations, protocol works, tutorials. Studies not conducted in English. Duplicate jobs and not available in full text.

A. Search Results
A total of 144 articles were collected during the search process. 11 were deleted after reviewing the title and abstract (n=133) of each document. Then, the method and conclusions were reviewed with emphasis and those that did not meet the inclusion criteria were discarded (n=87). Finally, there were 46 potential articles for systematic review. Fig. 1 shows the flowchart of the search strategy.

B. Description of Included Studies
The application fields or sectors that have benefited from the NLP and ML application correspond to the domains such as aviation, medicine, cyberbullying, education, engineering, technology, among others. In this regard, the medical sector has benefited from 13 studies, 28.76%. The education system from 10 studies, 21.74%. The Technology field has seven articles, 15.22%, among others. Table V shows the details.

D. Frequency of ML Algorithms
ML algorithms analyzed in this study are defined in Appendix D and grouped under supervised learning, unsupervised learning, and Deep Learning.

1) Frequency of supervised learning algorithms (S):
The Support Vector Machine (SVM) algorithm was used in 17 studies. The Naive Bayes (NB) algorithm was applied in 15 studies. The Radom Forest (RF) algorithm has 10 studies. R has 9 studies. K-NN has 8 studies. RF has 5 studies. The Passive aggressive (PA) algorithm was used in two studies. The AdaBoost (ADA) algorithm has 1 study like Singular Value Decomposition (SVD) algorithm, Fig. 2 shows the details.

2) Frequency of unsupervised learning algorithms (US):
The Latent Dirichlet Allocation (LDA) algorithm was used in three studies, 6.5%, while K-Means algorithm was used only in one study, 2.2%.

3) Frequency of deep learning algorithms (DL):
The Long Short Term Memory (LSTM) deep learning algorithm has 24 studies. Then, the Convolutional neural networks (CNN) algorithm has 12 studies. The Recurrent Neural Networks (RNN) algorithm has 9 studies. The Multilayer Perceptron (MLP) algorithm has 5 studies. The least used algorithms were Gating Circulation Unit (GRU) algorithm with four studies and Artificial neural network (ANN) with two studies. Fig. 3 shows the details.

4) Studies with hybrid algorithms, Supervised (S), Unsupervised (US) and Deep Learning (DL):
Out of 46 articles, 10 (21.74%) use only S algorithms. In this regard, [24] and [52] use SVM. The author in [35] use NB algorithm. On the other hand, the study of [51] uses the LDA algorithm.
Three studies use S and US algorithms at the same time: [17] use two S algorithms: SVM, NB and 1 NS: K-Means. [67] use five S algorithms: SVM, NB, Regression (R), RF, KNN, and one US algorithm: LDA.
The details of the ML algorithms used by the 46 studies can be found in Appendix C.

V. DISCUSSION
The popularity and proliferation of platforms working on the Internet such as web portals, social networks, and all digital media have created a massive social interaction between users, even more so because of the global COVID-19 pandemic that has led to the unprecedented increase in online learning and its consequent exponential production of unstructured texts. This phenomenon, according to [32], is allowing the increasing use of the NLP and ML field in text analysis for an efficient solution of real problems.
What are NLP and ML application fields?
It was discovered that sectors such as the health system, education, technology, engineering, software development, aviation, natural disasters relief, cyberbullying, construction, finance, marketing, politics, business organization, information security, psychology, and urban transport, benefit most. This reflects that NLP and ML can be applied to solve problems in any sector. www.ijacsa.thesai.org What is the frequency of ML algorithms for text analysis?
The analysis of the articles indicates that NLP preprocessing techniques such as tokenization, normalization, elimination of irrelevant words are necessary to apply ML algorithms, which allow having a positive impact to achieve the Garg & Sharma study objective [17]. TF-IDF, word2Vec, and Glove are among the most used NLP algorithms. The ML algorithms of supervised learning were: SVM, NB and RF. The least used algorithms were: PA, ADA and SVD. With respect to unsupervised learning algorithms, these were the least used. Only three studies used the LDA algorithm.
What is the frequency of DL algorithms?
With regard to Deep learning, the most used algorithm was LSTM with 22 articles, and the least used was ANN with only 2 articles. This approach becomes a primary tool for the NLP. However, it should be noted that ML algorithms can lead to error bias because it depends on the quality of data with which the research is carried out and especially on access to data since many institutions, unfortunately, restrict them, for example, hospitals [69].
Considering the works obtained, it can be said that the most used NLP technique was the TF-IDF. The most used supervised learning algorithm was the SVM, and with respect to neural networks or deep learning, it was the LSTM. On the other hand, according to [69], one of the main obstacles to applying the NLP and ML algorithms is access to data, representing a challenge for the AI community in reversing this situation.

VI. CONCLUSION
The most commonly used supervised learning algorithms for text analysis in the field of research are TF-IDF, word2Vec and Glove, while predominant deep learning algorithms are LSTM and ANN. In addition, this article complements the various studies regarding systematic reviews on NLP and ML, by describing the frequencies of influential algorithms and it is expected that this work will lead to further research to increase the cases of application of PLN and ML for the benefit of various fields such as health, education, transport, technology and others. Finally, it should be noted that improving the cognitive aspect of this science requires further research taking into account that the PLN and ML algorithms are universal, characteristic of mathematics and statistics.

Dimension Definition
Word segmentation (Tokenize) It is the process of converting paragraphs into inputs for the computer through a word list.

Data cleanup (Stop word)
It is the process of removing words that do not add exclusionary meaning to a sentence.
Lexicographic analysis with stemming It is the process of converting each word of the sentence to its root form by removing or replacing suffixes.

Lexicographic analysis with lemmatization
It is a more accurate process than stemming and involves making an analysis of the vocabulary and its morphology to return to the basic form of the word.

Type Algorithm Definition
Basic mathematical functions POS It is the process of grammatical tagging or disambiguation of word categories.
Name entity recognition (NER) It is the process of "finding out" if a piece of data belongs to a person or business organization.

N-gram
It is a sub-sequence of n items of a text data sequence. It is a probabilistic algorithm that allows making a statistical prediction of the next item of a sequence of a string of text data.

Bag of words (BoW)
It is the process that allows the feature extraction from the text, determines the number of times that there is a word in the sentence.

Basic statistical algorithms
Term frequency -inverse document frequency (TF-IDF) It is a statistical model that allows scoring the data to reflect their relevance in a given document.
Text vectorization It is the process which transforms the input of language into something that the computer can understand

ML Type Algorithm Overview
Supervised learning K-NN It is an algorithm that can be used to classify new samples or to predict values by looking for the "most similar" data points (by proximity).
Regression -R It is the algorithm that determines the relationships between dependent and independent variables for prediction and prognosis.
Decision tree -DT It is an algorithm that uses the fork for every possible outcome of a decision.

Support Vector Machines -SVM
It is an algorithm that seeks to find a hyperplane that best separates two different kinds of data points.