Data Mining to Determine Behavioral Patterns in Respiratory Disease in Pediatric Patients

There are several varieties of respiratory diseases which mainly affect children between 0 and 5 years of age, not having a complete report of the behavior of each of these. This research seeks to conduct a study of the behavior of patterns in respiratory diseases of children in Peru through data mining, using data generated by the health sector, organizations and research between the years 2015 to 2019. This process was given by means of the K-Means clustering algorithm which allowed performing an analysis of this data identifying the patterns in a total of 10,000 Peruvian clinical records between the years mentioned, generating different behaviors. Through the grouping obtained in the clusters, it was obtained as a result that most of the cases in all the ages studied, they presented diseases with codes between the range of 000 and 060 approximately. This research was carried out in order to help health centers in Peru for further study, documentation and due decision-making, waiting for optimal prevention strategies regarding respiratory diseases. Keywords—Respiratory diseases; data mining; cluster algorithms; K-Means algorithm


I. INTRODUCTION
The use of technologies has played an important role in the health sector, providing benefits that contribute to the fight against diseases. This context the use of data mining makes the processing and distribution of data more feasible, optimizing time, labor and costs [1]. Gathering large amounts of data to enable decision-making and strategic guidance is crucial for the analysis of the health situation in Peru [2]. According to [3], the health care industry in Peru is moving towards a personalized care model supported by new technologies to improve management and quality of service. Some of the processes where ICT has been implemented are telemedicine, remote diagnosis, informative applications, databases, cloud computing, etc.; however, more work still needs to be invested to close the existing gaps [3], [4], as the need to analyze big data and the analysis of population health is not yet fully covered.
Respiratory diseases have been mutating considerably over the years, causing the lack of a complete report of the behavior of these diseases for their proper prevention. According to MINSA and CDC [2], higher incidence rates were shown in children under 5 years of age with respect to acute respiratory infections and pneumonia between 2015, 2016, 2017 and 2018.
The situation of the health sector in Peru with respect to the use of data mining as a predictive tool is not many, although there are technological initiatives, such as regulating electronic medical records, or creating information systems that help the processes of health entities, there are still things to be done.
In the research [5], a prediction of academic performance was performed by data mining, through the use of three techniques, obtaining that using data mining it was possible in 82.87%, to make predictions of the academic performance of entrants in a timely manner.
Identifying that respiratory diseases are one of the main causes of mortality in infants in Peru, especially in children under five years of age [2], [6]- [8], the project aims to analyze population data to identify patterns of respiratory diseases in infant patients. The aim of this research is to generate results that will reveal the behaviors associated with respiratory diseases in children from 0 to 5 years of age, and may help health centers in Peru to create prevention strategies regarding respiratory diseases that are more frequent in infants and thus achieve a decrease in the number of deaths. 429 | P a g e www.ijacsa.thesai.org

A. Respiratory Diseases
The upper respiratory tract includes the nose, mouth, nasal passages, pharynx and larynx, and the lower respiratory tract includes the trachea, main bronchus and lung (Fig. 1).
This whole system directs the inhalation of the outside air into the lungs for breathing to take place. An acute respiratory tract infection that we will call for research as "respiratory disease", [9], [10] is an infectious process it is an infectious process that occurs within the airways, either upper or lower.
One of the most common causes why most people tend to get sick and attend the environment are upper respiratory tract infections, which produce variable symptoms ranging from runny nose, sore throat, cough, shortness of breath and lethargy [9], being one of the main ills afflicting preschool children [11].

B. Classification of Respiratory Diseases
These are classified as upper respiratory tract infections (SSRIs) and lower respiratory tract infections (ITRI), the most common being SSRIs, which compromises the airways, starting from the nostrils, larynx, sinuses and middle ear [9], [11], [12].

C. Pattern Recognition in Respiratory Diseases
Respiratory diseases generally present a pattern (standard or model) of behavior by which a physician can diagnose whether a patient is suffering from this condition as either a lower respiratory tract infection or an upper respiratory tract infection.

D. Data Mining
According to [13], [14], data mining is a process of extracting useful information by analyzing big data. Data mining is the process of knowledge discovery by analyzing interesting patterns that large amounts of data to be analyzed have in common (Fig. 2).

E. Clustering Algorithms
These algorithms aim at clustering data according to their classification (see Fig. 3), with the objective of finding the set of patterns for an efficient representation that characterizes the presented population [14]- [16].
The most widely used clustering algorithm is K-Means due to its good scalability with the use of data fulfilling initialization, assignment and update steps [15], [17].

III. METHODOLOGY
The research is of the applied correlational type since it attempts to establish a degree of association between the variables under study.
The method of analysis and synthesis will be used in the development of the research. The methodology that was implemented allows having an order and an adequate procedure to carry out the respective analysis and to be able to obtain the adequate results that are desired for this research work.

A. Population
The health sector is one of the sectors that generate large volumes of information and data capture of their patients. The population to be taken into account for this research was 10,000 Peruvian clinical records between the years 2015 to 2019, for cluster analysis, from which the variables extracted were type of disease, year, age, case number and gender. www.ijacsa.thesai.org

B. Data Processing and Analysis Technique
After evaluating and critiquing the data to ensure accuracy and reliability, we proceeded to debug unnecessary data, using appropriate statistical tools, using software such as Ms Excel and Clementine v 11.1.

1) Methodology processes a) Obtaining databases:
The extraction sources searched were from the Sergio E. Bernales Hospital, Daniel Alcides Carrion Hospital, WHO, research articles, MINSA, SISPRO, UNICEF and Red Cross, the sources were qualified according to some selection criteria that are associated with data quality, to finally define the variables taking into account the altitude, city of origin, amount of CO (carbon monoxide, EPS, stratum, date, date of measurement of air quality, inhabitants, mortality rate, amount of NO2 (nitrogen dioxide), maximum temperature, minimum temperature and average temperature obtained.
b) Obtain the data analysis tool to be used: The data analysis tool to be used is searched, in this case the RapidMiner tool was chosen.
c) Apply the databases in the analysis tool: The data obtained from the extraction sources are applied within the data analysis tool. d) Apply the search algorithms: In order to identify behavioral patterns within the databases applied in the analysis tool, for this reason, a Clustering algorithm is selected to run it in RapidMiner. e) Perform the search for data to identify patterns of behavior in these diseases: The algorithm is run on the data to be analyzed in the data analysis tool.
f) Obtain the search results: The results obtained from the search in the data analysis tool for the identified patterns are extracted. g) Documentation of the results obtained: Documentation is made according to the results obtained from the data analysis tool.

C. Design
This section will show the design of how the data is handled through the information point of view with its corresponding diagrams, according to the data model proposed by TOGAF.

1) Data architecture design:
To identify the aspects for the appropriate use of data, it is proposed to use the TOGAF framework which provides methods and tools that facilitate decision making in conjunction with data mining [19].
2) Selection of clustering algorithms: After performing an evaluation for an optimal selection of the clustering algorithm, based on two types of clustering, partition-centered and hierarchical, the K-Means algorithm was chosen. This algorithm was in charge of the development of the partitions at the same level to then group the data according to their category. For this research, four clusters were taken into account according to the distribution made by the K-Means algorithm, which allowed the analysis and distribution of the data according to the established parameters.
3) K-Means clustering algorithm: The process of this algorithm was applied in RapidMiner. Normalization, clustering, performance, de-normalization and new model application components were used to develop the process (see Fig. 4). To begin with, a Ripley-Set is taken that contains the data to be analyzed and then sent to the normalization, which is responsible for matching the data values for uniformity between them and thus achieve a correct execution of the algorithm. The normalized table is then sent to the clustering process for the execution of the k-means algorithm, and at the same time the un-normalization process is executed to return the table to its normal values to be sent to the application of a model.
After the execution of the clustering algorithm, it is sent to the performance process which measures distances with respect to its centroids. This process is necessary for a better grouping of each cluster with respect to its variables. Finally the normalized table and the cluster model go through the model aggregation process so that the results are performed on the original data values in a normalized way.

A. Cluster Graph
The graphs that show the grouping of each of the variables in the clusters generated by the analysis tool are presented.

1) Cluster vs disease graph:
The graph is distributed according to the number of clusters according to the disease code (see Annexure 1).  3) Cluster vs age graph: The graph is distributed according to the number of the cluster according to the age range analyzed from 0 to 5 years old.

5) Cluster vs. number of cases graph:
The graph is distributed according to the number of the cluster according to the cases analyzed. www.ijacsa.thesai.org

B. Cluster Analysis 1) Cluster 0:
The following cluster has a total of 2,426 records, associated with 115 types of diseases analyzed for this cluster, with a total of 59,345 cases distributed in each of the types of diseases (see Fig. 9 and 10 section cluster_0), between the ages of 3 and 5 years (see Fig. 7 section cluster_0), with the majority being 5 year old children and with 51% male and 49% female (see Fig. 8 section cluster_0). This cluster presented a lower incidence in the type of disease "CHRONIC RHINOPHARYNGITIS" resulting in 0.09% with a total of 88 cases for the ages of 5 years between female and male gender. It is observed that in the year 2017 in children aged 4 and 5 years there is a greater number of the disease type "ASTHMA, NOT SPECIFIED", occupying 19% compared to the other years analyzed, being mostly of male gender. For the years 2018-2019 it is observed that there was a significant decrease decreasing the number of cases by 39% for female gender and 41% for male gender.
Regarding "TRACHEAL AND BRONCHIAL DISEASES, NOT CLASSIFIED ELSEWHERE" represents 25.3% being present in the entire range of years studied 2015-2019 being present in the years from 3 to 5 years in both genders, observing an exponential growth increasing by 69.68%.
2) Cluster 1: The following cluster has a total of 2,520 records, associated with 112 types of diseases analyzed for this cluster, counting a total of 59,345 cases distributed in each of the types of diseases (see Fig. 9 and Fig. 10 section  cluster_1), between the ages of 0 to 5 years (see Fig. 7 section cluster_1) corresponding to the years 2017-2019 (see Fig. 6 section cluster_1).
In the present cluster, it was identified that the least present type of disease is 126 with 0 cases reported in 2019. Likewise, it is observed that the type of disease with more presence is "OTHER ALLERGIC RHINITISES", with a total of 8112 cases being 46.72% of female gender and 53.98% of male gender being present in all the evaluated ages, being the year 2018 its peak.
There is a similarity with cluster 0 with respect to the gender variable, where the male gender represents 49.5% of the cases analyzed and 50.5% the female gender.

3) Cluster 2:
The cluster has a total of 2509 records, associated with 116 types of disease analyzed for this cluster, counting a total of 59,345 cases distributed in each of the types of diseases (see Fig. 9 and Fig. 10 section cluster_2), between the ages of 0 to 2 years (see Fig. 7 section cluster_2) between the years 2015-2019 (see Fig. 6 section cluster_2), where the types of disease grouped together correspond to codes 047 to 176 and 184 according to Annex 1, within this cluster it was identified that the least present disease are 057 and 058 (see Fig. 5 and Fig. 9 section cluster_2).
There is a similarity with cluster 0 with respect to the gender variable, where the male gender represents 50.29% of the cases analyzed and 49.71% the female gender (see Fig. 8 section cluster_2).

4) Cluster 3:
The present cluster has a total grouping of 2,541 records associated with 115 types of diseases for this cluster, counting a total of 59,345 cases distributed in each of the types of diseases (see Fig. 9 and 10 section cluster_3), between the ages of 0 to 5 years (see Fig. 7 section cluster_3) corresponding to the years 2015-2017 (see Fig. 6 section cluster_3), where disease codes 000 to 119 correspond according to Annex 1, within this cluster it was identified that the least present diseases are those of code 112, 114, 116, 117, 118 and 119 finding a lower amount of data in the year 2018 with 0.47% and in the year 2019 0.07% of data was found.
There is a similarity with cluster 0 with respect to the gender variable, where the male gender represents 50.13% of the cases analyzed and 49.86% the female gender (see Fig. 10 section cluster_3).

V. DISCUSSION
The results obtained from the clusters after the application of data mining in child patients aged 0 to 5 years, through the discovery of patterns in respiratory diseases in the period 2015 -2019 (see Fig. 5, 6, 7 8, 9 and 10), were relevant for decision making. www.ijacsa.thesai.org After performing the analysis to the results of cluster 0 it could be identified that there was an exponential growth of "Diseases of the trachea and bronchus, not elsewhere classified", starting with 2572 cases in 2017, reaching with 3819 cases by 2019. If promotion and prevention strategies will be implemented for the coming years then many future cases in children 0-5 years old would be avoided.
After analyzing the results of cluster 1, it was possible to identify that if prevention plans had been applied during the period 2017-2018, the 334% increase in cases of "Other allergic rhinitis" for children between 0 and 5 years of age, where the female population was the most affected, would have been avoided.
According to the results of cluster 2, if measures are taken for prevention in ages 0 to 2 years for diseases with codes 047 to 176 and 184 (Annex 1), then the number of cases and affected would begin to decrease, this is because these diseases mainly affect this segment of the population.
After analyzing the results of cluster 3, it was possible to identify that diseases such as 112, 114, 116, 117, 118 and 119, and the other diseases from codes 000 to 119 had little presence in recent years for this number of cases. This is understood to mean that preventive measures are being taken with respect to these diseases; however, they are not sufficient since these diseases are still frequent despite their decrease.
Different behaviors were obtained through the grouping obtained in the clusters generated, with the result that most of the cases in all ages presented diseases with codes between the range of approximately 000 to 060.

VI. CONCLUSION
Data mining models based on algorithms are suitable for the prediction and description of the relationships that exist between indicators or variables, for the identification of patterns within the analysis of the results, optimizing the processing of large amounts of data. The management of these data allowed structuring them and subsequently converting them into information by means of RapidMiner.
The implementation of the K-Means algorithm allowed taking different metrics for the evaluation of the functionalities with respect to the classification of clustering algorithms, which by means of graphical representations identifies the behavior of the data to subsequently make decisions about the generation of future prevention programs in Peruvian health centers.
The aim of this research is to demonstrate the great help that can be provided by models that show a pattern of patient behavior and serve as a basis for redirecting resources in the Peruvian health sector. This data extraction allows future research to be carried out. www.ijacsa.thesai.org ANNEXURE 1 Nomenclature of disease variable codes is shown in Table I.   TABLE I.  TABLE OF