Developing A Model to Predict the Occurrence of the Cardio-Cerebrovascular Disease for the Korean Elderly using the Random Forests Algorithm

This study aimed to develop a model for predicting the cardio-cerebrovascular disease of the South Korean elderly using the random forests technique. This study analyzed 2,111 respondents (879 males and 1,232 females), who were age 60 or older, out of total 7,761 respondents, who completed the Seoul Welfare Panel Study. The result variable was defined as the cardio-cerebrovascular disease (e.g., hypertension, cerebral infarction, hyperlipidemia, cardiac infarction, and angina). As a result of developing a random forest-based model, the major determinants of the cardio-cerebrovascular diseases of the South Korean elderly were mean monthly household income, the highest level of education, subjective health condition, subjective friendship, subjective family relationship, smoking, regular exercise, age, marital status, gender, depression experience, economic activity, and high-risk drinking. Among them, mean monthly household income was the most important predictor of the cardio-cerebrovascular disease. Based on the developed prediction model, it is needed to develop a systematic program for preventing the cardio-cerebrovascular disease of the Korean elderly. Keywords—Prediction model; data mining; random forest; risk factors; cardio-cerebrovascular disease; stroke


INTRODUCTION
The cardio-cerebrovascular diseases include cerebrovascular diseases (e.g., cerebral hemorrhage and cerebral infarction), cardiac disorders (e.g., cardiac insufficiency, angina, and cardiac infarction), and vascular abnormalities (e.g., hypertension, diabetes, hyperlipidemia, and arteriosclerosis).As of 2013, the mortality due to cardiocerebrovascular diseases accounts for more than 25% of the national mortality.The annual death toll is 50.3 people for cardiovascular diseases and 50.2 people for cerebrovascular per 100,000 population [1].The cardio-cerebrovascular disease is the second leading cause of death in South Korea [1].The cardio-cerebrovascular disease has increased by 1.35 times over the past decade and has become a critical health problem in South Korea [1].
Particularly, the cardio-cerebrovascular disease is a representative chronic disease of the elderly.It is known that the mortality rate increases rapidly with age.Especially, previous studies reported that it increased abruptly in the elderly over 70 years old [1].Additionally, the cardiocerebrovascular disease of the elderly is often accompanied by severe disability even if surgical treatment is successful.Therefore, they tend to have a hard time to return to the society even after recovery [2].Consequently, it is essential to identify factors associated with the cardio-cerebrovascular disease and prevent them for achieving the successful aging.
As more people die from cardio-cerebrovascular diseases, there is a growing interest in managing and preventing the diseases.In the past 20 years, a number of studies have attempted to evaluate various risk factors for cardiocerebrovascular diseases such as sociodemographic factors, lifestyle, and family history [3][4][5][6].The results of these studies have identified various risk factors encompassing those that cannot be controlled (e.g., age and gender) and those that can be controlled (e.g., eat habits and physical activities) [3][4][5][6].
However, it has been pointed out that these individual risk factors have limitations in explaining the onset of a cardiocerebrovascular disease [7].Moreover, studies have indicated different factors as the most important risk factor.Additionally, although the cardio-cerebrovascular disease is known as a complex disease due to the interactions of multiple factors including sociodemographic factors (e.g., age and gender), environmental factors (e.g., lifestyle), and causative disease factors (e.g., hypertension and hyperlipidemia) [8][9], recent studies reported that psychological factors such as depression were major risk factors as well [10][11].
Moreover, the occurrence patterns and the risk factors of the cardio-cerebrovascular disease vary greatly among different ethnic groups.Therefore, it is difficult to establish a prevention and management strategy based on the results of previous studies conducted for different ethnic groups.Additionally, the lifestyle, an important factor in deciding the health, is determined by cultural influences as well as personal characteristics.Therefore, it is necessary to develop a model for predicting the cardio-cerebrovascular disease with reflecting the characteristics of the elderly living in the local communities in South Korea using big data.
The random forests technique has been used more frequently as a data mining algorithms for predicting the risk factors of target variables such as a disease or a disability [12][13][14].The random forests technique is a method of combining multiple decision trees based on the ensemble technique in order to minimize the over-fitting, which is a shortfall of the www.ijacsa.thesai.orgdecision tree.The technique shows a good prediction ability, which is an advantage of this technique.This study aimed to develop a model for predicting the cardio-cerebrovascular disease of the South Korean elderly using the random forests technique.
Construction of this study is as follows.chapter II explains data source and materials and chapter III defines random forests and explains the procedure of final model development.
Chapter IV compares the results of developed final prediction model.Lastly, chapter V presents discussion and direction for future studies.

A. Data Source
This study analyzed a portion of the raw data of the Seoul Welfare Panel Study, which was conducted by Seoul Welfare Foundation to survey Seoul citizens from Jun 1 to August 31, 2010.Seoul Welfare Panel Study was approved (#20113) by Statistics Korea in 2009 and it has been conducted to identify the welfare level of households residing in Seoul, understand the status of the welfare vulnerable class, and estimate the demand for welfare services [15].This study targeted households in Seoul as of "the 2005 Population and Housing Census" and sampled using the stratified cluster sampling method for 25 districts in Seoul.The main survey items were income, economic level, health, living conditions, and the demand for welfare services.The survey was conducted by using the computer-assisted personal interviewing method: the interviewer visited the surveyed households and inputted responses according to a structured questionnaire into a portable computer.This study analyzed 2,111 respondents (879 males and 1,232 females), who were age 60 or older, out of total 7,761 respondents, who completed the survey.

B. Measurements and Definitions of Variables
The result variable was defined as the cardiocerebrovascular disease (e.g., hypertension, cerebral infarction, hyperlipidemia, cardiac infarction, and angina).The explanatory variables included age (60years or older and younger than 70years or 70years or older), gender (male or female), the highest level of education (below elementary school, junior high school, high school, and college graduation and over), economic activity (yes or no), mean monthly household income (less than 2 million KRW, 2-4 million KRW, and more than 4 million KRW), marital status (living with a spouse, married but not living with a spouse, or single), High-risk drinking (yes or no), Smoking (non-smoker, past smoker, current smoker), subjective health condition (good, normal, or poor), subjective family relationship (good, average, or bad), subjective friendship (good, average, or poor), regular exercise (no or yes), and the depression symptom in the past one month (no or yes).

A. Exploring Potential Factors of the Cardio-Cerebrovascular Disease in Old Age
The prevalence of the cardio-cerebrovascular disease between groups was analyzed by using the chi-square test.When the significance level of an explanatory variable was 0.1 or below, it was considered as a potential factor of the cardiocerebrovascular disease and it was included in the random forest model.

B. Random Forests Algorithm
The random forests technique [16] is an algorithm that creates various sample datasets using bootstrap.This method has an advantage of increasing the diversity of the decision tree because it repeats the process of randomly selecting several variables [17].Unlike the decision trees, which present each node with the partition showing the most optimum results by using all variables, the random forests select explanatory variables randomly and use the method showing the most optimum results among the selected explanatory variable groups (Figure 1).The process of random forests is shown in Eq. ( 1).Fig. 1.Random forest classifier: source is Byeon [12] (1) Another advantage of the random forests is to reduce the variance compared to the bagging method because it decreases the correlation between trees.Moreover, it presents more accurate results than other algorithms and it is useful to find an important variable in big data because it utilizes thousands of independent variables without eliminating variables [18].Especially, when there are many input variables, it often shows similar or better prediction power than bagging or boosting.The input source of the R program for performing random forests analysis is shown in Fig  In this study, the number of trees in the model was set to 500.The analysis was conducted using R version 3.4.2and Waikato Environment for Knowledge Analysis (WEKA) version 3.6.0[19].

A. General Characteristics of Subjects
The characteristics the data (n=2,111) were analyzed and the results showed that the 53.5% of study subjects were between 60 and 69 years old and the 58.4% of them were women.The majority of the subjects lived with their spouses (67.2%), were elementary school graduation or below (43.3%), had the mean monthly income less than 2 million KRW (64.7%), were not economically active (83.2%), did not exercise regularly (55.4%), had poor subjective health (39.4%), had good subjective family relationship (59.1%), had average subjective friendship, and did not experience a depression symptom in the past one month (74.4%).The prevalence of the cardio-cerebrovascular disease was 42.3 %.

B. Potential Factors of Cardio-Cerebrovascular Disease in Old Age
Table 1 shows the general characteristics and potential factors of subjects according to the prevalence of cardiocerebrovascular diseases.The prevalence of cardiocerebrovascular diseases, which indicated the proportion of subjects suffering hypertension, cerebral infarction, hyperlipidemia, cardiac infarction, and angina, was 42.3% (n=894).The results of chi-square test showed that the elderly with cardio-cerebrovascular diseases and those without cardiocerebrovascular diseases were there were significant (p<0.05)different in age, marital status, economic activity, smoking, the depression symptom in the past one month, subjective health condition, and subjective family relationship.The prevalence of cardio-cerebrovascular diseases was significantly higher for the elderly equal to or older than 70 years (50.8%),not living with a spouse (47.8%), not economically active (44.1%),former smoker (43.8%), depression symptom experience in the past one month (49.7%), poor subjective health (51.1%), and average family relationship (46.4%).

C. Predict Occurrence of Cardio-Cerebrovascular Disease for Korean elderly
The importance of variables (the decrement of node impurity) based on random forests is shown in Table 2 and Figure 3.The results showed that the major determinants of the www.ijacsa.thesai.orgcardio-cerebrovascular diseases of the South Korean elderly were mean monthly household income, the highest level of education, subjective health condition, subjective friendship, subjective family relationship, smoking, regular exercise, age, marital status, gender, depression experience, economic activity, and high-risk drinking.Among them, mean monthly household income was the most important predictor of the cardio-cerebrovascular disease.Figure 4 shows the error rate graphs for each prediction model for each of the extracted 500 bootstrap samples.The error rate of the developed random forests was 0.24 and the prediction rate was 76.5%.This study developed a model for predicting the cardiocerebrovascular disease of the elderly living in the community using the random forests technique, which is a data minding algorithm based on the classification learning.This study constructed a prediction model of cardio-cerebrovascular disease considering multiple risk factors.The results of the constructed model showed that household income was the most important factor followed by the education level, which indicated that socio-economic factors were major risk factors.
Many previous studies have reported that the prevalence of the cardio-cerebrovascular disease is affected by the socioeconomic levels [20][21][22].The mechanism of the socioeconomic factors can be explained by the changes in hemodynamics due to the increase of stress and the lack of health life practice, HDL-cholesterol, insulin resistance, and blood coagulation-related factors [20].Therefore, the lowincome and poorly educated groups should be sufficiently considered when establishing prevention programs.
The results of this study confirmed that depression was a major predictor of the cardio-cerebrovascular disease.Previous studies reported that the depression and the cardiocerebrovascular disease were highly associated.Particularly, the cardio-cerebrovascular disorder was identified as a risk factor of depression [6,23], and those with depression were at risk for the cardiovascular disease and their mortality risk was twice than others without it [24].Morris et al. (1993) also found that cerebrovascular disorders were frequently accompanied by depression and people with depression had an 8-fold higher risk of death from cerebral infarction than others without it [25].Byeon (2015) identified the cardiocerebrovascular disease risk groups using the QUEST algorithm and also predicted that the elderly who experienced depression would have a higher risk of the cardiocerebrovascular disease [6].The results of this study indicated that the depression in the old age was a risk factor of the cardio-cerebrovascular disease.Therefore, it will be necessary www.ijacsa.thesai.org to develop the cardio-cerebrovascular disease prevention program for the elderly with depression for preventing the cardio-cerebrovascular disease of the elderly in the local community.
This study developed a depression prediction model for children from multicultural families by using CHAID algorithm and found that the experience of social discrimination is the most critical factor affecting depression.Although it is hard to compare the results of this study directly, the previous studies evaluating the relationship between social discrimination and mental health reported that the economic discrimination and the discrimination against a specific group (e.g., the elderly group) were significant predictor variables negatively influencing mental health [18].Therefore, it is necessary to establish a legal system and pay social level interests to overcome the discrimination and prejudice against adolescents from multicultural families based on the results of this study.
The results of this study showed that the cardiocerebrovascular disease prediction model based on the random forests technique had stronger prediction power than the previously developed cardio-cerebrovascular disease prediction model based on QUEST algorithm [6].The random forests showed superior prediction performance than the decision tree and it produced more stable results because it made decisions by integrating the prediction results of multiple decision trees using the bootstrap sample [26].Therefore, it was believed that using the random forests model would be more effective than using the decision tree model when estimating the importance of variables in the development of disease prediction models.It will be necessary to compare the predictive performance of the logistic regression model, the decision tree model, and the random forests in the future.

VI. CONCLUSION
Based on the developed prediction model, it is needed to develop a systematic program for preventing the cardiocerebrovascular disease of the Korean elderly.

Fig. 2 .
Fig. 2. Input source of the R program for performing random forests