Development of a Physical Impairment Prediction Model for Korean Elderly People using Synthetic Minority Over-Sampling Technique and XGBoost

The old people's 'physical functioning' is a key factor of active ageing as well as a major factor in determining the quality of life and the need for long-term care in old age. Previous studies that identified factors related to ADL mostly used regression analysis to predict groups of high physical impairment risk. Regression analysis is useful for confirming individual risk factors, but has limitations in grasping multiple risk factors. As methods for resolving this limitation of regression models, machine learning ensemble boosting models such as random forest and eXtreme Gradient Boosting (XGBoost) are widely used. Nonetheless, the prediction performances of XGBoost, such as accuracy and sensitivity, remain to be verified additionally by follow-up studies. This article proposes an effective method of dealing with imbalanced data for the development of ensemble-based machine learning, by comparing the performances of disease data sampling methods. This study analyzed 3,351 old people aged 65 or above who resided in local communities and completed the survey. As machine learning models to predict physical impairment in old age, this study compared the logistic regression model, XGBoost and random forest, with respect to the predictive performances of accuracy, sensitivity, and specificity. This study selected as the final model a model whose sensitivity and specificity were 0.6 or above and whose accuracy was highest. As a result, synthetic minority oversampling technique (SMOTE)-based XGBoost whose accuracy, sensitivity, and specificity were 0.67, 0.81, and 0.75, respectively, was determined as the most excellent predictive performance. The results of this study suggest that in case of developing a predictive model using imbalanced data like disease data, it is efficient to use the SMOTE-based XGBoost model. Keywords—Random forest; XGBoost; GBM; gradient boosting machine; physical impairment prediction model


I. INTRODUCTION
According as ageing progresses across the world, ageingrelated new concepts such as 'healthy ageing' and 'successful ageing' have emerged [1]. There are several standards for successful ageing, but in general, successful ageing is defined as having the high levels of physical, psychological, and social functions and satisfaction with life, a step further from physically healthy ageing [2]. It has been reported that factors affecting the successful ageing include age, abstaining from smoking, disability, arthritis, and diabetes, and particularly that the better their subjective health, family support and physical activities, the higher their level of successful aging [1,3,4,5].
The World Health Organization (WHO) introduced the concept of active ageing in order to promote the development of policies to cope with the problem of ageing [6,7]. According to the definition of WHO [6], active ageing is the process of optimizing opportunities for health, participation and security in order to enhance quality of life as people age. That is, active ageing supports people with ADL functions so that they actively participate in social activities, and induces people with ADL dysfunction to actively perform daily life by enhancing their ADL functions with appropriate support [8].
On the other hand, old people's 'physical functioning' is a key factor of active ageing as well as a major factor in determining the quality of life and the need for long-term care in old age [9,10]. Old people's state of physical functioning is mostly assessed in terms of activities of daily living (ADL), with which it can be judged whether an elderly can lead an independent life or not [11,12]. For the assessment of ADL, Katz Index, Barthel Index, and MBI are usually used [13]; and in the Korea National Health and Nutrition Examination Survey, the Korean Activities of Daily Living scale(K-ADL), a standardized test tool for physical functioning developed by reflecting Korean old people's living environment and culture, was used [14]. Old age, low educational level, the beneficiary of medical benefits, non-subscriber of health insurance, stroke, urinary incontinence, diabetes, and lung cancer have been reported as risk factors affecting K-ADL [15][16][17][18][19][20][21].
Previous studies that identified factors related to ADL [15][16][17][18][19][20][21] mostly used regression analysis to predict groups of high physical impairment risk. Regression analysis is useful for confirming individual risk factors, but has limitations in grasping multiple risk factors [22,23]. In addition, regression models assume the independence and normality of variables; however, it is difficult to derive accurate results in case of data that violate the normality of distribution, as in disease [24]. As methods for resolving this limitation of regression models, machine learning ensemble boosting models such as random forest, gradient boosting machine (GBM), and eXtreme Gradient Boosting (XGBoost) are widely used [25,26]. Ensemble learning is a technique for deriving more accurate final prediction by generating several classifiers and combining their predictions, and some previous studies [27,28] have reported that XGBoost developed recently shows performance exceeding that of the existing random forest or gradient boosting. Nonetheless, the prediction performances of www.ijacsa.thesai.org XGBoost, such as accuracy and sensitivity, remain to be verified additionally by follow-up studies.
On the other hand, it is highly probable that the problem of imbalanced data will occur in the prediction of impairment using big data [29]. Particularly, in the case of disease data, data are highly probable to distribute unequally because generally the number of patients is very fewer than those without disease. These imbalanced data cause prediction error in the process of machine learning and deteriorate the performance of a model, and thus techniques for dealing with imbalanced data are required in order to resolve this problem [30].
Hence, first, this article prepares basic data for policymaking to respond to ageing by predicting and analyzing the tendencies of physical impairment risk among Korean old people in local communities, and second, this article proposes an effective method of dealing with imbalanced data for the development of ensemble-based machine learning, by comparing the performances of disease data sampling methods.

A. Sources of Data
This study used and analyzed the raw data of Seoul Panel Study Data (SEPANS), which was carried out with Seoul citizens by the Seoul Welfare Foundation from June 1, 2016 to August 31, 2016. The SEPANS was conducted to grasp the welfare levels of households residing in Seoul, find out the actual state of vulnerable groups, and estimate demand for welfare service. Its population was households in Seoul as of the survey period among households subjected to 2005 Population and Housing Census, excluding foreigners and those in nursing homes, the military, and prisons. As for the sampling method, the stratified cluster sampling was used. As for the survey method, the computer-aided personal interview was used in which an interviewer visited households to be surveyed and inputted their responses to a structured questionnaire into a portable computer. This study analyzed 3,351 old people aged 65 or above who resided in local communities and completed the survey.

B. Measurement of Variables
The outcome variable was defined as physical impairment of the elderly measured by means of K-ADL, a standardized test. According to Won (2002) [14], reliability and validity were high according as the reliability coefficient of K-ADL was 0.7 or higher at the stage of test development (standardization) and the inter-item consistency of the questionnaire was 0.937. K-ADL consisted of 7 items of the most basic physical functions in daily life, including dressing, washing the face, bathing, self-feeding, moving out of bed, using the toilet, and relieving oneself. In the event of answering with partial help or complete dependence to any item of K-ADL, the respondent was classified into the group of physical impairment, and in the event of answering with complete independence to all the items, the respondent was classified into the group of non-physical impairment. Explanatory variables included sex, age, educational background (elementary school or below, middle school, high school, college graduate or above), being insured or not, stroke, diabetes, arthritis, monthly total household income (below KRW 2 million, KRW 2-4 million, KRW 4 million or above), the presence of a spouse (cohabiting with a spouse, having but not cohabiting with a spouse, no spouse), smoking (nonsmoking, smoking in the past, smoking at present), and the presence of depressive symptoms (yes, no), which were reported to have associations with Korean old people's ADL.

C. Predictive Model
As machine learning models to predict physical impairment in old age, this study compared the logistic regression model, XGBoost and random forest, with respect to the predictive performances of accuracy, sensitivity, and specificity. In testing the predictive performances, data were randomly divided into train data and test data in the proportion of 7:3; a predictive model was generated from the train data, and the performance of the model was tested with the test data. Random forest and XGBoost are models that include randomness, and the models were developed with the seed being fixed to 1234. The value of predictive performance for each model was predicted by means of the Area Under the Curve (AUC) of ROC curve (Fig. 1). As model performance assessment indices, the accuracy, sensitivity, and specificity of each model were obtained. Accuracy is the ratio of successful predictions to all predictions. Sensitivity is the ratio of a predictive model's predicting accurately old people to whom actual impairment will occur. Specificity is the ratio of a predictive model's predicting accurately that impairment will not occur to healthy old people to whom impairment will not occur actually. This study defined as the model of the best predictive performance a model whose sensitivity and specificity were 0.6 or above and whose accuracy was found to be highest after comparison with other models; and selected it as the final model for the prediction of physical impairment in old age. In all the analyses, R version 4.0.2 (Foundation for Statistical Computing, Vienna, Austria) was used. 38 | P a g e www.ijacsa.thesai.org

D. Ensemble Model
The ensemble model combines the prediction or classification results of several models, and use them in final decision-making, and a number of studies have shown that the model has better predictive performance than single decision tree models [31]. Ensemble methods are classified into boosting and bagging. Bagging is a method of generating several models through the sampling of source data and then making prediction by combining the outcomes of the models by voting or averaging; and reduces the variance of predicted values [32]. Boosting is a machine learning algorithm; is a method for better classifying observation values difficult to classify, by using more misclassified observation values; and reduces the bias of predictive values [33]. The concepts of boosting and bagging are presented in Fig. 2.

E. Random Forest
The forest ensemble is an ensemble form of decision trees. The decision tree is a model that divides the scope of variables in each branching, and can be used regardless of continuous/categorical target variables. It has the advantage of being capable of explaining a model easily, but its performance drops in the event that data has a structure not easily divided with horizontal partitioning or vertical partitioning [35]. A method developed to remedy this shortcoming is the random forest. The random forest samples data, generates several tree models, and then votes or averages the outcomes of the trees. It is similar to bagging, but is different from bagging in that it supplements the problem of multicollinearity by random selection of variables as well as sampling of data [36]. The concept of random forest is presented in Fig. 3.

F. XGBoost
XGBoost is one of boosting methods, and uses a misclassified observation value more in the next model when generating a tree [38]. That is, it is a boosting algorithm that trains with a method for improving performance as to misclassified observation values. XGBoost has the advantage of speedy calculation process owing to parallel computing that uses all CPU cores in learning, and is very useful because it supports various programming languages including Python and R [38]. The concept of XGBoost is presented in Fig. 4.

G. Sampling
Disease data generally have the problem of imbalance because the number of patients is fewer than healthy people. In the case of data used in this study, the ratio of normal old people was found to be 92%, and the ratio of old people with physical impairment only 8.0%, respectively, as a result of ADL assessment, which shows the problem of imbalance. To resolve the problem of imbalance, this study used the algorithms of under-sampling [40], over-sampling [41], and synthetic minority over-sampling technique (SMOTE) [30].
The under-sampling is a method of resolving the problem of data imbalance by randomly removing the major class among classes of response variables. The technique of undersampling can reduce the speed of model construction by removing data amount, but has the shortcoming of information loss.   The over-sampling is a method of resolving the problem of imbalance by randomly copying the minor class among classes of response variables. The over-sampling technique may cause the problem of overfitting because the speed of model construction increases due to the increase in data amount and it copies a small number of categories. SMOTE (Synthetic Minority Over-sampling Technique) is a method for supplementing overfitting, the shortcoming of over-sampling. One of minor classes among classes of response variables is randomly chosen, and then k nearest neighbors of this data is found. And the difference between this chosen sample and k neighbors is obtained, and the difference multiplied by any value between 0 and 1 is added to the existing sample, and then the resulting value is added to the training data. Lastly, this process is repeated. The SMOTE www.ijacsa.thesai.org algorithm is similar to over-sampling in that it increases data of a minor class of few categories, but it is known that it supplements overfitting, the shortcoming of over-sampling, through creating a new sample by properly combining the existing data, not copying the same data. The concepts of sampling types are presented in Fig. 5.

A. Accuracy of Predictive Models
The accuracy, sensitivity, and specificity of predictive models to which sampling methods were applied are presented in Table I. This study selected as the final model a model whose sensitivity and specificity were 0.6 or above and whose accuracy was highest. As a result, SMOTE-based XGBoost whose accuracy, sensitivity, and specificity were 0.67, 0.81, and 0.75, respectively, was determined as the final predictive model.

B. Results of XGBoost Model Development
The model to predict the physical impairment was developed through the Xgboost and the predictive power was compared with the results of random forest and logistic regression (Table II). Xgboost had higher classification accuracy than other predictive model in both training and test data. The analysis results of test data showed that the classification accuracy was 67.2% for Xgboost, 65.0% for logistic regression, and 62.1% for random forest.  Sensitivity and specificity are in the relationship of tradeoff. Therefore, the proportions of sensitivity and specificity are selected by the judgment of a researcher who uses a model. In this article, among random forest, logistic regression and XGBoost, the SMOTE-based XGBoost model, which showed the sensitivity and specificity of 0.6 or above and the highest accuracy, was derived as the final model of the most excellent predictive performance.
Similarly to the results of this study, previous studies also reported that XGBoost is more excellent than other ensemble models, such as GBM, in terms of accuracy [27,28]. It is presumed that XGBoost displayed excellent predictive performance in the areas of classification and regression because although it, one of tree-based ensemble learning algorithms, is based on GBM, it is equipped with its own functions of overfitting regularization and early stopping [27,28]. Further, previous studies [43,44] reported that XGBoost shows faster execution time than GBM. Therefore, the results of this study suggest that in case of developing a predictive model using imbalanced data like disease data, it is efficient to use the SMOTE-based XGBoost model.