Predicting the Depression of the South Korean Elderly using SMOTE and an Imbalanced Binary Dataset

Since the number of healthy people is much more than that of ill people, it is highly likely that the problem of imbalanced data will occur when predicting the depression of the elderly living in the community using big data. When raw data are directly analyzed without using supplementary techniques such as a sample algorithm for datasets, which have imbalanced class ratios, it can decrease the performance of machine learning by causing prediction errors in the analysis process. Therefore, it is necessary to use a data sampling technique for overcoming this imbalanced data issue. As a result, this study tried to identify an effective way for processing imbalanced data to develop ensemble-based machine learning by comparing the performance of sampling methods using the depression data of the elderly living in South Korean communities, which had quite imbalanced class ratios. This study developed a model for predicting the depression of the elderly living in the community using a logistic regression model, gradient boosting machine (GBM), and random forest, and compared the accuracy, sensitivity, and specificity of them to evaluate the prediction performance of them. This study analyzed 4,085 elderly people (≥60 years old) living in the community. The depression data of the elderly in the community used in this study had an unbalance issue: the result of the depression screening test showed that 87.5% of subjects did not have depression, while 12.5% of them had depression. This study used oversampling, undersampling, and SMOTE methods to overcome the unbalance problem of the binary dataset, and the prediction performance (accuracy, sensitivity, and specificity) of each sampling method was compared. The results of this study confirmed that the SMOTE-based random forest algorithm showing the highest accuracy (a sensitivity ≥ 0.6 and a specificity ≥ 0.6) was best prediction performance among random forest, GBM, and logistic regression analysis. Further studies are needed to compare the accuracy of SMOTE, undersampling, and oversampling for imbalanced data with high dimensional y-variables. Keywords—Random forests; gradient boosting machine; SMOTE; undersampling; imbalanced data; oversampling


I. INTRODUCTION
Depression is one of the important mood disorders at senescence. It is very important to diagnose and treat depression at an early stage because it is possible to treat and cure depression using medication or psychosocial therapy even after its onset [1]. Depressive symptoms in old age differ from those in young age. First, it is difficult to clearly distinguish depressive symptoms from dementia symptoms [2]. Pseudodementia, similar to dementia, shows a decline in cognitive ability in the dementia screening test, similar to the cognitive function test result of depression [3,4]. In particular, the elderly accompanied by depression often express a subjectively recognized decrease in memory and cognitive function, which are not common with adolescents [5,6]. Moreover, the elderly with depression suffer from a decrease in memory and cognition more than the healthy elderly [5,6].
Second, even though young patients complain about various physical symptoms, the key to diagnose depression, these physical symptoms are not very useful for diagnosing depression for elderly patients. For example, sleep disorder is a common symptom in adolescent depression, but elderly people frequently experience it regardless of depression [7,8]. Physical symptoms such as a normal decline in sexual function, constipation, and joint pain, associated with aging, are commonly found even in the elderly without depression [9]. Consequently, it is critical to accurately determine whether the depressive symptoms complained by the elderly are due to normal aging or depressive disorder.
Nevertheless, most of the studies that evaluated the depression of South Korean elderly were mainly regarding the factual survey for one city in terms of mental health, depression assessment, and the effectiveness of interventions for depression prevention and management. There are much fewer predictive model studies for identifying the factors associated with the depression of the elderly living in the community than patient-control group comparison studies. Previous studies [1,10,11,12,13] that evaluated the factors related to geriatric depression in South Korea local communities reported that health, socioeconomic status, education level, age, spouse, and social activities affected geriatric depression. Since regression analysis was mainly used as a modeling method to predict depression, they were efficient in identifying individual risk factors [14,15]. However, they were limited in identifying compound-risk factors (multivariate) such as sociodemographic variables and living habits [14,15]. Moreover, since regression analysis assumes independence, normality, and homoscedasticity, there is a possibility of producing biased results when the model is developed using data in violation of normality [16]. As a way to overcome the limitation of the regression model, big databased analysis, called machine learning or data mining, has been widely used in various fields. Machine learning can 75 | P a g e www.ijacsa.thesai.org analyze data accurately even if the data somewhat violate the assumption of normality such as nonlinear data in the estimation process [17]. Especially, it has been known that gradient boosting machine (GBM), which generates many classifiers and combines the predictions to derive more accurate results, and ensemble learning models such as random forest have much higher sensitivity and accuracy than a single decision tree [18,19]. Nonetheless, since the predictive performance of the ensemble learning model has been mainly tested using simulation data [20], it is necessary to conduct additional validation and verification for confirming the predictive performance of the ensemble learning model for using it for disease data, which are mostly imbalanced data [21].
Since the number of healthy people is much more than that of ill people, it is highly likely that the problem of imbalanced data will occur when predicting the depression of the elderly living in the community using big data [22]. When raw data are directly analyzed without using supplementary techniques such as a sample algorithm for datasets, which have imbalanced class ratios, it can decrease the performance of machine learning by causing prediction errors in the analysis process [23]. Therefore, it is necessary to use a data sampling technique for overcoming this imbalanced data issue [24]. As a result, this study tried to identify an effective way for processing imbalanced data to develop ensemble-based machine learning by comparing the performance of sampling methods using the depression data of the elderly living in South Korean communities, which had quite imbalanced class ratios.

A. Data Source
This study analyzed the raw data of the 2016 Seoul Panel Study (SEPANS) data. The SEPANS data was conducted from June 1 to August 31, 2016, for the purpose of estimating the welfare level of Seoul citizens and the actual status situation of socially vulnerable class. The population of this study was the households in Seoul at the time of the survey among the households subject to the 2005 Population and Housing Census. The stratified cluster sampling method was used for sampling households in 25 districts in Seoul. This study excluded foreigners and those admitted to retirement homes or nursing hospitals among the survey subjects. This study used the computer aided personal interview method that an interviewer visited the target households and entered the response to the structured questionnaire into a portable computer. This study analyzed 4,085 elderly people (≥60 years old) living in the community.

B. Variable Measurement
Depression, the outcome variable, was defined according to the Korean version of Center for Epidemiologic Studies Depression Scale-Revised (K-CESD) [25]. K-CES-D is a selfadministered depression scale composed of 20 items and it was developed by the National Institute of Mental Health. It is a primary screening tool for depression. The maximum score is 60, and a higher score indicates more severe depression.
The cut-off score of K-CES-D, the threshold of depression, was defined as 25 points.
Explanatory variables were age, gender, educational level (elementary school graduate and below, middle school graduate, high school graduate, or college graduate or above), smoking (smokers or non-smokers), drinking (less than once a week or twice or more per week), economic activity (yes or no), social activities for the past month (yes or no), mean monthly household income (less than KRW 1.5 million, KRW 1.5-3 million, or KRW 3 million or more), spouse living together (living together, bereavement/separated, or single), disease/accident/addiction in the last two weeks (yes or no), subjective health status (good, fair, or bad), subjective stress (yes or no), days of walking for 30 minutes or more per day (less than 1 day per week or 2 days or more per week), the frequency of meetings neighbors (less than once a month or twice or more per month), and the frequency of meeting relatives (less than once a month or twice or more per month).

A. Model Development and Evaluation
This study developed a model for predicting the depression of the elderly living in the community using a logistic regression model, GBM, and random forest, and compared the accuracy, sensitivity, and specificity of them to evaluate the prediction performance of them. To test the prediction performance of them, the data were randomly divided into train dataset (70%) and test dataset (30%). Prediction models were developed using the training dataset and the accuracy, sensitivity, and specificity of them were calculated by using the test dataset. Since GBM and Random forest have random characteristics, models were developed while the seed was fixed as 123456 for repeated measurement. The predictive performance of each model was evaluated by the area under the curve (AUC) of the ROC curve, and the accuracy, sensitivity, and specificity of each model were calculated as evaluation indices for the model performance. Accuracy means the percentage of successful predictions in all data. Sensitivity indicates the rate of a model predicting a senior with depression as depression. Specificity is a true negative rate, indicating how accurately a model predicts a senior without depression and not depression. This study defined the best predictive performance model as a model with the highest accuracy while sensitivity and specificity were 0.6 or higher, and the model was selected as the final model for predicting the depression of the elderly living in the community. . All analyses were performed using R version 4.0.2 (Foundation for Statistical Computing, Vienna, Austria) and Python version 3.8.0 (https://www.python.org).

B. Random Forest
Random forest is an ensemble model and it generates a number of decision trees to calculate predictions. The ensemble model is a method of integrating the classification results of multiple decision trees and using them for making the final decision. A number of studies [26,27,28] reported that the ensemble model had higher predictive power than single decision tree models. The ensemble model can be divided into bagging and boosting. Bagging is a way to predict www.ijacsa.thesai.org by combining the results of each model through averaging or voting after generating multiple decision tree models by sampling raw data. It has the advantage of reducing the variance of predicted values [29]. Boosting is a machine learning method that enables better classification for the observation values that are difficult to classify by using more misclassified observations. It has the advantage of reducing the bias of predicted values [30]. The concepts of bagging and boosting are presented in Fig. 1.
These ensemble models supplement the poor performance of decision trees when handling data that are not divided well into horizontal or vertical division [31]. Random forest samples data to create multiple tree models and then vote or average the results of each tree. It is widely used in various fields because it can handle the multicollinearity problem of trees by randomly selecting variables as well as sampling data from each model [32,33]. The concept of random forest is presented in Fig. 2.

C. GBM
The GBM is a machine learning algorithm designed by Friedman (2001) [36] that generates a prediction model by combining weak learners of traditional decision trees using ensemble techniques. This model generalizes the model by generating models for each step and optimizing the loss function that can randomly differentiate, like other boosting methods. In machine learning, boosting refers to a method of generating strong learners by combining weak learners [36]. It generates a model even if the accuracy of it is low, and the error of this model is supplemented by the next model. A more accurate model is created through this process, and the basic principle of it is to increase accuracy by repeating this process. The prediction model learning is to find a parameter that minimizes the loss function. One of the ways to find the optimal parameter is gradient descent. When a slope is calculated by differentiating the loss function with parameters and moving the parameters in the direction of decreasing the value, it reaches the point where the loss function is minimized. In the gradient boosting process, this exploration process is carried out in the functional space. Therefore, it differentiates the loss function by the model function learned so far, instead of the parameter. GBM's algorithm is presented in Fig. 3.

D. Sampling Techniques for Resolving Imbalanced Data
Disease data generally poses the problem of imbalanced classes because the number of people with a disease is smaller than those without a disease. The depression data of the elderly in the community used in this study also had an unbalance issue: the result of the depression screening test showed that 87.5% of subjects did not have depression, while 12.5% of them had depression. This study used oversampling [37], undersampling [38], and SMOTE [24] methods to overcome the unbalance problem of the binary dataset, and the prediction performance (accuracy, sensitivity, and specificity) of each sampling method was compared.
The undersampling method is a technique of randomly deleting data of multiple classes to match with the number of data in a class with small data. It is the fastest because it deletes data without conducting separate calculations, but the variation of performance is large because it deletes data randomly [38]. When a pair of data belonging to different classes and there is no data closer to each other, it is called www.ijacsa.thesai.org Tomek link. The Tomek link technique is a way to exclude data belonging to a class with more data. It has the effect of pushing the boundary line toward the class with many data. The edited nearest neighbors (ENN) technique is a technique that deletes the nearest k data out of a class with many data unless all or several of them belong to the class with many data. In other words, this technique deletes data of a class with more data that are around a class with fewer data. Since these traditional undersampling techniques delete data, they incur a loss of data and weaken the representativeness of data.
The oversampling technique is to use the data of a class with fewer data repetitively and randomly, which increases the weight. Like the random undersampling technique, it is the fastest because it copies data without conducting separate calculations, but the performance varies greatly because it copies data randomly.
The SMOTE technique finds n nearest neighbors of a class with small data regarding certain data belonging to the same class with a small data size, draws a straight line with the neighbor, and generates points until the random points have a balanced ratio. The concept of sampling types is presented in Fig. 4. Moreover, the algorithm of SMOTE is presented in Fig. 5. The Python code for executing SMOTE is presented in Fig. 6.

A. Comparing the Prediction Performance of the Model for
Predicting Senile Depression Table I shows the prediction performance (accuracy, sensitivity, and specificity) of oversampling, undersampling, and SMOTE. This study defined the final model with the best predictive performance as a model with the highest accuracy while sensitivity and specificity were 0.6 or higher. As a result, this study chose the SMOTE-based random forest algorithm, showing an accuracy of 0.68, a sensitivity of 0.83, and a specificity of 0.74, as the final model for predicting senile depression.

B. Major Predictors of Senile Depression
The model to predict the depression was developed through the GBM and the predictive power was compared with the results of random forest and logistic regression (Table II, Table III). Random forest had higher classification accuracy than other predictive model in both training and test data. The analysis results of test data showed that the classification accuracy was 63.0% for logistic regression, 65.1% for GBM, and 68.3% for random forest. Table III shows the major predictors of senile depression according to the SMOTE algorithm.

Model Factors
Logistic regression-raw data 8

GBM-SMOTE 10
Random forest-SMOTE 12 This study compared the performance of ensemble-based machine learning sampling methods using the depression data of the elderly in the community, which had an imbalanced class ratio. The results of this study confirmed that the SMOTE-based random forest algorithm showing the highest accuracy (a sensitivity ≥ 0.6 and a specificity ≥ 0.6) was the final model with the best prediction performance among random forest, GBM, and logistic regression analysis. Since specificity and sensitivity have a trade-off relationship (when one value increases, the other value decreases), the ratio of specificity and sensitivity is selected according to the judgment of the researcher using a model. This study proposes to compare the performance of machine learning suitable for the study objective by considering accuracy, specificity, and sensitivity instead of considering only accuracy when future studies on prediction models will compare models and evaluate predictive performance.
This study compared the prediction performance of ensemble models built on imbalanced data by sampling method and found that SMOTE showed the best performance.
Previous studies also reported that SMOTE had better predictive performance than undersampling and oversampling when analyzing imbalanced data [40]. The SMOTE technique has shown successful performance in various applied fields [41]. The ADASYN technique generates more realistic points deviated from the line by producing random points and adding random noise and it is a recently developed improved version of SMOTE. There have been continuous attempts to develop advanced algorithms that have better accuracy than SMOTE [42].
The results of this study suggest that using SMOTE as a sampling method to overcome the imbalance can be an efficient option when developing a prediction model using imbalanced binary data like disease data. SMOTE can alleviate the overfitting problem due to random oversampling and has the advantage of not losing useful data compared to undersampling or oversampling techniques [40]. However, it has also been reported that SMOTE may cause class overlapping, induce additional noise, and not be effective for treating imbalanced data with a high-dimensional y variable [42]. Therefore, although this study confirmed the effectiveness of SMOTE using an imbalanced binary dataset, the results cannot be generalized for all dimensions of data and the result should be interpreted with caution. Further studies are needed to compare the accuracy of SMOTE, undersampling, and oversampling for imbalanced data with high dimensional y-variables.