Predicting the Anxiety of Patients with Alzheimer’s Dementia using Boosting Algorithm and Data-Level Approach

Since overfitting due to imbalanced data can cause prediction errors during the learning process of machine learning and degrades the prediction performance of the model (e.g., sensitivity), it is necessary to add an additional data sampling technique in the model development step to reduce overfitting to overcome this issue, in addition to selecting a machine learning algorithm suitable for the data. This study examined Alzheimer's patients living in South Korea to understand the predictors of anxiety using boosting algorithms (i.e., AdaBoost and XGBoost) and data-level approach (raw data, undersampling, oversampling, and SMOTE) and confirmed the machine learning algorithm with the best prediction performance. We analyzed 253 elderly people who were diagnosed with Alzheimer's disease (aged from 60 to 74 years old) who visited rehabilitation hospitals for early dementia screening. This study developed models for predicting the anxiety of Alzheimer's dementia patients using AdaBoost and XGBoost. Moreover, this study compared the prediction performance (i.e., accuracy, sensitivity, and specificity) of the models. The results of this study showed that XGBoost based on SMOTE (accuracy=0.84, sensitivity=0.85, and specificity=0.81) was identified as the model with the best prediction performance. Consequently, the results of this study presented that using a SMOTE-XGBoost model may provide higher accuracy than using a SMOTE-Adaboost model for developing a prediction model using outcome variable imbalanced data such as disease data in the future. Keywords—Anxiety; AdaBoost; patients with Alzheimer's dementia; SMOTE; XGBoost


I. INTRODUCTION
The number of people with dementia increases worldwide, and that increases in South Korea as well. It has been reported that the number of dementia elderly was 750,000 in South Korea in 2019, and it has been forecasted to increase to 1.96 million by 2040 [1]. The increase of people with dementia indicates an increase in the number of elderly people who need long-term care [2]. Since dementia is accompanied by physical, cognitive, and behavioral issues, people with dementia require the help of a caregiver [3]. Therefore, managing the elderly with dementia is an important issue not only for the patient, but also for the family, society, and the country.
Dementia is an irreversible disease that mostly occurs in old age. It has been reported that the prevalence of dementia is one out of ten elderly people over 65 years old and one out of two elderly people over 85 years old [4]. Dementia can be classified into Alzheimer's disease, frontotemporal dementia, and Parkinson's dementia. Among them, Alzheimer's disease accounts for 60% of dementia patients, and it is difficult to detect early because the symptoms of it progress gradually and slowly [5]. The characteristic of Alzheimer's disease is to lose the memory of recent events [6]. Moreover, as the disease progresses further, people with it cannot remember the names of familiar people, names of objects, or places [7,8].
Additionally, apathy, depression, and anxiety were reported as the behavioral and psychological symptoms of dementia (BPSD), which are frequently observed in dementia as well as cognitive disorders [9,10]. BPSD including anxiety cause considerable pain to patients with Alzheimer's disease, which decreases the quality of life [11]. Lyketsos et al. (2000) [12] reported that 70-95% of dementia elderly residing in care facilities for the elderly and 60% of dementia elderly treated at home experienced BPSD. The BPSD of patients can lead to death by decreasing cognitive functions and/or exacerbating physical dysfunction [13]. Furthermore, it can not only negatively affect the lives of supporting family members, but also cause drastic pain [14,15]. In particular, BPSD including anxiety increase the medical expenses of dementia patients considerably: it has been reported that 30% of dementiarelated medical expenses were for managing BPSD, and treating dementia patients with anxiety was much more expensive than treating those without anxiety [16].
Since it is easier to treat BPSD including anxiety than cognitive impairments, it is possible to improve the quality of life of dementia patients and their caregivers by detecting and treating these symptoms early appropriately [15,17]. Consequently, detecting the anxiety of dementia patients as soon as possible is an important topic in geriatrics, and it requires developing a prediction model that can explore the risk factors of anxiety symptoms while considering a range of factors such as demographic characteristics, cognitive function, and ability to perform daily activities.
For the past 20 years, most studies on dementia have focused on the cognitive dysfunction of dementia, and relatively fewer studies aimed to identify the factors associated with BPSD [13]. Moreover, previous studies [18,19] mainly used regression analysis methods to identify risk factors for behavioral and psychological symptoms. Regression analysis methods are useful only for identifying individual risk factors, but they are limited in identifying multiple risks [20]. In particular, only a few studies conducted in South Korea evaluated BPSD. Previous studies [21,22] could only grasp the relationship with individual factors such as demographic characteristics as a way to understand the relationship of it with individual factors such as demographic characteristics.
Boosting algorithms such as eXtreme Gradient Boosting (XGBoost) and AdaBoost are widely used to overcome the limitations of these regression models. Although numerous previous studies [5,23] have reported that machine learning is more accurate than traditional statistical techniques such as regression analysis, modeling using disease data is highly likely to suffer from imbalanced data because the number of patients is much smaller than those without a disease. Consequently, the likelihood of overfitting is high [24]. Since overfitting due to these imbalanced data can cause prediction errors during the learning process of machine learning and degrades the prediction performance of the model (e.g., sensitivity), it is necessary to add an additional data sampling technique in the model development step to reduce overfitting to overcome this issue, in addition to selecting a machine learning algorithm suitable for the data [25]. This study examined Alzheimer's patients living in South Korea to understand the predictors of anxiety using boosting algorithms (i.e., AdaBoost and XGBoost) and data-level approach (raw data, undersampling, oversampling, and SMOTE) and confirmed the machine learning algorithm with the best prediction performance.

A. Subjects
This study analyzed 253 elderly people who were diagnosed with Alzheimer's disease among 1,553 elderly South Korean (aged from 60 to 74 years old) who visited rehabilitation hospitals and nursing hospitals in Incheon from August 2, 2017, to June 30, 2018, for early dementia screening. The screening conducted an in-depth dementia test, which was composed of sociodemographic information, previous medical history, cognitive function, mood, activities of daily living, interview with subjects and their guardians regarding changes in personality and others, Seoul Neuropsychological Screening Battery (SNSB) [26], and Korean version of Global Deterioration Scale(GDS) [27], for the diagnosis of Alzheimer's disease. A neurologist diagnosed Alzheimer's dementia based on the diagnosis criteria of "Diagnostic and Statistical Manual of Mental Disorder, 5th edition" and "National Institute of Neurological and Communicative Diseases and Stroke/Alzheimer's Disease and Related Disorders Association (Probable Alzheimer's disease) " . This study excluded those who had severe visual and hearing impairment for conducting the test, a medical history of stroke, and profound dementia corresponding to CDR 3.
This study tested the power of sample size by using the G-Power program 3.1.9 (Universität Mannheim, Mannheim, Germany). The results showed that the minimum number of samples was 217 when power (1-B)=0.95, alpha=0.05, effect size (f2)=0.15, and 19 predictors were applied. Therefore, 253 samples of this study satisfied the condition for testing statistical significance.

B. Measurements and Definitions of Variables
The outcome variable was defined as anxiety (yes, no). Explanatory variables were gender, age (65-75 for the youngold, and 75 and older for the old-old), an education level (middle school graduation or below, or high school graduation or above), income level (total household income), marital status (married, divorce/separation, or bereavement), smoking (non-smoking, former smoker, or current smoker), drinking habits (non-drinking, former drinker, or current drinker), exercise regularly at least once a week (yes or no), mean monthly social activity participation (less than 1 hour or 1 hour or more), subjective health (good, moderate, or poor), diabetes (yes or no), hypertension (yes or no), family history of dementia (yes or no), cognitive level (K-MMSE) [28], Clinical Dementia Rating (CDR) ) [29], depression, and activities of daily living (ADL). Anxiety was measured by using Korean neuropsychiatric inventory (K-NPI) [30]. K-NPI is a standardized test tool that measures the BPSD of patients. It divides the abnormal behaviors of dementia into twelve domains (i.e., delusion, hallucination, aggression, depression, anxiety, euphoria, apathy, disinhibition, irritability, aberrant motor behavior, sleep, and appetite), and evaluates each sub-item. When an abnormal behavior is found in a specific sub-item (e.g., anxiety), frequency (0-4 points) and severity (0-3 points) are measured, and they are multiplied to produce the final value (0-12 points). A higher score indicates a more anxious state. This study analyzed only the anxiety items in the K-NPI.
Cognitive function: Korean version of Mini-Mental Status Examination (K-MMSE) [28] was used as a tool to measure cognitive functions. K-MMSE includes diverse subcategories including temporal orientation, spatial orientation, memory, attention and computation ability, language ability, and spatiotemporal composition ability. It consists of 30 items (one point per item), and a lower score means more severe cognitive impairment. At the time of developing MMSE, the Cronbach' α value was 0.82 [31].
Clinical Dementia Rating (CDR): CDR [29] is a tool that is designed to classify the severity of dementia into five levels from a clinical perspective based on the evaluation of six areas (i.e., memory, orientation, judgment, problem-solving ability, social activities, family life and hobby, and hygiene and dressing up. At the time of developing the CDR, the interinspector reliability was Kappa=0.86~1.0 [29]. Depression: This study used the Short form of Geriatric Depression Scale Korea (SGDS-K) [33] for depression, which was standardized and developed according to the circumstances of the elderly in South Korea by extracting 15 items out of the 30 items of the Geriatric Depression Scale(GDS) [32]. SGDS-K is composed of a binary scale (yes/no), and ranges from 0 to 15. A higher score means a severe depression level. This study defined the threshold of SGDS-K, defining depression, as 8 points. At the time of developing SGDS-K, Cronbach' α value was 0.94 [32].
Activities of Daily Living (ADL): Korean version of Barthel Activities of Daily Living Index (K-BADL) [34] is a standardized test tool for measuring the activities of daily living, and this study used this tool. K-BADL consists of 10 sub-categories: bowels, bladder, washing face/hair combing/tooth brushing/shaving, toilet use, eating, transfer, mobility, dressing, going up and down stairs, and bathing. The score ranges from 0 to 20, and a higher score indicates that a person can perform more independently without the help of people around the person (normal level).

C. Development of Prediction Models and Validation of Predictive Performance
This study developed models for predicting the anxiety of Alzheimer's dementia patients using AdaBoost and XGBoost. Moreover, this study compared the prediction performance (i.e., accuracy, sensitivity, and specificity) of the models. This study randomly divided the data into a training dataset and a test dataset at a ratio of 7:3, developed prediction models, and tested the performance of the models using the test dataset. A 5-fold cross-validation (CV) was performed only on the training dataset, and the test dataset was used to evaluate the prediction performance. Random forest and XGBoost models contain randomness, and models were developed by fixing the seed to "01234". The prediction performance of each model was evaluated by the area under the curve (AUC) of the receiver operating characteristic (ROC) curve (Fig. 1) [35]. The accuracy, sensitivity, and specificity of each model were calculated as evaluation indices for model performance. Accuracy indicates the proportion of successful predictions among all samples. Sensitivity means the true positive rate, indicating that a prediction model predicts a dementia patient with anxiety as anxiety. Specificity means the true negative rate, indicating that a prediction model predicts a dementia patient without anxiety as no-anxiety. This study compared the prediction performance of each model and determined that a model with the highest accuracy with 0.6 or higher sensitivity and specificity as the best model. If models have the same accuracy, the model with the high sensitivity value was selected as the best prediction model. All analyses were carried out using R version 4.0.3 (Foundation for Statistical Computing, Vienna, Austria).

D. Boosting Algorithm
The boosting algorithm refers to the process of making a strong classifier showing a strong performance by using a linear combination of weak classifiers that have already been given. Freund et al. (1996) [36] introduced an improved technique to apply the boosting idea to actual data analysis in 1995, and it proved that the error rate of the boosting algorithm approached zero as the number of weak classifiers increased. The advantage of the boosting learning method is (1) it has relatively fewer parameters to be predicted compared to other learning methods; (2) a cascade classification model can be easily constructed in the aspect of false positive; (3) the boosting algorithm reduces the bias of the predicted values; and (4) since it is possible to select one specific dimension through a weak classifier, it can be applied as a method of feature selection when using data with many variables. This study developed the model for predicting the anxiety of dementia patients using Adaboost and XGBboost methods among boosting algorithms.

E. Adaboost
Adaboost is a learning technique that creates a strong classifier by repeatedly training a very weak classifier using samples of two classes. This technique improves the performance of a weak classifier by training the weak classifier while giving the same weight to all samples at first, and then increasing the weight of the sample misclassified by the basic classifier as steps progress. The concept of Adaboost [37] is presented in Fig. 2.

F. XGBoost
XGBoost is one of the boosting methods. This method uses the observations misclassified while generating trees more in the next model. In other words, it is a boosting algorithm that trains a classifier to have better performance for misclassified observations. The advantages of the XGBoost model are that it can prevent overfitting by minimizing the training loss and it has a faster learning and classification speed than existing gradient boosting models [38] because it is based on parallel and distributed processing. The concept of XGBoost [39] is presented in Fig. 3.

G. Data-level Approach
Disease data generally have an imbalance issue because the number of patients is smaller than that of healthy people. The data of this study also had an imbalance issue because the results of K-NPI test showed that 90.5% of subjects were Alzheimer's dementia patients without anxiety and 9.5% of them were Alzheimer's dementia patients with anxiety. This study compared prediction performance (accuracy, sensitivity, and specificity) using oversampling [40], undersampling [41], and SMOTE method [42] among various data-level approaches to overcome the data imbalance problem.
Oversampling is a data-level approach that solves the imbalance issue by duplicating data with a small number of classes [43]. For example, if there are 90 0s and ten 1s, 1 can be duplicated to be 90 1s. As a result, the total number of data becomes s 180, and the ratio of 0 to 1 becomes 1:1. Generally, it is possible to make a different ratio instead of 1:1. When the number of original data is large, oversampling may take longer to build a model due to a larger sample size, which is a shortfall. Moreover, it may cause an overfitting problem [44]. The concept of oversampling [45] is presented in Fig. 4. Undersampling is a data-level approach that resolves the data imbalance problem by randomly removing the class with a response variable of 0 [43]. In other words, it randomly removes 0 to make the ratio of 0 and 1 set to be 1:1. In general, it is possible to adjust the data so that the ratio is different, instead of 1:1. Since undersampling is a method of removing data as shown, it may cause information loss, a problem. The concept of undersampling [45] is presented in Fig. 5.
Synthetic minority over-sampling technique (SMOTE) is a method that combines oversampling and undersampling. It randomly selects one of the minor classes among the classes of the response variable, and then it finds k neighbors of this data. Then, the difference between the selected sample and k neighbors is calculated, and this difference is multiplied by a random value between 0 and 1. The calculated value is added to the existing sample, and then it is added to the training dataset. Finally, this process is repeated. The SMOTE algorithm is similar to oversampling in the aspect that it increases the data of the minor class. However, it is known that it makes up for the overfitting issue of oversampling, by creating a new sample by appropriately combining the existing data instead of duplicating the same data. The concept of the SMOTE algorithm [46] is presented in Fig. 6.

A. Accuracy of Prediction Models
The accuracy, sensitivity, and specificity of a eight prediction models ((AdaBoost & XGBoost) x (raw data, undersampling, oversampling, and SMOTE)) are presented in Table I. It was found that XGBoost based on SMOTE (accuracy=0.84, sensitivity=0.85, and specificity=0.81) was identified as the model with the best prediction performance (Fig. 7). Anxiety predictors of Alzheimer's dementia patients using SMOTE-XGBoost are presented in Table II. When the normalized importance of variables was analyzed, age, gender, family history of dementia, depression, ADL, K-MMSE, and CDR were confirmed as the major factors for predicting the anxiety of Alzheimer's dementia patients. Among them, depression showed the highest importance.

IV. DISCUSSION
This study developed models for predicting the anxiety of Alzheimer's dementia patients using the boosting algorithm and data-level approach. The results of this study showed that age, gender, family history of dementia, depression, ADL, MMSE, and CDR were the major factors in predicting the anxiety of Alzheimer's dementia patients. Previous studies [47,48] also revealed that age was significantly related to the behavioral and psychological symptoms of dementia patients. The results of Mushtaq et al. (2016) [49] showed that it was considerably associated with mood and aggressive behavior for early-onset Alzheimer's disease and it was related to psychosis for late-onset Alzheimer's disease was associated with psychosis. Gang et al. (2016) [50] reported that a lower cognitive function score indicated a worse behavioral and/or psychological symptom. Cho et al. (2006) [51] also reported that as the cognitive function decreased, the frequency of anxiety increased. The progression (stage) of dementia was also reported as a predictor of behavioral and psychological symptoms, and the stage of dementia was positively correlated with the number of expressed behavioral and psychological symptoms [52]. Particularly, as shown in this study, Hall et al. (2004) [53] also reported that depression was the most powerful factor influencing the occurrence frequency of behavioral and psychological symptoms such as anxiety. According to the results of this study, if an elderly Alzheimer's disease patient with reduced cognitive functions shows a depression symptom, the patient has a higher risk of anxiety. Therefore, it is necessary to identify and treat anxiety symptoms as soon as possible to maintain the patient's mental health.
This study developed prediction models based on imbalanced data using a boosting algorithm and a data-level approach. The results showed that the SMOTE-XGboost model showed the best prediction performance. Similar to the results of this study,   [24] also reported that an XGboost model showed superior classification accuracy compared to other boosting algorithms. It is believed that it has good prediction performance in classification and regression domains because XGboost has unique overfitting regularization and early-stopping functions, which GBM does not have, even though XGboost is a tree-based boosting algorithm and it is based on the gradient boosting algorithm (GBM).

V. CONCLUSION
Consequently, the results of this study presented that using a SMOTE-XGboost model may provide higher accuracy than using a SMOTE-Adaboost model for developing a prediction model using outcome variable imbalanced data such as disease data in the future.