Applying Synthetic Minority Over-sampling Technique and Support Vector Machine to Develop a Classifier for Parkinson’s Disease

As the number of Parkinson’s disease patients increases in the elderly population, it has become a critical issue to understand the early characteristics of Parkinson’s disease and to detect Parkinson’s disease as soon as possible during normal aging. This study minimized the imbalance issue by employing Synthetic Minority Over-sampling Technique (SMOTE), developed eight Support Vector Machine (SVM) models for predicting Parkinson’s disease using different kernel types {(C-SVM or Nu-SVM)×(Gaussian kernel, linear, polynomial, or sigmoid algorithm)}, and compared the accuracy, sensitivity, and specificity of the developed models. This study evaluated 76 senior citizens with Parkinson’s disease (32 males and 44 females) and 285 healthy senior citizens without Parkinson’s disease (148 males and 137 females). The analysis results showed that the liner kernel-based Nu-SVM had the highest sensitivity (62.0%), specificity (81.6%), and overall accuracy (71.3%). The major negative relationship factors of the Parkinson’s disease prediction model were MMSE-K, Stroop Test, Rey Complex Figure Test (RCFT), verbal memory test, ADL, IADL, 70 years old or older, middle school graduation or below, and women. When the influence of variables was compared using “functional weight”, RCFT was identified as the most influential variable in the model for distinguishing Parkinson’s disease from healthy elderly. The results of this study implied that developing a prediction model by using linear kernel-based Nu-SVM would be more accurate than other kernel-based SVM models for handling imbalanced disease data. Keywords—Kernel type; Rey complex figure test; support vector machine; SMOTE; Parkinson’s disease


I. INTRODUCTION
As the elderly population increases, the occurrence of senile diseases is also increasing. Among these diseases, Parkinson's disease particularly continues to increase. Health Insurance Review and Assessment Service (2018) [1] reported that the increased rate of Parkinson's disease incidence was the second-highest following that of dementia incidence in South Korea. The number of Parkinson's disease patients increased 2.5 folds over 12 years, from 40,000 in 2004 to 96,000 in 2016, and it reached 100,716 in 2017. As the number of Parkinson's disease patients increases in the elderly population, it has become a critical issue to understand the early characteristics of Parkinson's disease and to detect Parkinson's disease as soon as possible during normal aging.
Motor-symptoms (e.g., resting tremor, rigidity (slowing body movements down) are commonly observed in the early stage of Parkinson's disease [2,3,4]. Over the past 20 years, many studies [5,6,7] have focused on nonmotor-symptoms such as autonomic nervous system dysfunction, dysesthesia, and cognitive impairment, which are observed in the early stages of Parkinson's disease. Shulman et al.(2001) [8] reported that these nonmotor-symptoms were found in 88% of Parkinson's disease patients. Patients with Parkinson's disease do not need any help in performing their daily activities in the early stages [9] because their symptoms can be well controlled with a small amount of medication. However, as Parkinson's disease progresses, since their cognitive and motor functions decline a lot, it becomes difficult to conduct their daily activities and eventually lose the ability to perform them independently [10]. As a result, they must rely on others [10]. In addition, diminished cognitive functions have been reported as a factor causing both the patient and the family to fall into despair and depression along with the gradual decline in Parkinson's disease patients' physical function and uncertainty about the progression of the disease [11,12]. Particularly, nonmotor-symptoms of Parkinson's disease such as cognitive impairment are major predictors for the morbidity of Parkinson's disease dementia [7,13]. Therefore, it is necessary to detect them as soon as possible, which requires to accurately distinguish the cognitive decline in normal aging from that in Parkinson's disease.
Previous studies [14,15,16] that evaluated the difference in cognitive functions between the healthy elderly and Parkinson's disease patients without dementia reported that cognitive issues of Parkinson's disease patients were mainly associated with frontal lobe dysfunction. Cooper et al. (1991) [17] reported that Parkinson's disease patients had difficulty in processing information due to frontal lobe dysfunction and they could show impaired performance or inappropriate behaviors for the situation due to decline concentration. These results imply that the function of the frontal lobe is the key cognitive ability to detect and predict Parkinson's disease [18,19]. Nevertheless, there are not enough large-scale studies on the nonmotor-symptoms of Parkinson's disease in South Korea [13], and efforts to predict Parkinson's disease using machine learning are even scarcer. *Corresponding Author 96 | P a g e www.ijacsa.thesai.org In addition, it is difficult to detect Parkinson's disease in the early stage because abnormal symptoms progress slowly, the nature of a degenerative disease, and it is often hard to tell the onset of a symptom [7]. It is very common that even Parkinson's disease patients do not know exactly when the abnormal symptoms began to occur and they do not recognize a progressing mild cognitive problem [20]. Even if they recognize it, they often think that the symptom is due to aging [20]. Furthermore, it is hard to diagnose Parkinson's disease with only one neurological examination. The diagnosis of Parkinson's disease requires several consecutive measurements regarding the reaction to medications and the progression of the disease. Consequently, it is even harder to detect Parkinson's disease in the early stage.
Many recent studies [21,22,23] have widely used support vector machine (SVM), a supervised learning algorithm, as a way to classify and predict complex risk factors of diseases. When developing a prediction model using binary data like a disease, it is highly likely to encounter an imbalanced issue because the number of patients is smaller than that of people without the disease [24]. The imbalanced issue may cause a prediction error in the process of conducting machine learning and degrade the performance of the model. Consequently, it needs an additional imbalanced data processing technique using sampling in order to resolve the prediction error due to the imbalanced data. Previous studies [25,26] have reported that synthetic minority over-sampling technique (SMOTE) has less overfitting than oversampling or undersampling. This study minimized the imbalance issue by employing SMOTE, developed eight SVM models for predicting Parkinson's disease using different kernel types ((C-SVM or Nu-SVM)×(Gaussian kernel, linear, polynomial, or sigmoid algorithm)), and compared the accuracy, sensitivity, and specificity of the developed models.

A. Subjects
This study evaluated 76 senior citizens with Parkinson's disease (32 males and 44 females) and 285 healthy senior citizens without Parkinson's disease (148 males and 137 females) living in Seoul, Incheon, and Gwangju, while a senior citizen was defined as people equal to or older than 60 years and equal to or younger than 74 years. In this study, Parkinson's disease was defined as patients diagnosed with idiopathic Parkinson's disease according to the diagnostic criteria of the United Kingdom Parkinson's Disease Society Brain Bank [27]. The selection criteria for healthy seniors were (1) those who did not have a history of neurological diseases such as stroke and Parkinson's disease, (2) those who received at least 24 points from the Korean version of Mini-Mental State Exam (K-MMSE) and judged as normal, and (3) those who did not have a visual or hearing impairment while taking the test.
The power of this study was examined using G-Power version 3.1.9.7 (Universität Mannheim, Mannheim, Germany). The results showed that, when the number of predictors was 19, alpha=0.05, power (1-B)= 0.95, and the effect size (f2) was 0.15, the required number of samples was 217. Therefore, it was concluded that the number of this study's samples (n=361) was enough to test statistical significance ( Fig. 1 and 2).

B. Measurements and Definitions of Variables
This study measured the cognitive levels for each subtype using the Cognition Scale for Older Adults (CSOA) [28], which could measure cognitive function comprehensively considering age and education level. The CSOA is a standardized test that can comprehensively measure cognitive function while considering the age and education level of the elderly in South Korea. The CSOA is a survey tool that diagnoses dementia or cognitive disorders by evaluating each cognitive domain (sub-test) targeting the elderly suspected of having dementia or a cognitive disorder. Kim (2011) [29] reported that the reliability of CSOA (Cronbach's alpha) was 0.932. CSOA is composed of eight subtests: Mini-Mental Status Examination in the Korean Version (MMSE-K), Verbal Memory Test, Stroop Test, General Information, Digit Span Test, Rey Complex Figure Test (RCFT), Confrontation Naming Test, and Verbal Fluency Test. This study transformed the raw scores of the eight subtests into standardized scores with a mean of 100 and a standard deviation of 15, and used them to develop prediction models.
MMSE-K: MMSE-K is a test to examine the overall cognitive level and it can evaluate while considering the age and education level of the subject. It is composed of seven subdomains: orientation of time, orientation of place, memory registration, attention and calculation, memory recall, language function, and composition (construction). Scores range from 0 to 30 points.  Verbal Memory Test: The Verbal Memory Test uses 10 picture cards. The test is performed in the order of immediate recall trial, delayed recall trial, and delayed recognition trial. It evaluates the memory function index comprehensively. Delayed recall trial shall be conducted 15-20 minutes after performing the immediate recall trial. Delayed recognition trial shall be examined immediately after implementing delayed recall trial. The three types of raw scores are calculated: immediate recall, delayed recall, and delayed recognition. The immediate recall trial counts correct responses of each trial, and the total score ranges from 0 to 30 points. The raw score of delayed recall trial is the number of correct responses, and it ranges from 0 to 10 points. The raw score of delayed recognition trial is calculated by subtracting the number of "false positive" (answering "yes" to the picture that was actually shown before) from that of "true positive" (answering "yes" to a picture that was not shown before). If the score is negative, it is treated as 0. The total score ranges from 0 to 10. General Information: It consists of 20 questions asking common sense, and the mark of each question is 1. The total score ranges from 0 to 20 points. Digit Span Test: For this test, when the tester calls out a number, the test subject listens to it and repeats it immediately. There are digit span test-forward and digit span test-backward. Each test starts with an item with a shortlist of numbers and progresses to an item with a longer list of numbers gradually. The raw score of each test is the sum of items, and the total score ranges from 0 to 14 points.
RCFT: It is a test that asks a subject to copy Rey complex figure (RCF), and the copied drawing is used as a measure of visuospatial ability. The recalled drawing is used as a measure of memory function. RCF can be divided into 18 elements and each element is scored. Each element is evaluated while considering shape and location, and the raw score ranges from 0 to 36 points.
Confrontation Naming Test: It is a question and answer type test. It asks a subject to see a picture and name it. It consists of 24 items. The raw score ranges from 0 to 24 points. Verbal Fluency Test: It consists of two trials. In the first trial, the test subject shall say animal names as many as possible. In the second trial, the test subject shall say crop names as many as possible. The time limit for each trial is 1 minute, and the raw score is calculated by adding summing the number of correct responses in the first trial and that in the second trial.

C. Explanatory Cariable
Explanatory variables were gender (male or female), age, an education level (middle school graduation or below, or high school graduation or above), economic activity (yes or no), mean monthly household income (<1.5 million KRW, 1.5-3 million KRW, and ≥3 million KRW), living with a spouse (living together, bereavement/separation, or single), smoking (non-smoking or smoking), drinking (non-drinking, or drinking), subjective stress (yes or no), activities of daily living (ADL; total score), instrumental activities of daily living (IADL; total score), MMSE-K, Verbal Memory Test, Stroop Test, general information, digit span test, RCFT, confrontation naming test, and verbal fluency test.

D. SMOTE
In the Parkinson's disease data used in this study, the proportion of healthy elderly people without Parkinson's disease was 78.9%, and that of those with Parkinson's disease was 21.1%. Consequently, an imbalance issue was found in the class of y variable. Classifiers trained from these skewed data are more likely to produce biased results because they try to predict classes with higher weight. Accuracy may increase due to it. However, it is highly likely that the precision for a low frequency variable becomes lower and the reproduction of the class may decreases as well. This study used SMOTE to over the imbalance issue of this binary dataset. SMOTE finds n nearest neighbors, belong to the same minor class, for any value of a minor class, draws a straight line with that neighbor, and creates random values until they show a synthetic ratio. SMOTE's algorithm is presented in Fig. 3.

E. Development of Prediction Model
Models were developed using SVM to predict Parkinson's disease. SVM is a linear separation model that optimally separates the learning data on hyperplane, and it is a machine learning algorithm that finds the optimal decision boundary [30]. The concept of hyperplane is presented in Fig. 4. Although SVM has higher accuracy and is less likely to cause over-fitting than other models such as decision tree, the prediction performance varies by kernel type [31]. Therefore, this study developed eight SVM models according to the kernel type (C-SVM or Nu-SVM)×(Gaussian kernel, linear, polynomial, or sigmoid algorithm) to identify the SVM model with the best prediction performance and compared their prediction performance (accuracy, sensitivity, and specificity). The concept of kernel function is presented in Fig. 5.
This study randomly divided the data into train data and test data at a ratio of 7:3 to examine the prediction performance of the developed eight SVM models. Moreover, this study calculated overall accuracy, sensitivity, and specificity using the test data. In this study, sensitivity refers to the proportion of true positive, while specificity refers to that of true negative. This study defined the best performance model as the model with the best accuracy, while sensitivity and specificity were 0.6 or higher, by comparing the prediction performance of each model, and the best model was selected as the final model for predicting Parkinson's disease. All analyses were performed using Python version 3.8.0 (https://www.python.org) and R version 4.0.2 (Foundation for Statistical Computing, Vienna, Austria).   Table I shows the results of descriptive statistics on the cognitive function, ADL, and IADL of the healthy senior citizens and Parkinson's disease senior citizens.

B. Comparing the Accuracy of Parkinson's Disease Prediction Models according to SVM Classification Algorithm
This study compared the accuracy, sensitivity, and specificity of eight SVMs to confirm the prediction performance of a model by a kernel type (Table II). The analysis results showed that the liner kernel-based Nu-SVM had the highest sensitivity (62.0%), specificity (81.6%), and overall accuracy (71.3%). It was noteworthy that the polynomial-based C-SVM showed the highest specificity (86.5%) among the eight SVM models with the lowest sensitivity (28.8%). The linear kernel-based C-SVM had the lowest overall accuracy (Fig. 6).

C. Key Variables for the Classification of Parkinson's Disease in the Final SVM Model
This study assumed that the linear kernel-based Nu-SVM algorithm was the best model for predicting Parkinson's disease, which had the highest sensitivity and overall accuracy. This study also calculated the importance of variables in the kernel-based Nu-SVM model, which utilized 83 support vectors. Although it is impossible to simply compare the magnitude of the influence or importance between variables, it is possible to identify whether the relationship between a predictor and an outcome variable is positive or negative. The major negative relationship factors of the Parkinson's disease prediction model were MMSE-K, Stroop Test, RCFT, verbal memory test, ADL, IADL, 70 years old or older, middle school graduation or below, and women. When the influence of variables was compared using "functional weight", RCFT was identified as the most influential variable in the model for distinguishing Parkinson's disease from healthy elderly.

IV. DISCUSSION
In this study, MMSE-K, Stroop Test, RCFT, verbal memory test, ADL, IADL, 70 years old or older, middle school graduation or below, and women were the main predictors of Parkinson's disease. Among them, RCFT was the most influential variable. It is believed that RCFT was identified as the most important predictor in discriminating Parkinson's disease from the elderly [34] because the task of describing a complex figure requires the function of the frontal lobe in addition to the spatio-temporal composition ability, even though this test reflects spatio-temporal composition ability [35].
Another important finding of this study was that the prediction accuracy of the linear kernel-based Nu-SVM algorithm was the highest when the prediction accuracy of the eight SVM classification algorithms was compared to evaluate the SVM performance by kernel type. The performance of nonlinear SVM is affected by the employed kernel function and the parameters constituting it [36].

V. CONCLUSION
The results of this study implied that developing a prediction model by using linear kernel-based Nu-SVM would be more accurate than other kernel-based SVM models for handling imbalanced disease data. Additional studies are needed to compare the accuracy using data from various fields to prove the prediction performance of linear kernel-based Nu-SVM.