Machine Learning Model to Analyze Telemonitoring Dyphosia Factors of Parkinson’s Disease

For many years, lots of people have been suffering from Parkinson’s disease all over the world, and some datasets are generated by recording important PD features for reliable decision-making diagnostics. But a dataset can contain correlated data points and outliers that can affect the dataset’s output. In this work, a framework is proposed where the performance of an original dataset is compared to the performance of its reduced version after removing correlated features and outliers. The dataset is collected from UCI Machine Learning Repository, and many machine learning (ML) classifiers are used to evaluate its performance in various categories. The same process is repeated on the reduced dataset, and some improvement in prediction accuracy is noticed. Among ANOVA F-test, RFE, MIFS, and CSFS methods, the Logistic Regression classifier along with RFEbased feature selection technique outperforms all other classifiers. We observed that our improved system demonstrates 82.94% accuracy, 82.74% ROC, 82.9% F-measure, along with 17.46% false positive rate and 17.05% false negative rate, which are better compared to the primary dataset prediction accuracy metric values. Therefore, we hope that this model can be beneficial for physicians to diagnose PD more explicitly. Keywords—Parkinson’s disease; correlation; outliers; machine learning; RFE-based analysis


I. INTRODUCTION
Parkinson's disease (PD) is a chronic, neurodegenerative disease of the nervous system which affects our body movement including speech [1]. James Parkinson was invented this disease in 1857 and explained its condition as Shaking Palsy [2]. The main reason of PD is actually unknown. It affects 1% of people who are older than 65 years, and no medical treatments can cure this disease completely [3]. Almost 90% patients face trouble speaking normally as well as fail to express facial emotion; it results in slow speaking speed, slur words, mumbling, etc. [4]. The average age of patients lies between 55 to 65 years old [5]. Different environmental factors like rural living, consumption of water, pesticide manage and exposure, environmental toxin create individual's risks of happening PD. Out of many neurodegenerative disease such as Alzheimer's disease, headache disorders, stroke, epilepsy, multiple sclerosis, dementia, PD is considered as the second most common nerudegenerative disorder [2]. Different brain cells contain substantia nigra cells which produce dopamine. Dopamine is a chemical element which transmits signals within brain and controls the movement of body. When 60-80% dopamine creating cells are lost, there are not produced sufficient dopamine and people face about movement disorder that causes PD [6].
To ensure proper treatment about PD, it is required to identify these patients as early as possible. Many works have been happened where PD patients are identified based on different aspects and parameters. The symptoms of PD is divided into motor and non-motor group. The motor group is also called as cardinal symptoms which include tremor, rigidity, postural instability, and slowness of movement. Instead, nonmotor group shows the loss of speech, facial expression, and handwriting. These types of symptoms are called dopamine non-responsive symptoms. Speech properties are one of the most effective non motor element because 90% patients are faced PD based on vocal impairment [7]. In addition, non motor symptoms like speech are not decisive where these attributes are employed with cerebrospinal fluid measurement (CSF) and dopamine transporter imaging for predicting PD [8]. Due to redundant points and degradation of speech quality, it is more difficult for physicians to detect PD cases by assessing their vocal records in a manual way. Thus, an automatic model is useful which extracts speech patterns of subjects and detects PD more efficiently.
However, machine learning is a study of computer algorithms where it analyzes existing instances and predict expected outcomes [9], [10]. It is defined as a process of discovering useful, interesting, and complex patterns from a large amount and high dimensional data [11], [12]. Likewise, this technique is useful to predict PD through a set of practical datasets. In this work, we propose a machine learning-based framework to make PD detection convenient for clinicians. This model contains various state-of-art techniques like feature selection, outlier detection, and classification. Then, several evaluation metrics like accuracy, area under curve (AUC), fmeasure, g-mean, sensitivity, specificity, fall-out, and miss rate are used to assess the performance of individual classifiers [13]. The performance of classifiers are useful to detect the most significant feature subset where different classifier performs well than other subsets. The main contributions of this proposed PD diagnosis model are mentioned below: • Various feature subsets are generated and identified the best one by assessing the performance of individual classifiers.
• Detect anomalous/noisy elements to obtain more suitable feature subsets.
• To justify the performance of classifiers, numerous evaluation metrics are considered in this work.
This paper is organized as follows: Section 2 includes details of similar studies and their implications. Section 3 presents the methodology of a machine learning model for detecting PD at early stage. Also, it outlines the description of PD dataset, feature selection, classification and its evaluation metrics. Section 4 shows the experimental results of various classifiers for individual feature subdatasets, compare them to identify best feature subset. Finally, Section 5 concludes by summarizing this work and mentioning future research strategies.

II. RELATED WORK
Numerous works were happened to predict PD at early stage. Das [14] used different classifiers like Artificial Neural Network (ANN), DMneural, Regression, and Decision Tree (DT) to efficiently detect PD and compare their results. Tsana et al. [15] employed novel speech signal processing feature selection and statistical classifiers to investigate PD. Challa et al. [8] developed an automatic PD diagnosis model with feature extraction and various classifiers such as Multilayer Perceptron (MLP), Bayes Net (BN), Random Forest (RF), and boosted LR for early prediction of PD. Shamli et al. [16] proposed a multi-class classification model including C4.5, Support Vector Machine (SVM), and ANN to enhance prediction tendencies as well as reduce the cost for PD. Tong et al. [17] proposed a machine learning framework that achieves a 75% classification accuracy along with 69% balanced accuracy for neurodegenerative disease diagnosis. Since PD is a neurodegenerative disease as well, their system can improve the prediction rate for clinical use. Li et al. [18] proposed a PDoriented classification algorithm for improved classification performance. It involves a Classification and Regression Tree (CART) approach for picking the optimal training samples iteratively and an ensemble-learning algorithm combining RF, SVM, and ELM. Mathur et al. [19] implemented various classifiers like SMO, KNN, Rf, AdaBoost.MI, Bagging, MLP, and DT to scrutinized PD. Nilashi et al. [5] proposed a hybrid intelligent system for PD prediction where Incremental SVM is utilized to estimate Total-UPDRS and Motor-UPDRS. Almeida et al. [20] used 18 feature extraction and 4 machine learning methods to investigate sustainable phonation and speech tasks. Besides, phonation analysis was more efficient than speech task. Lahmini and Shmuel [21] investigated PD based voice pattern using various pattern ranking methods and optimized SVM. Mostafa et al. [22] proposed a new multiple feature evaluation approach (MFEA) as well as DT, NB, ANN, RF, and SVM show its best results for MFEA. Pham et al. [7] combined voice and image dataset where pairwise correlation and k-means clustering extracts features from vocal dataset. Then, it proposed an ensemble method to predict PD. Pahuja et al. [2] extracted various significant features and selected feature subsets from PD voice input dataset. Then, different classifiers such as ANN, SVM, and KNN were implemented and ANN with levenberg-marquardt algorithm provides the best results. Senturk et al. [6] proposed a machine learning model where feature importance and recursive feature elimination (RFE) methods were implemented for feature selection. Then, CART, ANN, and SVM were used to identify PD patients. Karabayir et al. [23] analyzed PD acoustic data using light and extreme gradient boosting, RF, SVM, KNN, LASSO, and LR. Then, they used feature importance procedure to identify significant features for classifying PD. Lamba et al. [24] represented a speech signal based hybrid PD disease diagnosis system where numerous feature selection (i.e., mutual information gain, extra tree, genetic algorithm) and classification methods (i.e., NB, KNN, RF) were employed. Also, SMOTE method was used to balance PD dataset. Paramanik et al. [25] used two recent decision forest algorithms such as SysFor, ForestPA including RF for developing PD detection models with the optimization of DT.

III. MATERIALS AND METHODS
In this work, we propose a machine learning framework to improve the efficiency of a PD dataset where the data validity is judged by applying many classifiers. For each classifier, multiple performance parameters are measured where we observed that these results could be improved by removing insignificant features and outliers. In the feature selection process, we employ a total of four methods and notice its outcomes.

A. Parkinson's Disease Data
We collected the dataset from the University of California Irvine (UCI) Machine Learning Repository, approved by the Bioethical Committee from the University of Extremadura. The dataset was created by Naranjo et al. [26]. It contains 240 instances for only 80 people whose ages are greater than 50 years old. Among 40 controls, there are found 22 men and 18 women respectively. On the other hand, 27 men and 13 women are defined as PD patients. According to the mean of Unified Parkinson's Disease Rating Scale (UPDRS), all subjects have 5 years or less PD duration. This dataset contains 44 acoustic features which captures a sustainable vowel /a/ for 5s with three runs. These features include five categories such as pitch local features, amplitude local perturbation, special envelope, noise and nonlinear measures. The individual features of these categories are given as follows: • Pitch Local Features: jitter relative, jitter absolute, jitter relative absolute perturbation (RAP), jitter pitch perturbation quotient (PPQ).

B. Methodology
The he overall implementation are demonstrated in the following Fig. 1: 1) Data Acquisition: After gathering PD voice dataset for UCI data repository, we clean and check missing, wrong, and incomplete information in this dataset. Afterwards, this dataset is prepared for further analysis.

C. Feature Selection Methods
Feature selection methods are useful to reduce the number of input variables and lessen the computational cost of these predictive models. In this work, we apply different feature selection methods into primary dataset and explore several feature subsets. Then, some sub datasets are generated using these subsets.

1) Correlation based Feature Selection (CFS):
Correlated values are linearly dependent on each other. Some features don't have any significant impact on the predicted responses, but they have a few drawbacks. A correlation matrix is created to find out the correlation among different features and remove some of them have higher coefficients above a particular limit [27]. It is a square matrix that consists of equal dimensions as features where all the possible correlated pairs are identified and displayed altogether. In order to drop them, a threshold is considered so that all columns exceeding this limit are eliminated. As expected, the number of columns of our dataset is decreased now, and it only contains features having a coefficient less than 0.90.
2) Analysis of Variance (ANOVA) F-test: ANOVA F-test [28] is really helpful to determine if more than one data samples' mean can be driven from the same or different distribution. On the other hand, F-statistic or F-test refers to a class of statistical tests, where the ratio between variances are measured. ANOVA F-test method can be applied to detect the most important features to minimize high data dimensionality.
It is a common feature selection strategy for numerical input values and categorical target variables.
3) Chi-Square Feature Selection (CSFS): CSFS is used to evaluate the discrepancy from the expected distribution when the feature incidence is independent from class value [29]. It tests two individual examples to avoid overfitting, reduce computational time, and boost the system's accuracy. However, it can work with data values measured on a nominal scale. The differences between various participant groups can be easily estimated without any assumptions about the distribution.

4) Mutual Information based Feature Selection (MIFS):
MIFS represents statistical independence that determines the relationships between random variables [30]. In brief, it detects the quantity of information one random value contains about another one. When it is used as a feature selection scheme, it gives the model a chance to evaluate the relevance of feature subsets depending on the output vector. By quantifying the gain, the system can make effective feature selection decisions. [31], [32] is effective at picking more relevant parameters in large training datasets. While using RFE, programmers should pay full attention to the number of features selection and the right algorithm implementation. It operates by looking for a subset of features for all columns of the training dataset and getting rid of some irrelevant features. At first, the classifier gets trained, and parameters whose absolute values are the smallest get eliminated until only the required ones remain.

D. Outlier Detection
Outliers refer to those data points, whose have a significant difference from common observations, for the variability of measurement, sampling issues, and experimental errors [33]. These values deviate outcomes from expected values in further analysis. So, we simply address them as deviant examples, unusual data, and special samples respectively. In many cases, they do not provide good enough outcomes for the presence of outliers. So, those values are required to handle and get more improved results. Among various methods, the interquartile range (IQR) method is widely used to find different types of outliers. In IQR method, three values such as first (Q1), second (Q2), third (Q3) quartiles are considered. Then, all other values that remain outside between Q1 and Q3 are called outliers. Different instances of the dataset are arranged in ascending order and placed them into four equal sections. Since IQR expands from the first to third quartiles, then the outcomes of IQR is Q3 -Q1. Hence, all records that are under the lower limit (Q1 -1.5 IQR) and over the upper limit (Q3 + 1.5 IQR) are called outliers. Therefore, all outliers can be detected in this way. After detecting them, they can be dropped or replaced by another suitable values. These instances affect the result of different machine learning algorithms in a particular dataset.

F. Evaluation Metrics
Some performance metrics such as accuracy, AUC, Fmeasure, Geometric mean, Sensitivity, Specificity, false positive rate, false negative rate have been used to evaluate the results of individual classifier. These metrics are expressed as a function of True Positive (TP), True Negative (TN), False Negative (FN), False Positive (FP) values.
• Accuracy is one of the most common evaluation metrics for classification models. It refers to how accurate a classification method is. We can express it as, • AUC characterizes how well positive classes are isolated from negative classes. It can be represented with T P rate (T P R) and T N rate (T N R) by following equations: • F-Measure is a harmonic mean of precision and recall.
F − M easure = T P T P + 0.5(F P + F N ) • Geometric mean (G-mean) is a measure of central tendency computed as the square root of specificity and sensitivity. The equation is • Sensitivity refers to the proportion of the positive events against positive predicted events. So, • Specificity refers to the proportion of the negative events against predicted negative events. So, • False positive rate (Fall Out) shows the ratio between the number of negative samples which falsely classifies as positive.
F alse positive rate = F P F P + T N (7) • False negative rate (Miss Rate) shows the ratio between the number of positive samples, which falsely classified as negative.
F alse − negative rate = F N F N + T P

IV. EXPERIMENT RESULT AND DISCUSSION
In this experiment, we implement different machine learning techniques such as feature selection, outlier detection and classification methods using scikit-learn library in python. From different feature subsets, we generate CFS, AVONA Ftest, CSFS, MIFS, and RFS dataset as well as implemented IQR method to detect outliers. However, DT, KNN, GNB, SVM, LR, MLP, XGB, RF, ET, Adaboost, GB, and SGB has been used to investigate these subdatasets along with primary dataset. This experiment has been conducted on Google Colaboratory.

A. Performance Analysis of Classifiers for Primary Dataset
In this work, the outcomes of each classifier for primary dataset are represented at Table I. Among all classifiers, GNB provides the best findings with 82.50% accuracy, 82.50% AUC, 82.49% F-measure, 82.50% G-mean, 82.50% Sensitivity, 82.50% Specificity, and the lowest 17.50% fall out, and 17.50% miss rate. Then, LR shows the second highest results to investigate and detect PD patients. Another classifiers such as DT, KNN, SVM, MLP, XGB, RF, ET, Adaboost, and GB show good result in this work. However, MLP and SGD do not produce more improved outcomes to identify PD patients.
When we investigate various ROC curves of different classifiers, GNB provides more TPR than any other classifier (see Fig. 2). Besides, another classifiers display good TPR except MLP and SGD.

B. Performance Analysis of Classifiers for CFS Dataset
According to the outcomes at Table II, LR obtains the best 80.30% accuracy, 80.21% AUC, 80.30% f-measure, 80.21% g-mean, 80.30% sensitivity, 80.13% specificity where it shows 19.87% fall out and 19,70% miss rate. However, it does not exceed the highest of GNB for primary dataset. The results of several classifiers such as DT, KNN, SVM, XGB, Adaboost, and GB are improved for CFS than primary dataset. Instead, GNB, MLP, RF, and ET are slightly decreased than primary dataset in this work.
After observing ROC curves of each classifier, LR also shows more TPR than other classifiers (see Fig. 3). However, MLP and SGD do not provide good TPR like most of the classifiers in this work.

C. Performance Analysis of Classifiers for ANOVA F-test Dataset
In the classification result of Table III, GNB obtained the best outcomes for ANOVA F-test dataset and does not give improved results compared to primary dataset (81.42% accuracy, 81.41% AUC, 81.42% F-measure, 81.41% G-mean, 81.42% Sensitivity, 81.40% Specificity, 18.60% fall out, 18.58% miss out). Also, the degradation of results are noticed for KNN, SVM, MLP, RF, ET, and SGD. However, we noticed a performance boost for DT, LR, XGB, AdaBoost, and GB respectively.
Then, when we consider ROC curves of different classifier at Fig. 4, GNB shows the highest TPR to detect PD more precisely. Besides, LR, DT, KNN, SVM, XGB, RF, ET, Adaboost, and GB also represent good outcomes in this work.

D. Performance Analysis of Classifiers for CSFS Dataset
Then, GNB gives the best performance (80% accuracy, 79.74% AUC, 79.87% F-measure, 79.74% G-mean, 80% Sensitivity, 79.48% Specificity, 20.52% fall out, 20% miss rate) whereas it does not exceed the outcomes for primary dataset (see Table IV). Also, many classifiers like KNN, SVM, MLP, XGB, RF, ET, AdaBoost, and GB are not generated good results where DT, LR, and SGD show improved results than primary dataset. When the ROC curves of different classifiers are observed (see Fig. 5), the curves of GNB and LR are very close to each other, but GNB is the best classifier to represent this curve. Again, MLP and SGD show its low TPR for CSFS dataset analysis.

E. Performance Analysis of Classifiers for MIFS Dataset
In this case, the outcomes of GNB and LR are very close to each other (see Table V). But, GNB shows slightly improved result than LR (79.29% accuracy, 79.28% AUC, 79.29% Fmeasure, 0.7928 G-mean, 79.29% Sensitivity, 79.28% Specificity, 20.71% fall out, and 20.7% miss rate). But it is not exceed GNB result for primary dataset. However, some classifiers like KNN, SVM, MLP, RF, ET, and SGD provide worsen results in MIFS dataset. However, the results of DT, LR, XGB, Adaboost, and GB are given a few improved result for MIFS than primary dataset.
However, the ROC curve of GNB and LR are almost same for MIFS dataset (see Fig. 6). Another classifiers also display good ROC curve except MLP and SGD. Also, LR shows the best ROC curve whose represent more TPR than any other classifier for RFE dataset (see Fig. 7).
As we observe the performance measures and ROC curves of different classifier, LR determine the best outcomes for RFE dataset. But, these results are not found more stable in various cases. After observing the results of primary and its generated subdatasets, different classifiers give better outcomes and feature reduction methods are shown effective findings to detect PD patients. Also, we scrutinize the average results of different classifier which represents at Table VII. In this case, GNB displays the best average outcomes among all classifiers. Likewise, LR provides the second highest average outcomes in this analysis. Then, RF, XGB, ET, DT, KNN, GB, and AdaBoost give well average results like previous observations in the primary and its sub datasets. MLP and SGD do not represent good average outcomes in this work.
This proposed framework is integrated more feature selection and classification method than other existing works [14], [17], [20], [8], [44]. To evaluate its results, we consider various kinds of evaluation metrics where different previous works [21], [2], [23] has not maintained such types of evaluation. Along with best feature selection and classification methods, this framework also explores the most stable classifier which can provide better outcome in any types of transformation and experimental settings.

V. CONCLUSION AND FUTURE WORK
This research has identified a reliable technique for feature selection of PD dataset with more simplicity, less running time, and cost-effectiveness. First, we explore insignificant features using different methods, remove them and generate sub datasets. However, the IQR method has been applied to detect outliers and prune them. Then, a lot of classifiers are used to investigate different types of PD datasets and compared them with primary dataset. In this case, LR shows the highest outcomes for RFE-based method. Besides, GNB is the most stable method to investigate Parkinson acoustic instances. This method can be potentially applied to similar types of datasets to obtain better solutions, distinguish between normal and sick people, and lessen diagnosis costs. Some feature selection and classification methods are provided random outcomes due to some infrastructural settings. In future, we would like to work on different limitations and gathered more widely used technologies to provide more satisfactory outcomes for detecting PD.