A Machine Learning Approach towards Detecting Dementia based on its Modifiable Risk Factors

Dementia is considered one of the greatest global health and social care challenges in the 21st century. Fortunately, dementia can be delayed or possibly prevented by changes in lifestyle as dictated through known modifiable risk factors. These risk factors include low education, hypertension, obesity, hearing loss, depression, diabetes, physical inactivity, smoking, and social isolation. Other risk factors are non-modifiable and include aging and genetics. The main goal of this study is to demonstrate how machine learning methods can help predict dementia based on an individual’s modifiable risk factors profile. We use publicly available datasets for training algorithms to predict participant’ s cognitive state diagnosis, as cognitive normal or mild cognitive impairment or dementia. Several approaches were implemented using data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) longitudinal study. The best classification results were obtained using both the Lancet and the Libra risk factor lists via longitudinal datasets, which outperformed cross-sectional baseline datasets. Moreover, using only data of the most recent visits provided even better results than using the complete longitudinal set. A binary classification (dementia vs. nondementia) yielded approximately 92% accuracy, while the full multi-class prediction performance yielded to a 77% accuracy using logistic regression, followed by random forest with 92% and 70% respectively. The results demonstrate the utility of machine learning in the prediction of cognitive impairment based on modifiable risk factors and may encourage interventions to reduce the prevalence or severity of the condition in large populations. Keywords—Machine learning; classification; data mining; data preparation; dementia; modifiable risk factors


INTRODUCTION
Dementia presents enormous global health and social challenges. Currently, there are around 47 million people with dementia worldwide, and that number is expected to triple by 2050 [1]. The aging population worldwide is almost certainly part of the reason behind this increase, especially in low-and middle-income countries. Dementia is a collection of symptoms of cognitive defects, which could be delayed or possibly prevented by eliminating certain risk factors associated with the condition.
This study aims to use a machine learning (ML) approach to classify the cognitive state and detect dementia based on these risk factors. The main research contribution is a demonstration of the utility of interpretable machine learning methods for the purposes of predicting future cognitive status for an individual based on modifiable risk factor variables that have been already defined by the Lancet commission and the Libra index. The analysis is applied to data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) longitudinal study. As far as known, no previous work has explored Lancet, and Libra lists of modifiable risk factors on the ADNI dataset using a machine learning approach.
The remaining of this paper is structured as follows. Section II provides background on the domain and some related work. The methodology applied in this research is described in Section III. Moreover, the experiment and results are provided in Section IV. Finally, the conclusion of the research and its future work is provided in Section V.

II. BACKGROUND
Dementia is described as a collection of symptoms related to cognitive deficits and is not considered one single disease. In the Diagnostic and Statistical Manual of Mental Disorders (DSM-5) [2], dementia is listed under Major Neurocognitive Disorder (NCD), and is defined by the following [2]: • There is evidence of a substantial cognitive decline in one or more cognitive domains.
• The cognitive deficits interfere with independence in everyday activities, are not exclusively in the context of a delirium, and are not mainly attributable to another mental disorder.
Dementia occurs mainly in people older than 65 years. Fortunately, it could be delayed or possibly prevented by eliminating some modifiable risk factors associated with it [1].

A. Dementia Risk Factors
The Lancet Commission study found that around 35% of dementia risk factors are potentially modifiable [1]. These risk factors include less education, hypertension, obesity, hearing loss, depression, diabetes, physical inactivity, smoking, and social isolation. Although the impact of these factors varies at different life stages, eliminating them at any stage would be beneficial. Moreover, studies recommend active treatment and intervention of modifiable dementia risk factors, which would potentially delay or prevent 30% of dementia cases [1], [3].
On the other hand, completely eliminating the apolipoprotein E (APOE) ε4 allele, which is considered the 2 | P a g e www.thesai.org major genetic risk factor of dementia, could reduce its incidence by 7% [1]. However, this and all other genetic factors are considered to be non-modifiable. Besides genetics, other non-modifiable risk factors include age and gender.
A common method to calculate dementia risk based on its risk factors is by using the Lifestyle for Brain Health (LIBRA) index [4], [5], [6], which is calculated by the Innovative Midlife Intervention for Dementia Deterrence (In-MINDD) project [7]. Alzheimer's disease (AD) is the most common type of dementia. The next most common type is vascular dementia (VaD), followed by dementia with Lewy bodies. Frontotemporal degeneration and dementia associated with brain injury, infections, and alcohol abuse are less common types of dementia [1].
Tariq and Barber [15] suggested dementia prevention by targeting vascular modifiable risk factors, as these two types are often co-existing in the brain and share some common modifiable risk factors.

B. Current Approaches Used in Detecting Dementia Risk
Factors Many studies have aimed to predict an early diagnosis of dementia through magnetic resonance imaging (MRI) and genetic variables. However, these measurements are expensive and not always available [16].
Most of the research that has used machine learning applied classification methods from MRI data to classify or predict a diagnosis of different cognitive diseases and states [17], [18], [19].
On the other hand, only a few studies have used machine learning techniques to determine risk factors associated with dementia or one of its major causes (i.e., Alzheimer's disease) [17]. Some of the studies combined modifiable and nonmodifiable risk factors in order to reach a higher level of accuracy.
Most of the available research used large cohort studies and a population-based perspective to determine associated risk factors [20], while some used statistical analysis to provide a ranked risk-factor index [6], [5].
Two main studies used machine learning techniques to detect dementia's risk factors and predict dementia risk accordingly [21], [22]. Both studies applied their analysis to one longitudinal cohort study with a relatively small size (i.e., 840 and 746 subjects respectively).
O'Donoghue, et al. [21] applied a non-linear dementia survival prediction model with a multilayer perceptron (MLP), which is an artificial neural network (ANN), and used both modifiable and non-modifiable risk factors defined in the In-MINDD project [4]. They also examined the hidden layers to extract different clusters of risk factors and explore different interactions between them. Due to a class imbalance of the MAAS dataset, their models were able to predict survival better than predicting dementia. Their models overall accuracy ranges between 53.57% and 70.24%.
Joshi, et al. [22] tried different attribute-evaluation methods on the major risk factors of both Alzheimer's and Parkinson's diseases, which included both modifiable and non-modifiable risk factors. Their attribute-evaluation methods included Chi-Squared, Gain Ratio, Info Gain, Relief F, and Symmetrical Uncertainty. They then applied several machine learning models, including Decision Tree, Random Forest (RF), and MLP to predict the patient's future status based on the defined risk factors. Their models did not detect dementia itself but instead classify subjects' diagnoses from three neurodegenerative diseases, which are AD, VaD, and Parkinson's. They used a relatively small dataset of fewer than 500 subjects from the ADRC and ISTAART studies [22].
Conversely, other studies aimed to predict dementia from neuroimaging data and in particular magnetic resonance imaging (MRI) or positron emission tomography (PET) scans of the brain. Ding, et al. [23] were able to predict Alzheimer's disease around six years before its diagnosis using fluorine 18 fluorodeoxyglucose PET images of the brain. They achieved 82% specificity at 100% sensitivity using a deep learning algorithm. In another study, Casanova, et al. [18] used both MRI images and cognitive tests to detect Alzheimer's risk using regularized logistic regression.
Although prediction using MRI or PET scans or even genetics data can be very accurate, it is not practical in many countries to scale such an approach for population screening, and it does not present direct links with potentially modifiable factors that could be taken into account by an individual patient to delay dementia.
There have been no published studies to date investigating machine learning approaches with larger datasets to link modifiable risk factors to dementia and therefore providing suggestions for treatment and lifestyle change based on multiple population-based longitudinal studies. Modern machine learning methods over and above those used in the aforementioned studies focusing on modifiable risk factors and larger datasets should be explored to determine if they can produce better predictions and Insight.
Moreover, using possibly interpretable models in clinical research is essential for intervention development and for gaining an understanding of the relationships and interactions 3 | P a g e www.thesai.org between symptoms or risk factors and diagnosis. Interpretability is difficult to achieve using black-box models such as neural networks, which contains hidden layers, although they might yield higher prediction accuracy. The easiest way to achieve interpretability is through interpretable models such as linear and logistic regressions, decision trees, and Naïve Bayes [24]. Consequently, this paper focuses on such methods with modifiable risk factors as input variables trained and tested on datasets significantly larger than those reported upon to date.

C.
The Alzheimer's Disease Neuroimaging Initiative

(ADNI) Study
Early prediction of dementia requires tracking changes in cognitive ability over time. The ideal study type which can support this tracking is one yielding longitudinal data points. In longitudinal studies, data are collected on one or more variables repeatedly, over time, in contrast with crosssectional studies, in which data are collected on one or more variables at a single time point [25] [26].
The Alzheimer's Disease Neuroimaging Initiative (ADNI) (http://adni.loni.usc.edu) is a longitudinal study that was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. Its primary goal has been to test whether MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD) [27].
The dataset consists of three longitudinal studies on around 1900 participants in total. ADNI enrolls participants who are between the age of 55 and 90 and are either normal healthy older adults used as controls (CN), people with either early or late MCI, and people with AD. The cognitive-state diagnoses (as well as dementia status) rating assessment of the participants was also provided.
The ADNI data set has been widely used in many research studies [17], [18], [19]. However, none of the published research that has used the dataset to date has attempted a machine learning approach to predict dementia based on established modifiable risk factors or even to explore the dataset for other possible dementia risk factors. Most of the research has instead focused on using MRI and PET scans or genetic data to predict Alzheimer's disease.
Most previous studies using the ADNI dataset and other longitudinal studies in the dementia field have used a complete case analysis (CCA) [29], and thus they considered only the cases with complete data and removed the missing values [17] [21] [31] [32]. Moreover, as per [30], if there is an overall worsening trend in health over time, missing data can be imputed from the same subject using their other available data.
While research has shown the importance of preventing or delaying dementia, and that this might be achieved by targeting known modifiable risk factors few studies have applied machine learning approaches to selecting dementia's risk factors and predicting dementia status. However, some studies combined both modifiable and non-modifiable risk factors. More research and work in this area would improve the early prediction of dementia and recommend actions that would possibly prevent or at least delay its onset by targeting only the non-modifiable risk factors. Using an interpretable machine learning approach on the attribute selection and prediction would help to predict dementia based on its modifiable risk factors.

III. METHODOLOGY
This research follows one of the most widely used process models for predictive data analytics, which is the Cross-Industry Standard Process for Data Mining (CRISP-DM) model adapted from [28] (see Fig. 1). The project lifecycle phases, as illustrated in the diagram, are business understanding, data understanding, data preparation, modeling, evaluation, deployment, and monitoring. All phases are going to be included in this project except deployment and monitoring, which are beyond the scope of this research. Domain understanding has already been established in the background (section II).

A. Understanding The ADNI Dataset
The ADNI dataset is extensive, containing hundreds of tables with different categories from basic patients' demographics to highly complicated genes and imaging datasets; however, not all tables were useful for the scope of this research. Therefore, an initial investigation of the dataset and its categories, subcategories, tables, fields, and their descriptions were needed. Fortunately, ADNI provides a data dictionary and an inventory that describe each table and its fields. The risk factors features are the independent variables while the diagnoses of the cognitive state are the independent variable, which might be one of three: cognitive normal (CN), mild cognitive impairment (MCI), and Dementia.

1) The Modifiable Risk Factor Attributes in ADNI
As the ADNI study dataset is extensive and consists of hundreds of tables and features under multiple categories, which may not be needed or useful for the aim of this research, the data dictionary, and the inventory were used to track only the necessary tables and features within them.
After reviewing the tables listed, the attributes related to dementia risk factors and diagnoses were selected. These attributes are listed in Table II. Attributes were selected from all ADNI cohorts except ADNI3 as the protocol of taking the medical history was different, and thus, not all features were 4 | P a g e www.thesai.org available. A total of 1812 subjects were considered in the analysis.

2) Cross-Sectional vs. Longitudinal Data
As the dataset used in this research is longitudinal, another step was needed to understand the data through the study timeline. First, an understanding of how the data appear as cross-sectional, either at the baseline or at any single time point, was obtained. Then, a complete longitudinal view of the dataset was analyzed, including the differences between the main study parts (i.e., ADNI 1, Go, 2, and 3) and each visit's collected data.

B. Data Preparation
In this phase, the data were prepared for modeling by applying various data mining techniques to clean and to preprocess the data. This includes handling missing values, feature extractions, features transformation, and other tasks. Dealing with longitudinal data adds a complexity level to the preparation process because there could be various reasons and explanations for the data over time. A summary of the data preparation steps is shown in Fig. 2.

1) Dealing with Missing Values in Longitudinal Data
Based on the ADNI study description, missing data were coded with -1 or -4. Typically, -4 is used for not applicable (i.e., data is not collected at a specific visit), and -1 is used for confirmed missing data. The detailed study schedule shows the data collected at each visit for each cohort group (i.e., CN, MCI, and AD).
To check the reason for missing data and to determine whether the data were missing completely at random (MCAR), missing at random (MAR), or not missing at random (NMAR) [29] [30], the visit schedule descriptions, the visit registry table, and the exclusion tables were checked. The exclusion tables helped determine the reason for dropout, which might not be related to dementia, such as study partner availability, moving to another city, or not being willing to undergo MRI scans. For the available records missing data, several reasons were identified, and different actions were applied.

a)
Missing Data Due to Scheduled Visit Design: In some cases, the data were missing because they were not collected during a visit (e.g., some visits were only for MRI imaging session; some data were collected only at the baseline). The missing data in these cases were considered MCAR and were imputed using the same patient's previous data, following the last observation carried forward (LOCF) method [30].
• Missing Height: In the detailed ADNI visit schedules, participant's heights were only taken once during a screening visit, unlike their weights, which were repeatedly taken at each visit. Therefore, missing height data for each visit were filled in using a participant's screening visit height.
• Missing Demographics and Medical History: These data were collected only during the screening visit (repeated at the screening visit for each cohort, i.e., if a participant was included in ADNI1, 2, 3, medical history was taken at the screening visit for each cohort). The missing data were filled in using the same data for all visits (not imputation rather than fixed, although it might change, this is not recorded).

b) Missing Data Due to ADNI Study Stage Design:
If the data were not collected during a specific ADNI stage, this means that the data were missing for all patients enrolled only during this stage. Therefore, only available or complete cases were considered. Examples of this include detailed smoking history, alcohol use, and medical history, which are not available for the ADNI3 study design. This led to selecting only ADNI1, ADNIGo, and ADNI2 cohorts for the study. Also, cognitive activity data were collected only beginning from ADNIGo, and thus, participants who were enrolled only in ADNI1 were excluded.

2) Feature Transformation
Not all features have the desired format. Some new features must be calculated from existing ones, and some binary or categorical features must be factorized or encoded. Moreover, some features must be aggregated because they are repeated in multiple rows and could be defined as unique new features. The applied feature transformation included: a) Unit Modifications: Height and weight units were not unified for all entries. Some were recorded as kg\cm, lbs\inch, lbs\cm, or kg\inch. All measurements were modified to the metric unit kg\cm. www.thesai.org b) Calculation: Some features needed to be calculated from other existing features. This may cause multicollinearity, which was reduced by selecting the best representative features which yield to better models results [18]. The calculated features were as follow: • BMI and Obesity: Body mass index (BMI) was calculated based on the height and weight of the participants. Moreover, obesity was recorded when BMI >30 [12]. • Physical Activity: Physical activity level has been calculated by adding up the related functional and physical assessment questioners' answers such as going shopping, playing games, and going out of the neighborhood. c) Factorization: Visit codes were in a string format and were factorized to be numerical for simple computations and comparisons. d) Aggregation: Structured medical description rowbased fields were converted to binary column-based features (i.e., row for each condition per participant converted to 1 row with all conditions per participant). e) Normalization: For modeling purpose, numerical data has been normalized to range from 0 to 1 using the MinMaxScaler.
f) Encoding Categorical Features: Categorical features were encoded using dummy variables by converting the feature of k-categories to k-1 different dummy variables [34]. This was applied to the marital status and gender variables.

3) Feature Extraction
Most risk factors available within the medical history description were text entries. These descriptions were entered as a free, unstructured text field with multiple variations of the same condition, which required some preprocessing to extract the features.
Some basic text mining techniques were applied to extract the previously defined risk factors and then to check for other possible factors. Using the NLTK package, stop words were removed, the text was converted to lower case, the most common terms and n-grams were selected, word clouds were plotted, and the known risk factor terms were searched and selected. Fig. 3 illustrates how medical history descriptions differ between those with dementia and others. After applying text mining, each unstructured medical history text field was converted to a structured field (categorized), which is illustrated by Fig. 4.

4) Feature Selection
The previously defined features that were clinically approved to be relevant were selected. Both Lancet commission and Libra index modifiable risk factors features were considered in order to check, which gives better results.

5) Data Integration
All selected tables were integrated and merged into a single table with all considered features.

C. Modeling
This research focused on interpretable modeling because of its importance for informing clinicians managing patients. Interpretable machine-learning classification models, such as the Logistic Regression, Naïve Bayes, Decision Tree, and Random Forest. Both binary (dementia vs non-dementia), and multi-class (CN vs MCI vs dementia) classifications were applied using the models.

1) Logistic Regression
Logistic regression (LR) is an extension of linear regression and is used to solve classification problems. Basically, it is designed to solve binary classification problems where there are only two outcomes, but eventually, it is extended to support multi-classification, which is referred to as a multinomial logistic regression [28] [24]. A wellknown method used to achieve a multinomial classification is using a set of one-versus-all models. For example, if there are n targets levels, n numbers of one-versus-all logistic regression models are created, and each model distinguishes between the features of one target level and all the others [28] [33].

2) Naïve Bayes
In machine learning, the Naïve Bayes (NB) method serves as a probabilistic classifier that uses the Bayes' theorem of conditional probabilities [24] [28]. It assumes a strong (naïve) independence between features and calculates the class probabilities for each feature independently. The conditional probability of a class is the normalized class probability times the probability of each feature given by a class [24].

3) Decision Tree
A decision tree (DT) is a tree-based model that splits the data repeatedly according to specific cutoff values in the features [24] [28]. Different subsets are created through splitting, separating instances to belong to one subset.

4) Random Forest (Ensemble Learning)
The random forest (RF) model is an ensemble learning model that combines bagging, subspace sampling, and decision trees to create a more powerful model [28] [33]. The random forest model overcomes the overfitting problem of a decision tree, which is why it usually performs better. A random forest model is a collection of decision trees in which each tree is slightly different from another. Once each individual decision tree model has been created (bagging), the ensemble makes predictions by returning the majority vote of the classifiers. This reduces the overfitting amount by averaging the results while maintaining the predictive power of each tree [33].

D. Evaluation
After the models are developed, the results were evaluated using multiple metrics and techniques to identify possible problems with overfitting and parameter tuning issues.

Confusion Matrix-Based Performance Measures
A confusion matrix is a convenient method used to comprehensively describe the performance of classification evaluations, which can be either binary or multi-class [28] [33] [35]. Most other metrics are derived from the basic components of the confusion matrix, which are the True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN), and their percentage conversions. From these components, main evaluation measures such as accuracy, precision, recall, and F-score were calculated [35]. In this study, recall (sensitivity) is defined as the proportion of subjects who have dementia that are correctly classified. Precision is defined as the proportion of subjects who did not have dementia that are correctly classified. Accuracy is defined as the proportion of all subjects that are correctly classified, while F1 is the weighted average of precision and recall.

1) Sensitivity, Specificity, and AUROC
The receiver operating characteristic (ROC) evaluates a model's true performance while considering all possible probability cutoffs (thresholds). The default threshold is 0.5; however, it could range from 0 to 1, and the classification results may change accordingly. The area under the ROC (AUROC) summarizes thresholds changes of both TPR (sensitivity) and FPR (1-specificity). The perfect fit is 1, the worst is 0, and the random prediction is 0.5.

IV. EXPERIMENTS AND RESULTS
The experiments and analysis conducted for this research were applied using the following environments, tools, and libraries: 1) Environments Used: Python (3.6.4) and R.
2) Tools Used: Jupyter Notebook version 5.4.0, Google Colab (for faster modeling), and SPSS version 24 (for missing data mechanism and quickly find and explore).

B. Model Validation
A balanced train-and-test split was applied using the StratifiedKFold to split the data once into a 75% training set and a 25% testing set. This ensures the same percentage of each class per group. Moreover, longitudinal data grouping by the participant was performed using the GroupKFold, which is a special variant of cross-validation that takes into account the repeated measurements from the same subject and consider them as grouped data. Parameter tuning was achieved using nested cross-validation by applying GridSearchCV parameter tuning (inner loop) to crossvalidation (outer loop).

C. Feature Selection
Using only statistically significant features either from the univariate analysis or per models was not sufficient because it decreased the accuracy from an average of 70% to 50%. This could be explained because there may be interactions between the features. The best feature selection was obtained using both the Lancet and the Libra index features. Using Lancet features alone gives an average of 59% accuracy only.
On the other hand, using Libra features only gives an average of 68% accuracy meaning that it is more comprehensive and predictive to the machine learning models, although combining it with Lancet's gives better results.

G. Evaluation Summary
The best classification results were obtained using both the Lancet and the Libra risk factor lists, considering the longitudinal data set which outperformed the cross-sectional baseline one. Moreover, using data of the most recent visits only provided even better results than using the whole longitudinal set.
The binary classification yielded to about 92 % accuracy, while the multi-class classification yielded to a 77% accuracy using logistic regression, followed by random forest with 92% and 70% respectively. The area under the ROC of the dementia class was nearly perfect at 96% for both models.
Furthermore, as this is an observational study analysis, and the feature importance of each model does not claim any causality of dementia or MCI. The importance derived from the available data may not be representative of a wider population.

A. Achievements of the Research Objectives
The research discussed and evaluated in the previous sections aims to use different interpretable machine-learning classification models to detect dementia based on its modifiable risk factors only.
The best classification results were obtained using both the Lancet and the Libra risk factor lists, considering the longitudinal data set which outperformed the cross-sectional baseline one.
Moreover, using data of the most recent visits only provided even better results than using the whole longitudinal set. The binary classification yielded to about 92 % accuracy, while the multi-class classification yielded to a 77% accuracy using logistic regression, followed by random forest with 92% and 70% respectively.

B. Limitations
This research involved an experimental analysis of an observational study based on the ADNI dataset, and there is no claim to present causations. The ADNI study was not primarily designed to address the modifiable risk factors; thus, it may lack some useful features, especially during the early and middle life courses. Social isolation and physical activities are not explicitly addressed by the study, and the results may be more accurate if more detailed data for these factors were collected. Medical history and other important useful demographic features, such as occupation, were collected as free text and were not categorized in a structured format during the data collection stage, which may have helped make the analysis simpler and more accurate.
ACKNOWLEDGMENT Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at:http://adni.loni.usc.edu/wpcontent/uploads/how_to_apply/ADNI_Acknowledgement_Li st.pdf Data collection and sharing for this project was funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI) (