AMIM: An Adaptive Weighted Multimodal Integration Model for Alzheimer’s Disease Classification

—Alzheimer’s disease (AD) is an irreversible neurological disorder, so early medical diagnosis is extremely important. Magnetic resonance imaging (MRI) is one of the main medical imaging methods used clinically to detect and diagnose AD. However, most existing computer-aided diagnostic methods only use MRI slices for model architecture design. They ignore informational differences between all slices. In addition, physicians often use multimodal data, such as medical images and clinical information, to diagnose patients. The approach helps physicians to make more accurate judgments. Therefore, we propose an adaptive weighted multimodal integration model (AMIM) for AD classification. The model uses global information images, maximum information slices and clinical information as data inputs for the first time. It adopts adaptive weights integration method for classification. Experimental results show that our model achieves an accuracy of 99.00 % for AD versus normal controls (NC), and 82.86 % for mild cognitive impairment (MCI) versus NC. The proposed model achieves best classification performance in terms of accuracy, compared with most state-of-the-art methods.


I. INTRODUCTION
Alzheimer's disease, a chronic neurodegenerative disease causing the death of nerve cells and tissue loss throughout the brain, usually starts slowly and worsens over time. AD is expected to affect 1 out of 85 people in the world by the year 2050 [1]. The progression of AD gradually leads to memory degradation and impairment of cognitive function, eventually leading to irreversible neuronal damage [2]. Although no treatment has been proven to be effective in preventing the progression of AD [3], the early diagnosis of AD remains important for subsequent treatment to delay the onset of cognitive symptoms [4].
Since 2013, deep learning has begun to gain considerable attention in AD detection research, with the number of published papers in this area increasing drastically since 2017 [5]. Early unsupervised methods used autoencoders or restricted Boltzmann machine methods to extract features that were then used for the classification of Alzheimer's disease [6]- [8]. Supervised learning applied to the diagnosis of Alzheimer's disease has been particularly well studied compared to unsupervised methods. Convolutional neural networks (CNNs) are the most successful deep models for image analysis, aiming to make better use of spatial information by taking 2D/3D images as input and extracting features by stacking several convolutional layers; the result is a hierarchy of progressively abstracted features [9], [10]. Most studies on Alzheimer's disease have been mainly architected by 2D CNNs or 3D CNNs as depth models. A large number of studies have performed feature extraction of MRI slices by 2D CNNs for ADNI classification [11]- [16]. Since MRI provides 3D images, how to select MRI slices is a question worth thinking about. Meanwhile, 2D slices cannot contain all the information of 3D MRI, so it is missing the global information. 3D CNN is widely used for diagnosis of 3D MRI, which does not require slice selection and also contains global information. However, AD detection must take the whole image or some ROI as input [17], [18]. This results in a steep increase in the number of parameters, which can create problems such as large amount of computation, time-consuming, and overlapping data. The joint 2D-3D CNN [19]- [21] first performs 3D feature extraction by multiple 3D data inputs, and then obtains the final classification result by 2D CNN. Likewise, there are also problems such as large amount of computational cost and time consuming. In addition, in the early detection of Alzheimer's disease, the degree of brain atrophy is less variable, and assessment by a single modality of MRI alone may have a certain bias; a combined assessment with multiple modalities will yield a relatively more accurate diagnosis.
Based on slice-level classification, there is a lack of threedimensional spatial information and the subjective uncertainty of slice selection. We proposed to effectively superimpose all slices to generate a dynamic 2D picture containing multiple slice information changes, i.e. a global information image. At the same time, the slice with the largest amount of information is also selected through the method of image entropy, and the clinical information is used as the input of the multidimensional feature auxiliary integrated approach. From our experiment, the proposed AMIM model has improved the performance significantly. The main contributions of our study are summarized in the following three folds.
• We propose an adaptive weighted multimodal ensemble model. The model uses an adaptive weighted method to optimize the different branches weights. It can effectively reduce the large amount of computational cost and time, compared with the grid search method.
• For the first time, we propose a new MRI image preprocessing method, which uses dynamic images and maximum information entropy slices as the input of MRI images; at the same time, clinical information modality is introduced to obtain better classification performance.
• A comprehensive evaluation is conducted on ADNI dataset. Experiments show that our method achieves best classification performance in terms of accuracy, compared with most state-of-the-art methods.
The rest of the paper is structured in sections and represented as follows. In Section II, related work describes the research status of Alzheimer's disease classification in detail. Section III introduces the structure and algorithm of the AMIM model. Section IV introduces the classification performance of the model on the ADNI dataset. Section V discusses the performance analysis of different views. Section VI concludes the paper.

II. RELATED WORK
With the rapid development of deep learning since 2012, there are more and more researches on the diagnosis of Alzheimer's disease. Researches based on Alzheimer's disease classification can be divided into the following branches according to input: ROI level, Patch level, 3D Subject level and 2D slice level. With ROI models [22], [23], manual selection of regions is required to extract the region of interest of the original brain image as the input of CNN model, which is a timeconsuming task. With patch models [19]- [21], multiple patches can be obtained from the entire 3D MRI, but there is a problem of data overlap. It is much more straightforward and desirable to use the entire image as input. At the 3D subject level, Korolev et al. [17] adopted 3D VGG and 3D ResNet as the backbone network for feature extraction, but the classification accuracy was only more than 80. Spasov et al. [18] proposed a method combining 3D MRI with clinical information, which can obtain good classification results. However, regardless of single mode or multi-mode 3D MRI, there is a large amount of calculation and long running time. In 2D slice classification method, it can reduce the number of hyperparameters to a certain extent. Due to the small sample size of medical dataset, Hon et al. [11] proposed to apply transfer learning to the classification of Alzheimer's disease. 32 slices were taken from each object as the dataset, and the model performed well. However, this result was only for the image level, without considering the subject level. Islam et al. [12] proposed a deep convolutional neural network for the diagnosis of Alzheimer's disease using brain MRI data analysis, and obtained good classification results. Zhang and colleagues [13] performed a systematic evaluation of CNN models with different structures and capacities, and the experimental results showed that the advanced structural models with medium capacity performed better than the models with maximum capacity. Good results have also been obtained. However, these methods are based on 2D images and cannot contain all the information of brain scan. They ignore the spatial information of 3D. The same situation exists in other slice classification studies [14]- [16]. According to the above analysis, we propose a new AD classification network architecture, AMIM, combining 2D-3D MRI and clinical information. In this paper, we propose an AMIM model that combines 2D-3D MRI to solve the problem of missing 3D spatial information for slice based classification. Clinical information is also introduced as the input of another modality. The model uses an adaptive weighted method to learn the weight shares of different classifier. Its architecture is shown in Fig. 1. Our proposed method is flexible and can in principle integrate other imaging modalities, such as positron emission computed tomography (PET), as well as other different clinical datasets. With the inspiration of the idea of transfer learning, we use the classical neural network pre-trained by ImageNet and removing the last classification layer as the backbone network of feature extraction [24]. Resnet18 as a backbone network will be introduced here.
In the following, we present our method in four parts. First, we introduce the dynamic image generation method and the maximum information entropy slicing method, respectively. Then, we present the adaptive weighted multimodal integration method. Finally, we introduce the training and optimization.

A. Dynamic Image
In the non-medical field, a popular method to represent a series of images is to apply a temporal pooling operator to the features extracted at individual images, for instance, temporal templates [25], ranking functions [26] and other traditional pooling operator [27]. We use the Z-dimension of the 3D MRI as the temporal dimension of the video to extract a fixed slice representation of each object. Since the extracted fixed representation retains all the dynamic characteristics of slices (i.e. the changes from slice to slice), we call it dynamic image. We calculate the coefficient θ t of slice I t and assume that the feature vector of this slice is V t . Multiply this coefficient with the average of all feature vectors from V 1 to V t to get the new feature vector and finally accumulate the new feature vectors to get the final dynamic image. See Fig. 2 for an example. The calculation formula is as follows.
The algorithm for processing dynamic images is shown in Algorithm 1. Typically, there are a large number of slices to choose from in 3D MRI scan. One method of slices selection is to manually select slices based on the highest similarity of anatomical features without knowing the clinical diagnosis information [28]. However, this approach needs to be chosen by professionals, which will cost a lot of labor and be subjective. Instead, we use image entropy to extract the most informative slices to train the network. Therefore, we will calculate the image entropy of each slice. Generally speaking, for a set of M symbols with probabilities P 1 , P 2 , · · · , P m . Entropy can be calculated as [29] :

Algorithm 1 Algorithm for Obtaining Dynamic Images
where H is the one-dimensional gray-scale entropy and P i is the proportion of grayscale value i.

C. Adaptive Weighted Multimodal Integration
In this section, we introduce the composition of classifiers and the adaptive weighted integration of the classifier, respectively.
For these classifiers we use the same composition structure. We add relu activation function and dropout between each layer of mapping to reduce the potential overfitting risks. Specifically the feature images are first dropout, and then through three layers of mapping, relu and dropout are added between each layer of mapping. Finally, softmax activation function is used to obtain the category probability value. Dropout set to 0.5. The output expression of each of these classifiers is shown as follows: where F i stands for the input of the feature map, C i is the weight parameter of the i-th classifier, φ i (F i , C i ) represents a function to be learned in an effort to transform the input, F i , to probability value O i .
where O 4 is the probability value in the 4-th classifier, X cli is the clinical information. C 4 is the weight parameter of the 4-th classifier. φ 4 denotes the operation of the 4th classifier. We normalized for clinical characteristics, i.e. demographic, neuropsychological, and the apolipoprotein E (APOE4) genotyping data. They all followed the same feature scaling procedure, with values normalized between [0, 1] for each independent clinical factor.
We propose an ensemble learning method with adaptive weights to improve the performance of the model and the confidence of prediction. For the probability values of multiple classifiers, we use the integration method of soft voting for the final output. Suppose we have M classifiers, the soft voting can be computed as: where O i is the probability value in the i-th classifier, α i represents the weight given to the i-th classifier. O t stands for the total output after overall soft voting integration. First we initialize the weights. To automatically compute the hyperparameter α i , we use a simple but effective approach: setting the hyperparameter α i as a trainable parameter in order to automatically and adaptively coordinate the importance learning of each attribute task. When multiple branch tasks are learned simultaneously, the "important" branches should be given high weights (i.e. α i ) to increase the loss size of the corresponding branch. We take a small learning rate for updating the network parameters and automatically learn the score weights for different classifiers.

D. Training and Optimization
We use cross-entropy as the loss function. We construct loss function for the output of individual classifier. The loss function is: where the label y j = 0 indicates that sample j is a negative sample and y j = 1 indicates that sample j is a positive sample. N is the total number of samples in the data set. O j i denotes the output probability value of sample j of the i-th classifier. In the training phase, with the loss function constructed from the output of each classifier, i.e., Equation 6, we can optimize the network parameters in each of them.
For the backbone network, the weights are fixed in the first stage and are not optimized. The weights are unfrozen in the second stage, we first optimize the backbone network 1 and backbone network 2 under the loss functions constructed from the classifier 1 output and classifier 2 output, respectively. Then the backbone network is further optimized by the loss function 3.
After we get the output of each classifier, we can get the final output, which is Equation 5. Then the loss function of the integrated output is shown below: where α i is the weight of the output probability values of the i-th classifier. O j t represents the output probability value of the soft voting integration of sample j through multiple classifiers. The hyperparameter α i is key part of the network. If we use the grid search method to obtain α i , this will be time consuming. We use the adaptive weighted method to update the network weights. The loss combines these weights as an integrate one to supervise the process of network training by adopting the back-propagation algorithm.

A. Dataset
We use the publicly available dataset from the Alzheimer's Disease Neuroimaging Initiative (ADNI) for our work. Specifically, we trained CNNs with the data from the "spatially normalized, masked, and N3-corrected T1 images" category. The brain MRI image size is 110 × 110 × 110. Since a subject may have multiple MRI scans in the database, we use the most recent scan of each subject to avoid data leakage. All the data we used are summarized in Table I. Among them are 132 men and 92 women, aged between 55 and 90.3 years old. Friedman's ANOVA was used to test the difference in median age between groups, and the Fisher's exact test was used to test the gender interaction of the x group. These interactions are not statistically significant (p > .05). And the following data: demographic data (age, gender, education level), neuropsychological cognitive assessment tests such as the Dementia Rating Scale (CDRSB), Alzheimer's Disease Assessment Scale (ADAS11, ADAS13), Rey Auditory Verbal Learning Test (RAVLT), as well as APOE4 genotyping. All data used in this study is from baseline assessments.

B. Evaluation Metrics
The proposed AMIM method mainly validates the AD classification (AD vs. NC), MCI classification (MCI vs. NC). The performance was evaluated using three metrics. Namely, accuracy, the percentage of correctly predicted samples; F1, the harmonic mean of precision (Eq. 9) and recall (Eq. 10); Area Under Curve (receiver operating characteristic curve determined by true positive rate and false positive rate). These metrics are defined as: Recall = T P T P + F N (10) where TP, TN, FP and FN stands for true positive, true negative, false positive and false negative, respectively. A higher value indicates better performance.
In the following, experiments are conducted to evaluate the performance of the proposed method. Specifically, Section IV-D focuses on testing the impact of unimodal and multimodal on the performance metrics of the experiments, respectively. Section IV-E1 aims to analyze the classification performance of the same dataset in different methods.

C. Complements
We use five-fold cross-validation for experiments. Since the proportion of data samples is unbalanced, a weighted loss function is used to ensure the balance of the samples. The loss function uses a cross-entropy loss function. Using the Adam optimizer, the learning rate is 10 to the negative 5-th power, except that the learning rate of L 5 is adjusted to 5×10 −6 . The classifier performs linear mapping, it will perform a dropout of 0.5 to prevent overfitting. To activate relu, the last layer uses the softmax function to output the probability value of the category.

D. Single Modality vs. Multiple Modalities
In this section, the effects of unimodal and multimodal data on the model are presented separately. For MRI, resnet18, which has been pre-trained and removed the last classification layer, is used as the backbone network to extract features. The output of clinical information is obtained through a multi-layer perceptron network model. We performed single-modal experiments on dynamic images, slices, and clinical information. As shown in the Table II, we can see that in the column of AD versus NC, the evaluation index of clinical information is very high. Thus we made t-SNE visualization for the data, as shown in Fig. 3. The data distribution of AD group and NC group is shown on the left. It is found that the two types of data have obvious dividing lines. The distributions of the MCI and NC groups on the right do not have obvious dividing www.ijacsa.thesai.org   Table II are not very satisfactory. This remains consistent with our results. Then we take the dynamic images and slices of medical imaging as input to obtain the integrated results of medical imaging. Specifically, the decision fusion of Output 1, Output 2 and Output 3 of the model in Fig. 1 is carried out. The specific experiment is introduced in the subsequent result chapter. It can be seen from the table that the integration results of slices combined with pictures of 3D spatial information changes are improved compared with single-mode medical imaging. Finally, we proposed the method, in which AD/NC accuracy reached 99.00%, MCI/NC accuracy also reached 81.63%.

1) Comparison With baseline methods:
We compared several other methods on the same dataset. Korolev et al. proposed a deep three-dimensional convolutional neural network structure for brain MRI scan classification [17]. In this work [18], structural magnetic resonance imaging (MRI), demography, neuropsychology and APOE4 gene were used as data inputs, and 3D separable convolutional layers were used as backbone networks for classification. Xing et al converted the 3D full image into 2D dynamic image, and then took the classical neural network and attention mechanism as the network model [30]. For these baseline methods, we maintain the parameter settings of the original paper. The results of evaluation indicators are shown in Table III. It can be found that our proposed method is the best in most indicators.
2) Comparison with state-of-the-art methods: In this section, we focus on comparing the classification performance of other widely used methods. The work investigated [31]- [37] and other MRI monomodal classification performance, see Table IV below. Multi-modality [32]- [34], [36], [38]- [41] includes MRI + PET, MRI + PET + biomarkers, MRI + DTI, and MRI + Cognitive scores, as shown in Table V.   For AD/NC classification, the accuracy of the single-modal methods listed in Table IV has classification results below 90.00%, while the accuracy of most multi-modal methods is above 90.00%. For MCI/NC classification, the accuracy of most single-modal methods is below 80.00%, while the accuracy of most multi-modal methods is above 80%. Among the listed studies, Zhu et al. [32], Liu et al. [33], Aderghal et al. [34] and Shao et al. [36] performed single-modal and multi-modal tests on the proposed method. The results show that, compared with single-modal data, the use of multi-modal data can obtain higher classification accuracy. In addition, our method achieves the best performance of 99.00% in the AD/NC classification and 81.63% in the MCI/NC classification with resnet18 as the backbone network. It is worth noting that, due to potential differences in data selection, preprocessing and even data set division, the results obtained by different methods are actually incomparable. The purpose of the comparison is only to provide an overview of other results and to show the baseline of existing methods.

3) Different backbone neural networks:
We performed other backbone neural network training. To evaluate the classification performance of different backbone neural networks, we used a five-fold cross-validation strategy to calculate the classification performance. Specifically, the entire subject sample set was divided into five subsets equally, and the subject samples within one subset were selected as test samples each time, and all remaining subject samples within the other four subsets were used to train the classifiers. This process was repeated five times independently to avoid any bias introduced by randomly dividing the dataset in cross-validation. We take the average of the three experiments as the final result of the data. The results are shown in the Table VI and Table VII.  Fig. 4. Pan et al. [42] proposed a Multi-view Separable Pyramid Network (MiSePyNet), in which representations are learned from axial, coronal and sagittal views of PET scans so as to offer complementary information and then combined to make a decision jointly. The experimental results show that the performance of the axial view is the best and multi-view fusion effect is better than the single-view. Next, this paper will discuss and analysis the classification performance of different views through experiments.
We performed experiments for different views by using the same parameter settings. The overall experimental results of AD versus NC and MCI versus NC are shown in Table  VIII and Table IX classification performance are similar. In the model with only slices as input, the overall evaluation metrics of coronal view are the best, the evaluation metrics of the other two views have small differences with them. In the model with only dynamic images as input, the AUC and F1 metrics of the axial view are the highest. In the whole hybrid model of dynamic image and slices, all evaluation metrics of the axial plane is the highest.
For the overall results obtained from these three different views, we can find that the classification performance of each view is not very different and the axial view is relatively better. In this paper, we only conducted experiments for one view. In the future, we will analyze three views of the image together. Different views show different information and how to obtain more comprehensive information plays a more important role (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 1, 2023 for Alzheimer's disease diagnostic.

B. Data Pairing
The dataset we used is from the "Spatially normalized, masked and N3-corrected T1 images" category in the ADNI public dataset. The dataset in this category contain paired MRI and clinical information. However, for the other categories of dataset in the ADNI, there were cases where subjects had not undergone the MMSE, ADAS11 examination. For patients with clinical information, we effectively combined the clinical information, which helped to improve the classification performance of the model. In the future, we will try to improve the diagnostic performance of ADNI with only some of the subjects' basic information (gender, age, etc.).

VI. CONCLUSION
In this paper, we proposed a multimodal adaptive weighted model, which takes global information images, maximum information slices and clinical information as multimodal inputs for the first time. Our model can effectively solve the problem of missing global information in slice classification. At the same time, the use of image information entropy selection slices can solve the subjective uncertainty of human selection. Using an adaptive weighting method to optimize the weights, it can combine the weights of different models more accurately than the grid search method. Our model achieves the best results in terms of classification performance, compared with the latest methods. The combination of medical images and clinical information for Alzheimer's disease classification is the future trend. Next, we will try to investigate how to better combine clinical information with medical images.