Spatial Feature Fusion for Biomedical Image Classification based on Ensemble Deep CNN and Transfer Learning

—Biomedical imaging is a rapidly evolving field that covers different types of imaging techniques which are used for diagnostic and therapeutic purposes. It plays a vital role in diagnosis and treating health conditions of human body. Classification of different imaging modalities plays a vital role in terms of providing better care and treatment options to the patients. Advancements in technology open up the new doors for medical professionals and this involves deep learning methods for automatic image classification. Convolutional neural network (CNN) is a special class of deep learning that is applied to visual imagery. In this paper, a novel spatial feature fusion based deep CNN is proposed for classification of microscopic peripheral blood cell images. In this work, multiple transfer learning features are extracted through four pre-trained CNN architectures namely VGG19, ResNet50, MobileNetV2 and DenseNet169. These features are fused into a generalized feature space that increases the classification accuracy. The dataset considered for the experiment contains 17902 microscopic images that are categorized into 8 distinct classes. The result shows that the proposed CNN model with fusion of multiple transfer learning features outperforms the individual pre-trained CNN model. The proposed model achieved 96.10% accuracy, 96.55% F1-score, 96.40% Precision and 96.70% Recall values.


I. INTRODUCTION
Biomedical imaging refers to capturing of an organ or tissue for diagnostic purpose. The field is very broad and rapidly evolving that covers different types of imaging modalities like ultrasound, magnetic resonance imaging (MRI), computerized tomography (CT), positron emission tomography (PET), etc. [1]. Biomedical and medical imaging plays a significant role in diagnosis and treating health conditions of human body. It helps to identify problematic health conditions in their early stages that certainly lead for providing better treatment to the patients [2]. The structural and functional changes in biological tissues of the human body normally cause the possible health problems. Biomedical imaging provides a way to view inside the human body that helps to reveal such changes [3].
Classification of different imaging modalities plays a vital role in terms of providing better care and treatment options to the patients. The traditional way to classify these different modalities of images is the naked eye classification that is performed by medical professional or subject expert. This is the cumbersome and sometimes time-consuming method.
Advancements in technology open up the new doors for medical professionals that involve computer-aided diagnosis (CAD) methods for automatic image classification [4]. The increasing advancement in the field of medical imaging technology, medical research and diagnosis become easy. Various types of imaging modalities and procedures are included in medical imaging technology that helps in diagnosis and treatment of the patients. Hence, it plays a dominant role in deciding the actions for the benefit of the patient's health.
In past few years, artificial intelligence (AI) brings a new way for analysing and interpreting data that also called predictive analysis that helps to identify the early signs of any of the health conditions [5]. Deep learning, an approach of AI, emerged as an outperformer in interpreting and analysing image data. Significant advancement has been made in the field of medical image diagnosis that improves disease diagnosis process significantly. Deep learning uses the architecture of artificial neural network that mimics the working of human brain. The complex computer vision tasks are effectively solved by the deep learning algorithms such as image recognition, classification and segmentation. The special class of deep leaning algorithms is known as a convolutional neural network (CNN) is widely used to solve image classification problems and achieved a significant performance on benchmark datasets. The reason behind popularity of CNN is the large availabilities of datasets and support of powerful Graphics Processing Units (GPU) that makes the integration of deep learning methods with computer vision popular [6].
There are several distinct imaging modalities in which biomedical images are generated. They are different in shape and types. Due to the diverse data distribution patterns, it may happen that the same CNN model may show different performance on different datasets [7]. CNN models are sensitive to the particulars of the training data. This makes possible that each time they are trained; they may find a different set of weights. These different predictions generate high variance [8]. Moreover, these deep features face the problem of small intra-class variance and large inter-class variance [9].
To address the above issue, a novel spatial feature fusionbased approach for biomedical image classification based on ensemble Deep CNN and transfer learning is proposed. We (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 5, 2022 154 | P a g e www.ijacsa.thesai.org have used stacked generalization that is an ensemble learning method. We have used four benchmarked pre-trained CNN models as a deep feature extractor. Extracted features from different networks are merged using spatial feature fusion method. Finally, two FC (fully connected) layers with ReLU are used along with one FC layer with softmax activation function.
The major contributions of the paper include: (a) To propose an ensemble learning framework for creating a deep feature extractor that combines transfer learning features from more than one pre-trained CNN models (b) To apply a spatial feature fusion technique that creates a generalized feature space from extracted features (c) To proposed a deep CNN model that can be used for biomedical image classification with increased prediction accuracy.

II. RELATED WORK
Health care is one of the fastest growing sectors that is delivered by health experts. It is sometimes difficult to recognize disease patterns from huge number of medical images. The special classes of Artificial Intelligence (AI) are machine learning and deep learning algorithms. These algorithms give impressive results for classification of biomedical images. Image classification is considered as one of the computer visions tasks. Many researchers have been worked in biomedical classification that resulting several robust methods that can be categorized into two types; traditional digital image processing techniques and deep learning models. The traditional image processing methods involves manual feature extractions methods whereas deep learning models perform the feature extraction without manual intervention.
N. Sharma et al [10] used texture primitive features to perform segmentation along with classification of medical images. They have used artificial neural network (ANN) for designing the algorithm that performs segmentation and classification. The algorithm has been used for CT and MRI images of distinctive body parts like brain and liver. Ahmed M. Sayed [11] applied machine learning techniques for diagnosis of breast cancer from MRI images. The highest classification accuracy is 94/6% given by KNN. M. I. Daoud et al [12] proposed a fusion approach based on multiple ROI for classification of breast ultrasound image. GLCM texture features are extracted in each ROI that are further classified using binary SVM classifier. The dataset considered for the experiment contains 64 benign and 46 malignant tumor images. The proposed approach provides very promising results with 98.2%, 98.4%, and 97.8% values for accuracy, specificity, and sensitivity respectively. P. Chak et al [13] proposed an Artificial Neural Network (ANN) and SVM based approach to classify the kidney stone images. To extract the features from the CT images, GLCM method was used. The ANN approach gives 95% accuracy, whereas SVM approach gives 99% accuracy. P. Nanglia et al [14] proposed a hybrid algorithm for classification of lung cancer images. The dataset considered for an experiment contains 500 images and the overall accuracy of the approach is 98.08%.
Deep learning is effectively applied for various domains, including satellite imaging, observation frameworks, mechanical and medical procedures, and precision agriculture. Several researchers have worked upon applying deep convolutional neural network for different medical applications. H. T. Nguyen et al [15] improved the prediction of disease using shallow CNN. They have applied data visualization techniques on Metagenomic data and achieved promising results. S. M. Anwar et al [16] proposed deep transfer learning and LDA (Linear Discriminant Analysis) based approach for classification of medical image modality. They have considered pre-trained ResNet-50 model to implement transfer learning approach along with LDA approach. For experiment, a benchmark ImageCLEF-2012 dataset was considered. The classification accuracy obtained is 87.91% that is significantly better as compared to the state-ofthe-art approaches. C.-H. Chiang et al [17] applied CNN for automatic classification of medical image modality. They have considered multiple image modalities that include CT images of abdomen & brain and MRI images of brain and spine. The accuracy achieved for validation and test sets are greater than 99.5%. Moreover, the F1-score for each of the category is greater than 99%.
B. P. Battula and D. Balaganesh [18] propose a hybrid model for medical image classification based on CNN and Encoder. HIS2828 and ISIC2017 are the datasets considered for the experiment. The results show that the accuracy of the proposed model is better than the existing models. A. A. Gomaa et al [19] discussed about how CNN and GANs are used to improve early prediction of plant disease. They have considered a tomato plant images that are infected with Tomato Mosaic Virus for conducting an experiment. The proposed CNN provides 97% accuracy. S. Patel [20] classified bacterial colony images using Atrous convolution using transfer learning approach. The dataset considered for the experiment contains 660 bacterial colony images classified into 33 distinct classes. The proposed model replaces the standard convolution layer of the VGG-16 pre-trained model with the Atrous convolution. The training and validation accuracy obtained from the experiment are 95.06% and 93.38% respectively.

III. PROPOSED METHOD
In this research, we have proposed a novel spatial feature fusion based Deep CNN model for biomedical image classification. We have developed a feature fusion network using Ensemble deep CNN that leverages the power of stateof-the-art pre-trained CNNs. Ensemble learning is the technique that combines different individual CNN models in order to increase prediction accuracy through generalization [21]. As shown in Fig. 1, the proposed model has three stages, (a) deep feature extraction and feature maps (b) spatial feature fusion of deep transfer learning features and (c) classification. www.ijacsa.thesai.org

A. Deep Feature Extraction and Feature Maps
Feature extraction and feature map generation is the crucial step for classification of images using deep learning networks. A deep learning network normally requires a large amount of resources and dataset to be trained for precise feature extraction. The common practice followed is to use the pretrained model instead of building and training the CNN model scratch. There are several pre-trained CNN models already exist which are developed to solve a similar problem [22]. All modern pre-trained CNN models have emerged from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) challenge [23]. The challenge is organized yearly from 2010 to 2017 that provides a platform for researchers to present their algorithms related to object localization and image classification tasks. The pre-trained models belong to ILSVRC are trained on ImageNet dataset that consists about 14 million images categorized in several classes. The pre-trained models used the concept of transfer learning, i.e. knowledge gained by solving one problem can be applied to solve similar related problems by using an already trained model to learn a different set of features. It eliminates the need of large amount of data with significant better performance along with reduced convergence time [24]. The commonly used pre-trained models are VGG16, VGG19, MobileNet, InceptionNet, ResNet, DenseNet, etc.
We have followed stacked generalization ensemble [25] approach to ensemble four pre-trained CNN models. For that, as the preliminary step, four benchmarked CNN models i.e. VGG19, ResNET50, MobileNet and DenseNet are considered as base models for the proposed novel model. We have reused these pre-trained networks with the parameter they have learned on ImageNet dataset. Also, we have removed their final softmax classification layer as it contains 1000 neurons. We have frozen all the intermediate layers of base models to keep their original trained weights. After that, the global average pooling is applied for dimensionality reduction and an output feature vector is created for each of the base model.

B. Spatial Feature Fusion of Deep Transfer Learning Features
Fusing the features obtained from various pre-trained CNN models supported many applications to achieve better performance than the conventional approach of utilizing the single deep CNN network for the task of classification [26]. The classification approach presented in this paper maps the feature space obtained by various pre-trained CNN models like VGG16, ResNet50, MobileNetV2 and DenseNet169 to a generalized feature space [26]. The feature space generated by each feature extractor with different dimension is represented as: Where i=1, 2, … N represents the number of images in the dataset and m=1, 2, …j represents the number of pre-trained CNN models used as feature extractors. The cumulative feature set for a particular image sample is defined as: Where, is representation of ith dimension feature in the given kth sample and represents the individual feature dimension. Thus, features obtained by the particular www.ijacsa.thesai.org CNN model forms individual feature space [27]. This, individual feature space is defined as: Where is the feature set of individual sample and n represents the total number of samples in the dataset.
Feature fusion technique combines heterogeneous features obtained from various CNN models and utilize combined features for comprehensive processing for the cumulative decision-making. The feature fusion technique combines feature spaces obtained from individual CNNs and provide feature subspace, which is generalized than the original feature space. The feature fusion technique to create a generalized feature space is represented as: Where, is future fusion function and , , and represents the feature spaces obtained by individual CNN model. The data imbalance and noise are the major limitations of the features obtained using individual CNNs. However, obtaining generalize feature vector by combining different feature spaces helps in selecting important features captured by different CNN models which leads to more accurate classification accuracy.

C. Spatial Feature Fusion
In this research, the spatial feature fusion algorithm fuses the feature maps obtained by four different deep pre-trained CNN models [28]. Hence, four pre-trained CNN models are connected with each other using feature fusion techniques and point of connection between four models is known as fusion point. The training of softmax classifier is the next step after the fusion point in order to achieve the result of the classification as represented in Fig. 1. The spatial feature fusion function represented as: (5) Where, , , and presents the set of features extracted by feature extractor P, Q, R and S respectively. Here, denotes the fusion space features' set and , z where L states the length, W states the width and D states the channels of the feature set, respectively [28].

D. Classification
After feature extraction, it is required to classify the data into various classes. This can be achieved using fully connected layers. In the proposed model, the third stage is performing classification. We have obtained a final feature vector after performing spatial feature fusion. For classification, two FC layers with ReLU and one FC layer with softmax are added to the proposed network as depicted in Fig. 1. A Rectified Linear Unit (ReLU) is used as an activation function for the first two fully connected layers. As the network classifies the input image into eight distinct classes, the last fully connected layer is applied with the softmax activation function.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
The details regarding dataset, experimental setup and results are discussed in detail in this section. We have implemented the proposed ensemble CNN model using spatial feature fusion along with single model of VGG16, ResNet50, MobileNetV2 and DenseNet169.

A. Dataset
The experiment was performed on the microscopic images dataset that represents peripheral blood cell images [29]. The dataset contains 17902 microscopic jpg images with size of 360 X 363 pixels. The images are captured in the Core Laboratory at the Hospital Clinic of Barcelona. The images are taken using the analyzer CellaVision DM96. These all images are annotated by the clinical expert pathologists. The dataset contains very high-quality labelled images that can be used to train any algorithms belonging to machine learning and deep learning. The dataset categorized into eight distinct groups that are platelet, neutrophil, monocyte, lymphocyte, immature granulocytes, erythroblast, eosinophil and basophil. The dataset is specifically used for hematological diagnosis using computational tools. All the images of the dataset are acquired from the year 2015 to 2019 [30]. Fig. 2 represents the sample images from platelet, neutrophil and monocyte class.
The proportion of images in the dataset is represented in Fig. 3. As mentioned earlier there are 17902 microscopic images that categorized into 8 distinct classes.
From the dataset, 80% data are used for training and 20% data are used for validation.

B. Environment Setup
The experiment is carried out on a workstation configured with an Intel® CoreTM i7 8th generation processor, 32GB RAM, and NVIDIA Titan XP GPU with 64-bit Windows 10 operating system. To build the model for microscopic peripheral blood cell images classification, TensorFlow and Keras [31] API are used. TensorFlow is an open-source library and it provides many of the high-level and low-level APIs. Keras is one of the high-level API that built upon the TensorFlow library. It is a user-friendly and extensible library that makes the building of deep neural network fast. Keras offers various building blocks like activation functions, layers, optimizers, batch normalization etc. that are required to build a convolutional neural network. It provides the functionalities to run the CNN on a graphical processing unit (GPU) or tensor processing unit (TPU).

C. Network Training and Hyperparameter Tuning
One of the pre-requisite for any of the CNN model deployment is Training. One of the good practices is to use the pre-trained model instead of building and training the CNN model from scratch. Transfer learning is about re-using the pretrained CNN model for solving a new problem [32]. All benchmark pre-trained CNN models are trained on the large dataset like ImageNet, PASCAL, COCO, etc. In this research, the weights along with learning of pre-trained model are transfer in the CNN model. Also, a fine-tuning process is performed by unfreezing the few of top layers to retrain them for classification purpose [33]. The dataset considered in the experiment is a small dataset as it contains 17902 images categorized into 8 distinct classes. Transfer learning and finetuning is performed by removing the last layer and introducing fully-connected softmax layer for the purpose of classification.
The basic building block of CNN model is artificial neural network. Thus, CNN models possess self-learning capability. For that, during training process, the weight of the network is adjusted in order to minimize the loss function [34]. There are set of parameters which controls the entire training process [35] [36]. In this experiment, model specific hyperparameters are used. Table I summarizes the hyperparameters along with their optimized value.

D. Evaluation Metrics
The benchmark evaluation metrics such as Precision, Recall, and F1-score are used to evaluate the performance of the proposed model. The equations used for the evaluation metrics are as per following.
Here, TP and FP are the true positive and true negatives, respectively. TN and FN are the true negatives and false negatives, respectively. To get the values of TP, TN, FP and FN, N X N confusion matrix is used, where N represents the number of classes available in the dataset [38]. The confusion matrix compares the actual values with the predicted values predicted by CNN model. It represents the values for TP, TN, FP and FN [37].

V. RESULT AND DISCUSSION
In this section, the performance of the proposed ensemble CNN model along with the performance of each individual CNN model i.e. VGG16, ResNet50, MobileNetV2 and DenseNet169 is discussed. To evaluate the performance of the model, two metric namely accuracy and loss are considered. The average accuracy and average loss for training and validation are measured. Accuracy is a metric used to measure the performance of a model. The accuracy is defined as: (9) Where TP is the true positive and FN indicates false negatives [38].
Loss is defined as an error happened in prediction by a model. It is calculated by the difference between the value predicted by a CNN model and an actual value present in the dataset. While training the CNN model, the aim is to decrease the loss by optimizing the weights. Usually, two functions are used to measure a loss namely mean square error and crossentropy. The cross-entropy is used to measure the loss and the formula is defined by the following equation [39].

∑ ∑
(10)   The proposed CNN model also converges adequately and it can be shown from Fig. 4. The figure represents the accuracy and loss curves obtained during training and validation. Fig. 4 shows that the curves for both; training and validation for accuracy are going parallel and close to each other. Moreover, the loss curves are also going parallel and close to each other. This shows that model is trained adequately without facing the issues of underfitting and overfitting. From the results, it can be concluded that the spatial feature fusion approach for classification of microscopic blood cell images provides better results with enhancement in learning ability. The generalized transfer learning feature space developed by deep feature extractor reduces the problem of bias and variance while increase the performance. Hence, the proposed CNN model has better performance than the individual pre-trained CNN models.

VI. CONCLUSION
Deep learning models provide promising results in biomedical image analysis. It brings a new way for analysing and interpreting data that helps to identify the early signs of health conditions. A spatial feature fusion approach for biomedical image classification based on ensemble deep CNN and transfer learning is proposed in this paper. For that, the generalized transfer learning feature space is developed and a spatial fusion is applied to merge the learned features of different pre-trained CNNs. The paper covers the details of implementation and evaluation of the most proposed CNN model for classifying microscopic peripheral blood cell images. The dataset contains 17902 images of blood cell that are categorized into 8 class labels. The experiment shows that the proposed ensemble CNN model outperforms individual pre-trained CNN model and provides better precision, recall and F1-score values.