Prediction of Diabetic Retinopathy using Convolutional Neural Networks

—Diabetic retinopathy (DR) is among the most dan- gerous diabetic complications that can lead to lifelong blindness if left untreated. One of the essential difficulties in DR is early discovery, which is crucial for therapy progress. The accurate diagnosis of the DR stage is famously complicated and demands a skilled analysis by the expert being of fundus images. This paper detects DR and classifies its stage using retina images by applying conventional neural networks and transfer learning models. Three deep learning models were investigated: trained from scratch CNN and pre-trained InceptionV3 and EfficientNetsB5. Experiment results show that the proposed CNN model outperformed the pre-trained models with a 9 to 25% relative improvement in F1-score compared to pre-trained InceptionV3 and EfficientNetsB5, respectively.


I. INTRODUCTION
Diabetic retinopathy (DR) is one of the diseases associated with diabetes and causes blindness to 4.4 million Americans over age 40 [1]. DR is an eye condition developed quickly in diabetes mellitus patients in type 1 or 2 [2]. DR often has no obvious symptoms in the early stages, but it becomes more pronounced as the disease progresses to more severe stages. An experienced ophthalmologist schedules a plan, which may run from weeks to months, to examine diabetic patients to determine their stage based on the retina's lesions and the severity level. Essentially, DR affects light-sensitive tissue blood vessels (i.e., retina) [3]. DR can be either nonproliferative or proliferative. In nonproliferative DR (NPDR), no abnormal blood vessel growth is found in the retina. Still, small outpouchings exist as the wall of retinal capillaries is weakened due to high blood glucose. These outpouchings are known as microaneurysms. NPDR can be mild, moderate, or severe based on the number of found microaneurysms and distortion of the blood vessels in the retinal exam. As the disease progresses, blood vessels may grow abnormally covering the retina; hence, DR becomes proliferative (PDR), leading to severe visual consequences.
In preventing blindness caused by the DR, detection, diagnosis, and treatment in earlier stages will control the disease and reduce vision loss. Diagnosis of the DR is complicated and requires high potential abilities [4]. One well-known obstacle for DR is that even for diabetic macular oedema, there are no early warning signs. Therefore, it is highly desirable to detect DR on time. Currently, DR diagnosis needs a well-trained doctor to diagnose the disease and manually evaluate digital images of the fundus of the retina. DR is recognised through identifying lesions connected with vascular malformations resulting from diabetes. This process may require longer time and effort depending on the experience and efficiency of the examiner doctor.
With the recent advancements in intelligent solutions, deep learning and transfer learning techniques showed significant success in object recognition and detection tasks. This research aims to automate DR diagnosis through exploiting convolutional neural networks (CNN) and transfer learning to identify DR from retina images. The Asia Pacific Tele-Ophthalmology Society (APTOS) dataset was used for blindness diagnosis and detection in this research. In addition, a comparison of the evaluation of different models to detect the disease effectively. This intelligent solution would help the health community diagnose the disease more efficiently, using less time and resources.
The remainder of this paper is organised as follows. In Section II, the related studies on the topic have been reviewed. Section III presents the research models used in this study followed by a description of the dataset used in the diagnosis DR in Section IV. Section V lays evaluation metrics and experimental design. Section VI presents the results of the experiments, and finally, the study is concluded in Section VII.

II. RELATED WORK
Early DR detection is critical and time-consuming, and ophthalmologists are burdened. This attracted many researchers to develop early DR detectors and classifiers. Here, an overview of the deep learning techniques used in the previous literature is presented. Also, the used DR datasets in those studies are summarised. All of the reviewed literature detected DR from retinal fundus images. If detected, DR was classified into one of four severity levels: mild, moderate, severe NPDRs and PDR.
In the deep learning approach, CNN extracts features from input images and feeds them to the deeper layer in the model. Shan et al. [19] distinguished microaneurysms from fundus images using stacked sparse autoencoder (SSAE). Their model reached 91.3% for F1-score and an AUC of 96.2%. Singh et al. [20] employed a densely connected neural network architecture to detect the DR severity efficiently. Experimental findings showed that the DR severity could be successfully identified through the model with an accuracy of 83.6%. Some researchers fine-tuned pre-trained models, known as transfer learning (TL), instead of training their models from scratch. These pre-trained models were initially trained using a large amount of out-of-domain data for object recognition and detection. Then, only the output layer is replaced according to the given task and number of classes. Table I lists some of these studies along with the used pre-trained models and their performance. Whenever a study investigated more than one pre-trained model, an ensemble was applied to combine all these models and produce an optimal model. Traditionally, the output layer of pre-trained models is replaced by a multi-layer neural network classifier and a softmax layer with a size equivalent to the number of classes to be recognised. Nevertheless, Taufiqurrahman et al. [11] suggested restructuring the MobileNetV2 model by replacing the fully connected layer with a Support Vector Machine (SVM) classifier. This modified version, MobileNetV2-SVM, obtained better performance than its original model. MobileNetV2-SVM achieved an accuracy of 85% and AUC of 92.8%. In a similar fashion, Khojasteh et al. [12] replaced the softmax layer with several classifiers: OPF, SVM, and KNN. Combing Resnet-50 and SVM outperformed other models with an accuracy of 98.2%, a sensitivity of 99%, and a specificity of 96%.
Several models can be combined using ensemble learning to improve prediction performance or reduce the bias in the learning process. Jiang et al. [17] introduced an image-based method to detect the DR early using an interpretable ensemble deep learning model. The proposed model is working on three main steps. Firstly, the fundus images preprocessing. Secondly, three different deep learning models have been used independently and trained sufficiently: Inception V3, Resnet152, and Inception-Resnet-V2. Finally, the Adaboost optimiser algorithm combined all the models' results to generate the final score. The integrated model proved a high performance in all evaluation metrics used: sensitivity, specificity, accuracy, AUC, 85.57%, 90.85%, 88.21%, and 0.946, respectively. Also, Tymchenko et al. [18] developed a DR detector using threehead CNN, which trained classification, regression and ordinal model. They used the output of these three heads for DR detection and achieved the sensitivity and specificity of 0.99.
As in the research, [17], [8], [6], [11], [7], this research aims to use InceptionV3, CNN, EfficientNetsB5 to detect DR due to their efficiency previous studies. However, these models will be validated using the same dataset to compare their results. Besides handling the issue of imbalanced distribution of classes on APTOS 2019 dataset, that not highlighted in previous researches.

III. METHODOLOGY
The purpose of this research is to classify a retinal fundus image whether it has a DR and at which severity. According to previous literature, deep learning and transfer learning models can solve this task. Transfer learning is a method that allows using the knowledge gained from other tasks to tackle new similar problems quickly and effectively. Hence, CNN models that are pre-trained will be fine-tuned utilizing domain dataset. Two pre-trained models will be selected for this study, InceptionV3 and EfficientNetsB5, for their effectiveness in diagnosing DR in the work of [17], [8], [6], [11], [7]. The performance of the fine-tuned pre-trained models will be compared with the performance of a CNN without pre-training.
A common issue in medical imaging datasets is the disparity in the number of samples within classes due to the difficulty of obtaining such samples. This problem is known as the class imbalance, and it pushes classifiers to prefer classes with higher training samples, reducing classification performance.
This section describes the techniques mentioned above.

A. Convolutional Neural Network (CNN)
Deep neural networks are artificial neural networks with more hidden layers to perform more complicated tasks and deal with massive amounts of data. The convolutional neural network (CNN) is one of the deep learning model networks with multiple layers such as convolution, pooling, fully connected, and non-linearity layer. CNN has been used with many applications, especially those that deal with spatial information, such as document analysis, image and video recognition, and computer vision [21]. The main aim of CNN is to increase or decrease the image dimensions into a more manageable form and extract the significant features, then process it to provide better predictions.
In this study, three convolution layers were employed with the same kernel size of (3,3). ReLU is used as an activation function with all layers, followed by a max-pooling layer with (2,2) pooling size to reduce the size of the large images. The results were flattened before the fully connected layer with a dropout of 0.2 to avoid overfitting. A softmax activation layer was used as the output layer. The architecture of the model is shown in Fig. 1. Some of the model's configurations were based on the work of [22], [23], [24].

B. Pre-Trained CNN Models
It has become customary to utilize a pre-trained CNN model and fine-tune it with in-domain dataset for the majority of computer vision applications. A pre-trained CNN model is a CNN model that has been trained on a large volume of data, such as ImageNet, for image classification [25]. Two pre-trained CNN models will be investigated in this paper: InceptionV3 and EfficientNetsB5.
Inception-v3 is a CNN architecture from the Inception family that contains 48 deep layers. Inception is characterised by implementing multiple kernels of different sizes in each layer (means become wider) instead of increasing the number of layers and going deeper in the network [26]. Each unit consists of four parallel operations: 1×1, 3×3, 5×5 conv layers and 3×3 max-pooling. All feature maps that come from different paths are concatenated together as the input of the next layer. Because in the image classification, the feature size of the image can diversify and deciding a fixed kernel size is difficult. Lager kernels are effective when the features are distributed over a wide area in the image. On the contrary, smaller kernels are useful and give excellent results in detecting small areas distributed across the image frame. To effectively recognise this variable size feature, kernels of different sizes are needed, which are provided in Inception models [27], [28].
EfficientNets family has a highly significant performance that achieves state-of-art on ImageNet, CIFAR-100, Flowers, and three other transfer learning datasets [29]. The architecture of the EfficientNets model involves convnet designs to reduce the space of the model with each layer to be scaled uniformly with a constant ratio to optimise the accuracy performance. It focuses on three aspects of scaling width, depth, and resolution. According to that, the EfficientNets family produces seven models with different image dimensions, and there is no change of layers operator of baseline network. This research proposes to apply EfficientNets B5 version.

C. Data Augmentation
Many approaches have been proposed to overcome the imbalanced dataset problem that can be classified into two categories: creating algorithms to resample the data and data preprocessing to generate new samples [30]. Resampling a dataset is a method used to balance the class distribution of the dataset. This is achieved by either adding samples to the minority class (oversampling) or removing samples from the majority class to balance the data (undersampling) [31]. However, data augmentation is a common technique used to generate new samples of the data to provide the image in a different representation.
Data augmentation techniques help improve the deep learning model's ability by generating artificial new images to achieve high variation in the training dataset and avoid overfitting problems. Many transform operations could be applied for data augmentation, such as random rotation, brightness, zoom, and image preprocessing techniques, such as Gaussian blur or CLAHE [32]. The data augmentation techniques included in this research are horizontal and vertical flip, zoom, and rotation. Fig. 2  validation, and testing deep learning models. The Asia Pacific Tele-Ophthalmology Society (APTOS) published this dataset in the second quarter of 2019. As shown in Table I, several studies used the APTOS2019 dataset [33] for blindness detection, containing a large set of retina images taken using fundus photography. Initially, two sets were published: labelled images, known as the train set, and unlabelled images, known as the test set. Only the labelled images were included in this study, consisting of 3662 fundus images. Each image was labelled into one of five classes, representing the severity of DR. Table II shows samples of each class and its characteristics that differentiate it from the others. As many of the medical dataset, APTOS2019 suffers from class imbalance as shown in Fig. 3 with majority of the cases towards healthy images without DR. However, there is a balance between the sum of all DR images regardless of their severity and healthy images.
The image size is a more critical factor that will impact the classification tasks. As shown in Fig. 4, there is a different distribution of image height and width, which suggests that not all images are in a perfect square shape.

A. Experimental Design
All experiments were implemented and evaluated using Python [34] and leverage TensorFlow and Keras library [35] using Kaggle GPUs, Kaggle presents free access to NVidia K80 GPUs in kernels. In particular, these GPUs can be used to train deep learning models [33]. For this study, the labelled set was split into three homogeneous sets: training, validation and testing sets with a ratio of 68%, 20%, and 12%, respectively. The distribution of classes within each split is shown in Table  III. Two sets of experiments were performed: fine-tuning and training using an imbalanced training set, 2489 samples, and a balanced training set after augmentation, 6158 samples.

B. Data Preprocessing
As most of the pre-trained models in this study were trained using images of size 224×224, images of APTOS2019 were rescaled accordingly. Moreover, images were converted into grayscale, which increases the visibility of some abnormalities. Following [18], [7], further image processing processes were applied: uninformative black areas were removed using circular crop, blending using Gaussian blur with alpha=4, beta=-4,  Label 1: Mild nonproliferative retinopath: In this early stage of the disease, small patches of balloon-like swelling in the small blood cells in the retina, known as microaneurysms. The fluid will leak into the retinas through these microaneurysms as shown in the left images.
Label 2: Moderate nonproliferative retinopathy: As the disease progresses, Blood vessels feeding the retina may swell and distort and also lose blood transportation capacity. These conditions cause significant changes to the appearance of the retina and can contribute to diabetic macular edema (DME) as shown in the left images.
Label 3: Severe nonproliferative retinopathy: Many further blood vessels are blocked, which deprive the retinal region of blood supply. These regions secrete growth factors that suggest that the retina is forming new blood vessels as shown in the left images.   and gamma=128. Consequently, the resulting images are not entirely greyscaled as modifications were applied separately on every pixel's colour channel. This helps improve the blood vessel's visibility and its growth in the eye, as shown in Fig. 5. All image preprocessing techniques was applied using Python (cv2) OpenCV library [36], [37].

C. Data Augmentation
Data augmentation was implemented using 'ImageData-Generator' class from Keras library [35]. As shown in Fig. 3, the number of cases in each category varies significantly, with no DR as the majority class (49.3% of total images). The number of the augmented images is different based on the number of the original images, as shown in Fig. 6. The augmentation phase enriched the diversities of the classes to provide highquality images to the learning models. This operation was performed only on the training dataset. Image augmentation for the minority classes was applied via zooming, flipping, and rotation, which acquired a dataset three times larger than the original set.

D. Fine-Tuning the Pre-Trained CNN Models
For every pre-trained model included in this study, the input layer was set to be 224×224 and three channels. However, the output layer was modified to match the number of classes in this task, i.e. five classes. Then, all layers were frozen during the fine-tuning process except for the modified last layers. The last layers were trained using Adagrad optimiser with a learning rate of 0.01 and for 30 epochs. Similar training configurations were employed when training CNN model.

E. Evaluation Metrics
A multi-class classification task necessitates factors such as class balance and expected outcomes when picking the optimal

Metric
Description Formula

Accuracy
The average number of correct predictions.

Precision
Capability of identifying the correct instances for each class.

Recall
Capability to recognise the true positive out of the total true positive cases.
The harmonic average of precision and recall.  metrics to evaluate the performance of a particular classifier against a given dataset. One performance metric may assess a classifier from a specific perspective while others can not, and vice versa. Hence, there is no standardised (unified) metric for defining the generalised performance measurement of the classifier. In this paper, several metrics are chosen to measure the models' performance: Accuracy, Precision, Recall and F1score. Table IV. summarises how each of the first four metrics is calculated for a multi-class classifier with C classes, where T P i and T N i are the number of cases correctly diagnosed for class C i or not, respectively. And F P i and F N i are the number of cases that were incorrectly diagnosed to the class C i or not, respectively.
As one of the experiments uses an imbalanced dataset, Cohen's kappa was used as an additional metric. It can be computed as follows: where P 0 denotes the overall accuracy and P e denotes a measure of the probability of the agreement between the prediction class values and the actual class values as it occurs by chance [38]. K = 1 if classes are in complete agreement while K = 0 proves the opposite.

A. Training with Imbalanced Dataset
Each pre-trained model was fine-tuned using the imbalanced training set, with 2489 samples and no DR as the majority class. When training the CNN model from scratch, the same imbalanced set was used for training. Table V. lists the results obtained during models training. Since accuracy is  unreliable when evaluating models trained on an imbalanced dataset, F1-score and kappa are the primary evaluating metric. CNN model achieves the highest F1-score with 67%, while the InceptionV3 model obtained 54%. On the other hand, the EfficientNetB5 model has the lowest performance.
To investigate the reasons for EfficientNetB5 performance, the learning curve for each of these models are depicted in Fig.  7. As shown in the figures, the learning curves of the CNN and InceptionV3 model in training and validation phases was improving smoothly, while the EfficientNetB5 model suffered from a high overfitting problem, which caused its low results. The diagonal expresses the correctly diagnosed states for each class, where the off-diagonal elements represent the misclassified samples. In general, all models have their best recognition for Class 0 (no DR) and 2 (mild NPDR) aligned with the class majority shown in Fig. 3. with Class 0 and 2 with the largest samples, respectively. However, most confusion was between different classes of DR, not no DR and any DR. This observation was accurate for all models. In other words, these models have good DR detection but poor severity level classification. The detection rate can be calculated by mapping all DR severity levels 1-4 to 1. Hence, the obtained detection rates are 90%, 96% and 83% for CNN, InceptionV3 and EfficientNetB5, respectively.

B. Training with Balanced Dataset
In this experiment, pre-trained models were fine-tuned using the balanced training set via augmentation, with 6158 samples. The same set was used for training when training the CNN model from scratch. Table VI lists the results obtained during models training. CNN model achieves the highest F1score with 64%, while the InceptionV3 model obtained 58%. On the other hand, the EfficientNetB5 model has the lowest performance with a 48% F1-score. Looking at the learning curves for these models in Fig. 9. the performance of the validation set improved better than the training set for the CNN  model, which indicates that some samples were difficult for the models to learn from the features. However, this was not observed for InceptionV3 and EfficientNetB5 models, which means the performance of training and validation sets were approximate are similar. Fig. 10 visualises the confusion matrix of these models. In general, all models could not recognise Class 4 (PDR) successfully compared to other classes. As in the previous experiment, most confusion was between different classes of DR, not no DR and any DR. The obtained detection rates here are 84%, 95% and 90% for CNN, InceptionV3 and EfficientNetB5, respectively.
This study performed two experiments; the first was on a dataset imbalanced between classes and only processed by scaling and resizing the image. The second experiment was on a balanced dataset by utilising augmentation data and applying image preprocessing techniques. F1-score was used to measure and compare the performance in both experiments because it is a standard measure of imbalanced data classification, in addition to the rest metrics mentioned in Section V-E. The performance was improved when using balanced since the InceptionV3 and EfficientNetB5 models obtained higher results. InceptionV3 model's performance improved in Recall and F1-score when using a balanced training set while the results of the CNN model decreased in all measures. On the other hand, the results of the EfficientNetB5 model improved in all metrics when using a balanced training set. Hence, finetuning pre-trained models could benefit from the augmented samples and enhanced features, which was not the case for the CNN model.
Furthermore, the CNN model achieves the highest results in the two experiments which are 67% and 64% of F1-score, in the first and second experiments, respectively. When looking at their learning curves, overfitting was an issue in pre-trained models, indicating the need for more powerful regularisation for these advanced architectures. In other words, the more complex the architecture, the more prone to overfit.
In general, the detection ability of these models was better than its classification between DR severity levels. For Efficient-NetB5, the DR detection was improved by 7% absolute when using a balanced training set, while it was the opposite case for CNN as its detection accuracy dropped by 6% absolute.

VII. CONCLUSION
DR is currently one of the dominant diseases that significantly affect people with diabetes. The paper covers the details of the implementation and evaluation of several deep learning models: CNN, InceptionV3, and EfficientNetsB5 for classifying DR using the APTOS2019 dataset. Two different experiments were conducted, the first with the original images and the second after processing the images and balancing the classes. The InceptionV3 model performed the best accuracy on the dataset in both experiments, while the CNN model got the highest F1-score in both experiments. Using these prediction results, effective DR detection systems can be implemented using deep learning models so that the patient can be treated and dealt with in the early stages. The results of this research may not be the same as previous research due to the difference in the dataset used and the data processing method. This research's main challenges and limitations are that the image dataset was imbalanced, and there was a shortage of efficiency of the devices utilised in processing even when using online GPU, such as Kaggle, the allotted time was limited. For further work, this research can expand to address these deficiencies by using other methods to balance data and apply other pre-trained models to diagnose DR.