Transfer Learning based Performance Comparison of the Pre-Trained Deep Neural Networks

Deep learning has grown tremendously in recent years, having a substantial impact on practically every discipline. Transfer learning allows us to transfer the knowledge of a model that has been formerly trained for a particular task to a new model that is attempting to solve a related but not identical problem. Specific layers of a pre-trained model must be retrained while the others must remain unmodified to adapt it to a new task effectively. There are typical issues in selecting the layers to be enabled for training and layers to be frozen, setting hyperparameter values, and all these concerns have a substantial effect on training capabilities as well as classification performance. The principal aim of this study is to compare the network performance of the selected pre-trained models based on transfer learning to help the selection of a suitable model for image classification. To accomplish the goal, we examined the performance of five pre-trained networks, such as SqueezeNet, GoogleNet, ShuffleNet, Darknet-53, and Inception-V3 with different Epochs, Learning Rates, and Mini-Batch Sizes to compare and evaluate the network’s performance using confusion matrix. Based on the experimental findings, Inception-V3 has achieved the highest accuracy of 96.98%, as well as other evaluation metrics, including precision, sensitivity, specificity, and f1-score of 92.63%, 92.46%, 98.12%, and 92.49%, respectively. Keywords—Transfer learning; deep neural networks; image classification; Convolutional Neural Network (CNN) models


I. INTRODUCTION
The primary evolution of neural networks was stimulated by the desire to design a process that could imitate the human brain. The ability of conventional machine-learning approaches to explore natural data in its natural form was limited. Deep learning enables computational models with several processing layers to learn and represent data at multiple levels of abstraction, simulating how the brain receives and analyses multi-modal information, and so implicitly capturing intricate data structures [1]. A convolutional neural network (CNN) is one of the most popular deep learning models. It uses deep convolutional networks and non-linearity to discover local and spatial features, and patterns directly from raw data. As a result, a CNN learns features from data automatically, eliminating the necessity to manually extract them [2]. Image classification is a vital phenomenon in computer vision and other computer vision approaches, such as localisation, detection, and segmentation are built on top of it [3]. Deep neural networks (DNN) have recently been popular in the deep learning community for solving real-world issues, but the deep networks may face obstacles and hurdles throughout the training process, such as exploding/vanishing gradients and degradation [4]. The deep architecture presents the dedicated concern of training a CNN from scratch, which needs massive computational power, a long training time, and a substantial amount of training data. The specificity of features rises as we progress from lower-level CNN layers to higher-level layers, until the last classification layer becomes profoundly task specific. The image features extracted by the lower-level CNN layers can be used to retrain the model for a completely different task, avoiding the need to start over [5]. In this case, all of the layers of a pre-trained CNN model can be employed as fixed feature extractors, with the exception of the final classification layer. Using the knowledge gained from earlier training, the final layer can be customised and retrained for a new task. When the depth of a network goes beyond the limit, it endures the degradation problem, which results in a decline in accuracy [6]. The internal covariate shift, which is the variation in the dissemination of the input data to a layer during training, is another matter of concern.
Transfer learning is a machine learning technique in which knowledge gained from one type of problem is applied to another similar task or domain [7]. CNN models are normally trained either from scratch or by applying transfer learning. Training from scratch involves a substantial amount of data to learn millions of parameters. Because a sufficiently labelled dataset is required for many applications, CNNs rarely train from scratch. Instead, a large-scale dataset is commonly used to pre-train a CNN, which is subsequently used as a fixed feature extractor or as an initialisation for other particular tasks [2]. The initial few layers of CNN models are trained to recognise task features. In the first layer, pre-trained models learn simple patterns like shapes and diagonals, then combine these components in successive layers to learn multipart features [8]. The models create meaningful constructs in the final layer by exploiting patterns learned from earlier layers. The final few layers of the trained network can be replaced and retrained with new layers for the target activity during transfer learning. Although fine tuned learning experimental studies need some learning, they are still much quicker than learning from the scratch [9], [6].

A. Pre-Trained Deep Learning Architectures
The promotion of artificial neural networks (ANNs) is the deep neural network (DNN), which comprises numerous hidden layers between the input and output layers. A DNN is capable of expressing an object well through its deep architectures and excels at modelling complex nonlinear relationships [10]. In recent times, CNNs have played a critical role in image classification and object detection. In 1998 LeCun et al. [11] proposed the first multilayer CNN, which is a convolutional network with seven levels that is simple to use, called LeNet-5. The layers of CNNs have become significantly deeper as GPU technology continues to advance. From 1998 to 2018, a number of CNN frameworks were developed, including LeNet [12], AlexNet [13], VGG 16 and VGG 19 [14], ResNet's Inception ResNet [15], ResNeXt, and other frameworks including PolyNet [16], DenseNet [17]. Transfer learning at a deep level instead of utilising traditional machine learning approaches that benefit from handcrafted features, CNN learns the most representative features from raw data automatically [18]. A variety of CNN architecture modifications with a rapid growth in the number of layers have recently been demonstrated. In this study, five pre-trained models, namely SqueezeNet, GoogleNet, ShuffleNet, Darknet-53, and Inception-v3 have been selected for the performance comparison and a brief summary is provided in Table I. A system designer must incorporate their judgment and substantial feature engineering to resolve the question of what needs to be transferred. The challenge on how knowledge should be conveyed through is model selection and how to supplement it to enhance prediction performance [11]. When selecting a network to apply to a problem, different aspects of pre-trained models are important to consider. Network accuracy, speed, and size are the most important considerations. Choosing a network is usually a compromise between these factors. The primary goal of this study is to compare the network performance of the selected pre-trained models based on accuracy, speed, and size to help the selection of a suitable model for image classification.
The rest of the paper is organised as follows. In Section II, we give a description of the related works. In Section III, the methodology is described in detail together with the transfer learning steps used in MATLAB. In Section IV, the experimental results are shown, followed by the performance evaluation and performance comparison of the pre-trained deep neural networks. Finally, in Section V, we provide the conclusions and future work.

II. RELATED WORK
In several disciplines, traditional machine learning algorithms have been widely accepted. Deep learning as well as image processing techniques have been used. With the introduction of transfer learning as a new learning framework [24], by fine-tuning pre-trained CNN models that have already been trained on ImageNet, similar results can now be obtained on deep learning applications. These models require a smaller number of training examples than developed models, which necessitate a significant amount of effort to acquire a big number of training instances [25]. Transfer learning has been used in a variety of fields, including agriculture, where it has been used to identify weeds, classify land cover, identify plants, count fruits, and classify crop types. Transfer learning has become increasingly important in medical image processing, while pretrained deep neural networks have made significant advances in the medical field, including the use of magnetic resonance imaging (MRI) scans, computerised tomography (CT) scans, and electrocardiograms (ECs) to detect life-threatening diseases, such as heart disease, cancer, and brain tumours. Shakil Ahmed et. al. [8] developed a transfer learning-based framework, which was tested against two well-known CNN models, Inception-V3 and VGG-16, using the Kimia Path24 dataset, which was created specifically for the classification and retrieval of histopathological images. Muhammed Talo [26] did the same kind of study with the same Kimia Path24 dataset. ResNet-50 and DenseNet-161, however, were used as well-known pre-trained CNN models. Rishav Singh et. al [11] presented a framework based on the concept of transfer learning to address and focus efforts on histopathology and unbalanced image classification, employing the widely used VGG-19 as a base model. Samuel Kumaresan et. al. [18], suggested employing transfer learning to overcome the issue of a small dataset of welding defect X-ray pictures. They used two large pre-trained convolutional neural networks, VGG16 and ResNet50, to extract features from weld defect radiograph images that can be used to classify 14 different types of weld defects. The goal for Edna Chebet Too et. al. [6] was to fine-tune and explore the deep convolutional neural network for image-based plant disease classification. The models VGG 16, Inception V4, ResNet with 50, 101, and 152 layers, and DenseNet with 121 layers were assessed in an empirical comparison of the deep learning architectures. Jianping Ju et. al. [27] in an effort to address the actual demand for jujube fault detection, introduced a jujube sorting model in small data sets based on convolutional neural networks and transfer learning using the SE-ResNet50-TL and SE-ResNet50-CL models. Triplet loss function and Center loss function were used to replace SoftMax loss function and embedded SE module for the dry red date defect detection. Alper et. al. [28] recommended a new CNN architecture for the hazelnut variety classification and the model was compared with four pre-trained models: VGG16, VGG19, InceptionV3, and ResNet50. In recent years, the difficulty of layer selection when using transfer learning with fine-tuning has received substantial attention. With the widespread adoption of deep learning techniques, transfer learning with fine-tuning appeared to be the ultimate approach for transferring knowledge, allowing scientists and professionals to apply such deep learning methods more quickly to a variety of domain problems [29].

III. METHODOLOGY
Classification has been an effective mission, and it is important in the subject of computer vision, which seeks to classify images into predefined classes automatically. Prior to the boom of deep learning approaches, a lot of effort was invested into constructing scale-invariant features, feature representations, and image classification classifiers [30]. These well-crafted qualities, on the other hand, work against objects in natural images with complex scenes, varying colour, texture, and illumination, as well as constantly changing positions and view parameters. Researchers have been working on sophisticated ways to increase image classification accuracy for decades. When the large-scale image dataset ImageNet was formed in 2009, Feifei Li [31] created the great-leap-forward advancement of image classification. The information about dataset, learning environment, gradient, learning rate, epoch, and mini-batch size employed in MATLAB, and the steps of transfer learning used in this study are explained under the methodology in the subsequent sections.

A. Dataset
The CIFAR-10 [32] dataset has 32 x 32 colour images that are divided into ten classes, each with 5,000 training images and 1,000 test images. Among the ten classes, five classes have been selected for the experimental process. From the training dataset, 3,000 training images are selected for each of the five classes, which includes the list as shown below: The most widely used split ratios are 70:30; 80:20; 65:35; 60:40 etc., in which the sample size suits the nature of the problem and the architecture implemented. There is no fixed law for dividing training and trial datasets when it comes to data splitting. Some scholars have traditionally used the 70:30 ratio to differentiate the datasets. As most widely used in MATLAB, the training set of images is split into training set and validation set by the 70:30 ratio.
[imdsTrain, imdsValidation] = splitEachLabel(imds, 0.7); Besides that, image augmentation could be used at random on the training datasets with distinct values to help expand the dataset, preventing the network from overfitting and capturing the exact features of the training images. The following settings for image augmentation have been chosen, as indicated in most of the MATLAB examples: horizontal reflection, horizontal and vertical translation in the range [- 30 30] pixels, and horizontal and vertical scaling in the range [0.9 1.1] with a random rate.

B. Learning of the Pre-Trained Networks
• Environment -The networks are implemented in MATLAB R2021a. The size of input images is adjusted to match the layers of various models.
• Stochastic Gradient Descent with Momentum (SGDM) -Gradient descent [10] is a popular neural network optimisation approach that can tackle a variety of trivial issues. When the training dataset is huge, however, the simple gradient descent method may use a lot of processing resources, making the convergence process slow. Simultaneously, because the gradient descent approach considers all of the training data for each calculation, it may result in overfitting. To resolve this challenging dispute, SGDM has been considered in this study. Momentum [33] is a commonly used acceleration technique in the gradient descent method whereby the convergence process can be accelerated.
• Learning Rate (LR) -When it comes to CNN training, LR is a crucial parameter. The LR is frequently decreased by a factor of 0.1 or 0.5. In this study, fixed learning rates of LR-0.001 and LR-0.0001 have been chosen instead of reducing the LR by each epoch.
• Epoch -The complete pass of the training algorithm across the entire training set is referred to as an epoch. In this study, the selected epoch values are 10, 20, 30.
• Mini Batch Size -A mini-batch is a subset of the training set that is utilised to calculate the loss function's gradient and update the weights. Two batch sizes are selected as part of the experimentation process: 32 and 64.

C. Transfer Learning Flow in MATLAB
CNN's unique qualities, such as incremental feature extraction in subsequent layers, make it possible to use parts of a pre-trained model for a completely new task without retraining the entire network [34], [35]. The fundamental idea is to use the initial layers from a pre-trained model and just retrain the last few layers on new images. Transfer learning can be implemented by replacing the output layer with a new classifier and then, selecting one of the two approaches: • Fixed feature extraction by freezing the initial layers or other layers of the convolutional base.
• Fine-tuning the weights and other parameters to retrain one or more convolution layers.
The entire experimentation process of image classification with the dataset CIFAR-10 has been done with MATLAB. The CIFAR-10 dataset is downloaded and provided as input data to the pre-trained model. Prior to loading the data, the entire dataset is divided into three main datasets comprising the training, validation, and testing datasets. To get good performance, deep neural networks require a vast amount of training data. Image augmentation, such as reflection, translation, and scaling, are used to increase the performance of deep networks in order to develop an effective image classifier with little training data. The pre-trained network is loaded, and the final layers are replaced with a new classification layer and a fully connected or convolutional layer. To fine-tune the model, the initial layers of each network are frozen and other parameters like the pool size, stride etc., are updated. The freezing layers are chosen according to the depth, size, and number of layers of the pre-trained network. Afterwards, the training process is initiated, followed by the classification of the validation, and test images. Finally, the classification accuracy is computed and performance of the networks are evaluated using confusion matrix. The transfer learning steps are illustrated in Fig. 1 which is then followed by the detailed steps performed in MATLAB to implement the entire transfer learning process.
First, training and validation sets are used to perform the training process to make sure that we have the best possible training model in the study. By checking the accuracy and loss of training and validation sets, we will be able to control the model's performance during training. Thus, the best possible model can be obtained by fine-tuning at the end of the training process. The results are evaluated by using the test set and these procedures were applied for each of the five models used in the study. All parameters are used in the same way for each model and the models used were evaluated using confusion Confusion Matrix 15) cm = confusionmat(trueLabels, predictedlLabels); 16) cm chart = confusionchart(trueLabels, predictedlLabels); Reset GPU matrix to find out the performance of the classifier. The performance evaluation used in the study and the comparison on the performance of different models are presented in Table  VIII and Table IX in the following section.

IV. RESULT AND DISCUSSION
The whole experimental process was carried out with a laptop and the experimental setup including the hardware, software, and its specifications are mentioned in Table II.

A. Experimental Results
In the training process, each iteration involves a gradient estimation and a network parameter update. Training can be tracked in MATLAB to determine how quickly the network's accuracy improves, as well as if the network attempts to overfit the training data. Following the completion of the training, the results can be inspected to see the finalised validation accuracy and to discover how the training was proceeded by plotting the key metrics, which include training accuracy, validation  Fig. 3, and they contain the validation accuracy, which represents the classification accuracy, and the validation loss, which represents the validation loss across the entire validation set for the five pre-trained networks.
The experimental results are presented in the tables    • Epoch-30 was chosen for further comparison based on the experimental findings, and the results were quite promising.
• When it comes to mini batch sizes, batch 32 has shown to be more promising than batch 64. Also, with Darknet-53, batch 64 displayed an error due to a lack of RAM (out of memory); hence batch 32 was chosen for further evaluation and comparison.
• With the exception of Darknet-53, LR-0.001 yielded favourable results when compared to LR-0.0001. For the subsequent studies, LR-0.001 findings were chosen for the other four networks, and LR-0.0001 results for Darknet-53.
• Out of the five pre-trained networks, Inception-V3 produced the best results, with the most layers, while SqueezeNet produced the unpleasant results, with the

B. Performance Evaluation using Confusion Matrix
The ratio between the number of right predictions made and the total number of predictions produced is known as classification accuracy [18]. The learning performance of the pre-trained deep neural networks is assessed using a standard confusion matrix method. A confusion matrix is a summary of classification problem prediction outcomes. It provides insight into correct and incorrect classifications, as well as the types of errors made, for each specific class. In image classification, the confusion matrix is primarily used to compare the classification to the actual measurement value in order to intuitively and accurately describe the accuracy of model classification [36]. The confusion matrix can be used to directly identify the performance of deep CNN models, and the evaluation metrics are listed below: where ACC stands for accuracy, which is defined as the percentage of correctly classified samples when a measured value is compared to a known value.
where PREC is the precision used to determine the model's ability to correctly classify positive values.
where SENS is the sensitivity, also known as recall, which is the frequency with which the model correctly predicts positive values. It's used to figure out how well the model can predict positive values.
where SPEC denotes the specificity with which the model's ability to predict negative values.
whereas the harmonic mean of the precision and sensitivity is the F1-score, also known as the balanced F-score or Fmeasure.
In MATLAB, the predicted class is represented by the rows, while the true class is represented by the columns. The diagonal cells relate to accurately classified observations. The off-diagonal cells correspond to observations that were inaccurately classified. The number of accurately and inaccurately classified observations for each predicted class are displayed as percentages of the total number of observations in the respective predicted class in a column-normalized column summary. The number of accurately and inaccurately classified observations for each true class are displayed as percentages of the total number of observations for that true class in a row-normalized row summary.  The results of the classification metrics evaluation of the five pre-trained networks for both the Learning Rates of 0.001 and 0.0001 are summarized in Table VIII. According to the evaluation results it is evident that the networks performed well on the LR-0.001 in compared to LR-0.0001 except for Darknet-53. Darknet-53, in compared to the other four networks, showed promising results with a LR-0.0001, whilst the other networks' performances were on the decline.

C. Pre-Trained Networks Performance Comparison
The performance comparison of the five pre-trained networks, encompassing both LRs, is shown in Table IX. Based on the performance comparison of the pre-trained networks, Inception-V3 has achieved the highest accuracy of 96.98%, as well as other metrics such as precision, sensitivity, specificity, and f1-score. The other networks produced somewhat lower results than Inception-V3, but altogether, all five pre-trained networks attained an accuracy of 90% or higher.
Transfer learning using a pre-trained CNN model is a better option for classification with the availability of only small datasets. In many of the previous studies, different pre-trained CNN models were compared using medical images and other relevant datasets. The results showed that the performance of the pre-trained models were mainly based on the dataset. With the limited computing resources, only five classes of the benchmark CIFAR-10 dataset are selected, but still managed to accomplish the aim of this study in the selection of a suitable model for image classification, Inception-V3. Even though one of the CNN model darknet-53 produced an error (out of memory) over batch-64, the current study shows that pretrained networks with the highest number of layers (Darknet-53 and Inception-V3) provided the maximum scores in the prediction of classes with best accuracy. The current study proves that transfer learning can be useful for various computer vision problems, especially for the ones with small datasets. With the availability of proper datasets, deep CNN models have the capabilities to take medical imaging technology further, providing a higher level of automation in medical imaging, including image processing and analysis.

V. CONCLUSIONS
In this study, we experimented with the performance of five pre-trained networks, such as SqueezeNet, GoogleNet, Shuf-fleNet, Darknet-53, and Inception-V3 with different epochs, learning rates, and mini-batch sizes. We performed the entire training process in MATLAB R2021a where we can view the complete network architecture of the CNN models, which helped us in the selection of freezing the initial layers. The final layers of the pre-trained CNN models are replaced either with a fully connected layer or convolutional layer and a new classifier replacing the classification layer. The initial layers are frozen to keep the weights intact and after the training, each model was evaluated using a confusion matrix. The experimental findings show that each pre-trained network produced different results with different hyper-parameters in the prediction of positive values. The results demonstrate that all the five pre-trained networks yielded promising results over the mini batch size-32, and epoch-30. In terms of LR, Darknet-53 delivered impressive results with LR-0.0001, achieving a maximum accuracy of 94.70%. Overall, the Inception-V3 model with LR-0.001 achieved the highest accuracy of 96.98%, as well as other evaluation metrics including precision, sensitivity, specificity, and f1-score of 92.63%, 92.46%, 98.12%, and 92.49%, respectively.

VI. FUTURE WORK
We presented a transfer learning-based performance comparison between the selected five pre-trained networks in this study. The freezing of network layers was selected based on the network depth, size, and number of layers. Only the initial layers were frozen; however, different sets of layers can be frozen. In the future, focus will be given to freeze multiple set of layers and to compare the results of frozen and non-frozen layers of the pre-trained networks. For further evaluation and comparison, different datasets and other pretrained deep neural networks can also be explored.