Investigating the Impact of Train / Test Split Ratio on the Performance of Pre-Trained Models with Custom Datasets

—The proper allocation of data between training and testing is a critical factor influencing the performance of deep learning models, especially those built upon pre-trained architectures. Having the suitable training set size is an important factor for the classification model’s generalization performance. The main goal of this study is to find the appropriate training set size for three pre-trained networks using different custom datasets. For this aim, the study presented in this paper explores the effect of varying the train / test split ratio on the performance of three popular pre-trained models, namely MobileNetV2, ResNet50v2 and VGG19, with a focus on image classification task. In this work, three balanced datasets never seen by the models have been used, each containing 1000 images divided into two classes. The train / test split ratios used for this study are: 60-40, 70-30, 80-20 and 90-10. The focus was on the critical metrics of sensitivity, specificity and overall accuracy to evaluate the performance of the classifiers under the different ratios. Experimental results show that, the performance of the classifiers is affected by varying the training / testing split ratio for the three custom datasets. Moreover, with the three pre-trained models, using more than 70% of the dataset images for the training task gives better performance.


I. INTRODUCTION
In our daily life, we have a huge number of generated data of different forms: text, image and video obtained from cameras and sensors.This data can be analyzed efficiently by using the advanced techniques such as deep learning.In image classification, deep learning models are used to identify special features in the images characterizing a particular class and that will help the model to distinguish between different classes.They can reach the human level performance on several fields like classifying animals, different types of food, diseases.
When having a small dataset, the best way to classify images is by using the transfer learning approach.It uses one model's knowledge on a machine learning task and reuses it as a starting point for a different but a related task.The pretrained neural network is fine-tuned to achieve the user's needs rather than being trained the model from scratch.
Using transfer learning on image classification was introduced in the literature in several areas:  The authors in study [1] propose the use of VGG19 architecture as the base model and complement it with different state-of-the-art techniques to classify histopathological images into IDC and non-IDC classes, Invasive Ductal Carcinoma (IDC) is a type of breast cancer.
 To identify covid-19, pneumonia and lung cancer diseases using chest radiographs, researchers in paper [2] suggest the combination of VGG-19 and Convolutional Neural Networks (CNN) to improve the performance of the multi-class classification task.
 The study presented by [3] gives different fine-tuned pre-trained models such as VGG-19, ResNet50V2 and DenseNet-121 to predict sentiments using the Twitterbased images.
 Researchers in the paper [4] gave the performance evaluation of Resnet18, Resnet50, Alexnet, DenseNet121, DenseNet201 and VGG16 models in rating gravel road images obtained from self-recorded videos and from Google Street View.
 The classification of a forest fire imagery into forest fire and no-fire was introduced in the paper [5] using a new proposed approach based on the use of the VGG19 model.
 To improve the intention classification accuracy, researchers in the paper [6] used the knowledge of the ERNIE model (Enhanced Representation through Knowledge Integration) for both: the student and the teacher models.
 The paper in [7] presents image classification and image prediction for the ImageNet dataset using the pre-trained models: MobileNet, MobileNetV2, VGG16, VGG19 and ResNet50.
 For the land use and land cover classification, transfer learning was used in the study presented in the paper [8] to fine-tune the pre-trained models: VGG16 and WRNs (Wide Residual Networks).To compare the performance and computational time, some techniques were employed such as: gradient clipping, data augmentation and early stopping.The red-green-blue version of the EuroSAT dataset was used in this work.www.ijacsa.thesai.org The paper in [9] presents a systematic review of the early detection of Alzheimer disease (AD) by using transfer learning and neuroimaging biomarkers.In this review, five datasets were used.The researchers in this paper confirm that, for the early diagnosis of AD, the use of transfer learning technique is beneficial to develop a more accurate model.
 Transfer learning studies that uses the non-medical ImageNet datasets for medical image analysis was systematically reviewed in the paper [10].To approach medical tasks with a non-medical dataset, the researchers suggest the use of transfer learning with ImageNet dataset.They also approve that CNN model and transfer learning technique gave reasonable performance.
 To classify mangrove communities, the study presented in the paper [11] uses three transfer learning strategies and discuss the differences in the classification task.Different models were constructed with the three deep learning algorithms: DeepLabV3+, HRNet and MCCUNet,  For Arabic tweet classification, a transformer-based model was proposed in the paper [12].This model was constructed from a pre-trained BERT model given by the hugging face transformer library using custom dense layers.To categorize the tweets, a multi-class classification layer was built on the top of the BERT encoder.Five publicly datasets was employed to do this study.
 The paper in [13] presents a review of the diabetic retinopathy classification with deep learning models that use transfer learning technique.According to this work, transfer learning is useful with medical image classification due to the limited number of medical images.
 To perform urban sounds classification, the researchers in the paper [14] have applied transfer learning with three datasets: UrbanSound8k, ESC-10 and Air Compressor.The pre-trained models used in this study are: GoogLeNet, SqueezeNet, ShuffleNet, VGGish and YAMNet.
 The detection of the Covid-19 disease was done in the paper [15] using the transfer learning technique with a dataset formed by X-ray images.The dataset is divided in three folders: Covid-19, pneumonia and normal (healthy) cases.Although a small dataset was used, high accuracy is achieved over all the models.The proposed approach uses VGG16, VGG19 and ResNet101 architectures.
 To categorize various food products using transfer learning, a recognition model was introduced in the paper [16]. The paper [18] gives an evaluation of pre-trained models for the detection of osteoporosis which is a bone disease in knee radiographs.VGG16 and VGG16 with fine-tuning were used in this study.The models were evaluated using accuracy, sensitivity and specificity metrics.According to this study, fine-tuning improves the VGG-16 performance for the desired task.
 Different pre-trained models for bird image identification were studied and compared in the paper [19].The models employed are: DenseNet201, InceptionV3, MobileNetV2 and ResNet52V2.The dataset contains 58388 bird images belonging to 400 spices.All the implemented models give good accuracy but DenseNet201 was the best network, according to the authors.
 The researchers in the paper [20] confirm that the VGG16 gives a good performance with all the nine different chest X-ray datasets used.The datasets have various sizes and different class labels.
While using the transfer learning technique, the obtained model's accuracy can exceed 90% even when using datasets with less than 100 images in each class [21] just by using the correct implementation of the pre-trained model.
In a previous work [22], a performance comparison of three pre-trained models on the classification task using a custom dataset was performed.The models were trained on 30 epochs with and 20 as values for the learning rate and the batch size parameters respectively.VGG19 achieved the highest accuracy, precision, recall and f1-score.This work follows the perspective of those researches, by proposing a study of variation impact of train / test split ratio on the performance of three fine-tuned pre-trained models (MobileNetV2, ResNet50v2 and VGG19), while using a new dataset never seen by the models.This paper is organized as follows: Section II contains a literature review on machine learning pre-trained models.Section III introduces the pre-trained models.Section IV contains the description of the three datasets used in this study.The preprocessing phase and evaluation metrics employed for the performance comparison are described in Section V. Results and discussion is given in Section VI.The conclusion is in Section VII.www.ijacsa.thesai.org

II. LITERATURE REVIEW ON MACHINE LEARNING PRE-TRAINED MODELS
Many researches try to understand how to enhance performance of ML pre-trained models.Here are some of those studies:  The work presented in paper [23] gives a study of the effects of dataset size and training/testing split ratios on the performance of multiclass classifiers.The results demonstrate that XGBoost gives the best performance.
The performance evaluation was done using 25 performance parameters.
 The paper in [24] presents a CNN-based automatic model for the identification of the strawberry leaf plant disease like: powdery mildew leaf, healthy leaf and caterpillar pests leaf.MobileNetV3-Large and efficientNet-B0 were implemented as architecture.The dataset contains 1336 images collected from the field and the data augmentation was applied to it.
 The paper in [25] proposed a deep learning method based on CNN architecture to classify six types of strawberry plants diseases.This study utilizes 4663 strawberry leaf disease images data.
 The work presented in the paper [26] demonstrates that the use of CNN models is useful than the non-Deep learning models to distinguish between infected and healthy strawberry leaves.Under the supervision of disease specialists, the dataset (1450 images) were collected from Balamore and Millen farms Ltd.AlexNet, SqueezeNet, GoogleNet, ResNet50, SqueezeNet-MOD1 and SqueezeNet-MOD2 were employed in this study.
 The paper [27] presents an evaluation of four ensemble learning algorithms (random forest, CatBoost, XGBoost and random forest) for the prediction of heart disease using different hyperparameter optimization techniques.Three kaggle datasets were combined which had features to augment the dataset size.Using 80% of the dataset images for training was useful because the proposed model gives better accuracy while working with the train / test split ratio of 80%-20%.
 The paper [28] studied the impact of different train / test split ratios on the model performance.

A. MobileNetV2
MobileNetV2 [29] is a CNN model based on an inverted residual structure where the residual connections are between the bottleneck layers.The architecture incorporates shortcut connections to aid in training deeper networks without vanishing gradients.To improve efficiency, the model uses depth-wise separable convolution which is independently performed for each input channel.Depth-wise separable convolution reduces the complexity cost and the pre-trained model size.Due to this, MobileNetV2 has higher accuracy, needs fewer operations and is much faster than the MobileNetV1 model.The MobileNetV2 architecture (see Fig. 1) consists of 17 building blocks in a row followed by 1x1 convolutional layer, global average pooling layer and classification layer.The expansion layer role is to expand the number of channels in the data.In the projection layer, high number of dimensions is reduced to a smaller one.

B. VGG19
VGG19 [30], part of the VGG family (Visual Geometry Group), is a CNN architecture with multiple layers.It was published by Simonya and Zisserman researchers from the Oxford University in 2014.It consists of 19 layers: 16 convolutional and three fully connected layers with a filter size of 3x3 (see Fig. 2).The number of parameters is reduced due to the small kernel size; it also enables them to cover the entire image.VGG19 was trained on the ImageNet database that contains more than 14 million images belonging to 1000 categories which helps the network to capture a diverse set of features, making it a powerful tool for the transfer learning task.For downsampling, VGG19 incorporates max-pooling layers and uses fully connected ones for classification.

C. ResNet50v2
ResNet50V2 [31], is a 50-convolutional neural network: 48 convolutional layers, one MaxPool layer and one average pool layer.It is known for its depth and skip connections which protect the model from vanishing gradient problem in much deeper networks.ResNet50V2 uses residual blocks which enhance the training efficiency in achieving both depth and accuracy in different tasks.To reduce the computational complexity and adjust the input layer to increase the performance of the network, ResNet50V2 utilizes batch normalization and bottleneck blocks.The ResNet50V2 architecture is shown in Fig. 3. www.ijacsa.thesai.org

IV. DATASETS
The balance (or imbalance) of the classes, which is the diversity of samples belonging to each class is a significant factor that affects the performance of classification models.Providing imbalanced data to the classifier may bias it towards the majority class because it lacks enough data to learn about the minority, which can cause false predictions.In this context, the study presented in this work was done with three balanced datasets each containing 1000 colored images divided into two classes: class_0 (500 images) and class_1 (500 images).The description of the three datasets is in Table I.Some samples of images in the three datasets are shown in Fig. 4, Fig. 5 and Fig. 6.

A. Preprocessing
One of the major problems when training deep learning models is to have a large dataset which is not always an easy task.It is necessary to have a huge number of images in each class of several subjects of the classification.To expand the size of the three small datasets used in this work, it was beneficial to utilize the data augmentation process with the KerasImageDataGenerator class.The summarized data augmentation description is in Table II.The images have different dimensions, for this reason, they were rescaled to 224x224 pixel resolution to make them compatible with the pre-trained model's requirement.

B. Evaluation Metrics
In machine learning, the performance evaluation of classification models needs the use of some metrics to be able to solve real-world problems.
There are several measures to test the performance of classification results [32], the three following ones were considered in this study: www.ijacsa.thesai.org

VI. RESULTS AND DISCUSSION
The pre-trained models were trained in Google Colab notebook with a learning rate of , a batch size of 32 for 60 epochs.Graphics Processing Units (GPUs) was used as the model's hardware platform.
Adam function is the simple and time-efficient optimizer for deep neural networks; thus, it has been employed for the compilation process.
The results discussed in this work are the best ones achieved from several experiments which were carried out for each case.

A. Dataset1
From Table III, it can be observed that the MobileNetV2 achieves the best sensitivity 100%) with the ratio 90%-10% and the best specificity (99%) with the ratio 80%-20%, but the best performance in terms of accuracy is obtained when using 70% of the dataset for training and 30% for testing ( 97.67%).Looking at the plot of confusion matrix (see Fig. 7), it can be seen that MobileNetV2 model accurately predicted 293 out of 300 total samples (train / test ratio: 70%-30%).Resnet50V2 performs well with the Dataset1: the accuracy is greater than 97% with all train / test ratios (see Table IV).The best sensitivity score (99.5%) is observed when using 60% of the dataset for the training phase.The high specificity score (100%) is obtained with the ratio 90%-10%.But the best performance of the network is achieved with the ratio 70%-30% in terms of accuracy (98.5%).ResNet50v2 classifier gives better performance with the ratio 80%-20%: it predicted accurately 197 samples which represent 98.5% of the total samples (200 samples) (see Fig. 8).Although the VGG19 classifier achieves the best sensitivity score (97%) with the ratio 80%-20% and the best specificity one (100%) with the ratio 90%-10%, the best performance of the network is observed while using 70%-30% as a train / test ratio.With this ratio (70%-30%), VGG19 reaches 97.33% for the accuracy metric (see Table V).The VGG19's confusion matrix, plotted in Fig. 9, shows that the model succeeds to classify 292 samples out of all samples while using the ratio 70%-30%.
The network gives better sensitivity (100%) while working with 90% of the dataset for the training phase.The better specificity and accuracy were reached with the ratio 80%-20% with scores of 100% and 99.5% respectively (see Table VII).
ResNet50v2 classifier gives better performance with the ratio 80%-20%: it predicted accurately 199 samples which represent 99.5% of the total samples (200 samples) (see Fig. 11).The train / test ratio 90%-10% gives better sensitivity, specificity and accuracy with the score 100% for each one of them (see Table VIII).The pre-trained model succeeds to correctly classify all the samples while taking 90% of the dataset for training which represent a good result (see Fig. 12).www.ijacsa.thesai.org

C. Dataset3
The MobileNetV2 model achieves better sensitivity (100%) with the ratios 80%-20%, and better specificity (100%) with the ratios 60%-40%.But out of all the train / test ratios, 80%-20% gives significantly better accuracy with a score of 99.5% which is a good result (see Table IX).According to the confusion matrix of MobileNetV2 model with the ratios 80%-20%, it can be observed that the pretrained model arrives to accurately predict 99.5% of all samples (199 samples of 200) (see Fig. 13).With the Dataset3, the ResNet50V2 classifier achieves the perfect sensitivity, specificity and accuracy while working with 90% of the total dataset's samples for training and 10% for testing, with a score of 100% for each one of them.It also gives better specificity with the ratio 80%-20% (see Table X).ResNet50v2 classifier gives better performance with the ratio 90%-10%: it predicted successfully all the samples (see Fig. 14).With the train / test ratio 90%-10%, the VGG19 pretrained model gives better performance in terms of sensitivity, specificity and accuracy with a score of 100% for each one of the metrics (see Table XI).The VGG19's confusion matrix, plotted in Fig. 15, shows that the model succeeds to classify correctly all the samples while using the ratio 90%-10%.
The results show that the best performance achieved with the networks was when using the train / test ratios 80%-20% and 90%-10%.While working with the ratio 60%-40%, all the classifiers couldn't give better scores in terms of sensitivity, specificity and accuracy, as well as the other ratios.The three networks need more than 70% of the dataset's samples for the training phase to give better results.www.ijacsa.thesai.org

VII. CONCLUSION
This study was realized with three datasets never seen by the pre-trained models: MobileNetV2, ResNet50 v2 and VGG19.The datasets were divided in two classes (class_0 and class_1).For all experiences, the batch size was fixed at 32 and the learning rate at .All the experiences were for 60 epochs.Analyzing the results, it can be observed that the train / test split ratio has a significant impact on the classification performance of the three pre-trained networks: MobileNetV2, ResNet50v2 and VGG19, the ratios 80%-20% and 90%-10% gives better results on the most cases.
All the pre-trained networks used in this study performs well with the Dataset3, this is due to its simplicity in comparison with the other datasets (Dataset2 and Dataset3).It has less features to be learned by the models which facilitate the learning process and enhance the classifier's performance.
In conclusion, increasing the size of the train data enhanced the performance of the three classifiers; more than 70% of the dataset's samples is required in the training phase to achieve better performance.
For future work, other datasets with different sizes will be studied using the three pre-trained models to have a more generalized conclusion concerning the impact of the train / test split ratio on the performance of the networks.Other architectures could be also added to the study.The impact of using several optimizers will be investigated for different pretrained models and different datasets.
Positive, TN: True Negative, FP: False Positive and FN: False Negative.

TABLE I .
DESCRIPTION OF THE DATASETS

TABLE III .
MOBILENETV2 EXPERIMENTATION RESULTS ON DATASET1

TABLE IV .
RESNET50V2 EXPERIMENTATION RESULTS ON DATASET1

TABLE V .
VGG19 EXPERIMENTATION RESULTS ON DATASET1

TABLE VI .
MOBILENETV2 EXPERIMENTATION RESULTS ON DATASET2

TABLE VII .
RESNET50V2 EXPERIMENTATION RESULTS ON DATASET2

TABLE VIII .
VGG19 EXPERIMENTATION RESULTS ON DATASET2

TABLE IX .
MOBILENETV2 EXPERIMENTATION RESULTS ON DATASET3

TABLE X .
RESNET50V2 EXPERIMENTATION RESULTS ON DATASET3