An Improved Deep Learning Model of Chili Disease Recognition with Small Dataset

—Due to its tasty and spicy fruit with nutritional qualities, chili is a demanding crop widely farmed around the world. Hence, it is essential to accurately determine the health status of chili for agricultural productivity. Recent years have seen impressive results in recognition fields due to deep learning approaches. However, deep learning models’ networks need an abundant data to perform well and collecting enormous data for the networks is time-consuming and resource-intensive. A data augmentation method is proposed to overcome this problem. It was applied to a small dataset of healthy and diseased chili leaf by utilizing geometric transformation method. Eventually, two deep learning models of CNN and ResNet-18 were evaluated using augmented and original datasets. From a series of experiment, it can be concluded that the trained deep learning models using original and augmented datasets perform better with an average accuracy performance of 97%.


I. INTRODUCTION
Chili (Capsicum sp.) is an important spice from the family Solanaceae that originates from South and Central America [1]. It is a demanding crop and extensively cultivated in tropical Asia and equatorial America with a high genetic diversity due to its edible and pungent fruit with nutritional values [2]. Vitamin C, potassium, phosphorus fibre, antioxidants like vitamin A, and flavonoids like β-carotene, α-carotene, lutein, zeaxanthin, and cryptoxanthin are among the nutritional values contained in a chili fruit which have the ability to suppress several human cancers [3]. However, owing to the impact of fungus, bacteria, viruses, pests, and climate on the chili cultivation process, the chili itself is susceptible to a variety of diseases. These diseases make it difficult for chili to thrive, reducing the production and quality of the fruit. It is estimated that 60-70 percent of diseases and early disease symptoms are detected just on leaves [4]. Hence, it is necessary to identify chili diseases precisely and implement early preventive and treatment measures.
Since the advent of deep learning, deep learning models have made significant advances in disease recognition [5]. Good performance from deep learning models normally needs a large number of parameters and enormous data to make these parameters operate properly. In order to do so, manual data collection and labelling [6] are required to get enormous data, which is resource-intensive and time-consuming. As a result, it can be hard to gather enough data to train the deep learning models, which significantly limits the accuracy of chili disease recognition.
With small collection of datasets, several research [7][8][9][10] in the chili agricultural field has used the data augmentation method to increase the volume of datasets. The method generates data artificially via adding augmented images to the existing dataset through either oversampling or warping [11]. By using oversampling augmentation such as generative adversarial networks (GANs), augmented images with a low likelihood of occurring label (abnormal) are added to the original datasets, preventing a deep learning model from being biassed toward the majority label of images during the recognition process. Even though GANs have intriguing promise, they need a substantial number of initial original images in order to train and create an augmented image [12]. As a result, depending on the initial size of the original dataset, GANs may not be a viable option.
In contrast, warping augmentations such as geometric transformation alter original images in such a way that their labels are maintained [13] , and this is accomplished without requiring a minimum amount of the original image to be present. Most of research [10,[13][14][15] that employed geometric transformation to augment original images concentrated on single transformation operation such as rotation, flipping, and scaling. To the best of the author knowledge, there has been very limited research on numerous fusions of geometric transformation operations in order to produce augmented images throughout the years. Hence, this research examines the data augmentation method known as a geometric transformation and its several transformation fusions on a small chili dataset. The augmented and original images are then fed into two deep learning models, Convolutional Neural Network (CNN) and Residual Network (ResNet-18), developed from scratch for chili disease recognition.
The contributions of this research findings can be summarized as follows. According to the findings of this research, the optimal level of accuracy for recognising chilli diseases depends on the category of datasets used and the size of the deep learning model. The finding implies that the best optimal recognition accuracy came from small datasets with both original and augmented images that were fed to a largersized model. This is in contrast to datasets with only original data (original datasets) and datasets with only augmented data (augmented datasets). All of the research experiments reveal that the deep learning models created from scratch are accurate to a maximum reported accuracy of 99.7%.
The remainder of this paper is organised as follows. The original dataset of healthy and diseased chili leaf produced for this research is explained in Section II. Meanwhile, Section III www.ijacsa.thesai.org delves into the data augmentation method's transformation process, focusing on geometric transformation and its several transformation fusions. The architectures of deep learning models for feature extraction and recognition purposes are then discussed in Section IV. In Section V, the experimental procedures and testings used in this research to acquire the accuracy performance findings are described and the conclusion of this research is presented in Section VI.

II. CHILI LEAF IMAGE DATASET
In this research, the camera of an Oppo Reno 2 smartphone was used to capture images of chili leaf in the Batu Pahat state of Johor, Malaysia. Both types of leaf showed healthy and indications of bacterial spot disease. Only 1200 original chili leaf images were able to be acquired due to the low quantity of chilli crops in the research site. Of those 1200 images, 600 showed healthy chili leaf, while the remaining 600 showed diseased chili leaf. The images are captured in auto-focus mode at a resolution of 3000 x 4000 pixels before being resized down to 224 x 224 pixels.

III. GEOMETRIC TRANSFORMATION
Data augmentation can be described as the mapping of any method that artificially increases the original dataset using the preservation label of transformations [13]: where Y is the original dataset and Z is the augmented dataset of Y . The original dataset that has been artificially increased is therefore expressed as: where Y′ stores the original dataset as well as the transformations described by φ . It is worth noting that the preservation label of transformations reflects that if an image d is an element of class f , then φ(d) is likewise an element of class f. Given that there is an infinite number of mappings φ(d) that fulfil the criterion of preservation label of transformations, this research assesses an augmentation method, namely the geometric transformation.
Geometric transformation is a data augmentation method that alters the image's geometry by relocating the locations of each pixel's value [16]. The image's fundamental pattern of a class is preserved, but it has been shifted to a new place and alignment. This research explores the types of geometric transformation such as reflection, translation, rotation, shearing, scaling, and several fusions between them.
Reflection [17] mirrored an image around the horizontal (xaxis) or vertical ( y -axis). It assists users in increasing the amount of images of an original dataset by requiring the original image matrices' rows to be inverted. In a horizontal reflection, the left and right sides of the image are turned horizontally. As shown below, the f x and f y components indicate the pixel's present location after reflection across the x-axis, while the coordinates of the object's original position in the image are denoted by x and y: where is the process equation of reflection on ℎ -axis. In a vertical flipping, the image is turned upside down such that the -axis is on top and the -axis is on the bottom. The and components indicate the pixel's present location after reflection across the y -axis, while the coordinates of the object's original position in the image are denoted by and , where B is the process equation of reflection on tℎe y-axis: Then there is translation [17], which is the process of shifting an object in an image from one location to another. The translation can be performed in four directions: down, up, right and left, which helps prevent positional bias in a set of translated images. The f x and f y components indicate the pixel's present location after translation, while the coordinates of the object's original position in the image are denoted by x and y, where is the process equation of translation: Next, rotation [18] entails spinning the original image, either in the left or right direction, with angles ranging from 1 o to 359 o . The and components indicate the pixel's present location after rotation while the coordinates of the object's original position in the image are denoted by and , where is the process equation of rotation: Additionally, shearing [17] is the process of altering the shape of the original image in a single direction. Shearing can be done in either the x-axis or the y-axis direction. The f x and f y components indicate the pixel's present location after shearing while the coordinates of the object's original position in the image are denoted by x and y.
Consequently, (7) shows the shearing in the x -axis direction, whereas (8) shows the shearing in the y -axis direction. The E and F are the process equations of images sheared on the x-axis and the y-axis directions, respectively.
In contrast, scaling [18], often known as zooming or cropping, is the process of enlarging and shrinking the original image in order to view more information. The operation of the process is to enlarge or shrink the image from a starting X, Y position to a destination X, Y. The f x and f y components indicate the pixel's present location after scaling, while the coordinates of the object's original position in the image are denoted by x and y,where G is the process equation of scaling. www.ijacsa.thesai.org An object in an original image that has been reflected, whether on the x -axis or the y -axis, can be translated by shifting the reflected image into a new location, resulting in a fusion of reflection and translation transformations. Given that H and I are the process equations of images reflected on the xaxis and the y -axis, respectively, and then translated, the equations are as follows: This research also includes the fusion of reflection and scaling transformations. Given that J and K are the process equations of images scaled, and then reflected on the x-axis and the -axis, respectively, the equations are as follows: Additionally, this research also has a fusion of scaling and shearing transformations. Given that L and M are the process equations of images scaled, and then sheared on the direction of the x-axis and the y-axis, respectively, the equations are as follow: Finally, there can be more than two fusions of geometric transformations. Given that N and O are the process equations of images scaled, then reflected on the x-axis and the y-axis, respectively, and lastly followed by translation, the equations are as follows:

IV. DEVELOPMENTS OF DEEP LEARNING MODEL
In the leaf disease recognition domain, researchers have employed enhanced deep learning network architecture through various models [9][10] and applied them to chili disease recognition. This research employs two types of deep learning models: CNN and ResNet-18, which are developed from scratch using the Deep Network Designer [19]. An accuracy measure in [20] is used as a metric to evaluate the performance of these models. The architecture of each model is described in further detail in the following section.

A. CNN Architecture
Output, input, and hidden layers are the three primary layers of a CNN model [21]. It is common for the hidden layers to have convolutional with rectified linear unit (ReLU) function, pooling, and fully connected layers. The convolutional layer comprises a collection of filters that are used to identify features of varying sizes. Each filter convolves throughout an input image by moving horizontally for a certain amount of time, then moves vertically for another amount of time until the whole image has been convolved. A nonlinear activation function, which is the ReLU, is then applied to the convolution process' outputs. The layer which pools the neuron cluster outputs from the convolution layer into a single neuron is called the pooling layer. The pooled output is then given to a fully connected layer, which adds a bias vector and multiplies it by a weight matrix before feeding it to a softmax layer, which executes the classification operation (output). The architecture of a CNN model is shown in Fig. 1.

B. ResNet-18 Architecture
ResNet is suggested in [22] as a solution to the issues of performance deterioration and gradient vanishing caused by the depth expansion of an CNN model. Convolution layers, pooling layers, fully linked layers, softmax layers, and shortcut connections make up the architecture of a ResNet-18 model shown in Fig. 2. The shortcut connections represent the connections that travel between two layers. There are two main kinds of pooling layers in the ResNet-18 model architecture in this research. The first of these layers is the max-pooling layer, which chooses the maximum element from the area of the feature map covered by the convolution filter. For the second layer which is the average pooling layer, instead of picking the maximum element, it works by calculating the average value of the element from the region of the feature map.
The building layer of ResNet-18 is seen in Fig. 3 with an input parameter and the desired output ( ) . The block makes use of a shortcut connection that enables it to immediately learn the residual ( ) = ( ) − in order to generate the desired output [ ( ) + ] , hence avoiding performance deterioration and gradient vanishing due to an excessive number of convolutional layers.   The ResNet building layer in Fig. 3 employs the residual mapping function [23] described below , where σ: denotes the activation function of ReLU. Through a second activation function of ReLU, the output y can be obtained: A linear transformation of output y can also be obtained by multiplying to in (19) as shown below.
V. RESULTS AND DISCUSSION The accuracy performance results for both the CNN and ResNet-18 models on original and augmented datasets of healthy and diseased chili leaf are acquired via the use of experimental setup and testing, which are detailed in the following section.

A. Experimental Setup
All of the models in the experiments run on MATLAB® with an Intel® CoreTM i3 processor operating at 3.4 GHz. The models' networks are fed data from three different categories of datasets: datasets with only original data (original datasets), datasets with only augmented data (augmented datasets), and datasets with both original and augmented data (original + augmented datasets). Only 1200 images from each category datasets are fed into the models in order to preserve the data balance. Therefore, 40 images from the original datasets are chosen to be augmented, consisting of 20 random images of healthy chili leaf and 20 random images of diseased chili leaf. During the augmentation process, which comprises of 15 geometric transformations, 600 augmented images are produced and preserved in the augmented datasets, while the original images are discarded. For original + augmented datasets, 300 images of original and augmented healthy chili leaf, as well as 300 images of original and augmented diseased chili leaf, are used. Table I shows the specific information for all the datasets in this research. When the CNN and ResNet-18 models are fed with a dataset, 70% of the data in the dataset is utilised to train the models, while the remaining 30% is used to test the models. During training, the hyperparameter settings [24] of both models, such as batch size, learning rate, maximum epoch, testing frequency and optimizer, are fixed such that the optimum performance of both models is equal. Table II summarises the fixed hyperparameter settings for both models. In each experiment, the accuracy performance of a developed model given an input dataset is determined using the following formula: The accuracy results obtained by developed models when applied to the three categories of datasets, which are referred to as original datasets, augmented datasets and original + augmented datasets are shown in Table III. Despite the fact that both models were trained on 600 images per class, the average accuracy attained from the original datasets was only 95.8%. The geometric transformation method improved performance and yielded the best accuracy result of the two models, with an average recognition accuracy of 97% from both models that can be seen from original + augmented datasets. These findings indicate that the geometric transformation method improves the abilities of the models to generalise [25] by modifying the orientation of original image while retaining its original information.
On the other hand, the accuracy results from the augmented datasets showed that if the models were only trained with augmented images, the accuracy dropped by 8.6% and 9.8% on average compared to the accuracy of the original datasets and the original + augmented datasets. This is due to the black pixel areas (background areas) in the augmented images created by geometric transformation and the absence of the attention mechanism [26] that is found in the original images. The deep learning models use more background areas of the augmented images as distinct regions in the training process, leading to lower accuracy performance.  This research proposed the data augmentation method known as a geometric transformation and its several transformation fusions on a small chili dataset and tested on two deep learning models, CNN and ResNet-18. A clear improvement in accuracy performance results were seen for both models after adding augmented images into the original datasets. The accuracy of both models went up by 94.2% for the CNN model and 99.7% for the ResNet-18 model.This suggests that a combination of the original and augmented images can improve the accuracy performance of the models substantially. Further research also revealed that ResNet-18 had the highest accuracy performance among both models when no data augmentation was used.