Learning Deep Transferability for Several Agricultural Classification Problems

This paper addresses several critical agricultural classification problems, e.g. grain discoloration and medicinal plants identification and classification, in Vietnam via combining the idea of knowledge transferability and state-of-the-art deep convolutional neural networks. Grain discoloration disease of rice is an emerging threat to rice harvest in Vietnam as well as all over the world and it acquires specific attention as it results in qualitative loss of harvested crop. Medicinal plants are an important element of indigenous medical systems. These resources are usually regarded as a part of culture’s traditional knowledge. Accurate classification is preliminary to any kind of intervention and recommendation of services. Hence, leveraging technology in automatic classification of these problems has become essential. Unfortunately, building and training a machine learning model from scratch is next to impossible due to the lack of hardware infrastructure and finance support. It painfully restricts the requirements of rapid solutions to deal with the demand. For this purpose, the authors have exploited the idea of transfer learning which is the improvement of learning in a new prediction task through the transferability of knowledge from a related prediction task that has already been learned. By utilizing state-of-the-art deep networks re-trained upon our collected data, our extensive experiments show that the proposed combination performs perfectly and achieves the classification accuracy of 98.7% and 98.5% on our collected datasets within the acceptable training time on a normal laptop. A mobile application is also deployed to facilitate further integrated recommendation and services. Keywords—Medicinal Plant Classification; Grain Discoloration Classification; Transfer Learning; Deep Learning


I. INTRODUCTION
Rice is not only a major food crop in Vietnam but also an important export product. The economy of different countries, as well as Vietnam, is highly dependent on the export of the particular commodity as Vietnam is the 5 th leading exporter of rice in the world [1], [2]. Rice is a highly nutritive cereal and is consumed as essential food in most of Asian countries [3], [4]. In Vietnam, rice crop is subjected to various diseases which affect its quality and reduce the entire production. In the recent year, a new harvest reducing disease, grain discoloration, is becoming a serious problem to the reduction of rice crops [5], [6], [7], [8]. In this sense, the rice grain discoloration is considered as a potential risk to the riceproducing countries and various reports from number parts of the rice industry about this disease strongly requires a solution. Accurate classification of the grain discoloration is essential before proposing any practical control schemes. Nevertheless, either effective intervention or accurate prediction showing a complete solution to the disease is currently unexplored.
According to World Health Organization, about 70% of the world's population relies on plants for their primary health care and some 35,000 to 70,000 species have been used as medicament [9], a figure corresponding to 14 − 28% of the 250,000 plants species estimated to occur around the world [10], [11], and equivalent to 35−70% of all species used worldwide [10]. Medicinal plants are of crucial importance to the health of human beings. The medicinal value of these plants, both wild and planted, lies in some chemical substances that produce a physiological action on the human body. Many of these indigenous medicinal plants are used as food ingredients and medical purposes [13], [15], [16]. Furthermore, the special significance of medicinal plants in conservation peduncles from the major cultural, livelihood or economic roles that they play in many people's lives. Various sets of recommendations have been compiled relating to the conservation of medicinal plants [12]. Vietnam is home to an estimated 12,000 species of high-value plants, of which 10,500 have been identified, and approximately 3,780 species have medicinal properties. Vietnamese medicinal plant plants account for approximately 11% of the 35,000 species of medicinal plants known worldwide. The market size for Vietnamese herbal products and medicinal dietary supplement products at an estimated US $100 million [14]. With its abundant indigenous plant varieties, medicinal plants, and associated traditional knowledge, it is undoubtedly that Vietnam's biodiversity has a crucial role in contributing to sustainable livelihoods over many generations through the provision of food security and health care [20], especially for local people living in remote areas who are directly dependent on resources exploitation. People in many rural areas of Vietnam classify plants according to their medicinal values. Classification is considered an important activity in the preparation of herbal medicines [21]. Despite the importance of research on medicinal plants, there are a few works have been conducted in the literature. The most recently intensive work was done almost decades ago [17], [18]. It is necessary to make people realize the importance of medicinal plants before their extinction. The knowledge of herbal medicines should be maintained and passed along future generations. It is important for practitioners and botanists to know how to identify and classify the medicinal plants through computers and devices. Accurate classification of the medicinal plants is essential before developing any recommendation and services.
From the machine learning perspective, the mentioned problems could be addressable by the adoption of a new rapid solution that can bring experts, farmers, policymakers, and strategists into one choir. Traditionally, a major assumption in many machine learning algorithms is that the training and future data must in the similar feature space. They address isolated tasks. Any differences may be eliminated before learning or they have no equivalent covariance during training a model. However, in many real-world applications, this assumption may not hold. The isolation insists on an entire learning procedure from dataset collection, model training, model evaluation and model tuning. Thus, there is obviously a demand for computing infrastructure and financial support. Transfer learning, however, attempts to change it by developing methods to transfer knowledge learned in one or more source tasks and use it to improve learning in a related target task [32], [33], [24]. Modern object classification models have millions of parameters and can take weeks to fully train. Transfer learning is a technique that shortcuts a lot of this work by taking a fully-trained model for a set of predefined categories like ImageNet [22], [23], and retrains from the existing weights for new implemented classes. The goal of transfer learning is to improve learning in the target task by leveraging knowledge transferability from the source task. Some work in transfer learning is in the context of inductive learning [47] and involves extending well-known classification and deduction algorithms such as Markov logic Networks, Bayesian networks, and neural networks.
Transfer learning methods tend to be extensively dependent on the machine learning algorithms being used to learn the prediction tasks, and can often merely be considered extensions of those algorithms [34]. Tremendous progress has been made in image classification and recognition, primarily thanks to the availability of large-scale annotated datasets. Since Krizhevsky et al. [25] won the ImageNet 2012 competition, there have been much interest and work toward the revival of deep convolutional neural networks [26], especially in the task of image classification [27], [28], [29]. However, in this research, we aim neither to maximize absolute performance nor to build a complete model from scratch, but rather to study transfer results of several well-known convolutional architectures. We use the reference implementation provided by Tensorflow [30], [31] so that our experiment results will be comparable, extensible and usable for a large number of upcoming research. TensorFlow is a modern machine learning framework that provides tremendous power and opportunity to developers and data scientists. One of those opportunities is to utilize the concept of transfer learning to reduce training time and complexity by repurposing pre-trained models.
Building upon these key insights, we propose design recommendations for classification of grain discoloration and medicinal plants. To the best of own knowledge, in this paper, the authors have made several contributions: • Firstly, we have collected grain discoloration samples, and medicinal plants samples and that can be served as benchmark datasets.
• Secondly, we have proposed the combination of deep learning via convolutional neural networks, the idea of transfer learning and several real-world agricultural classification problems that is previously unstudied in the literature.
• Thirdly, we have proved that knowledge from very diverse source task can be very helpful to a target even if the source task may not be sufficiently related.
• And lastly, a mobile application providing the most affordable ways for millions of people to access information is also deployed to facilitate further integrated recommendation and services.

II. PROPOSED METHODOLOGY
In the recent years, the growth of classification datasets and the manifold directions of object classification research provide an unprecedented need and a great opportunity for a thorough evaluation of the current state of the field of categorical object detection [22], [48]. Taking ImageNet [23] dataset as an example, it is a dataset of over 15 million labelled high-resolution images with around 22,000 categories. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) uses a subset of ImageNet of around 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images and 100,000 testing images.
Convolutional neural networks (CNNs) have recently gained outstanding image classification performance in the large-scale challenges [25], [35], [22], [36], [38]. The success of CNNs is achieved by their ability to learn rich level image representations as its hidden layers can be integrated theoretically unlimited. However, learning CNNs requires a very large number of annotated image samples and an estimation of millions of model parameters. This property obviously prevents the application of CNNs to problems with limited training data. There is a phenomenon in deep neural networks such that when trained the model on images, it tends to learn firstlayer features that resemble either Gabor filters or color blobs [37]. Such first-layer features appear not to be specific to a particular dataset, but generally in the way that they are applicable to different tasks. The transition of knowledge is eventually transferred from the first layer to the last layer of the network. Expectedly, several large-scale datasets can be used to train the learning models, and then the learned models are applied to a particular target task where the parameters of the last layer are re-weighted based on its own dataset. The idea of transferring knowledge along deep neural networks have been explored by many previous researches [37], [39], [40], [46]. Going to that research direction, we explore the performance of several state-of-the-art convolutional neural networks upon our collected data.

A. Inception-Based Models
Google introduced the inception deep convolutional architecture as GoogleNet or Inception-v1 in [42]. Later the Inception architecture was refined in several ways, mainly by the introduction of batch normalization and the reduction of internal covariate shift [43]. The third iteration of the architecture [44] was improved by additional factorization ideas which will be referred to as Inception-v3 in our implementation. The authors are aware of the latest version, Inception-v4 [45], but do not include it into this work. We aim to reserve it for further investigation where we are going to conduct a thorough performance comparison of various models upon our further collected datasets. In this section, the authors are going to summary Inception-v3 served as the basement for their implementation.
Basically, prominent parts of an image can have utterly large size variation. Thus, choosing the right kernel size for the convolution operation is very hard. A small kernel is favored for information that is locally distributed while a large kernel is favored for information that is globally distributed. The problem is addressed by the idea of inception [42] where filters with multiple sizes operate on the same level. As the result, the network might get wider but the computational expense is significantly reduced. Moreover, neural networks perform better when convolutions did not alter the dimensions of the input extremely. Reducing the dimensions may cause loss of useful information but improvement of computational efficiency. The balance point is known as representational bottleneck [44]. Another improvement of Inception-v3 over the original Inception may come from utilizing the idea of factorizing convolutions. The aim of factorizing convolutions is to reduce the number of connections and/or learning parameters without diminishing the network efficiency. By using appropriate factorization representation, convolutions can be made more efficient in terms of computational expensiveness and architecture complexity. For example, a 3 × 3 convolution is 2.78 times less computationally expensive than a 5 × 5 convolution, see Fig. 1. Thus, stacking two 3×3 convolutions is actually a boost in performance. Similarity, n×n convolutions are factorized into 1 × n and n × 1 convolutions, see Fig.  2. Last but not least, the third inception module is used for promoting high dimensional representations, see Fig. 3. To recapitulate, the outline of our adapted implementation architecture is described in Table V and Fig. 4.

B. Depthwise Separable Convolution based Model
A standard convolutional layer takes as input a D F ×D F × M feature map F and produces a D G × D G × N feature map G where D F is the spatial width and height of a square input feature map 1 , M is the number of input channels, N is the number of output channels, D G is the spatial width and height of an output feature map. The layer is parameterized by a convolution kernel K of size D K × D K × M × N where D K is the spatial dimension of the kernel, M is the number of input channels and N is the number of output channels. The mapping from F to G is done by applying a kernel of size  MobileNets are a class of convolutional neural network designed by researches at Google [31]. They are coined "mobile-first" [50], [51] in that they are architected from the ground up to be resource-friendly and run quickly. It is designed to effectively maximize accuracy while being mindful of the restricted resources for an on-device or embedded application. The models are effectively small, low-latency, lowpower parameterized to meet the resource constraints of a variety of use cases. They can be built upon for classification, detection, embeddings and segmentation similar to how other popular large-scale models. The main difference between the MobileNet architecture and a traditional CNNs is instead of a single 3×3 convolution layer followed by batch normalization [43] and ReLU [52], MobileNets split the convolution into a 3 × 3 depthwise convolution (see Fig. 6) and a 1 × 1  pointwise convolution (see Fig. 7) 2 . The depthwise convolution applies a single filter to each input channel while the pointwise convolution combines the outputs of the depthwise ones. In this section, we describe the model in more details.
The standard convolution operation has the effect of filtering features based on the convolutional kernels. However, in MobileNets, the filtering and combination processes are split into two separated stages by using the idea of depthwise separable convolution where the depthwise convolution and the pointwise convolution perform the filtering stage and the combination stage respectively. The depthwise convolution can be written as: whereK is the depthwise convolutional kernel. The m th filter inK is applied to the m th channel in F to produce the m th 2 Image courtesy to Matthijs Hollemans[54]  channel inĜ. Hence, the computation cost of the depthwise convolution is the following: Because the depthwise convolution only filters input channel, we need to combine them to create new features by computing a linear combination of the output of depthwise convolution via 1 × 1 pointwise convolution. As the result, the computation cost of the depthwise separable convolution is the following: which is the sum of the depthwise and the 1 × 1 pointwise convolutions. This factorization significantly reduces the computational cost [53]. More precisely, the reduction in computation by expressing convolution is the following: Although the base MobileNets architecture is already light and low latency, on-device applications may require the model to be lighter and faster. In order to build such less computationally expensive architecture, the model surfaces two hyperparameters, e.g. width multiplier and resolution multiplier, that we can tune to fit the resource and/or accuracy trade-off of our implemented model. The width multiplier allows us to thin the network, while the resolution multiplier changes the input dimensions of the image, reducing the internal representation at every layer. Given α and ρ be the width multiplier and resolution multiplier respectively. While the role of the width multiplier α is to thin a network uniformly at each layer, the role of the resolution multiplier is to reduce the resolution of the input image as well as the internal representation of every layer. The computational expense of a depthwise separable convolution with α is the following: where α ∈ (0, 1] with the typical settings of {1, 0.75, 0.5, 0.25}. In real world implementation, α = 1 is the baseline MobileNets and α < 1 is reduced MobileNets. Similarly, the computational expense for the core layers of the network as depthwise separable convolutions with width multiplier α and resolution multiplier ρ is calculated as: where ρ ∈ (0, 1] with the implicit settings so that the input resolution of the network is {224, 192, 160, 128}. In real world implementation, ρ = 1 is the baseline MobileNets and ρ < 1 is reduced computation MobileNets.
Thanks to the idea of depthwise separable convolution, the network architecture is lighter, and consequently, the computation expense is significantly reduced. The reduction of computational cost and the number of parameters is quadratically by roughly α 2 and ρ 2 . Given the value of α ∈ {1, 0.75, 0.5, 0.25} and the image resolution ∈ {224, 192, 160, 128}, Table I shows the comparison between a full convolution model, an inception-based model and 16 combinations of α and ρ in terms of the number of fused multiplication and addition operations, and the number of learned parameters. Readers should refer to the original papers for greater details.

A. Dataset Collection
In this work, we have evaluated the proposed models on different data collections. The first dataset is samples of grain discoloration which is one of the most common diseases on rice in the Mekong Delta. The second dataset is a collection of medicinal plants which is essential for indigenous medical systems in Vietnam. We describe them in more details within this section.
1) Grain discoloration: Ministry of Agriculture and Rural Development of Vietnam has promulgated the resolution QCVN 01-166:2014/BNNPTNT regulating national technical regulation on surveillance method of Rice pests [41]. The resolution has described how to label grain discoloration. We randomly select a grain plant in a rice-growing field where grain discoloration disease is observed. By counting the number of affected grain of rice, called m, and the total number grain of rice, called n, on a grain plant, then the percentage, called p, of grain discoloration is calculated by the following equation: p = m n . Then, based on the value of p, rice experts classify the grain plant into four intensive levels of contamination of grain discoloration [41]. Table II shows the equivalence range of p and the assigned level. Level 1 is the less intensively contaminated whereas level 4 is the most intensively contaminated. More than 1000 samples were collected from different rice growing areas of South Vietnam thanks to the help of rice experts from Cuu Long Delta Rice Research Institute as well as farm owners. Several photos taken by rice experts are presented in Fig. 8. We put a white board under the sample during taking pictures in other to isolate the background. All the collected samples were assigned to three different rice experts separately to conduct the label annotation. We keep the samples that have the same three annotated labels from three rice experts. At the final round of the collection procedure, 566 samples are retained. We show their distribution into four intensive levels in Table III.

2) Medicinal plants:
Approximately 5800 samples were collected from different growing areas of South Vietnam thanks to the help of botanic experts from botanic garden of Tay Do University as well as garden owners. Several photos taken by the authors and garden owners are presented in Fig. 9. Similar to the grain discoloration dataset, we put a white board under the sample during taking pictures in other to isolate the background. At the end of the collection procedure, 5816 samples from 20 different classes are retained. We show their distribution into classes in Table IV.

B. Implementation and Results
In our experiments, we set the required model hyperparameters as follows. The learning rate is {0.01}. The number of epoch is {2000}. We have attempted varieties of learning rate, e.g. {0.1, 0.001}, and epoch, e.g. {1500, 2500, 3000}; however, the results are just slightly different. The model converges at around 1000 th and 1500 th epoch for the training and the test sets respectively. We re-train the last layer of the models by using our collected dataset. We randomly split the dataset into a training set, a validation set and a test set by the 80/10/10 splitting schema.
The input size of the depthwise separable convolution based model is n × n × 3 for height, width and channel respectively. n ∈ {224, 192, 160, 128}. The input size of the inception-based model is 299 × 299 × 3 by default. We also try different sizes of {244, 192}; however, the results are similar. These resolutions are common settings of running deep convolutional networks. The classification decision is made at the softmax layer where its input is the probability distribution of investigated labels. In each combination, we cautiously re-run the model several times but the accuracy scores are unchanged. The architecture is described in Table  V and Fig. 4 for the inception-based model. Whereas, the depthwise separable convolution based model is described in Table VI and Fig. 12.
Our experiments were conducted on a normal laptop Core i7-6500U with 2.5GHz clock speed, 16GB of RAM. An upper bound of RAM required for our models is 1.8GB and 1.4GB for the grain discoloration and the medicinal plants datasets respectively. The training time takes around 10 minutes and 45 minutes to complete 2000 epochs with the learning rate of 0.01 for the grain discoloration and the medicinal plants datasets respectively. The low-end GPU NVIDIA GeForce 940MX with  Table I. During the training procedure, the parameters of the last model's layer were re-weighed. Hyper-parameters space is described previously. Only the overall best performance is reported. The implementation has achieved the best classification accuracy of 98.7% on the grain discoloration dataset and 98.5% on the medicinal plants dataset. The least accuracy scores of models is also a good result. The complete performance on 17 implementation in the experiment is presented in Tables VII and VIII.

C. Mobile App Deployment
Mobile communication technology has quickly become the world's most common way of sharing information and widespreading services. A mobile application providing the most affordable ways for millions of people to access information should be developed to facilitate further integrated recommendation and services. In that sense, we have deployed an  Android application, called medicinal plant recognizer, of our experimented models especially for the task of classification of medicinal plants. After training our models, we integrate them into the mobile app that is used in two different scenarios, e.g. real-time and offline prediction. In the first scenario, a particular Android-based smartphone points at an arbitrary medicinal plant and the prediction is made instantly. In the second scenario, a saved photo is added to the application and the prediction is made afterwards. The output is basically the probability distribution of the plant over labels. Fig. 11 shows the demonstration of our mobile application.

IV. REMARKS AND DISCUSSION
One of the biggest advantages of the combination between the idea of transfer learning and the usage of the state-of-theart deep convolutional neural networks is that it significantly reduces the heavy demand for the hardware infrastructure and the total training time. It helps developing countries, like Vietnam, come up with solutions in a timely and affordable manner. Instead of training a model using many high-end GPUs in week [42], [49] or even months [55], the pretrained model is reused to straightforwardly re-weight the parameters in the last layer within 45 minutes. It reveals that the classification accuracy is very accurate. Admittedly, the models were picked in a somewhat ad hoc manner with the main constraint being that the computational complexity and rapid deployment can be made within a limitation of resources.
One of the interesting phenomena to note in Fig. 10 is that the model might be overfitting. More precisely, the model obtains 100% accuracy on the training set early but seems to fluctuate on the test set. We have attempted several values of the learning rate and the number of epoch but facing similar behavior. The observation strongly indicates that there is a lot of room for adding a more sizable volume of data.
The experiment results in this paper have pointed out many further research directions. Firstly, we have collected several benchmark datasets of grain discoloration and medicinal plants that serves as a preliminary preparation for accumulating development. We aim to collect the most 70 used herbal plants described in the Decision No. 4664/QD-BYT [19] by Vietnam's Ministry of Health. Obviously, any recommendation systems should be developed upon the accurate classification results. Secondly, we have proved that the combination of deep learning via deep convolutional neural networks and the idea of knowledge transferability achieves notable results. Thirdly, we have addressed several agricultural classification problems that had been unstudied in the literature. And last but not least, we deploy the mobile version of the model to reach further users and development practitioners.

V. CONCLUSION
In this paper, we have proposed using an adapted deep learning architecture and investigated the idea of transfer learning upon a real-world classification problem. Although the experimented categories are not originally included in the ImageNet dataset, the combination works that well proving that this direction is worth investigating. The proposed transfer methodology has performed well on the unseen images of grain discoloration and medicinal plants samples. A mobile app of the best version of the depthwise separable convolution based model is also deployed. These works assist human beings in real-world classification and identification problems and are considered an essential task in agricultural research.