Performance Analysis of Deep Neural Network based on Transfer Learning for Pet Classification

Deep learning frameworks have progressed beyond human recognition capabilities and, now it’s the perfect opportunity to optimize them for implementation on the embedded platforms. The present deep learning architectures support learning capabilities, but they lack flexibility for applying learned knowledge on the tasks in other unfamiliar domains. This work tries to fill this gap with the deep neural network-based solution for object detection in unrelated domains with a focus on the reduced footprint of the developed model. Knowledge distillation provides efficient and effective teacherstudent learning for a variety of different visual recognition tasks. A lightweight student network can be easily trained under the guidance of the high-capacity teacher networks. The teacherstudent architecture implementation on binary classes shows a 20% improvement in accuracy within the same training iterations using the transfer learning approach. The scalability of the student model is tested with binary, ternary and multiclass and their performance is compared on basis of inference speed. The results show that the inference speed does not depend on the number of classes. For similar recognition accuracy, the inference speed of about 50 frames per second or 20ms per image. Thus, this approach can be generalized as per the application requirement with minimal changes, provided the dataset format compatibility. Keywords—Machine learning; knowledge distillation; transfer learning; domain adaptation


I. INTRODUCTION
Deep neural networks are thriving, due to vast data availability, newer complex models, and heterogeneous compute capacity. The data accumulation ease and its opensource availability are opening new doors for the research community. So new models are popping up almost every day on how to solve real-world problems using that data. Now crunching the data is also getting cheaper day by day, and one does not require a personal high-end custom configured system for this job. It is offloaded to cloud-based solutions provided by Amazon, Google, and Microsoft. Traditional machine learning & data mining algorithms make predictions using statistical models and trained on labelled or unlabelled training datasets. As the labelled data may be too few in practical applications; so to build a good classifier; semi-supervised classification done by using a large amount of unlabelled data and a small amount of labelled data [1], [2], [3]. In [4], the problem of how to deal with the noisy-class label is explored. Similarly, in [5], cost-sensitive learning is considered. In [6], it is shown that having a minimum depth to the network is vital for the model performance. All these approaches assumed that the distributions of the labelled and unlabelled data were the same.
For implementation on edge-based devices, the model size could be cut down by the compression techniques at various levels in the model, data, and computation. The classic Alexnet [7] was trained on the Imagenet dataset and performed 2.27 billion operations with 238MB of memory usage for storing the model data itself. In the compressed model, Squeezenet [8] performs 2.17 billion of operation but with a smaller footprint of 4.8MB, while Darknet [9], an open source for the Yolo [10], does less than 1 billion operations with 28MB of the footprint. Note that this comparison is assuming a baseline accuracy of 80 per cent in recognizing the labelled visuals. It does not include the run time memory requirements while performing the computation, which is not directly proportional to the number of operations performed as neural networks are nonlinear models. Also, compression-decompression takes more computation power. Mobilenet [11] tries to address this problem to an extent. Though the model size is reduced from the storage perspective but while performing the inference on a lightweight platform, they may fail to give the real-time response due to resource constraints. To make them predict with a high confidence value [12] the sematic segmentation approach from [13], [14] is used by [15], [16], [17].
The practical implementation of the deep neural network in real-life scenarios is quite the opposite of the earlier description. IoT based heterogeneous devices have resource constraints that limit their use on them. They mostly offload this to the cloud, but that solution is not always feasible due to the latency involved. This work explores the knowledge distillation approach in deep neural networks for IoT edge devices for real-time applications. The contribution of this work is to train a smaller model for a lightweight target platform with a negligible loss of accuracy. The proposed lightweight model can be easily customized to the different domains and can be easily ported to IoT based edge devices. Section 2 details why deep learning on heterogeneous edge devices is difficult to implement. In Section 3, the state of the art approaches for model reduction is presented. Section 4 delves into the knowledge distillation approach and how it can be used on the edge-based device followed by experimental setup and performance analysis of the implementation. 80 | P a g e www.ijacsa.thesai.org II. DEEP LEARNING ON EDGE DEVICES Convolution Neural Networks models have evolved to surpass the human capabilities in image classification task but when it comes to their deployment on the edge devices, there use is limited due to various resource constraints as described below:

A. Limited Data
The large dataset is not available in all circumstances for the training of the network and even it is available it may be quite expensive in terms of time and feasibility. Data privacy is another factor which forces to work with lesser data locally on the device itself.

B. Limited Model Footprint
Models which thrusted the growth of DNN with their accuracy limits surpassing humans are quite bulky in terms of memory requirements for storage. This memory could be either for storage of the model or for the storage of the millions of the weights calculated during the runtime. For practical implementation, the memory footprint of the application should be small enough to fit into any embedded device.

C. Limited Computation
Even with lesser data and smaller models the solution does not work out. Because of the millions of intermediate weights computation, it involves during the model run, it may require a desktop/server capability to finish the task in real time. The computational latency is not tolerable in the practical application involving the heterogeneous edge device. Now as the heterogeneous edge devices has limited capabilities, one need to devise the ways to eliminate or reduce these limitation causes. The next section describes this in detail.

III. DEEP NEURAL NETWORK REDUCTION
Though DNNs have the tremendous diversity of structures, still the core computation of a network is the variations of matrix-multiplications or more precisely multiply-andaccumulate (MAC) operations. The factors which effect the MAC operations are batch size, image dimensions, filter type, no. of channels, kernel size and activation size. These combined for every neuron to neuron connections make the millions of hyper-parameters of the DNN.
To reduce these transformation functions parameterized by learnable weights, researchers worldwide have developed their own various model compression techniques, but only some of the well-researched approaches are covered here for brief overview.

A. Pruning
The hyper parameter space of the DNN is reduced by trimming the network physically or pruning the network itself in various ways.
The unimportant weight connections can be pruned if they are below a predefined threshold or if they are redundant. About 50% of the weights can be pruned without fine-tuning and with fine-tuning, more than 80% of the weights can be pruned [18]. The pruning of the weights can be driven by energy distribution for the network [19]. In Energy Inference Engine [20], the sparse weights after pruning can also be compressed to reduce memory access bandwidth. Huffman coding is used to reduce storage and bandwidth requirements for weights by 20-30% [21].
Another approach to trim the individual neurons is that if they are redundant [22]. As these are basic element of the network so the associated connections of the neuron will also be obliterated. In the literature many ways are researched to do this type of pruning, even some of the neuron layers which do not contribute much in the network updation can also be removed [23].
Convolutional filters are applied to the data and according to their importance, they can be eliminated from the network. The filter's importance can be known by their influence on the weight calculation or L1/L2 norm [24]. Other methods are also researched in the literature which is not the scope of this work.

B. Quantization
The network architecture can be improved in many ways e.g. by reducing the quantity of weights and number of operations. The large convolution operation can be replaced with a number of smaller convolution operators having fewer weights in total, keeping the effective receptive field same i.e. large filters can be emulated with several of the smaller size filters in cascade e.g. convolution of size n by n can be made by combining 1 by n convolution with N by 1 convolution [21]. SqueezeNet [25] uses this approach to achieve an overall reduction in number of weights up to 50x compared to AlexNet, while keeping the accuracy in similar range.
Weights of fully connected layers can be quantized using Regularization technique [26], [27]. Clustering by 'k-means' [28] achieved more than 20x compression with negligible loss of accuracy. Hashed Nets [29] use a low-cost hash function to group weights into hash buckets to share parameters.

C. Knowledge Distillation
Knowledge distillation (KD) was introduced by [30] as: • Train a large model that performs and generalizes very well. This is called the teacher model.
• Take all the data you have and compute the predictions of the teacher model. The total dataset with these predictions is called the knowledge, and the predictions themselves are often referred to as soft targets. This is the knowledge distillation step.
• Use the previously obtained knowledge to train the smaller network, called the student model.  IV. TEACHER-STUDENT LEARNING Knowledge distillation starts with training a larger model, the teacher 'T'. As it is trained on a heavier platform (GPU), it achieves high performance. Then a lightweight model known as student 'S' is deployed to learn from 'T'. Now, 'S' is supposed to give comparable performance as 'T' but with less memory and more speed.
To improve knowledge transfer from teacher to student various types of methods are researched. Assuming a trained 'T' has already eliminated some label errors contained in the ground truth data, the authors in [29] treated the hard label predicted by 'T' as the underlying knowledge. While in [30], the soft label produced by 'T', i.e., the classification probabilities, are focused to provide more information to transfer. In general, knowledge is transferred from the 'T' to 'S' by minimizing a loss function in which the target is the distribution of class probabilities predicted by 'T'. This probability distribution has the correct class at a very high probability (close to '1') with all other class probabilities very close to '0'. As such, it does not provide much information beyond the ground truth labels already provided in the dataset. For this, Hinton [30], introduced the concept of "softmax temperature". As it grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes 'T' found more like the predicted class. This is the "dark knowledge" embedded in the 'T' and transferred to 'S' in the distillation process. The distillation related work can be categorized as below: • Feature Map: The feature map across channel dimension can be averaged to obtain spatial attention map [31]. The inner product of two feature maps can be used for the inter-layer flow [32]. The author in [33] improved this idea with singular value decomposition (SVD). A recent work [34] demonstrated the effectiveness of mimicking feature map directly in distillation.
• Transfer strategy: FitNets [35] selected a hidden layer from 'T' and 'S' to be hint layer and guided layer respectively. 'S' can get a better initialization through pre-training the guided layer with the hint layer as supervision. Net2net [36] proposed a functionpreserving transformation, which makes it possible to directly reuse it from 'T' to initialize the hyperparameters of 'S'.
• Hybrid strategy: Adversarial learning is used with distillation by using a comparator to check the outputs of 'S' and 'T' are close enough or not [37]. The author in [38] exploited reinforcement learning to search the best network structure of 'S' under the influence of 'T'.
In [39] and [40] progressive or lifelong learning is referred to make knowledge transfer step by step.
Looking to this a novel approach to developing deep learning models for various domains is proposed. As every student in a class distribution may not have a similar capability or generally it a Gaussian curve. To flatten the curve on the higher side of learning capability, the model tries to imbibe the relative knowledge which can be used on the lightweight students to perform the object detection task with comparable performance. The aim is, to provide a generic solution to the problem with the assumption that the model can only be transferred successfully using the smaller dataset avoiding the limitations of the domain transfer.

V. EXPERIMENTAL EVALUATION AND DISCUSSION
A popular framework Caffe [41] is chosen for the binary image classification task. The Redux Dogs vs. Cats competition dataset [42] is used for training and testing purpose on NVIDIA GTX 770 with 1536 GPU cores [43]. The training data consists of 12500 images of each for Cats and Dogs. For testing phase 12500 images random images from the dataset are chosen.
The model calculates the probability of a pet and assigns a numeric value between 0 and 1 for the predicted class. Currently the implementation involves binary classification; cat and dog, but it can be extended easily to include other types of pets. For that the model will give the probability values for each class and the highest value is the closest. The accuracy of the implemented training models depends on the model training parameters and varies with change in hyper parameters that itself is a separate research area. The log-loss formula can be used to represent the accuracy of any model: Where, n represents the number of images in the test set. y the prediction probability for the dog and y^ equals to '1' if the current image is identified as a dog or equals to '0' if a current image is predicted as cat. The log-loss probability is calculated for each run and note that a smaller value of log loss is desired.
First, the complex teacher model is trained using labeled data and then same is tested with the unlabeled data to classify the pet either cat or dog. Fig. 2(a) shows the learning curve of the teacher model achieving 75% validation accuracy in 1500 iterations which occurred in about 2 and ½ hours. Further, the weights from the teacher model are used to pass to the student model as per knowledge distillation criteria, i.e., the student model is initialized with the pre-trained data/weights from the teacher model. Fig. 2(b) shows the training/learning curve of the student model achieving a 95% validation accuracy in about 1000 iterations which occurred in less than 2 hours.
In Table I, while doing transfer learning, the accuracy has jumped from 75% to 95% that too in lesser run time. The log loss value also comes down close to unity, the ideal log loss value for this problem. 82 | P a g e www.ijacsa.thesai.org   To validate the transfer learning results further multiple runs are conducted on the higher capacity single GPU (NVIDIA GTX 1080Ti with 3584 cores) [44] configuration and with dual GPU configuration. The results are shown in Fig. 3 and summarized in Table II. It shows the comparison matrix from all the test runs on various GPU platforms. In model 1, using the GPU with significantly greater number of cores reduces the number of iterations to achieve the similar results and it converges faster. On the other hand, with model 2, the results are similar both in terms of accuracy and number of iterations irrespective of the platform availability. It indicates that transfer learning-based approach comes out as a clear winner for limited resource environment.
Next, the task is performed for binary, ternary, and multiclass identification and their performance is compared on basis of inference speed. The results show in Table III that the inference speed does not depend on the number of classes. The multiple domains are also considered to prove that for similar recognition accuracy the inference speed achieved is about 50 frames per second or 20ms per image. It can be safely concluded that using transfer learning approach a student model converges faster than the original complex teacher model. This directly translates to the saving in resource for each run of the learning which is exactly what is required for the implementation of CNN on heterogeneous embedded platform with lesser resources as now the lesser powerful embedded GPU (compared to discrete ones) can achieve similar accuracy. The results will make deployment and inferencing of DNN in heterogeneous devices easier and devices friendly. General-purpose experimentation platforms like raspberry-pi can be also used for the same. For the future work Nvidia latest platform like Jetson-TK [45] can be considered for real time implementation of this approach. GPU acceleration and model compression are orthogonal to each other. How much a model can be compressed and accelerated subject to given resource constraints (storage, computational power, and energy) and user-specified performance goals (accuracy, latency) is open research question. The development of the generalized model compression and acceleration framework would add another value to it. More research in this area can lead to trade-off between model compression and acceleration dynamically.

ACKNOWLEDGMENTS
The first author would like to thank the faculty/staff of Institute of Technology, Nirma University for the laboratory work support.