Recognition of Local Birds of Bangladesh using MobileNet and Inception-v3

Recognition of bird species can be a challenging task due to various complex factors. The purpose of this work is to distinguish various local bird species of Bangladesh from the image data. The MobileNet and Inception-v3 model which is mainly an image classification model used here to accomplish this work. Here, we have used a total of four approaches namely Inception-v3 without transfer learning, Inception-v3 with transfer learning, MobileNet without transfer learning, and MobileNet with transfer learning to accomplish the task. To evaluate our experimental results, we have calculated F1 Score besides the model’s accuracy and also presented the ROC curve to evaluate the model’s output quality. Then we have done a comparison among the applied four approaches. The experimental result has proved the working capability of the applied four approaches. Among these four approaches, MobileNet with transfer learning outperforms the others and obtained a test accuracy of 91.00%. For each of the classes, MobileNet with transfer learning obtained the highest F1 Score than other approaches. Keywords—Recognition; MobileNet; Inception-v3; transfer learning; computer vision; Bangladeshi bird


I. INTRODUCTION
The bird is one of the amazing creations of God. Besides adding charm and beauty to nature, they also help to sustain a balance of the fresh environment in the world. Bangladesh has a friendly environment to live for the birds. It is very difficult to recognize the local birds because of the bird's nature, color, structure, and voice.
Images can be classified in several ways. Image classification aims to category a distinct picture according to a set of probable categories. From a deep learning aspect [1], the image classification problem can be solved by transfer learning. This learning technique is becoming more popular day by day in computer vision because of its ability to construct accurate models within a very short time [2]. The main advantages of transfer learning techniques [3], there is no need for a huge training dataset and also no need for much more computational power. In computer vision, Transfer learning is normally exposed through the practice of pretrained models. A pre-trained model [1] is a technique that was trained on a huge dataset to resolve a problem similar to the one problem that individual desires to solve.
Here, we have mainly performed the task of recognition of the seven birds. Seven local birds of Bangladesh namely Bulbuli( ), Chorui(চড ়ু ই), Kaak( ), Doel( ), Machranga( ), Shalik( ), and Kokil( ) including the national birds which are shown in the Fig. 1. In this paper, we have implemented two models (two different approaches for each of the models) to accomplish this work of local bird recognition in Bangladesh. Firstly, we have performed Inception-v3 and MobileNet without using transfer technology and then we have again performed Inception-v3 and MobileNet using transfer technology. Then we have compared our observations on the basis of performance evaluation metrics namely accuracy and F1 Score. This paper is divided into five sections and presented as follows: Literature Review is presented in Section II. Section III describes the Methodology of this work. Experimental Result and Analysis are presented in Section IV. Conclusion is presented in Section V. Lastly; Future Work is mentioned in Section VI, respectively. (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 310 | P a g e www.ijacsa.thesai.org II. LITERATURE REVIEW Much research has been worked on transfer learning by many researchers but few have been applied this learning technique on local birds of Bangladesh. Many researchers have worked on bird recognition with many traditional diverse solutions.
A method of bird recognition based on Inception-v3 is presented by J. Bai et al. [4]. To improve the robustness and generalization of the model, they have applied some of the data augmentation techniques. Lastly, they evaluated their system in BirdCLEF test data and obtained 0.055 of classification mean average precision.
A deep learning neural network technique is presented to identify the bird species by S.K. Pillai et al. [5]. To implement this work, they used the Tensor Flow framework and twelve bird species were taken from the Caltech dataset to test their system. From their result, the true prediction (training dataset) as well as true prediction (unknown image) is always more than 0.5 and in some cases, both the values are more than 0.9 which proves the accuracy of the algorithm. For this dataset, train accuracy starts from 29% (at step 0) and goes till 100% (at step 1500). For their work, it is observable that the relationship between validation accuracy and the crossentropy is inversely proportional.
M. Lasseck [6] presented deep learning methods for the detection of the acoustic bird. Deep Convolutional Neural Networks originally outlined for the classification of the image, are adjusted and fine-tuned to recognize the bird's presence in audio recordings. To improve model performance, various data augmentation techniques were implemented here. N. R. Gavai et al. [7] have presented the experimental performance of the MobileNets model to retrain the flower kind datasets, which can exceedingly reduce the time and space for flower classification compromising the accuracy insignificantly. J. Bankar and N. R. Gavai [8] used a machine learning technique to classify the animal into particularized classes. They used the transfer learning technology to retrain the animal kind datasets based on the Inception-v3 model in the TensorFlow platform, which can exceedingly enhance the accuracy of classification in this circumstance.
Detection of pest birds in the field of agriculture is addressed by S. Lee et al. [9]. They proposed adapting deep learning techniques to certain cropped small areas of a frame where there is a huge chance of bird presence, based on the image processing's result. By conducting the background subtraction based on Gaussian Mixture Model, the moving objects are extracted. After that, color extraction and median filter eliminate the undesirable things in the agricultural environment. Lastly, the neural network object classifier was used to classify and minimize moving objects. The test results showed that employing neural networks to image's precise areas causes greater accuracy than the entire original frame.
S. S. Londhe and S. S. Kanade [10] studied the way of automatic bird's species identification by their vocalization. Bird sounds are classed by their function into songs and calls which are moreover subdivided into hierarchical levels of phrases, syllables, and elements. It is pointed out here that the syllable is a suitable unit to recognize bird species. Variety within diverse types of syllables birds are capable of producing is large. To identify species, the Support Vector Machine is chosen here.
A real-time detection system of birds in flight based on background subtraction and tracking through point correspondence is described by Moein Shakeri and Hong Zhang [11]. They have used a single fixed camera for their bird detection system and used Zivkovic's background subtraction technique. They appended a correspondence component based on point-tracking to the background subtraction technique to reach reliable bird detection. Investigations were carried to analyze the detection performance accepting objects of varied sizes, colors, and velocity. The obtained results showed the accuracy and efficiency of their system. M. R. Islam et al. [12] developed a MobileNet model for local bird classification using CNN. They have used a small dataset. They utilized 500 images in 5 classes of local birds. Their obtained accuracy is about 100%.
X. Xia et al. [13] utilized the Inception-v3 model for the classification of flowers. For retraining the datasets, transfer learning technology was practiced. Two types of datasets namely Oxford-17 and Oxford -102 flowers were employed and the classification process was conducted using four steps. 95% percent accuracy is found for the dataset Oxford-17 and 94% for the dataset Oxford-102.
Howard et al. [14] presented a model which is the application of a mobile vision named ModelNet. The MobileNet architecture is based on streamlined. For choosing the right sized model based on the problems, they mentioned two hyperparameters which are global. MobileNet effectiveness is also demonstrated by focusing on various classification problems.

III. METHODOLOGY
This section represents the way we have used to implement this work. This section basically consists of five subsections namely Dataset, Data Preprocessing, Model Description, Training, and lastly Testing. The implementation procedure of our work is presented in Fig. 2.
The detailed description of the abovementioned five subsections is presented below.

A. Dataset
In this work, we have collected seven local birds' images. Details of the seven local birds with mentioning the local name, English name, and also scientific name is presented in Table I. Primarily, we have collected 100 images for each bird species. As we are working with seven types of local birds. So, our total collected data is 700 image data. 311 | P a g e www.ijacsa.thesai.org

B. Data Preprocessing
The data preprocessing step is included with data resizing, labeling, and augmentation. Since, in deep learning, we need a sufficient amount of data but in some cases, it is not feasible to collect. So, in this type of case, data augmentation techniques can help individuals to rescue. So, to increase the size of the dataset, we have performed the data augmentation technique on the original dataset. Data augmentation can be done by several operations. The most commonly used operations are rotation, shearing, zooming, cropping, flipping, and contrast adjustment using histogram equalization, adaptive histogram equalization technique. We got 500 images for each class after augmentation. So, after the augmentation of data, we have a total of 3500 image data. We have resized this augmented label dataset for training and testing purposes. Then, we divided the data randomly into two sets (train-tests) with an 80% to 20% ratio.

C. Model Description
Here, we worked on Inception-v3 and MobileNet architecture in two different ways. One, we have used Inception-v3 and MobileNet architectures without transfer learning. And another way is using the transfer learning approach and for this, we used the pre-trained weight of these architectures on the imagenet dataset of 1000 classes. The details of MobileNet and Inception-v3 architectures for the imagenet dataset are presented in Table II. For our training purpose, we discarded the output layer of 1000 neurons and added an output layer of 7 neurons as in our dataset we have only worked on the 7 classes. During transfer learning with Inception-v3, we make the top 2 inception blocks freeze and the rest of the layer trainable. We made all the layers trainable when we did transfer learning with MobileNet architecture. As we were getting high overfitting with transfer learning, we used a dropout layer with a rate of 0.5 after the final fully connected layer to reduce the overfitting.

D. Training
To train the models, we used an Nvidia Geforce 2080 GPU with 8GB Memory. We trained the data in 100 epochs with a batch size of 8. For transfer learning, we used the pre-trained weight on the imagenet dataset. During training, we used the test set as the validation set. The accuracy with respect to epochs for all the models is presented in the Fig. 3. Blueline is used here to present the training accuracy and the other one is used here to present the validation accuracy.
The binary representation of the 7*7 confusion matrix shown in Table III is presented in Table IV. A Binary matrix is easy to interpret and understand. As 700 images are used to testing so, the summation of TP, FP, FN, and TN is equal to 700 for each of the bird classes.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 8, 2020 313 | P a g e www.ijacsa.thesai.org

IV. EXPERIMENTAL RESULT AND ANALYSIS
The following formula ((1) to (4)) is used to calculate the performance evaluation metrics. (1) In Receiver Operating Characteristic (ROC) metric is usually used to evaluate the quality of classifier's output. In the ROC curve, true positive rate places on the Y-axis and on the other hand false positive rate places on the X-axis. This denotes that the top left corner of the plot is the ideal point because here FPR is lowest, and TPR is highest. This is not very practical, but it does indicate that a larger AUC is usually better. From Fig. 4, we can see that the ROC of MobileNet with transfer learning approaches has larger AUC than others which indicates that this approach is working better here. www.ijacsa.thesai.org ROC curves are usually used in binary classification to analyze the output of a classifier. In order to extend the ROC curve and ROC area to multi-label classification, it is essential to binarize the output. The two evaluation measures for multilabel classification are micro-averaging and macro-averaging. This is possible to draw a ROC curve per label, but it is also possible to draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (microaveraging). Macro-averaging, basically known as another evaluation measure for multi-label classification which gives equal weight to the classification of each label.

V. CONCLUSION
This research work is mainly done on the seven local bird's species of Bangladesh, namely, (Bulbuli), চড ়ু ই (Chorui), (Kaak), (Doel), (Machranga), (Shalik), and (Kokil). Here, we have applied four approaches. To accomplish the work, we have considered 2800 images data as a training set and 700 images data as a test set. Among the four approaches, MobileNet with transfer learning performed better here in terms of performance evaluation metrics which is shown in Table V. VI. FUTURE WORK There are so many local bird species in Bangladesh. So, in the future, we will work with more local bird species with many other transfer learning techniques. Also, we will work on the recognition of the local bird species from the video data in the future.