Automatic Recognition of Marine Creatures using Deep Learning

org


I. INTRODUCTION
Millions of marine creatures live in the ocean, making it the largest habitat on the planet [1].The world's health is closely related to this marine biodiversity.These marine creatures are of the utmost importance to society as they are a source of food and a symbol of economic welfare.For example, fish are estimated to provide 20% of animal protein to about three billion people [2].In addition, the ocean is home to a diverse range of creatures that can be utilised for the development of pharmaceutical products to treat various diseases [3].The human race benefits from the numerous advantages that the marine ecosystem provides for its survival [4].Therefore, the effective conservation of this biodiversity in a sustainable manner is crucial for the proper functioning of the marine ecosystem and the human race [5].
Traditionally, marine biologists identify aquatic species by visually inspecting their morphological traits [6][7].Another popular method to correctly identify and group them is deoxyribonucleic acid (DNA) barcoding [8].This method can be used to precisely, accurately, and quickly detect invasive alien species or marine bacteria that can cause viral outbreaks [9].DNA barcoding has already proven its worth as a deterrent against various forms of economic fraud, such as seafood mislabelling [10].Despite its advantages, such an identification method is labour-intensive and time-consuming.As a result, it is crucial to create an automatic marine creature recognition system to address these difficulties.
Automatic recognition of marine creatures is a topic of interest to many researchers around the world.Pudaruth et al. have developed such a system in the form of a mobile application to recognise some of the marine fish that are present in Mauritian waters [11].However, no system has been developed to cater for other types of marine species that can be found in the Indian Ocean.Common people with no expertise in marine taxonomy have difficulties distinguishing between the different aquatic species.This poses a problem, especially when deadly ocean animals such as the stonefish, blue-ringed octopus, or the lionfish, amongst many others, are encountered [12][13].Furthermore, some endangered species require proper protection, such as conservation laws and regulations.These are only feasible after recognising them.
The motivation for this study is to create an image recognition model capable of distinguishing between different marine creatures.It is worth mentioning that 80% of the ocean is still undiscovered.According to an interview given by Dr Gene Carl Feldman to Oceana (an organisation focused on protecting the ocean), space exploration is far simpler than ocean exploration [14].As there is an abundance of marine life in the ocean and it is difficult to cater for all of them, the scope of this study focuses only on some marine creatures that are available at Odysseo Oceanarium in Mauritius.The recognition model has been integrated into a web application.The importance of this application is diverse.First, it will help in creating awareness about the creatures, especially dangerous species.Furthermore, it will also help in raising knowledge by providing some basic information about the creature after the recognition phase.Information such as its scientific name, common name, short description, and whether the animal is deadly has been provided.The information provided can then be used to better understand the animal.The proposed system employs computer vision and deep learning techniques to properly and accurately identify the marine creatures.This paper is divided into different sections.In Section II, a background study and reviews of related work in this field are provided.Section III delves deeper into the methodology and Section IV assesses the model's performance and discusses the results obtained.The final section concludes this paper.Table XII in Appendix I lists all the marine creatures from the Odysseo Oceanarium which were used in this study.www.ijacsa.thesai.orgII.LITERATURE REVIEW Several studies have been carried out in the past to develop systems for the automated identification of marine life.This section contains summaries of several relevant works in chronological order.

A. Fish Recognition
Strachan et al. conducted one of the earliest studies in this field, attempting to evaluate three different image analysis methods to differentiate between images of fish from different species [15].Methods such as invariant moments, mismatch optimisation, and geometric shape descriptors were used.Their strategy takes into account the fact that fish can be identified by their body shape.The dataset used in their experiment consisted of 60 different fish images.Their research found that the geometric shape descriptors method outperformed the other two approaches, yielding a 90% accuracy rate.However, their experiment was limited to a restricted number of species: only seven different species (two of which were identical gurnard fish, shot from different perspectives).
Fish image recognition has found its usefulness in systems such as automated fish counting.Fish counting is a challenging but critical task for the maintenance of a sustainable fishing level and the prevention of overfishing [16].Luo et al. proposed a method for such a system by using video footage captured during fishing operations [17].Their method involved the use of Statistical Shape Models (SSM) and Artificial Neural Network (ANN).To overcome the occlusion problem caused by people walking around the deck of the ship, the video footage was pre-processed.The colour of the images was used as a feature for the recognition process, and an Error Back Propagating ANN classifier was used to recognise the fish from the background.The next step was to use SSM to identify the fish.Lastly, a rule-based counting method was used to count the number of fish.Their method achieved an accuracy of 89.6% for a one-hour video.
The study conducted by Rathi et al. proposed a solution based on Convolutional Neural Network (CNN), deep learning, and image processing [18].Their method involved preprocessing of the captured images with the aim of removing noise and then using CNN to classify the different fishes.The Otsu's thresholding method was adopted to obtain a histogram representation of the input grayscale image.The next step was to perform morphological operations, such as dilation and erosion, to prepare the resulting image to be processed by the CNN algorithm.To put their method to the test, they used 27,142 images from the Fish4Knowledge dataset, representing 21 species which resulted in an accuracy of 96.29%.However, due to background noise and a lack of image enhancement techniques to compensate for lost features during the preprocessing phase, some of the classifications were incorrect.
Faster R-CNN was used by Mandal et al. to create a system for the automatic detection and identification of fish species [19].Their dataset consisted of 50 different fish species.Using a random sample technique, their dataset was divided into training (70%), validation (10%), and testing (20%) sets.They were able to achieve a mean average precision (mAP) of 82.4%.
Deep and Dash employed CNN for feature extraction, followed by Support Vector Machine (SVM) and k-Nearest Neighbour (kNN) for classification [20].They used the Fish4Knowledge dataset which was divided into training (90%) and testing (10%) sets.The training set was further divided such that 10% of the training images were used for validation.Their research proved that using kNN for classification yields the best accuracy of 98.79%.
Rico-Díaz et al. proposed a non-invasive method for addressing the fish recognition problem by combining artificial vision techniques and ANN [21].Their work relied on the fact that fish from different species can be distinguished based on their eye's sclera and pupil.The first step in the identification process was to employ image filtering techniques to reduce noise in the captured image.After that, background subtraction was done to segment the fish from the background.The next step was to identify the fish's eye, and for this the Hough algorithm was used.Additionally, a feed-forward ANN, being more costly, was also employed if the first method (the Hough algorithm) failed.Using their approach, they were able to achieve an overall accuracy of 74% for eye detection.Also, two underwater cameras were used to estimate the size and weight of the fish while they were swimming.Their solution, however, is dependent on the image's quality and a good background subtraction.Furthermore, performance degrades when the ANN is used if the Hough algorithm fails to detect the fish.
Liang et al. combined CNN and migration learning to distinguish between three different kinds of Chinese ornamental fish [22].They used TensorFlow, which is an opensource library for machine learning and artificial intelligence, to train their network model.A total of 14, 000 (4*3,500) images were gathered, which were divided into 3,000 and 500 images for training and testing, respectively.Their dataset, which consisted of 3,500 images of three different fish and one set for other types of fish, was gathered from the Internet by using the web crawler technology.Following that, preprocessing was an important step in enhancing the recognition rate as real-time videos of fish were shot outside of the aquarium.The dark channel prior and gamma correction methods were used for this purpose.The latter was a significant step towards the removal of brightness from the pictures.To reduce processing power, all images were scaled to 250 * 250 pixels.Their experiment showed that an accuracy of 98.1% is achievable.
Cai et al. took a different approach to realising a system for detecting fish and counting [23].They proposed to use the You Only Look Once Version 3 (YOLOv3) model with MobileNet as the backbone for feature extraction.Their proposed system was trained using different strategies.They found out that their system performs better than when using YOLOv3 alone.The average precision obtained was 79.61%.
Pudaruth et al. experimented with multiple machine learning classifiers to discover the most effective one for developing a smartphone application to recognise different fish species existing in the exclusive economic zone (EEZ) of Mauritius [11].Their model was tested on 38 different fish species with a dataset consisting of 1,520 images.Using the www.ijacsa.thesai.orgkNN classifier, they were able to attain an accuracy of 96%.Also, their study has shown that the use of deep neural networks (DNN) with the TensorFlow framework can attain an impressive accuracy of 98%.However, pictures of the fish were taken in a controlled environment and not in their natural habitat.The fish were placed on a white background to enable easier segmentation.

Conrady et al. used the Mask Region-Based Convolutional
Neural Network (R-CNN) object detection framework to perform classification of the Roman seabream fish, which is endemic in Southern Africa [24].Their dataset consisted of 2,015 images of the fish.They were able to get a mAP of 81.45% on their test data.

B. Marine Creature Recognition
While most researchers have focused on fish recognition, Chen and Yu proposed a high-definition camera system capable of recognising marine creatures [25].Their approach to the identification process was broken down into multiple steps.The first step was concerned with the extraction of the image frames from the original video.The following step involved pre-processing of the retrieved images.In addition, for the detection phase, the original image had to be transformed to its grayscale and binary representations.To classify seven creatures, two separate methods were used: The Back Propagation Neural Network and the SVM methods.Their study showed that the SVM approach had an accuracy of 92% in classifying the creatures, which was higher compared to the other methods.However, their proposed method does not work well for creatures with similar shapes.
Pelletier et al. developed a system capable of classifying marine animals into eight categories [26].Their imbalanced dataset contained 3,777 images.They used two models to conduct their tests, namely AlexNet and GoogLeNet.The best performing model was GoogLeNet.The models were tested with uncropped and cropped images.They found out that by using the cropped images and GoogLeNet, they got the best accuracy, which was 96.54%.Additional tests were done by also considering the top two results during the classification process.This further increased the accuracy of the GoogLeNet model with the cropped dataset to 98.94%.Aside from image recognition, several other approaches for automatic identification of marine creatures have been used in the recent past.Demertzis et al. suggested a novel technique: the use of a Machine Hearing Framework (MHF) for the identification of marine animals through their underwater sounds [27].They were able to recognise fish and marine animals with recognition accuracy of 96.08% and 92.18%, respectively.Song et al. proposed a method for the identification of marine creatures from seafloor videos [28].Their proposed methods are twofold: extraction of valid video clips followed by recognition of the creature.During the first phase, an image segmentation method was used to determine and extract all frames from the video containing the marine creature.The next phase was concerned with identifying and labelling the creatures in the valid video clips.This was accomplished with the help of public participation.Lastly was the recognition process, which was accomplished with the help of the information submitted by the public and the membership function.Their method had an accuracy of above 80% in extracting the valid video frames, and all the creatures were successfully recognised.
Liu et al. implemented an embedded system to classify marine animals into seven categories [29].Their dataset includes 8,455 photos of marine animals, with 80% of the images used for training and 20% used for validation.The training images were augmented by applying some transformations (rotation, translation, and flipping), which increased the number of training images to 27,056 (6,764*4).The models that were deployed on the embedded device were tested with 350 new images.Three models were used: MobileNetV1, MobileNetV2, and InceptionV3.MobileNetV2 had the highest testing accuracy of 95.0% and validation accuracy of 92.89%.Their model took an average of 0.0578 seconds to classify one image.

C. Knowledge Gap
Even though multiple studies have been conducted, very few have tested the effect of using deep learning (DL) on a big dataset.According to the review, the largest dataset had 50 species and consisted of 4,909 images [19].Furthermore, there is no web application available in Mauritius that can perform recognition of marine animals.Adding to that, no dataset consisting of more than 50 marine creature species is currently available.This work aims to provide some answers to these research gaps as well as to contribute a dataset and a web application to perform marine creature classification.

A. Data Collection at Odysseo Oceanarium
As no relevant existing dataset of marine creatures from the Indian Ocean was found at the time of this study, a custom dataset was created.Data collection can be done from multiple sources.However, due to time constraints, this is not feasible.Nonetheless, the source should be trustworthy and provide the desired information.In this regard, the Odysseo Oceanarium was chosen as the primary source of data gathering for this study.In recent years, underwater video surveillance has grown increasingly common in marine environments to acquire data on marine creatures in their natural habitat.This is a noninvasive method and provides sufficient data for research.Chen and Yu adopted this approach by using an underwater submerged video system [25].However, for this study, due to limited resources, videos of marine creatures were taken outside of the aquariums found at Odysseo using a smartphone.All the videos were taken with a Huawei Y9 Prime 2019 smartphone, which has a resolution of 16 megapixels.

B. Data Processing
Numerous videos of marine creatures were obtained at Odysseo.Each video was converted into frames.Pelletier et al. have already shown that cropped images result in better model performance [26].Taking this into consideration, each extracted frame was carefully cropped.Not all images were included in the final dataset.The following conditions had to be met to use the image, or else it was discarded: the image containing the creature should not be occluded by another creature or object; the creature should be recognisable and it www.ijacsa.thesai.orgshould not be too far from the image.Fig. 1 shows an example of a good image.It follows all the criteria described above.

C. Custom Dataset
The custom-built dataset consists of 51 classes of marine creatures as shown in Fig. 2. It has 5,709 images in total and is imbalanced.The class having the highest number of images is Dascyllus aruanus with 171 pictures, and the one with the lowest number of images is Chaetodon kleinii with 74 pictures.Marine animals are challenging to photograph since they conceal their presence in aquarium by hiding beneath rocks, among plants and tank accessories.This makes it difficult to collect the same number of images for all of the creatures and is the primary reason why the dataset is initially imbalanced.

D. Oversampling
To achieve an equal distribution of images per class, the dataset had to be balanced and for this oversampling was done as shown in Fig. 3.The following transformations were used for augmentation: flipping, change in brightness, shearing and rotation.Each image is subject to three possible modifications.
After oversampling, all classes got an equal distribution of images.Each marine creature now has 171 images.Fig. 4 shows the data distribution of the dataset after oversampling.

E. Splitting the Dataset
The oversampled dataset (171 images per class) were split using splitting ratios of 8:1:1, 7:2:1 and 6:3:1 for training, validation and testing sets.As a result of multiple manipulations, different dataset versions were created.A proper naming convention was devised to properly organise the work being done.Table I lists the various datasets that were used throughout this paper.

F. Feature Extraction and Classification
For this research, pre-trained CNN models were used both for feature extraction and classification.The pre-trained models that were employed are: MobileNetV1, InceptionV3 and VGG16.Different image sizes were utilised depending on which model was imported.The image sizes used are shown in Table II [30][31].After the features have been extracted, the next step is to use these features and a classifier to make a prediction.The architecture of the pre-trained models used is shown in Fig.

G. Use of Callback Functions
Different callback functions, such as ModelCheckpoint and EarlyStopping, were employed during model training.The EarlyStopping callback function is viewed as a technique to combat model overfitting.For this work, the number of epochs was fixed at 100, and then the EarlyStopping function was used to halt training if the model became overfit [32].The ModelCheckpoint callback function, on the other hand, was used to save the model during the training phase.Failure may occur occasionally, causing the training to be disrupted.It is preferable to resume training from the last saved epoch rather than starting it from scratch [33].

H. Use of Optimiser
The Stochastic Gradient Descent (SGD) optimiser was employed during model training to adjust the weight and learning rate properties of the DL model to reduce losses during backpropagation.Additionally, to help the optimiser converge in the right direction and prevent overshooting, Nesterov momentum was employed.Due to its look ahead property, the Nesterov method takes the appropriate precautions by making smaller updates to reach the minima [34][35].Experiments were performed using the MobileNetV1 pre-trained model to find the optimal parameters to pass to the SGD optimiser function.We discovered that setting the learning rate to 0.001, decay to 1e-6, and momentum to 0.8 produces the best results.Thus, these parameter values were used throughout this work for model training.

I. Training using Transfer Learning
Transfer learning is a concept whereby a previously trained (pre-trained) model is reused to tackle a new but comparable task.This is a popular deep learning technique as the neural network does not have to be trained from scratch with a huge volume of data.The weight that the network has already learnt is simply transferred to the new task in transfer learning.This technique aids in reducing training time and may possibly increase the performance of the neural network [36].In this research, transfer learning is used to train the CNN models.Fig. 6 illustrates how transfer learning was applied for model training using the custom-built dataset.There are different types of fine-tuning that can be done on pre-trained CNN models, such as training the entire model, training some layers and leaving others frozen, and freezing the convolutional base by not training the feature extraction layers [37].For this work, each test will be done by training the entire model and freezing the convolutional base.

J. The Web Application
A good and simple user interface is crucial for the user to efficiently use the application.Taking this into consideration, the user interface of the web application is divided into 4 areas: image upload area, prediction area, creatures in dataset area and modal displaying creature information.
1) Image upload area: Fig. 7 shows the image upload area when the application is accessed through a desktop and a mobile phone.2) Prediction area: The prediction area is illustrated in Fig. 8.It shows the predicted creature, along with a table containing creatures with confidence scores greater than the threshold value.If the confidence scores computed for the input image are lower than the threshold, an appropriate message is displayed to the user as shown in Fig. 9.In this context, a threshold value of 0.5 is employed.

3) Creatures in dataset area:
The creatures that the system can classify are shown in Fig. 10.

IV. RESULTS
This section provides an evaluation of the different models.

A. Testing Different Split Ratios
Table III shows the three splits for the oversampled dataset.The three pre-trained models employed were trained on the three versions of the oversampled dataset.Table IV   The best accuracy obtained in all cases is when the feature extraction layers are trained.Table VII shows the best performing model for each of the different splits.
Irrespective of the split ratios used, the best models were able to achieve very good accuracy of above 99%.The variations in the scores, as indicated in Table VII are less than 1%.This is due to the fact that randomness is used in weight initialization when training of the model starts.The weights are adjusted at every epoch.This produces different outcomes for the same model each time it is trained on the same dataset [38][39].This means that if the same experiments were repeated, different scores would have been obtained.To conclude, the differences obtained are considered insignificant.
Judging the models solely on accuracy is not enough to give a fair evaluation.The model's prediction time must also be considered.The inference time of the models to predict all the images in their test directory was repeated five times.In Table VIII, the total time taken is displayed in seconds.From Table IX, it can be seen that the MobileNetV1 models achieved the lowest inference time.Table X shows the number of trainable parameters and the sizes of the three models.Among the three models, MobileNetV1 has the least number of trainable parameters.Furthermore, the MobileNetV1 model is smaller in terms of size.As a result, it has the shortest inference time.For deployment, the MobileNetV1 model trained on the DS_oversample_8_1_1 dataset was chosen since it had the lowest inference time of 0.10 seconds per image.

B. Comparisons with Related Works
Even though the accuracies obtained in this study cannot be truly compared with other researchers because the same datasets were not employed, an attempt to compare our work with previous studies is made in this section.
Deep and Dash conducted several experiments using a dataset of 23 creatures [20].They used CNN for both feature extraction and classification.Additionally, they used a hybrid strategy in which CNN was used to extract features and a classifier (kNN or SVM) was used to classify them.They got the best accuracy of 98.79% when they used their custommade CNN for feature extraction and kNN as a classifier.However, in this study, we were able to achieve higher accuracy when the pre-trained CNN models were used for both extraction and classification.
The training methodology adopted by Liu et al. is the same as the one we have used [29].They also used transfer learning to train their models.They used the MobileNetV1, www.ijacsa.thesai.orgV. CONCLUSIONS Automatic recognition of marine creatures is a topic of interest to many researchers around the world.People who are unfamiliar with marine taxonomy have trouble discriminating between different aquatic organisms.This poses a problem, especially when deadly ocean animals are encountered.Several studies have been undertaken over the last few decades, but not many have examined the effect of training deep learning pretrained CNN models on a large dataset.
In this research, three deep learning models, namely, MobileNetV1, InceptionV3, and VGG16, were investigated and implemented for the task of marine creature classification.To achieve the objectives of this study, a customised dataset of 51 available creatures from the Indian Ocean was built and used for training and testing the effectiveness of models.Images of marine creatures were collected at Odysseo Oceanarium in Mauritius.
Several experiments with different split ratios were carried out.The splits for the training and validation sets were varied, and that for the testing set was fixed at 10%.Transfer learning was used, and the models were fine-tuned by replacing their classification layers with new ones.Adding to that, two experiments were performed for each model: training the feature extraction layers and not training them.All of these tests were carried out in order to determine the optimal split ratio, dataset, and model.
It has been concluded that the best suited model was MobileNetV1 trained with the oversampled dataset with a split ratio 80% for training, 10% for validation and 10% for testing.The model attained a classification accuracy and an F1 score of 99.89%.The model had an inference time of 0.10 seconds per image.This model was then integrated into a web application.
Our research has thus demonstrated that deep learning models offer enormous potential for automating the process of marine creature recognition.The developed web application with the integrated MobileNetV1 model provides a reliable and fully automated tool for the classification of marine creatures without the need for expert assistance.www.ijacsa.thesai.org

5
. The input image given to the VGG16 and MobileNetV1 models is an image of size 224 * 224 compared to the InceptionV3 model, which is of size 299 * 299.The classification layers of the pre-trained model were replaced with one global average pooling layer and three dense layers.The softmax activation function was used in the model's final dense layer for classification.Adding to that, for the two other dense layers, the rectified linear activation unit (ReLU) activation function was used.The custom model predicts the input image as one of the 51 classes of marine creatures from the dataset.www.ijacsa.thesai.org

Fig. 6 .
Fig. 6.Block diagram of the proposed training strategy.

Fig. 7 .
Fig. 7. Image upload area for desktop view (left) and mobile view (right).

4 )
Modal displaying creature information: The modal component is used to display information about a creature as shown in Fig.11.
MobileNetV2 and InceptionV3 models to perform feature extraction and classification.

TABLE I .
SUMMARY OF DATASET VERSIONS # Dataset Name Description 1 DS_oversample_8_1_1 This is the oversampled dataset, which contains 171 images for each class.Three different splits are done.

TABLE II .
IMAGE SIZES FOR DIFFERENT MODELS

TABLE III .
OVERSAMPLED DATASET SPLIT SUMMARY , Table V and Table VI shows the results obtained.

TABLE IV .
DS_OVERSAMPLE_8_1_1 RESULT

TABLE VII .
BEST MODELS FOR OVERSAMPLED DATASET Table VIII the average inference time can be calculated by dividing the time taken to predict all the images by five.The prediction time for one image can then be computed by dividing the resulting value by the number of test pictures as shown in Table IX.

TABLE VIII .
PREDICTION TIME

TABLE X
Table XI shows a comparison between the best performing model in their work and in ours.

TABLE XI .
[29]ARISION OF THE BEST PERFORMING MODELThe MobileNetV2 model presented by Liu et al. is limited to predicting 7 species[29].However, the model presented in this study can perform classification between 51 creatures.The larger the number of classes in a deep learning model, the more time the model generally takes to predict the image.