An Efficient Deep Learning based Hybrid Model for Image Caption Generation

org


INTRODUCTION
In this www world, every day in our life, all have experienced with the huge number of images in a real world which are self-interpret by the individual human being by using their wisdom. Human are naturally programmed to convert the natural scene in to text but it is the complex task for the machine as they are not much efficient like human. Still, human generated captions are considered better as machine need human intervention and programmed accordingly for the better result. Due to the recent development in deep learningbased techniques, computers are capable to handle the challenges of image captioning like detection of object, attribute and their relationship, image feature extraction and generating syntactic and semantic image caption [1].
With the advancement of AI, so many new ideas have revolutionized in the areas of image processing and it has transformed the world in a surprising way. The image captioning Approach (Fig. 1) has wider application in the real world as it provides the better platform for human computer interaction. Due to the emerging application in image processing, image captioning becomes the topic of interest for the academician and researchers.
By seeing the Fig. 2, picture someone guess that two dogs are playing with toy and someone might say two dogs hauling in floating toy from the ocean or two dogs run through the water with rope in their mouths, so all of these captions are appropriate to describe this picture. Our brain is so much trained and advanced that it can describe a picture almost accurate but same was not the case with machines.
Hence, the main aim of the image captioning is first identified the different objects and their relationship present in the image using deep learning-based technique, generating the textual description using the natural language processing and evaluate the performance of the natural language-based description using different performance matrices. Object detection and segmentation are the part of the computer vision and done with the help of popular CNN and DNN and generating image description (Fig. 3) are the part of natural language processing which is done by RNN and LSTM. CNN works for understanding the objects of the image or scene and provide the answers the various questions about the objects in image like what, where, how, etc.    For example, in Fig. 3, CNN identify the "dog", "toy", "water" and their relationship in the scene. Further RNN give the shape in textual form by using the keywords described by CNN by considering it in group of words. This one is also called the encoder-decoder architecture. Object detection is a part of computer vision which uses various algorithms, like YOLO, R-CNN, Mask R-CNN, MobileNet and SqueezeDet for detecting the different parts of the image efficiently.
Template based approach uses predefined templates of objects, actions and attributes to identify the input image [18], the authors use visual elements like object, action and scene for predicting the caption of the image. In [19] author takes the advantages of Conditional Random Field (CRF) based technique extract the features of the image. The proposed model evaluated using BLUE and ROUGE score on PASCAL dataset. As it is based upon pre-defined template it is not able to generate the caption of image with variable lengths.
Retrieval based approach generate caption by capering the features of the image with the datasets. It tries to finds the caption for input image by discovering similar features in the dataset. In [22] authors proposed a model to extract feature of the query image by searching it through the dataset and in [32], the authors propose the caption by using the density estimation method. In [25], the authors used semantic and visual features for image caption generation.
In the original dataset we have five captions for each image and our goal is to train a particular model on this dataset. After the training phase model becomes efficient for extracting the features of the particular image, various predefined image classification models are available which uses state-of-the-art algorithms for classifying the thousands of different objects/images efficiently. These models come up with better accuracy with respect to image rate classification, like ResNet. These are very easy to implement.
Encoder-decoder based approach is a most widely used for machine translation and image caption generation which is based upon deep neural networks. A dual graph convolution network based is proposed in [33] and NIC (Neural Image Caption) model based on encoder-decoder architecture is in [27]. This one is a simple model where CNN is used as a encoder, and in the decoder end LSTM and RNN are used for image caption generation.

III. RESEARCH METHODOLOGY
Here, for extracting the visual feature of the image, CNN used as an encoder which have Convolution layer, Pooling layer, and fully connected layer. Earlier AlexNet was used for compute vision problems but nowadays, the transfer learning are in trends in where several pre-trained CNN based models are available like VGGNet, Inception V3, DenseNet, ResNet etc. which are available with different convolutional neural layers and used for saving the training time of the model. Further, decoder is used to generating the final captions which gets the input from the encoder. GRU, LSTM and RNN are the most commonly used decoder. RNN are suitable for short words sequence and LSTM is best for long sequence.
This section depicts the proposed hybrid research methodology. Our main objective of the proposed model is to achieve the higher Meteor value. Our model is based on an Encoder-Decoder approach where it used the concept of transfer learning. Here in the first phase, features of the image is extracted by using VGG16, ResNet50 and YOLO (You Only Look Once) separately. YOLO is an efficient object detection algorithm in real time with is developed in 2015 Joseph Redmon et al. whereas VGG16 (Visual Geometry Group) is an object detection and classification approach which is pretrained on ImageNet dataset. This is deep Convolutional Neural Network (CNN) architecture which uses 16 convolutional layers. ResNet50 is a deep CNN with 50 convolutional layers which is able to classify more than 1000 object category.
In second phase, concatenate of the features of image extracted by the VGG16, ResNet50 and YOLO and all the duplicate words are eliminated.
In third phase, captions are generated by using the BiGRU and LSTM. BiGRU (Bidirectional Gated Recurrent Units) is a Neural Network architecture used in NLP (Natural Language Processing). This architecture uses two GRUs for taking input in forward and backwards directions. LSTM (Long Short -Term Memory) is a type of recurrent neural network architecture which used feedback connections and capable of Input Image "Two dogs playing with toy in water" Caption www.ijacsa.thesai.org identifying the relation between objects. In the last phase, both the captions are compared with the Meteor performance evaluation metrices. Final caption has the higher meteor value.

IV. DATASETS
Data are the backbone of any AI based systems. Recently image captioning is blessed with rich datasets like MSCOCO, Flickr8k, Flickr30k, PASCAL etc. in the dataset, every image is described in related five reference sentences. Every description of the scene is described by using different algorithms and grammar. MSCOCO is a large dataset which was developed by Microsoft whose target to describe the image as a human being. It first understands the scene and complete the image recognition, segmentation and generating suitable caption of the image. It contains 82,783 images, with validation set 40,504 images, and the test set 40,775 images. Flickr30k dataset has 28000 training images, 1000 testing and 1000 validation images.
Here, in this paper a benchmark dataset Flickr8k for the training of the model. It contains 8000 images with 5 captions of each image which provides the clear descriptions of the silent objects. It has manually labelled captions for all the images in English language. The dataset is divided into two categories. First one is image directory which has 8k images with 5 captions. Out of 8000 images, 6000 are used for training and remaining 2k images are for training purpose. Images in Flickr8k dataset are in jpg format with resolution 256*500 to 500*500 and average length of sentence is 12 words.

V. RESULT AND ANALYSIS
Performance of the image captions are evaluated by using different evaluation BLEU, METEOR, ROUGE, CIDEr and SPICE metrics. When analyzing the proposed model and matching the predicted words to their original captions, the BLEU score is applied. Fig. 4 illustrates how the loss gradually decreased as the number of training epochs grew. it could train our datasets across more epochs to get better descriptions, and here it did so for 100 epochs to enable comparison study. The loss value is between 0.5 and 0.1 epochs. Maximum and minimum values are observed for 10 epochs with losses of 0.5+ and less than 0.1 epoch, respectively. In Fig. 5, the comparison of the predicted caption with five additional original captions using a graphic representation of the BLEU score is illustrated. From 5 to 10 epochs, a sharp increase is observed from 0.50 to 0.56 BLEU score, then the graph experiences slight ups and downs till 50 epochs. Another score called "match words" counts the words that match up with the produced text of a picture. As shown in the graphical representation, the match words undergo significant upswell with changes as time passes. Witnessed as 0.49 match words in the case of 50 epochs and 0.40 in the case of 5 epochs. When Match Word and BLEU Score were compared, it was found that both inclined before reaching the heights. In the instance of Match words, the score increased from 0.500 to 0.555 from 5 to 10 epochs. After that, this sample saw minor changes through 50 epoch, reaching a score of 0.575. When discussing the BLEU score, it had two distinct peaks at 0.450 and 0.470 score at the 15 and 30 epochs. At 35, the graph had a slight decline (0.460), and at 50, it finally hit the score (0.480).
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 3, 2023 234 | P a g e www.ijacsa.thesai.org "a brown puppy is walking in snow" BLEU Score: 75 "A man flying with skateboard" BLEU Score: 72 "A girl is running on beach" BLEU Score: 73 "a player in white uniform is running with ball", BLEU Score: 73 "a white dog runs around in grass", BLEU Score: 75 "a man in black dress rides bike on hill", BLEU Score: 69 "a puppy is hopping in a grassy area", BLEU Score: 70 "three person standing under umbrella", BLEU Score:72 "a spotted dog is running with a ball", BLEU Score:73 "a black dog playing with a ball", BLEU Score: 75 "a person is climbing a snowy mountain", BLEU Score:74 "two old woman in red dress smile", BLEU Score: 74 "a woman is smiling and swinging", BLEU Score: 72 "a small girl in pink is sitting with a dog", BLEU Score: 74 "a black dog jumping over a log", BLEU Score:76   The graphical representation illustrates the model's recall changes with threshold values. Threshold values from 0.0 to 0.25 remained constant at 1. After then, a steady fall was observed from 0.25 to 0.75 and approached 0.0 value until a very little increase with around 0.1 recall value was noted too and final recalled value is accounted as 64.056. The graph that depicts the variation in accuracy with threshold values changes the shape of a sharp peak that is constant at 0.500 accuracy up until 0.0 to 0.25v threshold value, then a straight climb up to 0.675 accuracy, followed by a similar value fall up until 0.75 threshold value (Fig. 6). And resultant accuracy is 67.052. The graph shows model precision levels as well as variations in threshold settings. Although the precision value overall is 68.138, changes are seen from a 0.2 threshold value to a 0.75 with a simple increase in the precision values. Other starting and ending values were 1.0 from .075 to 0.25 and 0.5 from 0.0 to 0.25. Further in Fig. 7, BLEU score and Match score are compared which shows the compatible score. First average score of both are .52 on 5 Epochs. At 10 Epochs the values are increased to 0.56. it shows its best performance in 30 Epochs and decreases in 35 Epochs due the overfitting. In Fig. 8, 9 and 10 precision recall and accuracy are shown.    The given Tables I and II are the results from an LSTM based decoder model using a signal encoder on the flickr8k dataset. There are five encoders (Inception V3, VGG16, Res Net50, VGG19, and Proposed Hybrid Approach) given each represents their own values of BLEU-1, BLEU-2, BLEU-3, BLEU-4, ROUGE-L, and METEOR in the table chart. The maximum value in terms of BLEU-1 data is 0.67 for the proposed Hybrid Approach Encoder. However, in BLEU-2, the minimum value is held by Res Net50. Considering the data in BLEU-3 and BLEU-4, the minimum is send in the case of ResNet50 as 0.18 and 0.12, whereas the maximum is witnessed in the case of the proposed Hybrid Approach Encoder. In ROUGE-L, data is numbered as 0.21, 0.23, 0.27, 0.21, and 0.31 for Inception V3, VGG16, Res Net50, VGG19, and Proposed Hybrid approach, respectively. On the other hand, 0.22 was the value which was similar to VGG16 and VGG19 in the case of METEOR.

VI. CONCLUSION
In this paper, a hybrid encoder-decoder based model to generate the effective caption of the image by using the Flickr8k dataset. During the encoding phase, the proposed model used transfer learning-based model like VGG16 and ResNet5o and YOLO for extracting the image features. A concatenate function is used to combine the feature and removes the duplicate one. For the decoding, BiGRu and LSTM are used to get the complete caption of the image. Further BLEU value is evaluated of both the captions generated by BiGRU and LSTM. Final caption is considered whose METEOR value is high. The proposed model is also evaluated by METEOR and ROUGE. The proposed model achieved score BLUE-1: 0.67, METEOR: 0.54 and ROUGE: 0.31 on Flickr8k dataset. The experimental results show the better results through BLUE, METEOR and ROUGE when compared to another state-of-art models. The model is also helpful in generating the captions at real time.