Image Captioning using Deep Learning: A Systematic Literature Review

Auto Image captioning is defined as the process of generating captions or textual descriptions for images based on the contents of the image. It is a machine learning task that involves both natural language processing (for text generation) and computer vision (for understanding image contents). Auto image captioning is a very recent and growing research problem nowadays. Day by day various new methods are being introduced to achieve satisfactory results in this field. However, there are still lots of attention required to achieve results as good as a human. This study aims to find out in a systematic way that what different and recent methods and models are used for image captioning using deep learning? What methods are implemented to use those models? And what methods are more likely to give good results. For doing so we have performed a systematic literature review on recent studies from 2017 to 2019 from wellknown databases (Scopus, Web of Sciences, IEEEXplore). We found a total of 61 prime studies relevant to the objective of this research. We found that CNN is used to understand image contents and find out objects in an image while RNN or LSTM is used for language generation. The most commonly used datasets are MS COCO used in all studies and flicker 8k and flicker 30k. The most commonly used evaluation matrix is BLEU (1 to 4) used in all studies. It is also found that LSTM with CNN has outperformed RNN with CNN. We found that the two most promising methods for implementing this model are Encoder Decoder, and attention mechanism and a combination of them can help in improving results to a good scale. This research provides a guideline and recommendation to researchers who want to contribute to auto image captioning. Keywords—Image Captioning; Deep Learning; Neural Network; Recurrent Neural Network (RNN); Convolution Neural Network (CNN); Long Short Term Memory (LSTM)


I. INTRODUCTION
Auto image captioning is the process to automatically generate human like descriptions of the images. It is very dominant task with good practical and industrial significance [62]. Auto Image captioning has a good practical use in industry, security, surveillance, medical, agriculture and many more prime domains. It is not just very crucial but also very challenging task in computer vision [1]. Traditional object detection and image classification task just needed to identify objects within the image where the task of Auto image captioning is not just identifying the objects but also identifying the relationship between them and total scene understanding of the image. After understanding the scene it is also required to generate a human like description of that image. Since the boost of automation and Artificial Intelligence lots of research is going on to give machine human like capabilities and reduce manual work. For machines acquiring results and accuracy as good as human in image captioning problem has always been a very challenging task.
Auto image captioning is performed by following key tasks in order. At first features are extracted after proper extraction of features different objects from an image are detected, after that the relationship between objects are to be identified (i.e. if objects are cat and grass it is to be identified that if cat in on grass). Once objects are detected and relationships are identified now it is required to generate the text description, i.e. Sequence of words in orderly form that they make a good sentence according to the relationship between the image objects.
To perform above key tasks using deep learning different deep learning networks are used. For Example to get visual features and objects CNN with different region proposing models like RCNN, Faster RCNN can be used and to generate text description in sequence RNN or LSTM can be used. Using these networks various different methods are developed to perform auto image captioning in various different domains. However, still, there is room for the machine to make capable enough to generate descriptions like a human [61]. . After training the Deep Learning network for image captioning to evaluate its performance various evaluation matrices like BLEU, CIDEr, and ROUGE-L exists.
The purpose of this Systematic Literature Review is to study all newest Articles from 2017 to 2019 to find different methods to achieve auto image captioning in different domains, what different datasets are used to achieve the task, In which different practical domains this task is used, which technique Outperforms others and finally attains to describe the technicalities behind different networks, methods and evaluation matrices. Our study will help new researchers who want to work in this domain to attain better accuracy. We specially focused and the collection of quality articles which have been published till now. We attempt to find our different techniques presented in  articles, find their methods strengths and weakness. Finally we attempt to summarize them to explain which technique has better performance in its particular domain. Our work mostly focuses on identifying the most popular techniques. The areas in which yet there is attention require and in result section we also attempt to explain the technical concepts behind the used approaches.

II. METHODOLOGY
The planning conducting and reporting of this Systematic literature review is done step by step. First in planning section we identified the need of conducting this research its importance. Identifying the research questions and design search strategy, designing quality assessment criteria and finally designing data extraction strategy is also planned during this stage. After proper planning we have conducted the research. In alignment with our research problem we have come up with research questions for which we try to find answers during this research.

A. Research Questions
Before conducting this study we kept the following research questions to measure the quality of our work. This study basically provide a detailed knowledge related to these research questions. Table I

B. Search Results
According to our research questions we came up with our search keywords and we categorized them in two different groups, shown in the Table II. Using scientific approach for searching the results from different academic databases. We composed the query string from the keywords cited in Table II. Query String: ("Image Captioning") AND ("Deep Learning" OR "Neural Network" OR "RNN" OR "LSTM" OR "CNN") We applied the cited search query string on three well known academic databases namely IEEE Xplore, Web of Sciences and Scopus to search the articles. We adopted the most recent articles published during 2017-2019 from the journals, and our initial search results are illustrated in the Table III. Since an article can be indexed in many databases we removed the duplicate articles from either one of the database. After duplicate removal total number of studies from all three databases are shown in Table IV.
Abstract screening is also important to filter the searched studies to keep valuable studies that are more related to someone's work. We performed abstract screening on the 577 articles which were remained after duplicate removal to check out the relevance of studies with our work. We found many studies not relevant to our topic like some were about audio captioning or video captioning. After the abstract screening, we had a total of 308 studies out of 577 studies, Table V illustrates the total number of studies from each database after the abstract screening.

C. Quality Assessment Criteria
The quality of 308 articles was assessed for quality assessment criteria. We assessed the quality of selected 308 studies to ensure the quality assessment of our study. We went through the full text screening of those studies which were ambiguous and was not clear from abstract screening. The process of quality assessment criteria (QAC) was done with full text screening. All four authors agreed to make some quality assessment questions (QAQ) to ensure the quality of our work.

QA Q1
The article must be published in journal QA Q2 Article has proposed a proper method to implement image captioning using deep learning. QA Q3 The article must have clear and unambiguous results. QA Q4 Article must discuss the applications and challenges of image captioning. QA Q5 Article must discuss the evaluation strategy of the built model.
We assessed the quality of 308 studies on the basis of quality assessment criteria (QAC) questions and through full text screening, we found total 61 studies from all three databases. Number of each studies from all three databases shown in Table VI.
The result which we found above illustrated in PRISMA diagram (see Fig. 1). All this process we this dissipated in following diagram.

D. Data Extraction and Synthesis
After selection of final 61 primary studies we extracted data from those studies for performing final synthesis. We defined our data extraction strategy based on our research questions. We have extracted following parameters from our primary studies for further synthesis, year or article published, title, models use for language generation and object detection, methods use to implement models, datasets used, evaluation matrices used for evaluation purpose and finally accuracy of proposed model.
The purpose of synthesis is to summarize the facts extracted in data extraction and give a clear picture of work done in past and directions to new researchers.

A. Datasets
There are many datasets available for performing image captioning. In literature most common used data sets are MS COCO and flicker 8k and 30k. Moreover for a text description of specific task like in medical or traffic movement description their own dedicated datasets are created. Fig. 2 below show the datasets along with their frequency in our selected studies. 1) MSCOCO: MS COCO stands for common object in context. It is very large dataset which contains 330k images, 1.5 object instances and 5 captions per image. MS COCO is found to be very widely used dataset in literature. It is very best suited for image captioning because unlike other datasets it contains non iconic images. Iconic images are those images which contains only one object with a background where as non-iconic images contains various objects overlapping. Object layout plays an important role in understanding context of scene and that is very carefully taken care of while labeling images. Fig. 3 shows some images taken from MS COCO dataset.
2) Deep learning networks: Deep learning network used for images is Convolution neural network. CNN has been proved best to map image data into output variable. There are various prebuilt model that take advantage from this feature of CNN i.e. RCNN faster RCNN etc. these models are used for object detection and localization in images which is very necessary task in image captioning since it's not just classification task and understanding image contents is necessary. Once image data is understood there is need of predicting the sequence of words to generate the text for that particular image. For sequence prediction two most famous networks are Recurrent Neural Network (RNN) and long short term memory (LSTM). For image captioning generation task CNN is either used with RNN or LSTM where CNN is used for understanding image contents and RNN or LSTM for text description generation. Fig. 4 and Table VII represents the number of studies that have used RNN or LSTM with CNN. In terms of performance we Compared BLEU-1 performance of both text prediction networks and found out that LSTM outperforms RNN in terms of accuracy. Fig. 6 shows the result of top 5 highest accuracy achieving papers for both networks.

3) Convolution Neural Network (CNN): Convolution
Neural Network is an algorithm of Deep Learning which is normally used to process images. CNN is an evolution of simple ANN that gives better result on images. Simple dense network is best for classification tasks where some features are used to classify the image. CNN performs best with more features in an image. It is used to process the local features as well. Because images contain repeating patterns of particular thing (any image). It takes images as an input and understands it to perform assigned tasks. Two main functions of CNN are convolution and pooling. Convolution is used in CNN to detect the edges of an image and pooling is used to reduce the size of an image. It is a method in which we take a small number matrix called kernel or filter then move it over our picture and convert it depending on the filter values. Following formula is used to calculate the feature map, where f is used to denote input image and h is used to denote filter. The outcome matrix rows and column indexes are labeled with m and n, respectively.

4) Recurrent Neural Network (RNN):
CNNs commonly do not do well in a sequential fashion when the input data is interrelated. CNNs have no connection of any kind between previous input and next data. So all of the outputs depend on themselves. Depending on the trained model, CNN takes input and gives output. For doing above task RNN is used. RNN have its memory, so that it is able to remind what happened earlier in data. Earlier means previous inputs. RNN performs best on textual data because text is interrelated (sequential data).Basic formula for RNN is written below.

5) Long Short Term Memory (LSTM):
LSTM is a variant of RNN. It is better than simple RNN because it solves the issues faced by simple RNN. Two major issues faced by simple RNN is (i) exploding gradient and vanishing gradient and (ii) long term dependency. LSTM uses gates to remember the past and gates are the heart of LSTM. Gates which are available in LSTM are (i) input gate (ii) forget gate and (iii) output gate. They all are sigmoid activation function. Sigmoid means output between 0 and 1, mostly 0 or 1. When output is 0, it means gate is blocking. If output is 1 then pass everything. Below is the equation for above defined gates.

B. Evaluation Mechanism
Evaluating the trained model is quiet difficult task in image captioning for this purpose various evaluation matrices are created. Most common evaluation mechanisms found in literature are BLEU, ROUGE-L, CIDEr, METEOR, and SPICE. It is found that BLEU score is most popular method of evaluation used by almost all of the studies. You can verify this from given Fig. 7 and Table VIII. 6) BELU: BLEU stands for bilingual evaluation understudy. It is an evaluation mechanism widely use in text generation. It is a mechanism for comparing the machine generated text with one or more manually written text. So basically it summarizes that how close a generated text is to an expected text. BLEU score is majorly prevalent in automated machine translation but it can be also used in image captioning, text summarization, speech recognition etc. Particularly in image captioning the BLEU score is accuracy that how close a generated caption is to a manual human generated caption of that particular image. The score scale lies between 0.0 to 1.0. Where 1.0 is perfect score and 0.0 is worst score.
We found that almost all studies used bleu as their evaluation matrix and they calculated BLEU-1 to 4 where BLEU-1 is calculating accuracy only on 1 gram, BLEU-2 for 2 grams, BLEU-3 for 3 grams and BLEU-4 for 4 grams.
The BLEU score can be calculated from following formula.
�� log =1 � 7) METEOR: METEOR stands for metric for evaluation and translation with explicit ordering. While BLEU takes account of entire text generated overshadowing the score of each and individual sentence generated the METEOR takes care of that. For doing so METEOR enhances the precision and recall functions. Instead of precision and recall the meteor utilizes weighted F-score for mapping unigram and for incorrect word order it uses penalty function.
Formula for weighted function is: Where P and R stands for precision and recall calculated as m/c and m/r, where c and r are candidate and reference length and m is number of mapped unigrams among two texts.
Formula for Penalty function is: Where c is number of matched chunks and m is total number of matches.
Over all meteor score is found by: 8) ROUGE-L: ROUGE stands for recall oriented understudy for gisting evaluation. As clear from its name ROUGE is only based on recall but ROUGE-L is based on its F score which is harmonic mean of its precision and recall values. Following are the formulas for calculating precision, recall and F values = ( , ) = ( , ) Here A and B are candidate and reference generated text and m and n are their lengths and LCS stands for longest common sequence since ROUGE-L depends on longest common sequence. Now for calculating F their harmonic means are calculated.
Throughout our review, we have observed that image captioning is mostly used generally. There are various domains that can take advantage of image captioning to automate their tasks.

1)
A model can be trained in medical ultrasound or MRI images or angiographic videos to generate a complete report of a person without any consent from a doctor. Image captioning can be used to generate an automatic report by looking at those medical images of a person.
2) Image captioning can also be used in industries to automate various tasks. A model can be trained on images of a company product manufacturing environment to find out an anomaly in the environment or product automatically. It can also be used also used to detect any mishap in a company like fire or security issues.
3) Image captioning can also be used in agriculture to generate the report of crops for owners by looking at images of crops. 4) Image captioning can also be used in traffic analysis report generation by using CCTV cameras installed on streets and thus guide drivers which is the best suitable path to take and where parking is available.