A Survey on Attention-Based Models for Image Captioning

org


I. INTRODUCTION
Image captioning is targeted to represent an image with a sentence that should be accurate and summarized. The problem of image captioning is similar to using a machine to translate a sentence, but in image captioning, the machine task will be translating an image into a sentence. So, it is necessary to visually understand the image before producing the caption. The caption of the image should be expressive through detecting the objects of the image and their attributes, finding the relationship between the detected objects and the place/activity where the objects are included.
The task of image captioning is very necessary for that it can be as an assistant to the impaired people by providing a brief description for the image while exploring the internet. Image captioning can be used in implementing self-driving cars by providing the agent with the ability to drive in a safe, fast and accurate way. Also, generating a caption for medical images automated the process of diseases diagnosis and treatment. In addition, it can be used to generate captions for the images included in the news articles. There are many other applications for image captioning, like in service robotics, military, education and image indexing.
In order to generate a sentence with reasonable linguistics and true semantics, Computer Vision (CV) methods are used to visually understand the image. In addition, Natural Language Processing (NLP) models are employed to generate a correct sentence. The power of Deep Learning (DL) approaches in CV [1][2][3][4][5][6][7] and NLP [8][9][10][11][12] makes it the first choice for many approaches in image captioning. Convolutional Neural Network (CNN) was most commonly used in the vision part to get the image features. Then, Recurrent Neural Network (RNN) was used as a language model [13][14][15][16].

 Language Model: (LSTM and others)
For the purpose of generating high quality captions, it was helpful to use advanced visual processing by considering the most salient features in the images while generating the caption words which is called attention model. The attention mechanism takes inspiration from the human visual system, which does not focus on all the scene parts but only on small parts of the scene. The salient features included in the image take precedence in encoding the image instead of the whole image. Attention has been used in different tasks, like machine translation and object identification. Moreover, many image captioning approaches employed the attention model and achieved a very good enhancement [33][34][35][36][37].
In this paper, a detailed survey for the attention-based approaches employed in image captioning is presented. In addition, a taxonomy of these attention-based models is provided including two new categories for categorizing the attention-based approaches. Most of the state-of-the-art articles for image captioning using attention-based models are included and compared with respect to the benchmark datasets and metrics.
The organization of this survey paper is as follow: In Section II, Literature review is presented. The attention mechanism and its taxonomy is presented in Section III, including four main categories of the attention models and their subcategories. The benchmark datasets in addition to the  [17,[38][39][40][41][42]. Some of these surveys [17,38,[40][41][42] considered few attention-based approaches in image captioning because most of the attention approaches were issued after publishing these deep learning surveys. A comparative study for attention-based techniques was published by Khaing and Phyu [43]. Their survey presented a good comparative study of the attention-based models but without any categorization and moreover, the most recent reference in their survey was in the year 2018, and there is big progress in the attention-based methods starting from the year 2019 as shown in Fig. 1. The newest survey for attentionbased models was presented by Zohourianshahzadi and Kalita [44]. In [44], they presented an evolution path of the attention models including hard and soft attention, semantic attention, spatial attention, adaptive attention, and bottom-up and topdown attention.
As per our knowledge, there is no detailed survey with a good taxonomy for the attention-based approaches employed in image captioning. Motivated by this gap in the existing image captioning survey papers, especially for the attention-based approaches, a detailed survey for the attention-based approaches employed in image captioning is presented in this paper by introducing new categories.

III. TAXONOMY OF ATTENTION-BASED MODELS
Employing the attention mechanism in image captioning was motivated by the successful work achieved in neural machine translation [45] and object recognition [46,47]. The attention was employed in the decoder part of the translation task to mitigate the encoder from the need to model all input sentence information [45]. Xu et al. [48] proposed captioning approach by exploring the attention technique to consider the significant regions in the process of caption generation.
According to [48], the attention was applied at the decoder so that at every time step ( ), LSTM produced a new word depending on the hidden state ( ), the words produced at the previous steps and a vector called context vector ( ̂ ). The context vector ( ̂ ) represents the information of an appropriate location of the image at specific time step . The context vector ̂ can be calculated using the annotation vectors, which are the features related to the image regions, and their assigned weights . The weights are assigned to every annotation vector , using Multilayer Perceptron depending on the previous step hidden state . The attention model used for calculating the weights had two variants either soft or hard attention depending on how the weights will be interpreted.
Variants of the attention model were proposed in image captioning research area, some researchers enhanced the model by employing the attention as multi-stages or by inserting information to guide the attention. However, the most notable variant of the attention is the transformer-based models as can be seen from Fig. 1, there is a big interest in applying the transformer-based models in comparison with the other categories.
In image captioning, the attention mechanisms can be categorized into four categories, as demonstrated in Fig. 2. According to Chen et al. [49], the visual attention-based approaches may concentrate on the spatial features or the semantic features, so visual attention is added as a category for characterizing the attention-based models. In addition, according to He et al. [50], the attention-based methods can be categorized based on applying the attention as single-stage in the decoder, two-stages, two-stages with scene graph or based on the transformer. This classification is added as subcategories into the category named Attention Blocks. In addition to these main two categories, in this survey paper, two new categories that were not included in other survey papers for characterizing the attention-based models are added, which are Number of Attention Layers and Guided-Attention.

A. Visual Attention
Visual Attention [51,52] is a significant technique in the human visual system. The brain targets a region or an object using computational capabilities with the guidance of low-level image features in a time step. The visual attention models can be divided into spatial and semantic attention.

1) Spatial attention:
For spatial attention, the attention is demonstrated spatially at a specific region [48,49,[53][54][55][56]. For each fixed location, attention weights are calculated related to this location at each iteration. Several approaches apply soft attention, which models the feature maps with the computed weights. www.ijacsa.thesai.org While other approaches use hard attention by selecting a set of regions which are salient from the feature map and concealing the other regions. Through applying the weighted pooling, some of the important spatial data may be lost. In addition, regularly the spatial attention is computed in the last convolutional layer, which leads to some analogous feature results for distinct regions because of the big size of the filter, resulting in ineffective spatial attention.
2) Semantic attention: Instead of attending to the fixed resolutions in spatial attention, other approaches proposed attending to the image's semantic concepts [57,58]. Semantic attention is more like the human description of the image because people describe the most important objects and do not talk about all regions in the image. Attributes can be utilized from any image location even if there is no actual existence of these attributes within the image. For the purpose of attending to semantically necessary attributes, You et al. [57] employed a semantic attention framework that used top-down and bottom-up models. Bottom-up was used to select the semantic attributes, and top-down was used to decide when and where to apply the attention. Another approach was proposed by Gan et al. [58] in which they recognized the semantic tags and computed the probability of the tags to be utilized in forming the LSTM parameters. LSTM weight matrices were expanded to a group of weight matrices that are tag-dependent.
Using semantic attention requires extra resources that are important for detecting the relationship among the semantic concepts and the image.

B. Attention Blocks
The attention-based models can also be categorized according to the block where the attention is applied. The attention can be applied as a single-stage in the decoder block, two-stages by obtaining bottom-up and top-down attentions, two-stages with injecting a graph network, or Transformerbased models.

1) Decoder-based attention (single-stage attention):
In decoder-based attention models, the attention is employed at the decoder. In the process of producing the caption words, the informative regions [59] are targeted in the attention by the decoder. Depending on the LSTM hidden states and the previously predicted caption words, Xu et al. [48] proposed to use the attention module in the decoder of the captioning approach while generating the sentence words. A weighting matrix is introduced for each feature map receptive field then this weighted map and the last predicted word were forwarded to the language model for the purpose of predicting the next word.
2) Two-stage attention: Rather than attending to the salient regions like Decoder-based attention, Anderson et al. [53] presented a model that contains two-stage attention. Faster R-CNN [60] was employed in the bottom-up attention module. Then, the attention was distributed among the image regions using a top-down attention mechanism. They used two LSTM layers for the purpose of applying attention to the selected spatial features, the first layer was for the top-down attention, and the other was for the language layer. The drawback of their model is that it cannot handle object-object relationships.
An approach that is similar to [53] was introduced by Lu et al. [61]. Their proposed decoder determines whether the word will be visual and predicted according to a certain image region or the word will be predicted from the textual vocabulary. The essential advantage of their approach is in its availability to have additional object detectors, which can lead to producing different image captions. The main gap in two-stage attention models is that the models are lacking for getting the relationship between the image regions.

3) Two-stage attention with graph:
To enhance the twostage attention models, graph networks can be employed to discover the relationship among the detected regions which can result in enhanced features and accordingly improve the caption generation. Similar to [53,62], Yao et al. [55] employed the attention mechanism for attending to the informative image regions. The key novelty in (GCN-LSTM) [55] is that they used two graphs for detecting the relationship between the image regions. A semantic graph was employed with the nodes representing the image regions and the edges representing the relationship between these detected image www.ijacsa.thesai.org regions. While the geometrical relations between the regions 'vertices' were demonstrated by the spatial graph. Then, Graph Convolutional Network (GCN) [63] was utilized to output relation-aware region representations.
The approaches presented in [55,64] employed Faster R-CNN to identify the image objects and thus explore the relationships between regions of interest. Faster R-CNN was trained on the Visual Genome dataset [65]. While in [66], the visual relationships were modelled on Flickr30K [67] and MS COCO [68] and so the pre-established classes of the relations are not required.
The authors in [55] extended the approach to (GCN-LSTM-HIP) [69] to include a hierarchical tree of three levels which have the image as the root, the detected regions as the first layer and the instances/foreground of the regions at the leaf layer. Then, a Tree-LSTM [70] was employed for modelling the dependency structure and improving the features.
Another model presented by Guo et al. [71] which detected a set of visual semantic units 'VSUs' where the units represent the objects, attributes and the object's relationships. Semantic and geometry graphs were employed while the vertices representing the semantic units and the edges representing the connections between them differed from [55] that presented the relationships as edges. GCN was then introduced in [71] to output context-aware embeddings for the visual semantic units. Attention for the different kinds of units was applied via context gated attention (CGA). Another scene graph approach was presented by Yang et al. [72] that used the edges to represent the relationships in the graph. Language inductive bias was integrated into the captioning framework, and its features are represented via a scene graph auto-encoder (SGAE).
The main drawback in the graph scene-based models is that, however, the models made an enhancement to the performance compared to the two-stage models, but the need for additional models for scene graph construction is still a problem. Also, with respect to the computational cost, having two graphs is ineffective.

4) Transformer-based:
Unlike the graph-based models, the transformer models don't include any graphs and thus don't need additional models for the graph construction. The transformer was originally designed for text translation [73]. The transformer is able to avoid any duplication by employing the attention in a comprehensive way between the input and the output. Extensive approaches were proposed to employ the transformer models in image captioning [74][75][76][77][78][79][80][81][82][83][84][85].
Huang et al. [86] proposed Attention on Attention (AoA) approach, which adds attention over the traditional attention. "Information vector" and "attention gate" were produced by the query and the attended results, then second attention was produced by element-wise multiplication between them. AoA was applied in the encoder to detect the relations between the objects. While in the decoder, AoA was employed for holding the relevant attention output and ignoring the deceptive results.
Captioning transformer with stacked attention module was proposed by Zhu et al. [76]. A multi-level observation was proposed in such a way that all transformer layers had the opportunity for generating the sentence word. Average pooling was then employed to find the probability of the word by merging all the contributions.
Cornia et al. [74] proposed a transformer approach to consider low and high-level relationships by modelling them as multi-level. They utilized persistent memory vectors while encoding the relationships with prior information. In addition, rather than applying the attention only to the last encoding layer, all the encoder layers contributed to the sentence generation process and connected to the decoder layers in mesh-like connectivity.
A Multimodal transformer was proposed by Yu et al. [75], which is able to model three different relations, which are: word-to-word, word-to-object and object-to-object. Selfattention in the same modality and co-attention in distinct modalities were acquired. In addition, multiple views were employed in two designs: aligned and unaligned multiple views.
The conventional transformer was expanded with the addition of EnTangled Attention (ETA) and Gated Bilateral Controller (GBC) [77]. ETA gave the transformer the ability to use semantic concepts and visual information. The interconnection between the multimodal information was controlled by the GBC. Object relation transformer [78] was proposed in which geometrical information for the relationship between each pair of objects was included within the transformer through spatial attention.
He et al. [50] proposed a model with the idea of changing the internal structure of the transformer that was originally proposed to handle text. They introduced an expanded transformer that includes three parallel sub-transformer layers to handle three different relationships: parent, child, and neighbor.

C. Number of Attention Layers
The attention models can be characterized according to the number of required attention steps either to attend once per word, attend with fixed steps or adaptively determine the number of required attention steps.

1) Single-layer attention:
The attention operation is connected to the word generation procedure in the traditional attention-based framework [48]. The framework attended once to the image prior to generating the following word. The model attended to selected image regions in each iteration, and the computed attention features were sent to the RNN as input. The problem of attending once per word is that some important information may be lost, especially if the model attended to an incorrect region.
2) Fixed multiple attention steps: In order to enhance the single attention process by avoiding predicting incorrect words, several approaches attended multiple times to enhance the attended region and get the lost data [87,88]. Du et al. [87] proposed a model that attended more times to the image per word and showed that it could improve image captioning without adding extra parameters. Two LSTMs model were www.ijacsa.thesai.org utilized, which have the ability to attend for arbitrary times and enable the flexibility of the attention operation.
Triple attention approach was proposed by Zhu et al. [88]. The attention is utilized to the input phase of the previous step LSTM hidden states. In addition, attention was also utilized in the output phase of present hidden states. Conditional embedding was used in addition to the word/image embedding at every input stage of LSTM. This way, the prior text information was coupled with image information, and accordingly, text and image information appeared in the input of the word generation procedure.
For the purpose of getting attention to different semantic abstractions, Chen et al. [49] applied the attention in a multilayer since the lower layers are the dependent layers for the feature maps. The attention in their approach was approached to each entry of the feature maps, which are multi-layer. They also proposed channel-wise attention for applying the reweighting process in every channel through the word generation process. The channel-wise attention could be viewed as the procedure of choosing the semantic concepts by paying more attention to the channels produced by filters indicated by the semantics.
A hierarchical approach (CNN+CNN) [89] was proposed such that they employed the CNN as a decoder besides being the encoder. Their hierarchical attention model learns the relationship of the attributes for all image regions and all levels. The dot-product operation used in their framework results in reducing the parameters and can be faster than Multilayer Perceptron attention used in [48,54]. The idea of hierarchical attention was also employed in [90][91][92]. Yan et al. [90] proposed a mechanism made up of global and local attention modules which related to the global CNN features, extracted by CNN encoder, and local object features, extracted by object detector, respectively. Sequential attention was presented by Fang et al. [93] to take into consideration the sequential attention relationships in several time steps at word generation and correspondingly improve the visual data in caption generation. Another sequential attention was proposed by Liu et al. [94], in which the image was represented as a sequence of objects, and the attention was employed to consider all objects information during sentence word generation.

3) Adaptive attention:
According to the previously presented approaches, sometimes there are no image regions corresponding to each sentence word. So, an adaptive attention approach [54] was proposed that includes a sentinel gate and spatial attention to determine where and when to attend in the caption generation. They presented an extension to LSTM that, rather than having one hidden state, they added a visual sentinel vector. In addition, a sentinel gate was proposed to determine whether the attention will be targeted to the visible sentinel or the image. Another adaptive attention approach was proposed by Deng et al. [95] that adaptively determine whether it is needed to depend on the language model or the visual signals. Their proposed approach can make the image captioning task more flexible by enhancing the obligatory correlation between image regions and sentence words.
Adaptive semantic attention framework [96] was proposed to incorporate dual-LSTMs; the first LSTM works as a visual sentinel to acquire fine-grained representations. The second LSTM serves as a language model that produces the sentence words depending on the updated attended vector and first LSTM output.
Huang et al. [97] presented an adaptive attention time model (AAT). The model was learned to determine the number of required attention steps in each step of the decoder in order to produce the next word. Using AAT, the mapping between the image regions and caption words can be applied arbitrarily such that a caption word may attend to multiple regions and vice versa. Their approach doesn't add parameters gradient noise.

D. Guided Attention
For the purpose of enhancing the performance of image captioning approaches and generating accurate captions, some approaches inserted additional information guidance [98,99] like the concept features that make a connection between the input image and the caption. In [100], the model was guided through semantic information acquired from the images and sent as extra input for the LSTM units. While in [101], the approach could be guided through concept features which are obtained from predicting the recurrent word existence in the captions. Another way for learning the features is by adding a network for guidance [102]. More similar to [102], Sow et al. [103] inserted a network for guidance, but rather than obtaining one vector for guidance, [103] obtained a sequential network for guidance which was able to adjust the guided vectors in the sentence generation process. They also utilized the Luong attention mechanism [104] that is an enhanced style of the attention technique.
Text-guided attention approach was presented by Mun et al. [105]. Related sample captions, namely guidance captions, were employed to get visual attention and produce appropriate captions. The related sample captions were obtained through the similar training images that participate in equivalent related regions with the input image. Topic-guided attention was proposed by Zhu et al. [106], which picked up the significant features by the information guidance through incorporating the topics within the image with the attention mechanism.

A. Datasets
Different datasets have been presented in the research area of image captioning. The popular datasets, which are Flickr8K [107], Flickr30k [67], Microsoft COCO [68] and Visual Genome [65] are presented. [107]: Dataset consists of about eight thousand images selected from six groups on Flickr.com and does not have a tendency to famous locations or people; instead, various situations and locations are represented. The dataset includes five captions for each image through human annotations. www.ijacsa.thesai.org 2) Flickr30K [67]; Extension to Flickr8K, consists of 31,783 images. Flickr30k contains 8.7 objects per image, 44,518 object categories, 6.2 objects per category, 5 sentences per image and 16.6 expressions per image.

1) Flickr8K
3) Microsoft COCO dataset [68]: A large-scale dataset that broadly used in image captioning task. MS COCO includes 328,000 images, 7.7 objects per image, 91 object categories, 2.5 million labelled instances, 27,473 objects per category and five sentences per image.
4) Visual genome dataset [65]; It is an image captioning dataset that considers the relationship modelling between objects. It generates captions for different image regions, unlike the other datasets, which generate the caption to the entire scene. The dataset includes more than 100 thousand images, 18 attributes, 21 objects per image, and 18 objects relationships.

B. Performance Metrics
For the purpose of evaluating the image captioning techniques, different metrics were proposed to compare the output generated caption with the original caption. In this section, the main used performance metrics which are BLEU [108], ROUGE [109], METEOR [110], CIDEr [111] and SPICE [112] are presented. [108]: It is originally introduced by IBM for the evaluation of machine translation. This metric measures the quality of the generated sentence by calculating its similarity with the original reference translations. N-grams of the machine-generated sentence are compared to those of the reference sentences and get the matching counter. The output score is higher, and the quality of the generated sentence is better when there are more reference sentences and there is a higher number of matches. The range of BLEU values is from zero to one, and a small number of generated captions can get one only if it is identical to the ground truth caption.

1) BLEU "Bilingual Evaluation Understudy"
2) ROUGE [109]: It is originally introduced for the evaluation of text summarization. ROUGE metric calculates the quality of the text generated summary by counting the number of its n-gram, sequences of words, and pairs of words that overlapped with the reference summaries created by experts. ROUGE-N (N-grams), ROUGE-L, ROUGE-W, and ROUGE-S are the types of the ROUGE metric.
3) METEOR [110]: A metric utilized for evaluating the machine-generated texts by matching the unigrams of the machine-generated sentence and the reference sentences. Once this matching is computed, recall and precision of unigram and a measure of fragmentation were utilized for computing a METEOR score. 4) CIDEr [111]: A metric utilized for evaluating the image descriptions. The five available captions of the dataset used in the other metrics are not enough for finding the consensus among the judgment of the human and the output captions. A consensus is a measurement for counting the mutual n-grams between the ground truth and predicted captions and assigning low weights for the common n-grams. 5) SPICE [112]: The previously explained metrics depend on the n-grams and SPICE metric overcomes this restriction by employing a scene graph in which the reference and generated captions are converted to a graph-based semantic representation. SPICE is measuring if the objects and attributes are represented in the generated caption in an effective way in addition to their relationships.

V. COMPARISON AND DISCUSSION
In this section, the performance of different state-of-the-art approaches is presented and discussed. In Table I From the beginning of using the attention mechanism in image captioning by Xu et al. [48], it has been shown that their approach obtained better performance on Flick8k, Flickr30k and MS-COCO. The reason behind the better performance is that their approach considered the most relevant objects when generating the image caption. Moreover, they showed that the hard attention variant of their mechanism outperforms the soft attention on these benchmark datasets. After that, You et al. [57] showed that attending to the semantic attributes instead of attending to the spatial attention [48] can improve the results by generating semantically rich captions.
Further improvement in the results was obtained by introducing multiple attention layers, which can be used in a hierarchical structure or by using either a fixed or adaptive number of attention layers. Du et al. [87] achieved 38.1, 28.3, 58.0, 126.1 and 22.0 on BLEU-4, METEOR, ROUGE, CIDEr and SPICE, respectively. These results are higher than the results of hierarchical attention [89]. The hierarchical structure in [89] used the CNN as decoder; however, Du et al. [87] used two LSTMs model to enable attention at arbitrary times and make the attention operation more flexible.
The adaptive attention approach of Huang et al. [97] achieved 38.7, 28.6, 58.5, 128.6 and 22.2 on BLEU-4, METEOR, ROUGE, CIDEr and SPICE, respectively, which are higher than that of both Wang and Chan [89] and Du et al. [87]. The reason for their better performance is that their model was learned to determine the number of required attention steps in each decoder step, and the mapping between the image regions and caption words can be applied arbitrarily.
Anderson et al. [53] achieved a good performance by employing a two-stage decoder containing bottom-up attention and top-down attention. Yao et al. [55], Guo et al. [71] and Yang et al. [72] further enhanced the results of the two-stage decoder by introducing scene graphs for detecting the relationship between image regions. Yao et al. [69] achieved better results than [55,71,72] by introducing a hierarchical tree and using a tree-LSTM to model the dependency structure.
The best performance in Table I was achieved by Yu et al. [75] and Pan et al. [114]. In [75], Yu et al. used a multimodal transformer that can model three different relations, and the www.ijacsa.thesai.org model was designed in two views aligned and unaligned multiview visual representation. However, Pan et al. [114] modelled second order interactions through proposing X-linear attention module plugged into transformer. Both of [75] and [114] are Transformer-based attention models which proves that Transformer-based models can achieve better results in comparison with other attention-based mechanisms. The big interest in applying the transformer, as can be seen from Fig. 1, comes from its ability to weight the importance of every input region and its ability to avoid any duplication by employing the attention in a comprehensive way between the input and the output. In addition, it can be parallelized in an effective way.
Employing the attention mechanism in image captioning started from the year 2015 [48] and it is getting more attention from that time since the number of research papers employed the attention is increasing every year as explained in Fig. 1. In addition, the authors have a tendency for using the scene graph with attention models and also great attention is going towards applying the transformer in the image captioning task due to its parallelization nature and better performance. In addition, part of the research in image captioning task recently is going towards applying the attention as multi-layer in order to enhance the predicted words or adaptively determine the number of required attention steps.

VI. CONCLUSION AND FUTURE WORK
In this paper, a survey was presented for the attention-based image captioning approaches. Four main categories of the attention-based approaches and their subcategories are summarized. Furthermore, the attention-based approaches were compared on benchmark datasets and popular performance metrics. As discussed in the paper, there is a great improvement in the image captioning task due to using the attention-based models especially using Transformer-based approaches. Although there is an impressive effect of using the attention-based models in image captioning, there is still room for improvement. Faster R-CNN is extensively employed as an encoder because of its ability to get effective detection results. However, training of Faster R-CNN is not a simple task, and it gives unsatisfied results in some cases, like when having images of low resolutions or when the objects are deformed or of small size. So, it may be better if other image encoders are used or when an enhanced version of Faster-RCNN is employed. In addition, another room for improvement can be found in the transformer-based models with introducing new transformer architectures, which may help in improving the quality of the result description.