The Impact of Text Generation Techniques on Neural Image Captioning: An Empirical Study

.


I. INTRODUCTION
Image captioning aims to provide accurate and textual descriptions for a given image.It is a challenging task integrating visual as well as textual understanding, and it involves technologies from both computer vision and natural language processing.Automatic image captioning has found practical applications in various domains, including social media [1], remote sensing [2], robotics [3], and medical image report generation [4].
Automatic image captioning has been receiving much attention in recent years, and a variety of approaches and strategies have been proposed and studied [5].Although deep learning models have made significant progress in image captioning, describing images correctly remains a challenge.Image captioning models need to understand image content, object recognition, and object relationships while capturing the interaction between images and language to generate natural language descriptions.Inspired by the advances in neural machine translation, most state-of-the-art image captioning models follow the encoder-decoder pipeline , which consists of an encoder and a decoder.Specifically, an encoder is used to transform an image into vector representations, and a decoder is used for translating the information from the encoder into natural sentences, yielding a relevant caption.In the literature, different encoders, decoders, and varying strategies supporting the encoding or decoding process have been investigated [5], [6].
As one of the core elements of image captioning, the decoder that acts as a text generator has attracted lots of research focuses.At first, various different language models have been employed as the decoder.Under the encoderdecoder framework, a mainstream image captioning model is CNN-RNN [7], where convolutional networks (CNN) are employed as the encoder for feature learning, followed by recurrent neural networks (RNN) act as a decoder for caption generation.Apart from that, various different models have been proposed and developed, including CNN-Long Short-Term Memory (LSTM) [8], CNN-Gated Recurrent Unit (GRU) [9], and CNN-Transformer [10].On the other hand, different decoding strategies have been proposed and studied.Firstly, beam search has been widely adopted by RNN-based decoders for improving the quality of the output caption [11].Secondly, to enhance the attention-based decoder, the attention on attention (AOA) strategy [12] has been proposed to extend the conventional attention mechanism.Last but not least, in order to enhance the decoder's capability of learning to predict the words appearing in the caption, various training strategies have been explored, including cross-entropy loss and reinforcement learning.
Naturally, different image captioning approaches employing varying strategies or mechanisms may have varying capabilities of generating captions.As can be seen from Fig. 1, for the same given input image, four different image captioning approaches provide four different captions, which are of varying quality as revealed by the evaluation metrics.In other words, different image captioning approaches may exhibit different captioning performance.However, due to the complexity of the image captioning models, the difference in performance may originate from the encoder, the decoder, or the relevant strategies.Recent studies have comprehensively analyzed the effect of different encoders on the model performance from an empirical perspective [13]- [18].For the decoder part, its impacts on the image captioning performance have also been revealed and studied [19]- [22].Nevertheless, there is still a lack of detailed and comprehensive investigations of the impacts referring to various aspects of the decoder (including the language model, the decoding strategy, and training strategy).
To gain additional insights into the encoder-decoder based image captioning models, in this study, we conduct an extensive empirical study with the goal of comprehensively investigating the impacts of decoding related techniques on the performance of image captioning.We compared the impact of CNN-based, GRU-based, and LSTM-based decoders on image captioning models.In addition, we investigated the impacts of two types of decoding strategies, the search strategy (namely, the greedy search and the beam search) and the AOA mechanism.
We also analyzed the impacts of training methods on the performance of image captioning models and compared the impacts of two training methods, the Cross-Entropy Loss, and the reinforcement learning-based method.Furthermore, we investigated the impact of the combinational usage of these strategies.
We conducted experiments on the MSCOCO dataset [23], which is a widely-used dataset for the task of image captioning.We employed four state-of-the-art image captioning models as the basic models and further constructed a series of model variants from them by modifying the decoding parts of these basic models.To evaluate the performance of these image captioning models, we adopted six evaluation metrics, including BLUE1 [24], BLEU4 [24], METEOR [25], ROUGE [26], CIDEr [27] and SPICE [28].Overall, our experimental results confirm the impacts of the language models, the decoding strategies, and the training strategies on the performance of image captioning models.More specifically, it is revealed that different language models may benefit different image captioning models, and the beam search, AOA mechanism, and the reinforcement learning based training method can generally improve the performance of image captioning models.In addition, it is also found that the combinational usage of various strategies can positively affect the captioning performance.
The contributions of this study are summarized as below.
• We conduct extensive experiments to empirically analyze the impacts of the decoder involving various text generation techniques on the performance of the image captioning models.Our study considers the impacts of language models, decoding strategies, training strategies, and the combinational usage of decoding and training strategies, and accordingly evaluates the performance of 68 image captioning models (including 4 basic models and 64 model variants with varying usage of the language model, decoding strategy and training strategy).
• We highlight some practical findings.Our findings suggests that the performance of an image captioning model can be properly enhanced by configuring it with suitable language model as well as appropriate decoding and training strategies.This also provides a reference for further improving the performance of image captioning models.
The rest of the paper is structured as follows.Section II provides an in-depth discussion of previous research work We introduce some preliminary knowledge, including the commonly used language models, the decoding and training strategies for image captioning models, in Section III.In Section IV, we present our experimental design, including the research questions, the basic image captioning models employed in the experiments, the datasets and the evaluation metrics.Section V reports and discusses our experimental results to answer each of our research questions.Section VI concludes with a summary of this study and proposes directions for future research.

II. RELATED WORK
Image caption : Image captioning [29], [30], [31], [12] achieves significant improvements over the neural encoderdecoder framework [6].The Show-Tell model [30] uses convolutional neural networks (CNNs) [32] to encode images into fixed-length vectors, and a Long short-term memory (LSTM) [33] as a decoder to sequentially generate words.To capture fine-grained visual details, attention-to-image captioning models [29], [31], [12] have been proposed to dynamically pin words together with relevant image parts during generation.To reduce exposure bias in sequence training, Rennie et al. [34] use reinforcement learning to optimize non-differentiable metrics.In order to further improve the accuracy, transformer models [10], [35] were proposed, allowing the model to effectively capture the relationship between different positions in the input sequence.Empirical study : The factors that affect the image caption model are roughly divided into two parts: encoder and decoder.In order to study the impact of encoders on image caption models, people began to use different CNN encoders, such as Inception-V3, VGG, Resnet, Densenet, etc., for empirical research.Among them, [13], [14], [16] used ordinary LSTM as the decoder, [17] used MSprop as the optimizer on this basis, and [18] changed the decoder part from LSTM to GRU.In order to ensure the comprehensiveness of the experiment, [15], [36] also used LSTM and a combination of LSTM and visual attention mechanisms as decoders.
The research on the decoder part mainly focuses on the decoder architecture, search strategy and visual attention mechanism.Among them, [37] mainly focused on the impact of search strategies.[20] considered the influence of one-way and two-way LSTM decoders and search strategies.[19] selected the injection model and conducted experiments using different search strategies.[21] and [22] mainly focued on the impact of visual attention mechanism on the model.In addition, [22] takes into account the Transformer model.
While other papers have analyzed only one aspect of the attention mechanism, or two types of decoder architectures, we have built on this foundation by experimenting with RNNs, GRUs, LSTMs, and Transformers using different types of strategies as well as combinations of strategies, in order to have a more comprehensive analysis of decoder language models.

III. PRELIMINARIES
This section briefly introduces the commonly used language models, as well as the strategies that are applicable to text generators, in the context of encoder-decoder based image captioning.

A. Decoder Language Models
As shown in Fig. 2, the decoder of an image captioning model is responsible for translating the vector representation resulting from the encoder into a natural language caption.In the context of image captioning, the generation of captions can be formulated as a sequence to sequence learning task.Several language models have been employed to accomplish this task, including RNN, LSTM, GRU, and Transformer.
RNN [38] is used to process sequential data, but it does not handle long sequences and long-distance dependencies well due to vanishing or exploding gradient problems.
LSTM [33] is an improved RNN that effectively solves the gradient problem by introducing a gating mechanism, making it good at capturing long-term dependencies and achieving good results in tasks such as text generation and machine translation.
GRU [39] is an improved version of RNN.It uses a gating mechanism, has a simple structure and fewer parameters, and shows good performance in multiple sequence generation tasks, similar to LSTM.
Transformer [40] is a neural network based on a selfattention mechanism.It has global context modeling and parallel computing capabilities.It can comprehensively consider image features and subtitle sequences to generate accurate and coherent image subtitles.

B. Decoding Strategies
Apart from the language model, the decoder can be equipped with various different strategies.In this study, we mainly focus on the search strategy and the strategies relating to the attention mechanism.
1) Search Strategy: Greedy search and beam search are two search strategies for generating sequences that are commonly used in the task of image captioning.
Greedy search is a sequence generation method that selects the currently optimal option each time without considering the global optimal solution.It is usually computationally efficient but may sacrifice final performance.
Beam search is a sequence generation method that considers multiple alternative outputs and selects the set of alternative outputs with the highest probability score to improve the quality of the generated results, often used in natural language processing and machine translation tasks.
2) Attention on attention mechanism: For encoder-decoder framework based image captioning, the attention mechanism is commonly applied for guiding the decoding process.The Attention on Attention (AOA) approach aims at extending the traditional attention mechanism applied to image captioning tasks.
The AOA approach consists of two main parts: the first part is the global attention module, which is used to compute global attention weights between image features and context vectors; the second part is the local attention module, which is used to compute local attention weights based on the global attention weights.

C. Training Strategies
The training process aims to prepare the captioning model for learning to predict the probabilities of words that will appear in the caption.Two types of commonly adopted training strategies are Cross-Entropy Loss and reinforcement learning.
1) Cross-entropy loss: Traditional image captioning models are usually trained using maximum likelihood estimation (MLE) to optimize model parameters by minimizing crossentropy loss.However, this method cannot directly measure subtitle quality and can easily lead to inaccurate or repeated subtitles.
2) Reinforcement learning: Training image captioning models using reinforcement learning has led to significant improvements.A typical method is self-criticism sequence training [34], which treats the generated subtitles as a sequence of actions, the quality is evaluated with the CIDEr-D metric, and the metric is maximized through reinforcement learning.

IV. EXPERIMENTAL DESIGN
This section presents our research questions, basic image captioning models, datasets, and evaluation metrics.

A. Research Questions
We plan to investigate the following four research questions.
• RQ1: What is the impact of the language models on the performance of image captioning models?
• RQ2: How do different decoding strategies affect the performance of image captioning?RQ2.1:How does beam search compare to greedy search for the task of image captioning?RQ2.2:What is the impact of using the AOA mechanism with the language model on the performance of image captioning?
• RQ3: How do different training strategies used for the decoder impact the performance of image captioning?
• RQ4: What is the impact of the combinational usage of various strategies of the decoder on the performance of image captioning?

B. Basic Image Captioning Models
In this study, we employed four state-of-the-art image captioning models as the basic models, based on which we constructed various model variants (the details are elaborated in Section V).The information of these models is summarized in Table I, and further described below.
FC [34]: The FC model utilizes a deep CNN model ResNet101 to encode the input picture, and then a linear map is used for embedding.The model uses an LSTM-based decoder.
Att2in2 [34] : The Att2in2 model uses ResNet101 as an encoder and LSTM as a decoder.Particularly, it is an image captioning model involving the attention mechanism.The model is an improved version of the subtitle attention model [31].As shown in Table I , the encoders of the above four models are either Faster R-CNN Resnet101 [41] or ResNet101 [32].We further detail these two models as below.

C. Dataset
Our experiments are conducted on the MSCOCO dataset [23], which is a popular benchmark for image captioning tasks, containing 123,287 images, each with 5 captions, for a total of 615,935 captions.We use the "Karpathy" data split [42], with 5,000 images for validation, 5,000 for testing, and the rest for training.
To preprocess the captions, we generated a vocabulary of 10,369 unique words by converting sentences to lowercase and removing words that appeared less than five times.

V. EXPERIMENTAL RESULTS
This section analyzes and reports our experimental results.Specifically, for each RQ, we discuss the motivation, present the approach, and finally report the results.

A. RQ1: Impact of Language Models on the Task of Image Captioning
Motivation: For an encoder-decoder based image captioning model, the language model constitutes the key part of the decoder, and thus it is crucial to the overall performance of image captioning.Prior studies have proposed various language models for supporting the decoding stage of image captioning [5].Yet, there is still a lack of empirical evidence revealing the extent of the impact, and it is also unclear how different language models affect the performance of image captioning models.Therefore, in this RQ, we investigated image captioning models with varying language models, in order to reveal the impact of language models on the performance of image captioning models.
Approach: First, we utilized the three baseline models employing an RNN-based language model (namely, FC, Att2in2, and Up-Down) by following their original configurations in the prior studies [12].Secondly, we constructed six model variants from the three baseline models by modifying the decoder part.That is, for each model, two variants were constructed by replacing its default language model LSTM with an RNN and a GRU, respectively.For the sake of simplicity, we utilized M ♢L to denote a model M supported with the specific language model L. For example, FC♢RNN represents one variant of model FC where the default language model LSTM is replaced with an RNN.Thirdly, for each of the models obtained in the previous steps, we further constructed a variant for each of them by modifying its encoder model (i.e. from ResNet101 to Fast RCNN ResNet101, or vice versa).Finally, we evaluated these 18 models (including three baseline models and 15 model variants) and collected evaluation results on a series of evaluation metrics.
Results: Tables II and III, respectively report the evaluation results of nine models employing the same encoder model.
Based on these results, we make the following observations.
1) The use of different language models leads to varying performance of the image captioning model.As shown in Table II, for each of the models, the use of RNN, GRU, or LSTM as the decoder model yields different values for each of the six evaluation metrics.For example, the BLEU1 values for the three models, that is, FC♢RNN, FC♢GRU, and the original FC model are 73.70,73.71, and 74.06, respectively.Table III consistently reveals this point.
2) The language model affects different image captioning models in different ways.At first, it is observed that the language model may have opposite impacts on different image captioning models.Consider the FC and Att2in2 models as an example.According to Table II, compared to the use of RNN, the use of GRU positively contributes to the BLEU1 value of FC (the BLEU1 values of FC♢RNN and FC♢GRU are 73.70 and 73.71), while it negatively affects the BLEU1 value of Att2in2 (the BLEU1 values of Att2in2♢RNN and Att2in2♢GRU are 75.56 and 75.11).On the other hand, the degrees of the impacts of the language models may also be different when they are applied to different image captioning models.As can be observed from Table III, FC♢GRU outperforms FC♢RNN in terms of the CIDEr metric, exhibiting a discrepancy of 1.65 (97.72 vs. 96.07).Nevertheless, although Att2in2♢GRU also outperforms Att2in2♢RNN in terms of the CIDEr metric, the discrepancy in the performance is relatively tiny (0.14).
3) The best language model for different image captioning models may be different.Among the three language models under investigation (that is, RNN, GRU, and LSTM), they are beneficial to different image captioning models.For the models employing the faster R-CNN ResNet101 as the encoder, the best language model for the FC model is LSTM; while the Up-Down model exhibits the best performance with the GRU as the language model (as observed from Table II).Quite differently, for the models employing ResNet101 as the encoder, the FC model performs best with GRU, the Att2in2 model achieves the best performance with LSTM, while the Up-Down model performs best with RNN (as observed from Table III).
RQ1 : For the encoder-decoder based image captioning models, employing different language models as the decoder always leads to varying captioning performance.Nevertheless, the impact of the language models on different image captioning models may vary, and accordingly, the good language models may also be different from the perspective of different image captioning models.

B. RQ2: Impact of Different Decoding Strategies on Image Captioning Models
Motivation: At present, the endoer-decoder based image captioning models have been extended and enhanced via a variety of decoding strategies [5].Although these decoding strategies have been demonstrated to be able to positively contribute to captioning performance, they have not been comprehensively investigated on the same set of image captioning models and datasets.To fill this gap, in this RQ, we empirically studied the impacts of two types of decoding strategies, the search strategy and the AOA mechanism.
Approach: We first focus on the search strategies adopted by the decoder of the image captioning model.To this end, we conducted experiments on 20 models, including the 18 models constructed for RQ1, the basic Transformer model and its variant employing the RestNet101 instead of the Faster R-CNN ResNet101 as the encoder.It is noted that all of these 20 models adopt the greedy search (as reported in Table I).Based on these, we further constructed 20 model variants from them by replacing the greedy search with beam search.In particular, the latter set of models is configured with various beam sizes (in this study, we adopted four beam sizes, 2, 3, 4, and 5).As a result, there are 20 groups of models, each of which consists of two models sharing the same technical details except for the search strategy.We evaluated all of these models on the dataset and compared the performances of models within individual groups.
To study the impacts of the AOA mechanisms, the three base models, Att2in2, Up-Down, and Transformer, and their variants are utilized.The FC model and its variants are excluded because they do not employ the attention mechanism and thus the AOA mechanism is not applicable.For each of the models, we constructed a variant for it by additionally applying the AOA mechanism, and then conducted a comparison analysis of their performance.
Results: Fig. 3 reports the performance comparison results of image captioning models using or not using the beam search strategy.Particularly, Fig. 3 (a)-(f) reports the results for the ten groups of models employing the faster R-CNN ResNet101 as the encoder, where each subfigure focuses on the comparison of performance with respect to one of the evaluation metrics.Accordingly, the comparison results relating to the other ten groups of models that using the ResNet101 as the encoder are reported in 3 (g)-(l).Fig. 4 further reports the performance comparison results on seven groups of models applying or not applying the AOA mechanism.Based on these results, we have the following observations: 1) The use of different decoding strategies affects the performance of image captioning models.It can be observed from Fig. 3 that using greedy search or beam search leads to varying captioning performance of the relevant models.Similarly, Fig. 4 also shows that every image captioning model under investigation exhibits different performance with and without using the AoA mechanism.
2) Compared to greedy search, the use of beam search generally improves the captioning performance.Firstly, it can be observed from Fig. 3 that most of the models achieve better performance by using beam search.This indicates that the use of beam search is beneficial to image captioning models.On the other hand, it can also be found that the optimal beam size of the beam search for different models varies.Nevertheless, for the majority of models, the best performance is reached with a beam size of 2.
3) The application of the AoA mechanism benefits most of the image captioning models under investigation.Fig. 4 shows that after additional applying the AoA mechanism on the target image captioning models, the captioning performance has been improved in most cases (that is, for most of the models with respect to the majority of evaluation metrics).Although there are some models for which the application of the AoA mechanism leads to a decrease in captioning performance (i.e., the Up-Down model employing the encoder of faster R-CNN ResNet101), the extent of the decrease is relatively smaller than the extent of the increases resulted from using the AoA mechanism.
RQ2 : For encoder-decoder based image captioning models, the application of decoding strategies affects the captioning performance.Specifically, the use of beam search always outperforms the use of greedy search, and most models exhibit the best performance with the beam search configured with a beam size of 2.Moreover, the application of the AoA mechanism is beneficial to most of the image captioning models under investigation.

C. RQ3: Impact of the Training Strategies on Image Captioning Models
Motivation: Currently, encoder-decoder based image captioning models have emerged with various training methods.However, no prior study has focused on revealing the effect of training methods applied on the decoder part.Hence, in this RQ, we empirically studied two training approaches (Cross-Entropy Loss and Reinforcement Learning) and their impacts on captioning performance.

Approach:
In the experiments, we reused the 20 image captioning models, including the FC, Att2in2, and Up-Down models employing the RNN, GRU, and LSTM in the decoder part as well as the Transformer model, and also their relevant variants using a different CNN (faster R-CNN ResNet101 or ResNet101) as the encoder.Noted that all of these models are trained by following their default method, namely, the cross-entropy loss method.Based on each of these models, we further constructed a model variant by training its decoder via a reinforcement learning based method, the self-critical sequence training method.These result in 20 groups of model, where each group consists of a model and its variant involving a decoder trained via reinforcement learning.We evaluated these newly constructed model variants and further conducted a comparison analysis with individual groups.Results: Tables IV and V, respectively report the evaluation results of ten newly constructed models employing the same encoder model.For each newly constructed model (where the decoder is trained via reinforcement learning), we further compared its performance with the relevant model that trained via the cross-entropy loss method (as reported in Tables II  and III).Accordingly, Tables IV and V further report the improvements made by applying the reinforcement learning based training method (the improvement is indicated by the ↑).
Based on these results, we make the following observations: 1) Reinforcement learning is an effective training method for supporting the task of image captioning.Both Table IV and Table V reveal that training the decoder by the self-critical sequence training method leads performance improvement for all of the target models.For example, after applying the self-critical sequence training method, FC♢RNN exhibits 4.02 improvement in terms of the BLEU1 metric, while the improvement is with respect to the CIDEr metic (as shown in the first row of Table IV).
2) The performance improvements made by reinforcement learning are different for different image captioning models.According to Table IV, the performance improvements obtained via self-critical sequence training range from 0.59 to 14.16.Similarly, as shown in Table V, the highest increase in performance is 12.92, and the lowest increase is 0.46.Furthermore, it can be observed that the application of the reinforcement learning based training method leads to varying performance improvement for every of the target models.
RQ3 : For encoder-decoder based image captioning models, training the decoder with reinforcement learning will improve the model performance.Nevertheless, the degree of the improvements is different for different image captioning models.

D. Impact of the Combinational Usage of Decoding Strategies on the Performance of Image Captioning Models
Motivation: We have previously investigated the effect of every single strategy or mechanism on the performance of image captioning models.With the observation that these strategies and mechanisms can be applied to an image captioning model together, in this RQ, we further studied the effect of the application of various combinations of these strategies.
Approach: We utilized the four baseline models employing the faster R-CNN ResNet101 encoder as the basic model.We further considered the combination of the three strategies or methods, that is, the search strategy, the AoA mechanism, and the training method.We use P xyz to denote the application  We further applied various combinations of different strategies on every basic model to construct some model variants.For the FC model, since it does not support the attention mechanism, only P 101 (that is, beam search and the selfcritical sequence training) is applicable.Accordingly, one model variant was constructed from the FC model.For the other three basic models, four different combinations of these strategies are applicable (namely, P 101 , P 110 , P 011 , and P 111 ), and thus four model variants were constructed from each of them.At last, these model variants were evaluated on the dataset.
Results: Table VI reports the evaluation results of 13 model variants employing some combination of the strategies or methods applied on the decoder part.Noted that for each model variant, its relevant models employing one of these strategies have already been evaluated and investigated in the previous RQs, we thus compare it with the one exhibiting the best performance in order to report the performance improvement achieved via the application of combined strategies (the performance improvement is shown in Table VI).Based on these results, we make the following observations: 1) The combination of various strategies helps to improve the performance of image captioning models in most cases.Table VI shows that the captioning performance is improved in most cases (i.e., most of the evaluated metrics for most models) after using the combination strategy on the target image captioning models.Although the application of the combination strategy to some models leads to a decrease in their captioning performance (e.g., the Att2in2 model with the usage of the beam search and self-critical sequence training method), the decrease is relatively small.
2) Different combinations of methods have different effects on the performance enhancement.As can be seen from the Table VI, the model performance improvement is different for different models using the same combination of strategies, and the model performance improvement is also different for the same model using different combinations of strategies.Nevertheless, it is observed that for the three models to which variuos combinations of strategies have been applied, they exhibit the best performance with P 111 .That is, by applying the beam search, the AoA mechanism, and the reinforcement learning based training method together, these models perform better than those equipped with only parts of these strategies.
RQ4 : For encoder-decoder based image captioning models, applying various strategies to the decoder is helpful for improving the overall captioning performance.For the image captioning models under investigation, they exhibit the best performance with the use of the beam search, AoA mechanism and the reinforcement learning based training method.

VI. CONCLUSION
In this work, we focus on the impact of various aspects of the decoder on image captioning.In order to understand the impact of the text generation technique employed by the decoder on the results, we have conducted an extensive empirical analysis involving three different language models, two different decoding strategies, and two different training methods.The results of the research and analysis show that different language models have different impacts on the performance of the generated subtitles.Meanwhile, the use of two different decoding strategies as well as the training method of reinforcement learning helps to improve the model performance.In addition, it was found that using a combination of these strategies is usually better than using only a single strategy in image subtitle generation tasks.Future research directions can consider expanding our research to more complex datasets, especially exploring in cross-cultural environments.In addition, further research on how to integrate other machine learning technologies, such as transfer learning, to further improve model performance is also an important direction.The development of these future works will help expand our research and have a broader impact.

Fig. 1 .
Fig. 1.For a given image, different image captioning approaches may yield varying captions.
[29] : The up-Down model encoder part uses Faster R-CNN ResNet101, which is a classical target detection model, and local features from an image, and the decoder part employs a two-layer LSTM architecture (Top-Down Attention LSTM and Language LSTM), which utilizes both bottom-up and top-down attentional mechanisms, in order to generate natural language descriptions that match the content of the image.Transformer[40] : The Transformer model uses Faster R-CNN ResNet101 as the encoder and employs a Transformer as the decoder.The Transformer decoder is an integral part of the Transformer model and is used to convert the input sequence generated by the encoder into a target sequence.It gradually generates target sequences through the self-attention mechanism and encoder-decoder attention mechanism.
Faster R-CNN ResNet101[41]: Faster R-CNN ResNet101 combines the Faster R-CNN object detection algorithm and the ResNet101 feature extractor.As a feature extractor, ResNet101 can efficiently extract features from images.Faster R-CNN ResNet101 combines the efficiency of the target detection algorithm and the deep feature learning ability of ResNet101, making the model perform well in target detection tasks.

Fig. 3 .
Fig. 3. Comparison of models employing the greedy search with those employing the beam search.For the latter, various beam sizes (2, 3, 4 and 5) have been investigated.A total number of 20 groups of models are studied, including ten groups of models using Faster R-CNN ResNet101 ((a) -(f)) and another ten groups of models using ResNet101 ((g) -(l)).

Fig. 4 .
Fig.4.Comparison of the 14 groups of models, including seven groups of models using Faster R-CNN ResNet101 ((a) -(f)) and another seven groups of models using ResNet101 ((g) -(l)).In each group, one model does not using AoA (denoted as base), while the other one applies AoA (denoted by AoA).
: ResNet101 is a CNN model having 101 layers.It is a variant of a Residual Network, and it introduces residual connections to solve the problem of gradient disappearance and gradient explosion in deep network training.Compared with traditional shallow networks, it can learn image features at a deeper level, thereby extracting more complex and advanced feature expressions.

TABLE I .
BASIC INFORMATION OF THE SELECTED MODEL

TABLE II .
EVALUATION RESULTS OF THE NINE MODELS EMPLOYING FASTER R-CNN RESNET101 AS THE ENCODER.AMONG EACH BASIC MODEL AND ITS VARIANTS, THE BEST PERFORMANCE IN TERMS OF INDIVIDUAL METRICS IS HIGHLIGHTED WITH BOLD TYPE.FURTHERMORE, THE BEST PERFORMER IN TERMS OF INDIVIDUAL METRICS IS UNDERLINED

TABLE V .
PERFORMANCE OF MODELS WITH THE RESNET101 AS ENCODER AND TRAINED WITH SELF-CRITICAL SEQUENCE TRAINING METHODAoA mechanism are applied together, while P 011 represents that the AoA mechanism and the self-critical sequence training method are applied together.

TABLE VI .
EVALUATION RESULTS OF MODELS APPLYING MULTIPLE (COMBINED) STRATEGIES.BY COMPARING EACH MODEL (THAT APPLY AT LEAST TWO TYPES OF STRATEGIES TOGETHER) WITH THE BEST ONE APPLYING ONLY ONE OF SUCH STRATEGIES, THE IMPROVEMENT ACHIEVED VIA COMBINATIONAL USAGE OF VARIOUS STRATEGIES IS REPORTED (↑ AND ↓ REPRESENT THE POSITIVE AND NEGATIVE IMPROVEMENTS, RESPECTIVELY FC Att2in2 Up-Down Transformer P (101) P (101) P (110) P (011) P (111) P (101) P (110) P (011) P (111) P (101) P (110) P (011) P (111)