Comparative evaluation of CNN architectures for Image Caption Generation

Aided by recent advances in Deep Learning, Image Caption Generation has seen tremendous progress over the last few years. Most methods use transfer learning to extract visual information, in the form of image features, with the help of pre-trained Convolutional Neural Network models followed by transformation of the visual information using a Caption Generator module to generate the output sentences. Different methods have used different Convolutional Neural Network Architectures and, to the best of our knowledge, there is no systematic study which compares the relative efficacy of different Convolutional Neural Network architectures for extracting the visual information. In this work, we have evaluated 17 different Convolutional Neural Networks on two popular Image Caption Generation frameworks: the first based on Neural Image Caption (NIC) generation model and the second based on Soft-Attention framework. We observe that model complexity of Convolutional Neural Network, as measured by number of parameters, and the accuracy of the model on Object Recognition task does not necessarily co-relate with its efficacy on feature extraction for Image Caption Generation task.


I. INTRODUCTION
Image Caption Generation involves training a Machine Learning model to learn to automatically produce a single sentence description for an image. For human beings it is a trivial task. However for a Machine Learning method to be able to perform this task, it has to learn to extract all the relevant information contained in the image and then to convert this visual information into a suitable representation of the image which can be used to generate a natural language sentence description of the image. The visual features extracted from the image should contain information about all the relevant objects present in the image, the relationships among the objects and the activity settings of the scene. Then the information needs to be suitably encoded, generally in a vectorized form, so that the sentence generator module can convert this into a human readable sentence. Furthermore, some information may be implicit in the scene such as a scene where a group of football players are running in a football field but the football is not present in the scene frame. Thus the model may need to learn some level of knowledge about the world as well. However, the ability to automate the caption generation process has many benefits for the society as it can either replace or complement any method that seeks to extract some information from the images and has applications in the fields of education, military, medicine, etc., as well as applications in some specific problems such as helping visually impaired people in navigation or generating news information from images.
During the last few years there has been tremendous progress in Image caption generation due to advances in Computer Vision and Natural Language Processing domains. The progress made in Object Recognition task due to availability of large annotated datasets such as ImageNet [1] has led to availability of pre-trained Convolutional Neural Network (CNN) models which can extract useful information from the image in vectorized form which can then be used by caption generation module (called the decoder) to generate caption sentences. Similarly, progress in solving machine translation with methods such as encoder-decoder framework proposed in [2], [3] has led to adoption of similar format for Image Caption Generation where the source sentence in machine translation task is replaced by the image in caption generation task and then the process is approached as 'translation' of image to sentence, as has been done in works such as [4], [5], [6]. The attention based framework proposed by [7] where the decoder learns to focus on certain parts of the source sentence at certain time-steps has been adapted in caption generation in such as way that the decoder focuses on portions of image at certain time-steps [8]. A detailed survey of Image Caption Generation has been provided in [9] and [10].
Although there has been a lot of focus on the decoder which 'interprets' the image features and 'translates' them into a caption, there has not been enough focus on the encoder which 'encodes' the source image into a suitable visual representation (called image features). This is mainly because most methods use transfer learning to extract image features from pre-trained Convolutional Neural Networks (CNN) [11] which are trained on the Object Detection task of the ImageNet Large Scale Visual Recognition Challenge [12] where the goal is to predict the object category out of 1000 categories annotated in the dataset. Since the last layer of the CNN produces a 1000 length vector containing relative probabilities of all object categories, the last layer is dropped and the output(s) of intermediate layer (s)  Hence, in this work we evaluate Image Caption Generation using popular CNN architectures which have been used for Object Recognition task and analyse the co-relation between model complexity, as measured by the total number of parameters, and the effectiveness of different CNN architectures on feature extraction for Image Caption Generation. We use two popular Image Caption Generation frameworks: (a) Neural Image Caption (NIC) Generator proposed in [6] and (b) Soft Attention based Image Caption Generation proposed in [8]. We observe that the performance of Image Caption Generation varies with the use of different CNN architectures and is not directly correlated with either the model complexity or performance of CNN on object recognition task. To further validate our findings, we evaluate multiple versions of ResNet CNN [13] with different depths (number of layers in the CNN) and complexity: ResNet18, ResNet34, ResNet50, ResNet101, ResNet152 where the numerical part in the name stands for the number of layers in the CNN (such as 18 layers in ResNet18 and so on). We evaluate multiple versions of VGG CNN [14] architecture: VGG-11, VGG-13, VGG-16 and VGG-19 and multiple versions of DenseNet CNN [15] architecture: Densenet121, Densenet169, DenseNet201 and Densenet161, each of which has different number of parameters. We observe that performance does not improve with the increase in the number of layers, and consequently, increase in model complexity. This further validates our observation that effectiveness of CNN architectures for Image Caption Generation depends on the model design and that the model complexity or the performance on Object Detection task are not good indicators of effectiveness of CNN for Image Caption Generation. To the best of our knowledge, this is the first such detailed analysis of the role of CNN architectures as image feature extractors for Image Caption Generation task. In addition, to further the future research work in this area, we also make the implementation code 1 available for reference. This paper is divided into following sections: In Section II,we discuss the relevant methods proposed in the literature, in Section III, we discuss the methodology of our work, in Section IV we present and discuss the experimental results and in Section V we discuss the implications of our work and possible future studies.

II. RELATED WORK
Some of the earliest works attempted to solve the problem of caption generation in constrained environments such as the work proposed in [16] where the authors try to generate captions for objects present in an office setting. Such methods had limited scalability and applications. Some works tried to address the task as a Retrieval problem where a pool of sentences was constructed which could describe all (or most) images in a particular setting. Then for a target image, a sentence which was deemed appropriate by the algorithm was selected as the caption. For example, in [17], the authors construct a 'meaning space' which consists of triplets of 1 https://github.com/iamsulabh/cnn variants <objects, actions, scene>. This is used as a common mapping space for images and sentences. A similarity measure is used to find sentences with the highest similarity to the target image and the most similar sentence is selected as the caption. In [18], a set of images are retrieved from the training data which are similar to the target image using a visual similarity measure. Then a word probability density conditioned on the target image is calculated using the captions of the images that were retrieved in the last step. Then the captions in the dataset are scored using this word probability density and the sentence which has the highest score is selected as the caption for the target image. The retrieval based methods generally produce grammatically correct and fluent captions because they select human generated sentence for a target image. However, this approach is not scalable because a large number of sentences need to be included in the pool for each kind of environment. Also the selected sentence may not even be relevant because the same kind of objects may have different kind of relationships among them which cannot be described by a fixed set of sentences.
Another class of approaches are the Template based methods which construct a set of hand-coded sentence templates according to the rules of grammar and semantics and optimization algorithms. Then the methods plug in different object components and their relationships into the templates to generate sentences for the target image. For example, in [19], Conditional Random Fields are used to recognize image contents. A graph is constructed with the image objects, their relationships and attributes as nodes of the graph. The reference captions available with the training images are used to calculate pairwise relationship functions using statistical inference and the visual concepts are used to determine the unary operators on the nodes. In [20], visual models are used to extract information about objects, attributes and spatial relationships. The visual information is encoded in the form of [<adjective1,object1>,preposition,<adjective2,object2>] triplets. Then n-gram frequency counts are extracted from web-scale training dataset using statistical inference. Dynamic programming is used to determine optimal combination of phrases to perform phrase fusion to construct the sentences. Although the Template based approaches are able to generate more varied captions, they are still handicapped by the problems of scalability because a large number of sentence templates are to be hand-coded and even then a lot of phrase combinations may be left out.
In recent years, most of the works proposed in the literature have employed Deep Learning to generate captions. Most works use CNNs, which are pre-trained on the ImageNet Object Recognition dataset [1], to extract vectorized representation of the image. Words of a sentence are represented as Word Embedding vectors extracted from a look-up table. The look up table is learned during training as the set of weights of the Embedding Layer. The image and word information is combined in different ways. Most methods use different variants of Recurrent Neural Network [21] (RNN) to model the temporal relationships between words in the sentence. In [5], the image features extracted from CNN and the word embeddings are mapped to the same vector space and merged using element-wise addition at each time-step. Then the merged image features and word embeddings are used as input to a MultiModal Reccurent Neural Network (m-RNN) which generates the output. The authors use AlexNet [22] and VGG-16 [14] as CNNs to extract image features. In [4] a Bidirectional Recurrent Neural Network is used as decoder because it can map the word relationships with both the words that precede and the words that succeed a particular word in the sentence. The word embeddings and image features are merged before being fed into the decoder. The authors use AlexNet [22] CNN to extract image features. In [6], a Long Short Term Memory Network [23] is used as decoder. The image features are mapped to the vector space spanned by hidden state representations of the LSTM and are used as initial hidden state of the LSTM. Thus the image information is fed to LSTM at initial state only. The LSTM takes in previously generated words as input (with a special 'start token' as the first input) and generates the next word sequentially. The authors use [24] as CNN for extracting image features. Using the Attention approach, in [8] the authors train the model to focus on certain parts of the image at certain timesteps. This attention mechanism takes as input, the image features and output until the last time-step and generates an image representation conditioned on text input. This is merged with the word embeddings at the current time-step by using vector concatenation operation and used as input to the LSTM generator. The authors used VGGNet [14] CNN as image feature extractor. Recently, methods using Convolutional Neural Networks as sequence generators have been proposed such as in [25] for text generation. Based on this approach, [26] propose a method which uses a CNN for encoding the image and another CNN for decoding the image. The CNN decoder is similar to the one used in [25] and uses a hierarchy of layers to model word relationships of increasing complexity. The authors use ResNet152 [13] CNN to encode the image features. More recently, Transformer Network has been used which uses self-attention to model word relationships instead of Recurrent or Convolutional operations [27]. Based on this approach a Transformer based caption generation is proposed in [28]. Since most of the methods use different CNN architectures to extract image features, there is a need for a comparative analysis of their effectiveness in image feature extraction using the same overall format for caption generation.

III. PROPOSED METHOD
In image caption generation, given an image the task is to generate a set of words S = {w 1 , w 2 , w 3 , ..., w L } where w i ∈ V where L is the length of the sentence and V represents the vocabulary of the dataset. The words w 1 and w L are usually the special tokens for start and end of the sentence. Two more special tokens for 'unknown' and 'padding' are also used for representing unknown words (which may be the stop words and rare words that have been removed from dataset to speed up training) and padding the end of the sentence (to make all sentences of equal length because RNNs do not handle sentence of different lengths in the same batch), respectively. Given pairs of image and sentence, (I N , S i ) for i ∈ (1, 2, 3, ..., j), during training we maximize the probability P (S i |I N , θ) where j is the number of captions for an image in training set and θ represents the set of parameters in the model. Hence, as mentioned in [6], during training the model learns to update the set of parameters θ such that the probability of generation of correct captions is maximized according to the where θ is the set of all parameters of the model, I is the image and S is one of the reference captions provided with the image. We can use chain rule because generation of words of a sentence depends on previously generated words, and hence Equation 1 can be extended to the constituent words of the sentence as, where w 1 , w 2 , ..., w L are the words in the sentence 'S' of length L. This equation can be modelled using a Recurrent Neural Network which generates the next output conditioned on the previous words of the sentence. We have used LSTM as the RNN variant for our experiments.
In this work, we evaluate caption generation performance on two popular encoder-decoder frameworks with certain modifications. For both the methods, we experiment with different CNN architectures for image feature extraction and analyse the effects on performance. The first method is based on Neural Image Caption Generation method proposed in [6]. However, unlike the method proposed in [6], we have not used model ensembles to improve performance. In addition, we have extracted image features from a lower layer of the CNN which generates a set of vectors each of which contain information about a region of the image. We have observed that this leads to better performance as the decoder is able to use region specific information to generate captions. Throughout this paper, this will be referred to as 'CNN+LSTM' approach with the word 'CNN' replaced by the name of CNN architecture used in the experiment. For example, 'ResNet18+LSTM' refers to caption generation with ResNet18 as the CNN. The second method is similar to the Soft Attention method proposed in [8]. We use an attention mechanism which learns to focus on certain portions of image for at certain timesteps for generating the captions. Similar to the CNN+LSTM approach, this Soft Attention approach will be referred as 'CNN+LSTM+Attention' approach with the word 'CNN' replaced by the name of CNN architecture used. Figure 1 explains both the methods.

A. Image Feature extraction
For extracting image features, we use CNNs which were pre-trained on ImageNet datset [1] for the Imagenet Large Scale Visual Recognition Challenge [12]. The models generate a single output vector containing the relative probabilities of different object categories (with 1000 categories in total). We remove this last layer from the CNN since we need more fine-grained information. Also, we remove all the layers at the top (with the input layer being called the bottom layer) which produce a single vector as output because we need a set of vectors as output which contain information about different regions of the image. Hence, the image features are a set of vectors denoted as, a = {a 1 , a 2 , a 3 , ...a |a| }, a i ∈ R D where |a| is the number of feature vectors contained in a, R represents real numbers and D is dimension of each vector. The set of image feature vectors thus generated are used in two ways in the methods used in this work. In the 'CNN+LSTM' method, the image features are mapped to the vector space of hidden state of the LSTM and used to initialize the hidden and cell state of the LSTM decoder. For the 'CNN+LSTM+Attention' method, in addition to hidden and cell state initialization, the set of image feature vectors is also used at each time-step to calculate attention weighted image features which contain information from those regions in the image which are important at the current time-step. We explain this in detail in Sections III-B and III-C.

B. CNN + LSTM method
In this method, we use a CNN encoder to extract image information and use that information as the initial hidden state of the LSTM decoder. Using the set of image feature vectors thus obtained as described in Section III-A, we obtain a single vector by averaging the values of all vectors in the set as, where |a| is the length of set of image feature vectors extracted from the CNN. This is used to generate the initial hidden and cell states of the LSTM by using an affine transformation followed by a non-linearity (T anh function) as, where W h , W c and b h , b c are weights and biases of the MultiLayer Perceptron (MLP) which is used to model the transformations.
The successive hidden and cell states are generated during training. Since the generation of words is dependent on the previous words in the sentence as depicted in Equation 2, this dependence can be modelled using the hidden state of the LSTM (which is also modulated by the cell state). Hence, where f θ is any differentiable function and since it is recursive in nature it can be modelled using an RNN. Since the hidden state also depends on the previous hidden states, it can be modelled as a function of previous hidden state and inputs as, where f θ is the same differentiable function as in Equation 6 since the model is trained end-to-end with the same parameters. And words are represented as word embeddings which is a function that maps one-hot word vectors to the embedding dimensions and is also learned with the rest of the model, as where f θ is the same differentiable function in Equation 6 and w e i is the word embedding vector for word w i . We use LSTM as described in [23]. The LSTM has three control gates: input, forget and update gates. The equations for updating the different gates are as follows: where W i and R i , W f and R f , W o and R o and W z and R z are weight matrices (input and recurrent weight matrices) pairs for the input, forget, output and the input modulator(tanh) gates, respectively. b is the bias vector and σ is the sigmoid function. It is expressed as σ(x) = 1/1+exp(x) and condenses the input to the range of (0,1). tanh is the is hyperbolic tangent function which condenses the input in the range (-1,1). i t , o t and f t are input, output and forget gates respectively. The input www.ijacsa.thesai.org gate processes the input information. The output gate generates output based on the input and some of this information has to be dropped which is decided by the cell state. The cell state stores information about the context. The forget gate decides what contextual information has to be dropped from the cell state. The internal structure of the LSTM has been depicted in Figure 2.

C. CNN + LSTM + Attention method
In this method, in addition to the the initial time-step, the image information is fed into the LSTM at each time-step. However a separate attention mechanism generates information which is extracted from only certain regions of image which are relevant at the current time-step.
The attention mechanism produces a context vector which represents the relevant portion of the image at each time-step. First a set of weights are calculated for each image feature vector a i ∈ a, i ∈ (1, 2, 3, ..., |a|) as described in Section III-A.
where i ∈ (1, 2, 3, ..., |a|). Then the attention weights are calculated as, where α is the set of weights, one for each image feature vector a i in a such that |a| k=1 α i = 1. Then the context vector is calculated by another function, We have used the function f att and Φ as desrcibed in [8].
With the context vector thus obtained, the equations for the gates of the LSTM decoder would be, where W i and R i , W f and R f , W o and R o and W c and R c are weight matrices (input and recurrent weight matrices) pairs for the input, forget, output and the input modulator(tanh) gates, respectively. b is the bias vector and σ is the sigmoid function.
We have evaluated the performance using BLEU, ME-TEOR, CIDER, ROUGE-L and SPICE metrics that were recommended in MSCOCO Image caption Evaluation task [42]. The evaluation results are provided in Tables I and  II for 'CNN+LSTM' and 'CNN+LSTM+Attention' methods respectively. In addition we have provided some examples of generated captions in Tables III and IV for both the methods. We have used Flickr8k [30] dataset which contains around 8000 images with 5 reference captions each. Out of the 8000 images, around 1000 are earmarked for validation set, around 1000 are meant for test set and the remaining are for training set.
We can make following observations from the results: • For example, there is a variation of around 4 to 5 points in the evaluation metrics between the best and worst performing models in both Tables I and II.
• In addition, the performance of a decoder framework which employs additional methods of guidance (such as attention) but uses a lower performing encoder can be worse than simpler methods which use better performing CNN encoder. For example, the best performing model using CNN+LSTM method (Table  I) have better performance than lower performing models using CNN+LSTM+Attention method (Table  II).
• Although different variants of the same model (such as ResNet, Densenet and VGG) differ greatly with respect to the number of parameters, they generate image captioning performances which differ only by around 1 point on most evaluation metrics. ResNet18, being the smallest model in terms of number of parameters (among ResNet based CNNs) performs competitively as compared to the larger ResNet variants which have many times more parameters. We also observe that DenseNet121 and VGG-11 being the smallest models among DenseNet and VGG models, respectively, outperform other DenseNet and VGG based CNNs in evaluation scores along certain metrics.
• Also the different variants of ResNet [13], VGG [14] and DenseNet [15] architectures differ greatly in terms of Top-5 error on Object Detection task when evaluated with Imagenet dataset. However, that difference does not translate to similar difference in performance in Image captioning task.
• For each image, most models generate reasonable captions but there is a great variation in the caption sentences generated with different models. In some cases, captions generated with different models describe different portions of the image and sometimes some models focus on a certain object in the image instead of providing a general overview of the scene.
• In some cases, models do not recognize certain objects in the image. In particular, we have observed many cases of incorrect gender identification which points out to possible statistical bias in the dataset towards a particular gender in a certain context.  Thus we can conclude that choice of CNN for the encoder significantly influences the performance of the model. In addition to the general observations, we are able to deduce the following specific observations about the choice of CNN: • ResNet [13] and DenseNet [15] CNN architectures are well suited to Image caption generation and generate better results while having a lower model complexity than other architectures.

V. CONCLUSION
In this work, we have evaluated encoder-decoder and attention based caption generation frameworks with different choices of CNN encoders and observed that there is a wide variation in terms of both the scores, as evaluated with commonly used metrics (BLEU, METEOR, CIDER, SPICE, ROUGE-L), and also the generated captions while using different CNN encoders. In terms of most metrics, there is a difference in performance of around 4-5 points between the worst and best performing models. Hence, the choice of particular CNN architecture plays a big role in the image caption generation process. In particular, ResNet and DenseNet based CNN architectures lead to better overall performance while at the same using lesser parameters than other models. Also, since there is a great variation in the generated captions for each image, it may be possible to use ensemble of models, each of which utilize a different CNN as encoder, to increase diversity of generated captions. Also, model ensembling would lead to better performance. In the works proposed in the literature, model ensembling has been used such as in [6] but such model ensembles utilize similar models trained with different hyperparameters. Using ensembles of models, which use different CNN encoders is an area which could be explored in future works.
Furthermore, we hope that this analysis of the effect of choice of different CNNs for image captioning will aid the researchers in better selection of CNN architectures to be used as encoders in image feature extraction for Image Caption Generation.