A Hybridized Deep Learning Method for Bengali Image Captioning

—An omnipresent challenging research topic in com- puter vision is the generation of captions from an input image. Previously, numerous experiments have been conducted on image captioning in English but the generation of the caption from the image in Bengali is still sparse and in need of more reﬁning. Only a few papers till now have worked on image captioning in Bengali. Hence, we proffer a standard strategy for Bengali image caption generation on two different sizes of the Flickr8k dataset and BanglaLekha dataset which is the only publicly available Bengali dataset for image captioning. Afterward, the Bengali captions of our model were compared with Bengali captions generated by other researchers using different architectures. Additionally, we employed a hybrid approach based on InceptionResnetV2 or Xception as Convolution Neural Network and Bidirectional Long Short-Term Memory or Bidirectional Gated Recurrent Unit on two Bengali datasets. Furthermore, a different combination of word embedding was also adapted. Lastly, the performance was evaluated using Bilingual Evaluation Understudy and proved that the proposed model indeed performed better for the Bengali dataset consisting of 4000 images and the BanglaLekha dataset.


I. INTRODUCTION
An image is worth a thousand stories. It is effortless for humans to describe these stories but it is troublesome for a machine to portray them. To obtain captions from images it is necessary to combine computer vision and natural language processing. Previously lots of research has been done on image captioning but most of them were done in English. Research done on Image captioning using other languages [13], [15], [16] is still limited. Few works until now have been conducted on image captioning in Bengali [5], [23], [37] so we aim to explore image captioning in the Bengali language further.
About 215 million people worldwide speak in Bengali among those 196 million individuals are natives from India and Bangladesh. Bengali is the 7 th most utilized language worldwide 1 .As a result, it is momentous to generate image captions in Bengali alongside English. Moreover, most of the natives have no knowledge of English. Additionally, image captioning can be used to aid blind people by converting the text into speech blind people who can understand the image. Also, surveillance footage can be captioned in real-time so that theft, crime or accidents can be detected faster.
The main issue of image captioning in the Bengali language is the availability of a dataset. Most of the datasets 1 https://www.vistawide.com/languages/top\ 30\ languages.htm available are in English. English datasets can be translated using manual labor or using machine translation. At any rate, manual translations have higher accuracy, they are extremely monotonous and troublesome. Machine translation on the other hand provides a better solution. In our experiment, we used a Machine translator such as Google translator 2 to translate English captions to Bengali and modified those sentences that were syntactically incorrect manually. Furthermore, we also utilized BanglaLekha 3 dataset which is the only publicly available Bengali dataset for image captioning till now. All the captions in this dataset were in Bengali and human annotated. We employed two approaches to captioning images in Bengali. Firstly, a hybrid model was used as demonstrated in Fig. 1 where two embedding layers were concatenated. Among those concatenated embedding one was GloVe [22] which utilize a pre-trained file in Bengali and another was fastText [7] which was trained on the vocabulary available. Secondly, two different models were trained to have a single embedding. One was conducted with only a trainable fastText embedding and the other experimented on GloVe embedding which was pre-trained in Bengali. For all three of the cases, InceptionResnetV2 [28] and Xception [38] was used as a Convolution Neural Network (CNN) to detect objects from images.
In this work, we proposed a hybridized Deep Learning method for Image captioning. This was achieved by concatenating two word embedding. The contribution of this paper is as follows: • We introduced a hybridized method of image captioning where two word embedding pre-trained GloVe and fastText were concatenated.
• Experiments were carried on both our models using Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU). BiGRU has not been used before for image captioning using different languages other than English.
• Moreover, these two models have been tested on two Flickr8k datasets of varying lengths. One dataset contains 4000 images and the other contains 8000 images. To our best knowledge, no paper used Flickr8k full dataset translated in Bengali for image captioning.
• Additionally, our model was also tested on the BanglaLekha dataset which contains 9154 images.
• Lastly, it was shown that our proposed hybrid model achieved higher BLEU scores for both the Flickr4k-BN dataset and the BanglaLekha dataset.

II. RELATED WORK
This section depicts the progress in image captioning. Hitherto, many kinds of research have been conducted and many models have been developed to get captions that are syntactically corrected. The authors in [2] presented a model that deems the probabilistic distribution of the next word using previous word and image features. On the other hand, H. Dong et al. [6] proposed a new training method Image-Text-Image which amalgamate text-to-image and image-totext synthesis to revamp the performance of text-to-image synthesis. Furthermore, J. Aneja [21] and S. J. Rennie [25] adapted the attention mechanism to generate caption. For vision part of image captioning Vgg16 were used by most of the papers [2], [11], [24], [25], [27], [30] as CNN but some of them also used YOLO [9], Inception V3 [6], [31], AlexNet [24], [30] ResNet [11], [18], [24] or Unet [4] as CNN for feature extraction. Concurrently, LSTM [6], [9], [11], [17], [31] was used by most of the papers for generating the next word in the sequence. However, some of the researcher also utilized RNN [19] or BiLSTM [4], [30]. Moreover, P. Blandfort et al. [32] systematically characterize diverse image captions that appear "in the wild" in order to understand how people caption images naturally. Alongside English researchers also generated captions in Chinese [15], [16], Japanese [1], Arabic [12], Bahasa Indonesia [13], Hindi [26] German [29] and Bengali [5], [23]. M. Rahman et al. [23] generated image caption in Bengali for the first time followed by T. Deb et al. [5]. Researchers of paper [23] used VGG-16 to extract image features and stacked LSTMs. On the contrary, researchers of paper [5] generated image caption using InceptionResnetV2 or VGG-16 and LSTM. They utilized 4000 images of the Flickr8k dataset to generate captions. We modified the merge model adapted by paper [5] to get much better and fluent captions in Bengali.
Only three works have been done on image captioning in Bengali till now. In [23], author's first paper, was where in image captioning in Bengali followed by [5] and [37]. Rahman et al. [23] have aimed to outline an automatic image captioning system in Bengali called 'Chittron'. Their model was trained to predict Bengali caption from input image one word at a time. The training process was carried out on 15700 images of their own dataset BanglaLekha. In their model Image feature vector and words converted to vectors after passing them through the embedding, the layer was fed to the stacked LSTM layer. One drawback of their work was that they utilized sentence BLEU score instead of Corpus BLEU score. On the other hand, Deb et al. [5]  To overcome the above mentioned drawbacks of fluent captions we conducted our experiment using a hybridized approach. Moreover, we used 8000 images of the Flickr8k dataset alongside the Flickr4k dataset. We further validated the performance of our model using the human annotated BanglaLekha dataset.

III. OUR APPROACH
We employed an Encoder-Decoder approach where both InceptionResnetV2 and Xception were used separately in different experimental setups to Encode Images to feature vectors and different word embedding were used to convert vocabulary to word vectors. Image feature vectors and word vectors after passing through a special kind of RNN were merged and passed to a decoder to predict captions word by word this process is illustrated in Fig. 2. We propose a hybrid model that consists of two embedding layers unlike the merge model [5]. We also conducted experiments on the merged model having either pre-trained GloVe [22] or trainable fastText [7] embedding. To be more precise, we trained the merge model using three settings as shown in Fig. 1. Our proposed hybrid model is shown in Fig. 3. It consists of two part which is encoder and decoder.

• Encoder
The encoder comprised of two parts one for han- dling image features and another for handling word sequence pair. Firstly, image features were extracted using InceptionResnetV2 [28] or Xception [38]. These image features were preceded down to a dropout layer followed by a fully connected layer and then another dropout layer. A fully connected layer was used to reduce the dimension of the image feature vector from 1536 or 2048 to 256 to match the dimension of word prediction output. Secondly, Input word sequence pairs are feed to two embedding layers one was pre-trained GloVe embedding and another was fastText which was not pre-trained. Both embeddings were used to convert words to vectors of dimension 100. The vector from the two embeddings was then passed through a separate dropout layer followed by either BiLSTM or BiGRU of dimension 128. To match the dimension of visual feature vector output these vectors were passed through an additional fully connected layer of dimension 256. These two outputs were then concatenated. This concatenated output was then mapped to the visual part of the encoder using another concatenation and then forwarded to the decoder.

• Decoder
The decoder is a Feed Forward Network which ends with a SoftMax. It takes the concatenated output of the encoder as input. This input was first passed through a fully connected layer of 256 dimensions followed by a dropout layer. Finally, via probabilistic Softmax function outputs the next word in the sequence. The SoftMax greedily selects the word with maximum probability.

IV. EXPERIMENTAL SETUP
This section narrates the total strategy adapted to obtain captions from images. Also, different tuning techniques availed are described here.

A. Dataset Processing
Flickr8k dataset has 8091 images of which 6000 (75%) images are employed for training, 1000 (12.5%) images for validation and 1000 (12.5%) images are used for testing. Moreover, with each image of the Flickr8K dataset five ground truth captions describing the image are designated which adds up to a total of 40455 captions for 8091 images. For image captioning in Bengali, those 40455 captions were converted to Bengali language using Google Translator. Unfortunately, some of the translated captions were syntactically incorrect. Hence, we manually checked all 40455 translated captions and corrected them. We utilized these 8000 images as well as selected 4000 images as done by Deb et al. [5] in Bengali(Flickr4k-BN and Flickr8k-BN). These 4000 images were selected based on the frequency of words in those 40455 captions. Using POS taggers most frequent nouns Bengali words were identified from ground truth captions. The most frequent words in the Bengali Flickr8k dataset are shown in Fig. 4 for Bengali and English respectively. 4000 images analogous to these words are selected and made two small datasets Flickr4k-BN.
We also utilized the BanglaLekha dataset which consists of 9154 images. It is the only available Bengali dataset till now. All its captions are human annotated. One problem with this dataset is that it has only two captions associated with each image resulting in 18308 captions for those 9154 images. Hence, vocabulary size is lower than Flickr4k-BN and Flickr8k-BN. Flickr8k-BN consists of 12953 unique Bengali words, Flickr4k-BN consist of 6420 unique Bengali words and BanglaLekha consists of 5270 unique Bengali words. It can be seen that the BanglaLekha dataset has a vocabulary size even lower than Flickr4k-BN. Hence, we employed the Flickr8k-BN dataset alongside Flickr4k-BN and BanglaLekha datasets. The split ratio of all three datasets for training, testing and validating are shown in Table I.

B. Image Feature Extraction
One essential part of image captioning is to extract features from given images. This task is achieved using Convolutional Neural Network architectures. These architectures are used to detect objects from images. They can be trained on a large number of images for extracting image features. This training process requires an enormous number of images and time. Due to the shortage of a large number of images, we utilized Convolutional Neural Network architecture which was pretrained on more than a million images from the ImageNet [33] dataset in our model known as InceptionResnetV2 [28] and Xception [38]. These two pre-trained architectures were used separately for different experimental setups. The reason for using InceptionResnetV2 and Xception is that these models can achieve higher accuracy at lower epochs. The last layer which is used for prediction purposes of this pre-trained of InceptionResnetV2 model is pulled out and the last two layers of the pre-trained Xception model were pulled out. Finally, the average pooling layer was used to extract image features and convert them into a feature vector of 1536 dimensions for InceptionResnetV2 and 2048 dimensions for Xception. All the images are given an input shape of 299x299x3 before entering the InceptionResnetV2 model. Here 3 represents the threecolor channels R, G and B.

C. Embeddings
Handling word sequences requires word embedding that can convert words to vectors before passing them to special recurrent neural networks (RNN). In our model GloVe [22] and fastText [7] have been used as an embedding.
• GloVe is a model for distributed word representation.
The model employs an unsupervised learning algorithm for acquiring vector representations for words. This is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity.
• fastText is a library for the learning of word embeddings and text classification created by Facebook's AI Research (FAIR) lab. The model employs unsupervised learning or supervised learning algorithms for obtaining vector representations for words. fastText yields two models for computing word representations namely skipgram and cbow. Skipgram model learns to forecast a target word using the nearby word. conversely, cbow model forecasts the target word according to its context where context depicts a bag of the words contained in a fixed size window around the target word.
Both GloVe and fastText have pre-trained word vectors that are trained over a large vocabulary. These embeddings can also be trained. In the hybrid model shown in Fig. 3, two embeddings have been used GloVe and fastText. There GloVe was pre-trained but fastText has been trained on vocabulary available in the dataset. Trainable fastText instead of pretrained fastText was used to enrich the vocabulary with words in Flickr8k and BanglaLekha datasets. Also, results of pretrained fastText have already been demonstrated by Deb et al. [5]. The combination of two embedding leads to redundancy of words but it gives fluent caption in Bengali as the vocabulary size increases. On the other hand, pre-trained files for both GloVe and fastText in the hybrid model will give much greater redundancy and the vocabulary size becomes small as the vocabulary does not contain unique words in the dataset.
Two other models were trained alongside the hybrid model. Unlike the hybrid model, these two models had a single embedding either a trainable fastText embedding or a pretrained GloVe embedding. GloVe file "bn glove.39M.100d" 4 pre-trained in Bangali Language was used for Bengali datasets.

D. Word Sequence Generation
Flickr8k dataset has five captions associated with each image and BanglaLekha has two captions associated with each image. One of the difficult tasks of image captioning is to make the model learn how to generate these sentences. Two different types of special Recurrent Neural Network (RNN) were used to train the model to generate the next word in the sequence of a caption. The input and output sizes were fixed to the maximum length of the sentence present in the dataset.
In the case of Flickr4k-BN and Flickr8k-BN maximum length was 23. On the other hand, two different maximum lengths of the sequence 40 and 26 were used for the BanglaLekha dataset. Reducing the maximum sequence length significantly increased the evaluation scores. While training if any sentence were generated having a length less than the maximum length zero-padding was applied to make that sentence length equal to the fixed length. Additionally, an extra start token and end token is added to the sequence pair for identification in the training process. During training, image features vector and previous words converted to vector using embedding layer were used to generate the next word in the sequence probabilistic Softmax with the help of different types of RNN. Fig. 5 illustrates the input and output pair. Due to the limitation of the basic Recurrent Neural Network (RNN) [34] to retrain long term memory a better approach was taken by Deb et al. [5] which uses Long Short-Term Memory (LSTM). However, LSTM [10] only preserve preceding words but for proper sentence generation succeeding words are also necessary. As a result, our model uses Bidirectional Long Short-Term Memory (BiLSTM) and Bidirectional Gated Recurrent Unit (BiGRU) which are illustrated in Fig. 6 Whereŷ <t> is the output at time t when activation function g is applied to recurrent component's weight W y and bias by with both forward activation − → a <t> at time t and backward activation ← − a <t> at time t. • GRU is a special type of RNN. Reset and update the gate of a GRU helps to solve the vanishing gradient problem of RNN. The update gate of GRU seeks how much information from the previous units must be forwarded. The update gate adopted is computed by the following formula: where z t is update gate output at the current timestamp, W z is weight matrix at update gate, h t-1 information from previous units, and x t is input at the current unit.
The reset gate is used by the model to find how much information from the previous units to forget. The reset gate is computed by the following formula: where r t is reset gate output at current timestamp, W r is weight matrix at reset gate, h t-1 information from previous units, and x t is input at the current unit. Current memory content used to store the relevant information from the previous units. It is calculated as follows:h whereh t is current memory content, W is weight at current unit, r t is reset gate output at current timestamp, h t-1 is information from previous units, and x t is input at the current unit.
Final memory at the current unit is a vector used to store the final information for the current unit and pass it to the next layer. It is calculated using a formula: where h t is final memory at the current unit, z t is update gate output at current timestamp, h t-1 is information from previous units, andh t is current memory content.
• LSTM is another Special type of RMNN. Unlike the GRU the LSTM has three gates, namely, the forget gate, update gate and the output gate. The equations for the gates in LSTM are: where i t represents input gate, f t represents forget gate, o t represents output gate, σ represents sigmoid function, W x represents weight of the respective gate(x) neurons, h t-1 represents output of previous LSTM block at timestamp t-1, x t represents input at current timestamp and b x represents biases for the respective gates(x).
Input gate tells what new information is going to be stored in cell state. Forget gate determine what information to throw away from cell state and Output get is used to provide output at timestamp t. The equations for the cell state, candidate cell state and the final output are: where c t represents cell state at timestamp t and c t represent candidate for cell state at timestamp t. candidate timestamp must be generated to get memory vector for current timestamp c t . Then the cell state is passed through a activation function to generate h t . Finally, h t is passed through a softMax layer to get the output y t .

E. Hyperparameter Selection
One major problem of machine learning is overfitting. Overfit models have high variance. These models cannot generalize well. As a result, this is a huge problem for image captioning. We observed the performance of our model and noticed that it was suffering from overfitting rather than underfitting. To minimize this overfitting problem some hyperparameter tuning has been adapted in our model. Firstly, different values of dropout [35] have been used for sequence model image features and decoder. Dropouts help prevent overfitting. For feature extractor dropout value of 0.0 was used, a dropout of 0.3 was used for the sequence model and in the case of decoder dropout value of 0.5 was utilized. Secondly, different activation functions were employed for different fully connected layers. For example, regarding the feature extractor model and decoder ELU [3] activation function was availed and for the sequence model, ReLU [36] activation function was employed. Thirdly, we employed external validation to provide an unbiased evaluation and ModelCheckpoint was availed to save models that had minimum validation loss. On the other hand, ReduceLROnPlateau was used for models that had Xception as CNN. Moreover, Adam optimizer [14] was utilized and the models were trained for 50 and 100 epochs having learning rates of 0.0001 and 0.00001. A short summary of the hyperparameters adapted in different models are shown in Table II and the loss plot of BanglaLekha dataset and Flickr8K-BN dataset are ornamented in Fig. 7 and Fig. 8, respectively. From these plots, it can be seen that the model converges towards epoch 100. Another important factor that improved the result was maximum sentence length. In the BnglaLekha only a few sentences had lengths greater than 26. As a result, we took a maximum length of sentences in this dataset to 26. This enhanced the evaluation scores greatly.

V. ANALYSIS
We implemented the algorithm using Keras 2.3.1 and Python 3.8.1. Additionally, we ran our experiments on GPU RTX 2060. Our code and Bengali Flickr8k dataset is given in GitHub 5 . We translated the Flickr8k dataset to Bengali using Google Translator Like that done by [16]. Bilingual Evaluation Understudy (BLEU) [20] score was used to evaluate the performance of our models as it is the most wielded metric nowadays to evaluate the caliber of text. It depicts how normal sentences are compared with human generated sentences. It is broadly utilized to evaluate the performance of Machine translation. Sentences are compared based on modified ngram precision for generating BLEU scores. BLEU scores are computed using the following equations: where P(i) is the precision that is for each i-gram where i = 1, 2, ...N, the percentage of the i-gram tuples in the hypothesis that also occur in the references is computed. H(i) is the number of i-gram tuples in the hypothesis and Matched(i) is computed using the following formula: where t i is an i-gram tuple in hypothesis h, C h (t i ) is the number of times t i occurs in the hypothesis, C hj (t i ) is the number of times t i occurs in reference j of this hypothesis.
where ρ is brevity penalty to penalize short translation, n is the length of the hypothesis and L is the length of the reference. Finally, the BLEU score is computed by: Two different search types Greedy and Beam search were used to compute these BLEU scores. In a Greedy search word with maximum probability is chosen as the next word in the sequence. On the other hand, Beam search considers n words to choose from for the next word in the sequence. Where n is the width of the beam. For our experiment, we considered beamwidth of 3 and 5. We computed 1-gram BLEU (BLEU-1), 2-gram BLEU (BLEU-2), 3-gram BLEU (BLEU-3), 4gram BLEU (BLEU-4) for various architectures. These are illustrated in Table III, Table IV and Table V. Performance of the proposed Hybrid architecture and single embedding GloVe or fastText on Flickr4k-BN dataset consisting of 4000 data for Bengali are demonstrated in Table  III. From Table III it can be stated that the Hybrid model performed better for both BiLSTM and BiGRU on the Bengali dataset than only GloVe and only fastText word embedding. Moreover, we obtained better BLEU scores than paper [5]. The greedy search was employed to compute these BLEU scores.
Consequently, the performance of the single embedding GloVe or fast Text and hybrid architecture on Flickr8k-BN dataset consisting of 8000 data and BanglaLekha dataset are displayed in Table IV and Table V, respectively. There also it can be observed that the proposed Hybrid model performed better for both BiGRU and BiLSTM than the other models. The Highest BLEU score was obtained using BiLSTM on Flickr4k-BN and Flickr8k-BN as a result the captions generated by the Hybrid model for both datasets are illustrated in Fig. 9. Furthermore, our proposed Hybrid model also gave the highest BLEU scores for the BanglaLekha dataset for both BiLSTM and BiGRU as shown in Table V. From there it can be observed that Xception and the learning rate played a vital role in increasing the BLEU scores. These scores were even better than BLEU scores obtained by paper [37].

VI. CONCLUSION
In this work, we exhibited a notion for automatically generating caption from an input image in Bengali. Firstly, a detailed description of how the Flickr8k dataset was translated in Bengali and distributed into a dataset of two sizes was presented. Secondly, how image features were extracted and the different combinations of word embedding utilized were also conferred. Moreover, the reasons for using a special kind of word sequence generator was elucidated. Furthermore, different parts of the proposed architecture were ornamented. Finally, using the BLEU score it was authenticated that the proposed architecture performs better for both Flickr4k-Bn and BanglaLekha datasets. This validates the fact that image captioning using the Bengali language can be refined further in the future. We will try to adapt the visual attention and transformer model in the near future for better feature extraction and getting more precise captions. Additionally, we aim to make our own dataset having five captions with each image, unlike the BanglaLekha dataset that has two captions associated with each image to enrich the vocabulary of our dataset.