Content-based Image Retrieval using Encoder based RGB and Texture Feature Fusion

— Recent development of digital photography and the use of social media using smartphones has boosted the demand for image query by its visual semantics. Content-Based Image Retrieval (CBIR) is a well-identified research area in the domain of image and video data analysis. The major challenges of a CBIR system are (a) to derive the visual semantics of the query image and (b) to find all the similar images from the repository. The objective of this paper is to precisely define the visual semantics using hybrid feature vectors. In this paper, a CBIR system using encoded-based feature fusion is proposed. The CNN encoding features of the RGB channel are fused with the encoded texture features of LBP, CSLBP, and LDP separately. The retrieval performance of the different fused features is tested using three public datasets i.e. Corel-lK, Caltech, and 102flower. The result shows the class properties are better retained using the LDP with RGB encoded features, this helps to enhance the classification and retrieval performance for all three datasets. The average precision of Corel-lK is 94.5% and it is 89.7% for Caltech, and 88.7% for the 102flower. The average f1-score is 89.5% for Caltech, and 88.5% for the 102flower. The improvement in the f1-score value implies the proposed fused feature is more stable to deal the class imbalance problem.


INTRODUCTION
Content-based image retrieval (CBIR) is the technique to retrieve similar images from a large image database using visual characteristics such as color, shape, structure, Zernike values, and histogram of the images [1,2]. Nowadays it has an inevitable requirement in various application areas such as video surveillance, medical image retrieval, crime detection, military surveillance, remote sensing applications, the textile industry etc. [3][4][5][6]. The efficiency of the CBIR system greatly depends upon the visual feature selection. The high-level semantic features [3,7] of an image are its color, shape, structure, Zernike values, and histogram are used for manual image annotation and are less biased with noise [8,9]. Features represented using the spatial layout of the pixels within an image patch are referred as low-level features or local descriptors [10][11][12]. Some of the popular low-level image descriptors are Local Binary Patterns (LBP) [13][14][15], Orthogonal-Combination of Local Binary Patterns (OC-LBP), Center-Symmetric Local Binary Patterns (CS-LBP), Local Ternary Patterns (LTP), Local Directional Patterns (LDP) [15], Scale-Invariant Feature Transform (SIFT) [16] are used for image retrieval. The performance of a unique texture feature varies with different datasets. The major limitation of the texture feature is directly mapping the texture image to its histogram [1,9,[17][18][19], which is represented on a scale of 0 to 255, so that all the information learned from the patches of images are not well preserved. With the implementation of Deep learning features a new breakthrough is achieved in the field of computer vision and its applications. It uses the Convolutional neural networks (CNNs) features [14,[20][21][22][23] as the image descriptor. The Deep learning technique requires adequate images for its training. The several layers of the CNN encoder represent the image features at different levels [11]. The lower layers contain the detailed image features, whereas the higher layers present the semantic information of the image [10,11]. The fully connected layer extracts discriminative image features using an order-less quantization approach. Finally, these features are mapped to the class label using the dimension reduction technique and soft-max pooling [10,24].
An effectual feature extraction technique precisely describes the image contents. It also helps to maintain a distinctive signature for the images of different classes. In recent years image retrieval using feature fusion has been emphasized by many researchers [3,8] to build a more powerful image descriptor using the feature fusion technique [7,10,23,[25][26][27]. These are more sensitive to noise and image resolution. Moreover mapping the low-level image features to the high-level visual semantics is challenging [7,8,28,29]. Thus, there is a need to design an enhanced CBIR system.
In this work, a deep-learning feature fusion framework is proposed, where the auto-encoding features of the RGB channels are fused with the auto-encoding feature of the texture image. Here two different CNN models are trained independently. The first model usages the RGB channels data, which learns the spatial image information using automatic encoding. The second model usages the texture image data for the training to learn the auto-encoder-based texture features. The spatial and texture features extracted by CNN encoders are fused together to provide more precise feature descriptors for the image. The texture feature of an image i.e. the histogram of the texture image is biased by the background image textures, which impedes the learning ability of the classifier [9,17]. Textures of similar images are expected to be alike. More effective learning can be possible from the texture image set, as the CNN uses the batch mode for the (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 14, No. 3, 2023 246 | P a g e www.ijacsa.thesai.org training. For the extensive analysis of the proposed fusion framework, a CBIR system is developed. Here the encoding features of three different textures such as LBP, CSLBP, and LDP are fused with the RGB channel encoding feature individually. The classification and retrieval performance of the different fused features are presented. These are also compared with encoding features of only RGB channels. The model is tested for three different datasets such as Corel-lK, Caltech, and 102flower. It is observed that the classification result of the LDP_RGB fusion outperforms the results of LBP_RGB, CSLBP_RGB, and RGB. Moreover, the proposed fusion features preserve more class-oriented properties, so that the retrieval rate is enhanced. The performance analysis for the top 80 images retrieval using the proposed auto-encoderbased feature fusion and the auto-encoder-based RGB channel feature are shown in the result section. The retrieval rate using LDP_RGB fusion also surpasses all the other methods discussed.
The major contributions of this work are mentioned below: where the texture image and RGB channel image features are fused.
 The auto-encoding features of the texture and RGB channels are extracted by two different CNN models to save the low variance pixel information of the texture image.
 The CBIR model is tested for three different textures i.e. LBP, CSLBP, and LDP textures with RGB channel encoding feature.
 The model is tested with three different datasets such as Corel-lK, Caltech, and 102flower.
 Improvement in the f1-score implies the proposed feature descriptor handles the class imbalance issue more precisely.
 The retrieval result is enhanced with the fusion of LDP and RGB encoded features.
The rest part of the paper is arranged in the following order: Section II presents a review of feature fusion and CBIR system. Section III shows the proposed feature fusion model, CNN encoding architecture, and performance evaluation metrics. The detailed results are shown in Section IV i.e. the results and discussions. The conclusion of the work is presented in Section V.

II. RELATED WORK
Kayhan, N., et al. [1] build a weighted feature-based CBIR system using modified local binary patterns (MLBP), local neighbourhood differences patterns (LNDP), filtered gray level co-occurrence matrix (GLCM), and the quantization color histogram features. Khan, U. A., et al. [2] used hybrid classification model using three color moments, Haar Wavelet, Daubechies Wavelet and Bi-Orthogonal wavelets features. They have used genetic algorithm (GA) and SVM classification and L2 Norm is used for the similarity measure. Kashif, M., et al. [3] proposed a hybrid image descriptor using local ternary pattern, local phase quantization, and discrete wavelet transform. They used joint mutual information (JMI) based feature selection to derive the optimal feature for effective image retrieval. Carvalho, E. D., et al. [4] proposed a histopathological breast image classification model using phylogenetic diversity indexes. They have also used the phylogenetic diversity indexes to rank the images. Authors claim, it outperforms XGBoost, random forest, and support vector machine. Choe, J., et al. [5] proposed a medical image retrieval model for interstitial lung disease diagnosis using the deep learning features of CT images. Pradhan, J., et al. [7] proposed a regions-of-attention-based feature fusion technique for image retrieval, here authors used multi-directional texture features with spatial correlation-based color features to derive the image semantics. Pathak, D., et al. [9] proposed a retrieval system by concatenating the deep learning GoogleNet features with the hue, saturation, and intensity features of the HIS image, and Histogram of orientated gradient (HOG) feature of the RGB image. Here the authors claim this technique is used to reduce information loss due to image resizing. A. Latif, et al. [10] presented a comprehensive review of the recent development and the state-of-the-art CBIR systems. The study explored the major concepts of CBIR like image representation, image retrieval, low-level feature extraction, and recently used semantic deep-learning approaches, it also includes future research directions in CBIR. M. Sotoodeh, et al. [17] presented a local texture descriptor referred as Color Radial Mean Local Binary Pattern (CRMLBP).
The CRMLBP is computed for the sign-difference, magnitudedifference, and central gray value patterns in the RGB color space and their histograms are concatenated. The feature weights are optimized using Particle Swam Optimization (PSO) technique. The performance of this feature vector is tested with various datasets such as Wang, Holidays, Corel data. Sampathila, N., et al. [18] presented an image retrieval method using Grey-level co-occurrence-based Haralik's features and histogram-based cumulative distribution function (CDF) for the brain MRI image retrieval. Here the KNN approach is used to find the distance between the query image and other images. Khan, M. A., et al. [19] proposed an intelligent human action recognition system using Handcrafted and deep convolutional neural network features fusion. Here the histogram of oriented gradients (HoG) and deep features are fused. A multi-class support vector machine (M-SVM) is used for the classification. Ma, W., et al. [22] suggested a cloud-based privacy-preserving image retrieval service using deep convolutional features with from the encrypted image. For image encryption, a hybrid encryption method is adopted. Wang, S. H., et al. [23] suggested deep feature fusion technique using graph convolutional network and convolutional neural network features for Covid-19 classification. Here they used the CT images to test their model performance. L. T. Alemu, et al. [25] proposed a multifeature fusion-based CBIR system, where various handcrafted features with deep NN features and membership score is applied based on their probabilistic distribution. Then an incremental nearest neighbour (NN) selection is used to implement k-NN for dynamic query selection. Wang, W., et al. [26] presented a two-stage CBIR model using the fusion of global and local feature. Authors use a sparse coding for the sparse representation of the local features followed by feature www.ijacsa.thesai.org pooling and the Euclidean distance measure is used to find the similarity between the sparse feature vectors. Bella, M. I. T. et al. [28] proposed the image retrieval system using information fusion technique, where the GLCM and HSV color moment features are fused the model is tested with Corel-1K, Corel-5K, and Corel-10K datasets.
Table I presents a survey on the different feature fusion techniques used for image classification and retrieval. However, there is a scope to define a better image descriptor using the strength of the texture feature with the deep CNN feature. In this work, intend to define a more precise feature vector by combining the CNN-encoded texture feature with the encoded RGB channel feature.

III. PROPOSED MODEL
In the case of Deep learning, the image features are fetched automatically using a CNN encoder. The features extracted from the RGB channel carry more information than the Gray-scale image, as it learns from three channels R, G, and B coherently. At the same time, computational complexity increases. Moreover, information stored in all three channels is highly correlated, which impedes the learning rate. In this work, a feature fusion technique is proposed where the CNNencoded feature of the texture image is fused with the encoded feature of RGB channels. Two different CNN encoders are used to derive the texture and RGB features from an image. The motivation behind two different CNN encoders instead of adding the texture image in the 4th channel in addition to the R, G, and B is that the range of the pixel values of the texture image is comparatively smaller than the pixel values of the R, G, and B channels. So the texture information will not be suppressed during the recursive MAX pooling and ReLU operations.

A. LBP Texture Image
The LBP texture of a 3 x 3 pixel block is achieved by thresholding the pixel values of the neighbours with its center pixel into binary values, where the value is 1 if the value of the neighbour pixels is greater or equal to the value of the center pixel, otherwise 0. The values of all the 8 neighbours are stored in an unsigned-byte form, here the range varies from 0 to 255. Eq. (1) shows the calculation of the LBP texture image, where R is the radius of the circle [14].

B. CSLBP Texture Image
Center-Symmetric Local Binary Patterns are produced by computing the thresholding difference of pixel values with their symmetrically opposite pixels with respect to the canter of a pixel block. Here the thresholding difference is a smaller integer value T. The CSLBP labels generate shorter histograms, which is a more stable feature for the flat image regions. Eq. (2) represents the calculation of the CSLBP texture image [15].

D. Auto-encoder-based CNN Feature
The deep CNN feature of an image is generated using an automatic encoding technique. The image feature is learned through batch mode training, hence it is expected that the feature preserves the class information. As the layers of CNN architecture are densely connected, the learning becomes faster with automatic weight adjustment for a particular class using supervised learning. www.ijacsa.thesai.org  Fig. 2 shows the proposed model, here the size of the input image is 512 x 512 for both the CNNs i.e. the RBG and the texture input. The CNN architecture consists of seven layers and each layer contains a convolution operation followed by the ReLU and MAX pooling operations. Non-linearity property is introduced to the convolution output with the ReLU activation function. Whereas the image size reduction is done by the MAX pooling with each convolution operation. The flattened layer is used to reduce the image to a singledimension feature vector of size 1 x 1024. Further dimension reduction is done with four fully connected layers. The softmax operation is used to calculate the class label from the feature map using the energy function.

E. Feature Fusion Model
Where: = model loss parameter.
= regularization factor used to deal with the model complexity.
The cross-entropy loss is determined as the penalty value in each iteration using that energy function. Eq. (5) represents the convolution operation at point a (x, y) of an image I used the filter f, where the H and W represent the height and width of the image. The ReLU operation is defined using Eq. (6). Eq. (7) represents the regularized training error of an instance. Eq. (8) represents the sigmoid function Si used to map the output value within (0, 1). The cross-entropy loss for each iteration is defined by Eq. (9).  Where N shows the number of samples and M is the number of labels, the represents if the label is correctly classified as, for the instance, . Here is the probability value of the model that assigns label to the instance .
The proposed feature fusion model using the CNN feature of the RGB image and the CNN feature of the texture image uses the standard learning rate with an early stop parameter value of 0.99. The model training is done using 80:20 holdout validation. Here a GTX 1650 graphics system with 16 GB RAM is used for the training and testing of the proposed model.

F. Performance Measures
The performance of the proposed feature fusion is evaluated using parametric quantifiers such as precision, recall, and f1-score [13], which are defined below using Eq. (10), Eq. (11), and Eq. (12) respectively.
Here the true positive (True+) value shows the number of images correctly identified into their belonging class by the system. The false positive (False+) shows the number of images falsely recognized by the system, and the false negative (False-) shows the number of images falsely rejected by the system. The precision shows the number of images correctly identified into their belonging class with respect to the total number of images identified by the system. Whereas recall represents the number of images correctly identified for a class with respect to all the images belonging to that class. Hence the average recall value is a significant performance measure for a retrieval system. The Caltech and 102flower datasets have a different number of total images in different classes. The harmonic mean of these classes i.e. f1-Score is also presented in addition to the precision [13,29]. The receiver operating characteristics (ROC) curve, which is plotted using the true-positive rate vs. false-positive rate, illustrates graphically the classifier's performance. The Cityblock distance measure shown in the Equation (13) is used to measure the similarity between the images.
City-block distance measure: Where: = feature vector of the query image = feature vector of the database image IV.

RESULTS AND DISCUSSION
The results of the proposed CBIR model using encoded texture feature fusion are discussed in this section. Here the CNN-based auto-encoding features of the RGB channels are fused with the auto-encoding features of three different texture features i.e. LBP, CSLBP, and LDP. The image retrieval model is tested with three different datasets such as Corel-lK, Caltech, and 102Flower. To avoid the extensive processing time, selective 15 classes of the 102Flower dataset have been considered. The precision, recall, and f1-score of each class are presented for all three datasets. The ROC curve shows the overall classification performance using the CNN encoding features of RGB, RGB_LBP, RGB_CSLBP, and RGB_LDP. The average retrieval performance is shown separately for all three datasets using all the above-discussed four encoding features for top 80 image retrieval. The class-wise retrieval performances are illustrated with the bar graph for all four encoding features. The detailed analysis results of individual classes are discussed in the sub-sections below. www.ijacsa.thesai.org Table II shows the performance analysis of the Corel-1K dataset maximum average precision value is 94.5% using the LDP_RGB encoder feature. It is 94.2% using LBP with RGB, 94.3% for CSLBP with RGB, and 94.2% using the RGB encoder feature. The classification performance is presented using the ROC curve in Fig. 3(a).

A. Results Analysis of Corel-lK
The Recall and f1-score are 94.4%, and 94.5% respectively using the LDP_RGB feature, which is maximum in comparison to the other features. Though there is a small difference in the classification rate, the average retrieval rate is significantly enhanced using the LDP_RGB feature in comparison to the other features for retrieving the top 80 images shown in Fig. 3(b), and the class-wise retrieval analysis is shown in Fig. 4 for the top 10 images. In this dataset, each class consists of 100 images, so there is no major difference in the precision and f1-score values.

B. Results Analysis of Caltech
The result analysis of the Caltech dataset is shown in Table  III, in this case, value of maximum average precision is 89.7% using the LDP_RGB encoder feature. Whereas it is 88.9% using LBP with RGB, 88.3% for CSLBP with RGB, and 88.9% using RGB encoder feature. The classification performance is presented using the ROC curve in Fig. 5(a). The Recall and f1-score are 89.6%, and 89.5% respectively with the LDP_RGB feature, which is more in comparison to the other encoding features.
Though there is a visible difference in the classification rate, the average retrieval rate is also enhanced using the LDP_RGB feature in comparison to the other features for retrieving the top 80 images shown in Fig. 5(b), and Fig. 6 show the class-wise retrieval analysis for the top 10 images. As in this dataset, the number of images in the different classes varies the precision, and f1-score values are also different, moreover these values are more stable using the LDP_RGB encoder feature.

C. Results Analysis of 102Flower
The classification performance of the 102Flower dataset is shown in Table IV. Here 15 classes are selected, and the name of the flower and the number of images available for each class is mentioned in the table in the first and last columns respectively. Here the value of maximum average precision is 88.7% using the LDP_RGB encoder feature. Whereas it is 87.9% using the LBP with RGB, 85.6% for CSLBP with RGB, and 84.5% using the RGB encoder feature. The classification performance of the 102Flower dataset is presented using the ROC curve in Fig. 7(a).
There is a significant difference in the classification rate. The result shown in Fig. 7(b) claims that average retrieval rate using LDP_RGB feature is better than other fusion techniques for retrieving the top 80 images. Fig. 8 shows the class-wise retrieval of the top 10 images. The value of the f1-score is also enhanced using the LDP_RGB feature. Table V presents a state-of-art, where the performance of five other works available in the literature, using the same dataset are compared with the proposed feature fusion model. It shows the accuracy of Corel-1k is 94.5% and Caltech256 is 89.7%. A significant improvement is achieved for both datasets using the proposed feature fusion model.

V. CONCLUSION
This paper proposed a CBIR model using the feature fusion technique. Here the CNN-encoded features of the image are fused with the encoded features of the RGB image. As the range of pixel values in the texture image is comparatively smaller than that of the RGB image, two different encoders are employed to extract the CNN features separately. These two features are fused to define a more significant image descriptor.
The proposed model is tested with three public datasets i.e. Corel-lK, Caltech, and 102flower. It is observed that the classification performance is improved by the proposed feature fusion model as compared to the RGB channel encoding feature. The result shows the performance of the LDP with RGB feature fusion is better with respect to the LBP with RGB and CSLBP with RGB features. There is a significant improvement in the retrieval system for the top 10 as well as top 80 image retrieval. Moreover, the enhancement of the f1-score using the proposed feature fusion technique illustrates the class property is better retained using the fused features. The f1-score value improved significantly using the encoder-based LDP with RGB feature fusion for the dataset having class imbalance issues such as Caltech, and 102flower. In future, the model can be tested using the fusion of other textures like LTP, GLCM.