Image Retrieval using Visual Phrases

Keypoint based descriptors are widely used for various computer vision applications. During this process, keypoints are initially detected from the given images which are later represented by some robust and distinctive descriptors like scale-invariant feature transform (SIFT). Keypoint based imageto-image matching has gained significant accuracy for image retrieval type of applications like image copy detection, similar image retrieval and near duplicate detection. Local keypoint descriptors are quantized into visual words to reduce the feature space which makes image-to-image matching possible for large scale applications. Bag of visual word quantization makes it efficient at the cost of accuracy. In this paper, the bag of visual word model is extended to detect frequent pair of visual words which is known as frequent item-set in text processing, also called visual phrases. Visual phrases increase the accuracy of image retrieval without increasing the vocabulary size. Experiments are carried out on benchmark datasets that depict the effectiveness of proposed scheme. Keywords—Image processing; image retrieval; visual phrases; apriori algorithm; SIFT


I. INTRODUCTION
Information extraction from the images is a very important process in image processing and computer vision.It is used to extract information from images to interpret and understand their contents for image processing applications.Image-feature extraction is one of the driving factors in interpreting and processing images for the development of various computer vision areas.
Content Based Image Retrieval (CBIR) 1  [1] is an image processing technique to retrieve an image and its contents with a given object query from the large database efficiently.One of the key issues is to search the visual information and phrases with computer vision techniques for image retrieval data from a huge database.The objective and goal of searching a query is one of the applications of image processing in computer vision.Applications include medical image databases like Computerized Tomography (CT), Magnetic Resonance Imaging (MRI), and ultrasound, World Wide Web (WWW), scientific databases and consumer electronics that include digital camera and games, etc.
Visual information and media are common applications in the media channels and social media.These applications and image retrieval contents have gained enough attention for the researchers to develop an efficient and robust application inside 1 CBIR is also known as Query by Image Content (QIBC) the image retrieval databases.One of the most fundamental issue in image retrieval is the space or memory amongst the feature descriptors of the images and low level features are required to save feature descriptor memory [2].
One of the most commonly used feature technique in image processing is Scale Invariant Feature Transform (SIFT) for the image databases [3].SIFT performs better in various computer vision tasks and it is robust to geometric transformations intrinsically [4].Conventionally, distance is computed to match one object to another object in image retrieval tasks for any given point in all images.In SIFT, all keypoints are identified and represented in a given image first of all.The nearest point in an image is the keypoint for matching one image to another one.Local keypoint descriptors mainly face two computational issues (1) space feature and (2) to find two similar images from the databases.
In order to overcome above mentioned issues in SIFT descriptor local keypoint features, local key descriptors are quantized using Bag of Visual Words (BoVW) technique.Various quantization techniques are used for image processing and retrieval databases like, Fisher Vector [6], VLAD [7][8][9], binary quantizer and BoVW model [10].
BoVW model is commonly employed in literature for image processing and computer vision oriented applications which include image retrieval [10,11] and image classification [8].BoVW model concept has originated from the documents retrieval, text retrieval, and image retrieval for representing most occurring words or number of frequency words in the document files.For normalizing the vocabulary size in any document, stop words and most occurring words are deleted and later, stemmed or lemmatization techniques are applied for the remaining words.Same idea is applicable on clustered descriptors and visual domain.Clustered center of descriptors is considered as a visual word.Learning process is performed by clustering from the large database which is an off-line procedure.Representation of visual words can be shown with histograms obtained from any image.Quantization process and description representation is explained in section III-A.BoVW model considers each visual word a single entity which is one of its limitations [12].Words are grouped based on their frequency in the documents for training purpose in text processing applications.Training set is frequent item set in text processing words.with those limitations and finally, evaluations of the proposed model are presented with a short conclusion in the last section.

II. LITERATURE REVIEW
There are two main categories of image retrieval techniques; (1) text based search, where, images are annotated manually to perform retrieval tasks in the text based managed system database and (2) content based search, where, annotation automatically retrieves images using visual content words including colors, shapes, textures or any other information that can be extracted from images [13,14].They are indexed by using indexing techniques for large scale retrieval.
Recently, convolution neural network (CNN) has come up as one of the state-of-the-art classifiers by obtaining better performance on various computer vision applications.CNN is used both; as a feature vector and as a classifier for the image classification in most of the frameworks reported [15].Object search, scene retrieval, video retrieval, and video Google are some of the active research areas based on this technique which is also known as text based search.
In SIFT, first keypoints are located at a length of 128 of vector for each keypoint [3].Using SIFT, keypoints range from 2.5 K to 3.0 K for an individual image.Visual words are then quantized against each local keypoint descriptor to single image feature.BoVW model gives successful and promising results for image retrieval in large databases where performance accuracy and a low recall rate is obtained using a standard query expansion method in text retrieval documents.
SIFT descriptors are used with variety of techniques for the same type of problems to improve the performance in order to generate robust and distinctive results.To search object computational efficiency, the feature descriptors are clustered or quantized to hamming space [16] or to a single image feature [17] from a large corpora of image databases.
In image retrieval, all leading methods from a large corpora image database rely on same technique with variants [11].Each image is processed to extract features in high dimensional feature space from a large corpora of image databases.Feature descriptors are quantized to represent features to the visual word in smaller discrete size corpus vocabulary.
Another approach for searching is the use of phrases which are obtained by visual words.This technique has two major drawbacks.Phrases which are defined only show us the co-occurrence of visual text in the whole image and its neighbor [18].They do not give us the spatial information between the words instead, they only provide the neighbor information and never give long-range interaction.It never defines the spatial layout of visual words and there is a weak spatial verification.Secondly, the total number of phrases increases exponentially in the number of words.A subset from the phrase set can be selected for this purpose by using some algorithm, however, this might remove a large portion of phrases.In these phrases, some words are removed which might prove to be important for image representation in future.
Geometry-preserving Visual Phrases (GVP) [19] takes spatial information in the examining step and is deployed in a specific spatial arrangement.This algorithm is inspired by [20] which is used for object categorization.It defines the co-occurrences of GVP within the whole image by building the kernel of support vector machine for object categorization and it is not used for the large databases.Authors extend their algorithm for a large image database.For this purpose, they increase little memory usage in the searching method with BOV model that provides with more spatial information.For improving the searching efficiency, they use their approach with GVP into the min-hash function [21].This approach increases the searching and retrieval accuracy by adding some spatial information in addition to the computational cost.
In the modern era, mobile phone demand is increasing and people frequently ask for added features on their devices and many companies also fulfill their demands and add more and more features in their products.Identification of landmarks is one of the most prominent applications, with the help of which people take the information about different places by taking the pictures of those locations which is very useful for visitors [22].In the next section, we present the proposed model which is based on BoVW.

III. PROPOSED APPROACH
In this section, BoVW based model is proposed.The discriminative power of visual words can be increased by using visual phrases [22].It is inspired from text-based searching where two words are concatenated to make one phrase based on the frequencies of occurring together in a large corpus.
To model the same idea in visual search, it is needed to define words and transactions in visual space.Images are represented by a set of local keypoint descriptors such as SIFT [3].Searching the images which are based on raw SIFT descriptors is computationally expensive [10].BoVW is widely used to make image search feasible for large databases.BoVW are treated as words in the proposed framework analogous to text based searching [10,[23][24][25].Later in this section, BoVW is explained which is followed by frequent item-set algorithm (Apriori) and finally, BoVW based proposed framework is explained.

A. Bag of Visual Words
Bag of visual word model is widely used for feature quantization.Every key point descriptor, x j ⊂ R d , is quantized into a finite number of centroids from 1 to k, where k denotes the total number of centroids also known as visual words which are denoted by V = {v 1 , v 2 , . . ., v k } and each v i ⊂ R d .Let us consider a frame f which is represented by some local key point descriptors f X = {x 1 , x 2 , . . ., x m }, where x i ⊂ R d .In BoVW model, a function G is defined as: where, G maps descriptor x i ⊂ R d to an integer index.Mostly, Euclidean distance is used to decide the index for the function G.For given point x i , Euclidean distance is computed with all the centroids, which are named as visual words, and the index of centriod is selected whose distance is the minimum with the x i .For a given frame f and bag of visual word V, I f = {µ 1 , µ 2 , . . ., µ k } is computed.µ i indicates the number of times v i has appeared in frame f , and I is the unit normalized at the end.Mostly, k-mean or hierarchical k-mean clustering is applied and centroids (visual words) V are obtained.The value of k is kept very large for image matching or retrieval applications, the suggested value of k in this proposed approach is 1 million.Accuracy of quantization mainly depends on the value of k, if the value is small then two different keypoint descriptors will be quantized to same visual words which will decrease the distinctiveness, or if the value is very large then two similar keypoint descriptors, which are slightly distorted, can be assigned different visual words which decreases the robustness [10] [26].

B. Frequent Item-set Detection
Apriori is well-known data mining algorithm which is used for finding frequent item-sets from transactions [27].Let the items be denoted by I = {i 1 , i 2 , . . ., i k }, and the transactions by T = {t 1 , t 2 , . . ., t m }, where each t i contains combination of more than 1 items, i.e., t i = {i 1 , i 4 , i 7 } contains three items, i 1,4,7 ∈ I.As stated above, the experiments in this paper covers only 3 frequent item-sets by following: The value of minimum support is set 0.75, which implies that any item-set is considered frequent if it appears in at least 75% of the transactions.All those items in C 2 are treated as frequent if those item sets were present in at least 75% of the transactions and denoted by L 2 .5) Similarly, L 3 is calculated.

C. Frequent Visual Word Detection
Now, Apriori approach is extended to the visual phrases.To detect frequent item-set, called as visual phrases in this paper, each keypoint descriptor is mapped to a visual word which is treated as an item.Every image is represented by set of visual words, as shown in Fig. 1 (a-e).
To create the transactions out of visual words, radius r around each keypoint is drawn and all the visual words within that radius are treated as one transactions, as shown in Fig. 1  (e-f).The value of r if increased to large number of pixels, then the length of the transaction is very high.In this paper, we experimented by keeping r = 100.
Oxford 5K [28] dataset is used for the training of visual words and detection of frequent visual words.Oxford 5K dataset contains 5065 images of 11 different landmarks.There are 3.5K keypoints, on average, using Hessian Affine detector.

D. Dataset
To evaluate the proposed framework, PCA-SIFT dataset is used which is one of the challenging datasets used in several works and can be downloaded online 2 .The dataset is shown in Fig. 2.There are 10 different scenes with each having three severe transformations.Transformations include change in scale, rotation, zooming, viewpoint change, and different intensities of illumination.

E. Experimental Setup
During the experiments, 10000 visual words are learned which are treated as items.To obtained 10000 visual words, which are basically centroids, obtained by k-mean clustering.
In training phase, Oxford 5K dataset is used for feature extraction and clustering, SIFT is extracted from all the images and pooled into one feature set.Later, k-mean clustering is applied by keeping the value k = 10000.VLFEAT 3 library is used for k-mean clustering.
Once the visual words are learned and images are represented by visual words, transactions are generated, as explained in previous section and Figure 1 as well.Frequent visual words are identified using Apriori algorithm using Rpackage.
The baseline is same as explained in Equation 1, the image f is represented by I f = {µ 1 , µ 2 , . . ., µ k }.Let the visual phrases be denoted by F = {φ 1 , φ 2 , . . ., φ k } where φ i is the unordered pair of three frequent visual words identified by Apriori algorithm.For every φ i the frequency is also stored in separate file, the frequency is taken into account if there are more than one frequent items under the radius of given keypoint x i .The given image is quantized same as Equation 1, the only difference is that V is replaced with F, the function where G F maps the given keypoint descriptor x i to an index from frequent visual words F. The G F is computed as follow • For given image, repeat the steps explained in Figure 1 (a-e).
• Draw the circle of radius r for every keypoint, record the other keypoints within that circle, denoted by t, as illustrated in Figure 1 (f).
• Find the 3-combination of all the elements in t i for the given keypoint x i , and check all those combinations in F.
• The index from F is assigned to the keypoint x i if any of the 3-combination of the transaction t is present in F. Most of the times, there are more than one combinations of t present in F, so the index of most frequent φ is assigned to x i .
Finally, Video Google [29] approach is used for matching the visual words between pair of the images.
The mean average precision (mAP) is used to evaluate the proposed framework.Precision P is obtained as follow where, E denoted correctly retrieved, and O denotes total retrieved.Precision is calculated at different values of recall R which can be computed as follow 3 http://www.vlfeat.org/where, W denotes the total number of images to be retrieved and total true positives for a given query.For each query, an average precision is computed, and finally, mean of all average precisions (mAP) is computed as illustrated in Table I.
Table I shows the average precision for each scene and finally mAP, for proposed framework and BoVW model.It can be seen that the proposed framework achieves perfect precision for some of the scenes.

IV. CONCLUSION
This paper presents the extension of BoVW model.Images are represented by local keypoint descriptors which are later quantized into visual words (BoVW).Instead of representing every keypoint with single visual word, the model is extended to pair the visual words which are known as visual phrases.This idea is inspired from text based search engines where text document is represented by set of frequent item-sets.In this paper, upt to three frequent item-sets are discovered and image is represented by L 3 frequent item-sets.Experiments on benchmark dataset show the increase in mean average precision (mAP) which is increased from 0.7364 to 0.8289.The same framework can be extended to L n -frequent item sets for very large databases which is also the future work of proposed framework.

Fig. 1 .
Fig. 1. Abstract flow diagram of the proposed approach to model the Apriori algorithm [5] to find the frequent visual words.Figures (a-c) show the original image which is converted into gray scale and later represented by the SIFT keypoint descriptors, (d) maps each keypiont to its nearest visual word, (e) shows few keypoints with radius r, and (f) shows one keypoint which is converted into a transaction.

TABLE I .
RETRIEVAL ACCURACY OF PROPOSED FRAMEWORK COMPARED WITH BOVW MODEL.