Experimental Evaluation of Basic Similarity Measures and their Application in Visual Information Retrieval

—Searching for similar images is an important feature for image databases and decision support systems in various subject domains. However, it is essential that search results are sorted by degree of similarity in reverse order. This paper presents a comparative analysis of four existing similarity measures and experimentally tests whether they could be used to calculate similarity between images. Metrics could be evaluated by comparing their results to the cumulative human perception of similarity between the same images, obtained by real people. However, this introduces a lot of subjectivism due to non-uniform judgement and evaluation scales. The paper presents a more objective approach - checks which measure performs best in retrieving more images, containing objects of the same type. Results show all four measures could be used to calculate similarity between images, but Jaccard’s index performs best in most cases, because it compares features vectors positionally and thus indirectly consider shape, position, orientation and other features.


I. INTRODUCTION AND RELATED WORK
With the development of the Internet and information technology, it has become possible to store and process larger volumes of data, with more and more data in the form of images. This is the basis of the great interest in the approaches and algorithms for image organization, search and retrieval. Naturally, storing large volumes of images requires a new and efficient image retrieval approach. There are two main approaches to image organization and storage -text descriptions, keywords or labels (known as text-based image retrieval) [1] and content based image retrieval [2], [3]. The use of text descriptions is a slow and time consuming process (because of the need a person to describe images with text, not from a computational point of view), so algorithms for content based image retrieval are of greater scientific interest. The main characteristics used in these algorithms are color [4], [5], shape [6], texture [7], spatial features and their combinations [8]. Color is one of the most basic and at the same time distinctive features that hardly changes when you rotate, reduce or increase the size or when changing the orientation of the images. Therefore, the use of color or the color distribution in images at CBIR is the most popular approach among researchers, and yet it is not exhausted and is still subject of interest.
The typical architecture of CBIR systems consists of two main elements. The first is related to the feature extraction of the images and their storage, organization and indexing. The second concerns the assessment of the similarity between the query image and the images in the database. What similarity measures to use and how to assess their suitability for the specific application?
One of the major problems with assessing the visual similarity of images is that there is no classification to use as a criterion. Therefore, it is not possible to make an accurate assessment of the results of the application of the various methods for assessing the similarity of images. It is not possible to use user evaluation (through surveys or any other methods) as the subjective factor in the evaluation is too important and there are undoubtedly huge differences in similarity ratings made by different people, even on a small sample of images. All this requires the search for automatic and without human intervention criteria for assessing similarity.

II. GOAL AND MOTIVATION
The aim of this paper is to test whether four popular similarity measures (not specially designed for image comparison) could be used to calculate similarity between images. We have tried to do it in our previous paper [9] by comparing the results of similarity measures to the cumulative human perception of similarity between the same images, obtained from an online survey. However, we encountered an enormous problem then -non-uniform judgment and evaluation scales used by the individual respondents.
The survey was designed so that a query image was shown next to a set of sample images, and users were required to specify the exact value of similarity (in their own opinion) in percentage between the query and each image within the set. Since we used nominal, rather than ordinal scale, we have got quite high non-uniformity between individual answers. For example, a respondent specified the similarity between the query and the image X is 80%. Another respondent specified 95% for the same pair of images, while a third respondent specified 40%. Averaging answers having high discrepancies as the above mentioned, could not guarantee reliability and accuracy of obtained "human perception of similarity". So the latter could not be reliably used as a reference. 30 | P a g e www.ijacsa.thesai.org To test the four similarity measures (Jaccard's index, Euclidean distance, City block distance and Chi-squared dissimilarity) and evaluate how good they are, we decided to use an alternative more objective approach. Inspired by the Top-N accuracy, we applied a similar evaluation approach. We defined a set of 200 images -50 red roses, 50 tomatoes, 50 red apples and 50 red peppers. The colors of all images are similar -red (roses, fruits, vegetables) and green (leaves). The idea is to check which measure performs best in correct classification of retrieved items (precision) for a specific level of recall. Let's say we are looking for a rose and the system is set up to return 10 results. Then the best similarity measure will be the one that returns most roses out of these 10 results and just a few (or preferably none at all) tomatoes, apples and peppers. However, since there are 50 images of roses, we run 50 queries (every image is used as a query) and average their respective precision for the specified level of recall (top-N results).
This study is important in order to determine which of these four basic similarity measures performs best in searching for images. Results will allow to design and develop an improved universal image retrieval system that could correctly find similar images in various subject domains, or even a system that could automatically select the best similarity measure for a given subject domain by itself.

III. EXPERIMENTAL ENVIRONMENT AND EVALUATION
The experimental CBIR system used is described in details in [9] and [10]. Briefly, the formation of a feature vector for each image is a sequence of the following actions: • Each image is divided into 32 by 32 blocks in both width and height dimensions ( Fig. 1 to 4).
• All pixels in all the blocks are converted from RGB to one of our 64 primary colors. How these 64 colors were selected and the process of color transformation is described in our previous research [9], [10].
• The dominant color (out of 64 selected colors in our proposed and used color scheme) in each block is determined based on the number of pixels of each color. This dominant color is associated with this block, and the other colors in it are ignored.
In other words, we use just a single color code to substitute multiple pixels per block. In this way, the enormous image color content is reduced to a feature vector with 1024 (32 by 32) color codes. Results of such quantization and color substitution are showed on Fig. 1 to 4. That allows fast image processing and similarity searching. Also, it improves recall as well. The system does the same color analysis for both the image query and the image set and computes such feature vectors for each graphic file. Based on set-theoretic or algebraic methods and similarity measures such as Jaccard Index, Euclidean Distance, City Block Distance and Chi-Square Dissimilarity described in [9] we calculate the similarity factor between the query and each result. At the end, the system returns a sorted list of similar images (Fig. 9).    A set of 200 images (as described earlier) is used in our study. They all have common visual or color characteristics, but are divided in four separate groups -roses, tomatoes, apples and peppers. The feature vectors are stored in the database and each of these 200 images is used as a query in the experiment. It is known which image is from which of the four groups and keeps track of how many images there are in the returned result in the first n similar images from the same group. The first 3, 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50 results are examined and the average of the number of images returned from the same group is calculated for every query. This is repeated for each of 31 | P a g e www.ijacsa.thesai.org the similarity measures with the idea of checking the degree of accuracy/adequacy for each of the measures.
Let's take the group of images of roses for example. It is checked for each rose image, run as a query, how many of the top-N returned results are roses as well. The test is repeated for all 50 rose images and then the average precision value is determined as a proportion of the returned rose images to all returned images (e.g. top-3 results). So, if for "rose image 1" as a query, we have 3/3 returned rose images (i.e. all returned images are roses), for "rose image 2" as a query, we have 2/3 rose images (i.e. 2 images are roses indeed and the third image is another object), and for "rose 3", we have 1/3 rose images returned (i.e. 1 image of a rose and two other objects), then the average precision for top-3 results is (3 + 2 + 1) / (3 + 3 + 3) = 6 / 9 = 2 / 3.
In the experiment, this is done with 50 image queries from each of the four groups and the top-3, -5 ... results are examined. In this way, we can track how the mean precision changes for each similarity measure, based not only on individual queries, but on all 50 queries from each image group. The results are shown in Tables I, (for the set of roses, run as queries), II (for the set of apples), III (for the set of peppers) and IV (for the set of tomatoes).  Results are also presented graphically on Fig. 5 (for the set of rose query images), Fig. 6 (the set of apple query images), Fig. 7 (the set of pepper query images) and Fig. 8 (the set of tomatoes query images). It clearly seems that the Jaccard's index significantly outperforms (retrieves more images containing an object of the same type) all other similarity measures for the set of roses (see Fig. 5). However, this is not the case for tomatoes (Fig. 8), for example, although they have the very same colors. The indepth analysis of the similarity measures themselves reveals the reason -the Jaccard's index calculates similarity between two images by positionally comparing the dominant colors block by block. All other described measures utilize colors globally. Taking local color distribution into account allows considering not just colors, but shapes and local details as well.
Jaccard's index performs better with roses rather than tomatoes, because the rose's flower consists of multiple individual leafs that reflect light differently and creates dark shadows between leafs (Fig. 1), while tomatoes are singular convex rounded objects (Fig. 2). So, accounting position of the shadows, the Jaccard's index can more easily and reliably guess if the red object in the center of the image is a rose or something else. However, distinguishing a convex red tomato from a convex red apple is much more difficult. That is why Jaccard's index outperforms all other similarity measures for the set of roses (due to additional surface features -shadows between leafs) and the set of peppers (due to the oblong shape), but achieves less better (but still better) results for apples, and no improvement for tomatoes image set.  • When searching by color content, and consider colors globally, then Euclidean distance, City block distance and Chi-squared dissimilarity produce commensurate results. That is clearly noticeable on Fig. 5 to 9. The main difference between these metrics is in the magnitude of the calculated value. However, the relationships between the calculated similarity factors remain the same, regardless of which one of these three similarity measure is used. It should be noted here that exactly the relationships between similarity factors, rather than the absolute values themselves, create the order of the search results.
• In contrast to all other similarity measures, Jaccard's index compare feature vectors positionally, so it takes into account not just colors, but also their spatial distribution. As a result, it indirectly considers shape, position, orientation and other features.
• When objects have specific features on their surfaces or irregular (e.g. oblong) shape, then Jaccard's index significantly outperforms other similarity measures. That is easily noticeable on Fig. 5 for the set of roses and Fig. 7 for the set of peppers.
• In general, when there is no a-priori information about the image database, the Jaccard's index seems the best single similarity measure between images. This statement is supported by the data in all tables and figures.
ACKNOWLEDGMENT This paper is supported by project 2022-EEA-01 "Analysis of big data processing algorithms and their application in multiple subject domains", funded by the Research Fund of the "Angel Kanchev" University of Ruse.