A Perceptual Matching based Deduplication Scheme using Gabor-ORB Filters for Medical Images

In the ever widening field of telemedicine, there is a greater need for intelligent methods to selectively choose data that are relevant enough to be transmitted over a network and checked remotely. By the very nature of medical imaging, a large amount of data is generated per imaging or scanning session. For instance, a Magnetic Resonance Images (MRI) scan consist of hundreds to thousands of images related to slices of the organ being scanned. But at often times all of these slices are not of interest during the process of medical diagnosis by the medical practitioner. Not only does this result in the access of unwanted data remotely, but it can also put greater strain on the bandwidth available over the network. If the relevant images can be selected automatically without human intervention, ensuring great sensitivity, the abovementioned issues can also be alleviated. This paper proposes a novel method of perceptual matching and selection of relevant MRI images by using a deduplicating technique of combining Gabor filter with Oriented FAST and Rotated BRIEF (ORB) feature extraction technique on a vast set of MRI scan images. The outcome of this method are relevant deduplicated MRI scan images which can save the bandwidth and will be easy for the medical practitioner to verify remotely. Keywords—Perceptual matching; ORB feature extraction; Gabor filters; MRI scan; deduplication


I. INTRODUCTION
Advanced data frameworks have been progressively conveyed in the recent medical care scenarios. Indeed, numerous medical clinics and hospitals depend on medical clinic data frameworks (HIS), radiology data frameworks (RIS), and picture documenting and correspondence frameworks (PACS), for storage of MRI Scan Medical images. These frameworks facilitate the practitioners to share images [1] [2][]3]. Data Deduplication is a method of eliminating repeated copies of data in order to retain the storage capacity. It results in decreased cost per gigabyte with more area to store data. In radiology centers, more than petabytes of medical images are stored every year which may contain redundant data too. This increase of repeated data cannot be handled by the existing IT technologies [23]. In an MRI Scan collection of images for a single subject regarding any organ, views are taken from three directions. Each direction concentrates on taking images of slices of the organ with slight differences like 1mm, 4mm, 6mm etc. For a non suspicious subject, there may be more irrelevant repeated data which when stored will waste a lot of storage space. By the present nature of MRI scans of a single subject, relevant details are not distributed evenly in equidistant slides. Hence there is a need for perceptual matching and deduplication for transmitting medical images among practitioners for expert opinions [1]. By combining Gabor filter, ORB key-point detection and Brute-Force Matcher, we achieve this perceptual deduplication where relevant data is extracted without any specific regard for the distances between the slices. A finely tuned Gabor filter provides the ORB algorithm with just the right amount of details that leads to the most optimal matching differences.
This method, being based on ORB, has got excellent performance in identifying and matching near-similar images even if a one-to-one positional correspondence between the pixels are absent. Therefore, variations like missing regions, appreciable level of orientation difference, size difference, etc. can be accounted.
In this paper the method for fine tuning the parameters of Gabor Filter is implemented. It also describes the method of feature extraction using ORB and brute force matching technique which assures the performance of perceptual matching so as to identify repeated data and discard it. Section 2 give an overview of the work done. The rest of the paper is organized as follows. Section 3 explains the algorithms used for finding out the deduplicated images and the background where this method has its relevance. In Section 4, results are discussed. The paper is concluded in Section 5.

II. LITERATURE REVIEW
Storage used for medical imaging must be expandable in the coming years. It is normal for an incoming of Terabytes of medical imaging data per year for hospital radiology departments [4]. Deduplication is a very important aspect for the reliability of storage facilities, as the expense of storage management can be minimized by removing duplicated files [5]. A huge amount of memory is wasted in storing the redundant data for a single patient in a single take [24].
However, deciding whether two original files are the same or not by viewing two processed images is not easy. Messagelocked encryption was suggested to facilitate the deduplication of processed image. This approach lets users produce the same ciphertext for the same file and enables users to reap the benefits of ciphertext deduplication [6]. Techniques had been proposed for the deduplication of encrypted data in different levels. To support the deduplication feature for partially duplicated files, block-level deduplication methods were studied [7]. Cloud media centers were implemented for dedicated deduplication system where the older versions were not working [8]. Near-duplicate data scanning for encrypted (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 data has been studied. A lot of wastage in the storage system occurs due to these duplicates which is very costly to afford [9].
Deduplication techniques are characteristics based implemented. Characteristics like extracting features, using hash technique for indexing extracted image feature, image similarity detection based on distance using threshold, etc were used to find almost equal images. In order to analyze the exact duplicates or the near exact duplicates different features extraction algorithms like Scale Invariant Feature Transform, Speed Up Robust Feature, Principal Component Analysis, Binary Robust Independent Elementary Features were taken into account [10].
The authors in [11] used Difference of Gaussians, Principal Component Analysis with Scale Invariant Feature Transform methods for feature extraction of images and space efficient bloom filter were hashed with these features. The locality sensitive hashing used correlated attributes to find the similar images. This method gave efficient bandwidth and saved storage resources. Later in [12] the authors used perpetual hash algorithms for creating image signatures which identified similar duplicated images. It resulted in deduplicating storage and saving bandwidth while transmission. Another method used was MapReduce technique. It manages data in huge quantities in a distributed manner. The authors in [13] used this technique resulting in a faster image duplication identifier. SIFT methods where made strong by incorporating k-means algorithm and groups of multiple image clusters. This helped to detect the near duplicate images and those were hashed based on histogram distance [14]. Another research was with Local-based binary representation which made use of binary vector and histogram for finding out the duplicate images [15]. Later a real time novel method which made use of Bloom filters along with the existing techniques was implemented. It used the correlation property and resulted in reduced latency processing [16].
A faster and more efficient feature point detector than SIFT and SURF named Oriented FAST (Features from Accelerated Segment Test) and Rotated BRIEF(ORB) technique is nowadays used widely [17]. It has advantages of low computation cost and better performance [18]. Medical images can be exchanged in a very efficient way using the cloud [25]. It has the limitation of numerous applications pointing to the same data at the same time. Due to the increasing need of storage capacity, the PACS also have its own limitations [26] III. BACKGROUND A complete MRI scan image set is retrieved from the MRI scanner and used by a medical practitioner for medical diagnosis. But in case an expert opinion is required from another medical practitioner, the files are first deduplicated to select only the relevant slides. This subset of the scan image set is then encrypted using a pixel-scrambling algorithm which uses chaotic maps and intensity variations [27]. On the receiver end, the files are decrypted to retrieve the deduplicated set of scan images. Due to the deduplication method applied, the bandwidth for the transferring of medical images among practitioners and the storage for storing the relevant data can be increased for some more time. In Fig. 1 the graphical representation of the above said scenario has been explained.

A. Algorithms
Input images are chosen from complete MRI scan image set with very less slice width. Images 1 mm apart with no perceptible differences are chosen as similar images, and images 5 mm apart with some perceptible differences of valid nature are selected as dissimilar images for the execution of this method. Following are the algorithms used to find the most optimal Gabor filter parameter set. Gabor filter has parameters which needs to be fine tuned individually for this deduplication method.  The resultant of these algorithms are a set of nearly matched MRI images which can be regarded deduplicated and can be discarded before transmitting.

B. Working
Building Gabor filters and its application, ORB feature extraction, parameter set cleaning, base parameter set selection and iteration, deduplication are the key phases in the proposed deduplication scheme. This scheme is graphically represented in the Fig. 2.
The prevailing section deals with each step of working of the whole deduplication procedure.

1) Build Gabor Filters:
Gabor filters are special classes of band-pass filters which allow certain bands of frequency and reject the rest. The Gabor filters are well known for its time and frequency transform characteristics. Filters with distinct scaling directions can be constructed using Gabor filters caused by different parameters [19]. In image processing, they are used for feature extraction, edge detection, texture analysis, etc. The Gabor function has the capability to capture the localized information with respect to spatial frequency, location, and selection of direction [20]. Mathematically Gabor filters [21,22]are expressed as a function: g(x, y; λ, θ, φ, σ, γ) = exp − x 2 +γ 2 y 2 2σ 2 exp i(2π x λ + φ) x = x cos θ + y sin θ y = −x sin θ + y cos θ Build filter function iterates through all Gabor filter parameters and create kernels for each set of parameters. Range of values for each parameter, namely, Ksize(x,y), Sigma(σ), Theta(θ), Lambda(λ) and gamma(γ) are predefined during this phase. These ranges are as follows: Ksize from the set [3,31], Sigma from 1 to 10 with a step size of 1, Theta from 0 to π with a step size of π/16, Lambda from 10 to 100 with a step size of 10 and Gamma from 0.1 to 1.2 with a step size of 0.2.
(IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 12, 2020 2) Apply Gabor Filters: Three images are selected to form two pairs of images, where one image is common to both. First pair is formed from MRI images that are taken 1 mm apart and represents similar images. Whereas second pair represents dissimilar images and are taken 5 mm apart. Pairs of similar and dissimilar images are read into the program and they are passed to the apply filter function. Both images are filtered using the kernels and filtered Gabor images are formed.
3) ORB Key-point Extraction: These images are compared using ORB key-point extraction and brute-force matcher (BFMatcher). Orb function returns the number of key-points found in each image and matching percentage between them. Structural similarity index is also found between the images. These values along with Gabor filter parameters are saved into a CSV file for further processing.

4) Parameter Set Cleaning:
The CSV file is cleaned so as to extract valid set of parameters. The cleaning process is as follows: any parameter set which gives very low number of key-points are discarded. All sets which generates very low or very high matching percentages are also discarded. Any set that produces abnormally high structural similarity index is discarded since high structural similarity index would indicate high loss of details in the images. A high minimum-threshold number of key-points in either images indicate existence of appreciable level of details.

5) Base Parameter Select:
Parameter sets are sorted in the descending order of their matching percentages. First few parameter-sets with the highest matching percentages are retained and rest are discarded. These parameters are used to find matching percentages for dissimilar pair of images.
Difference between matching percentages of similar and dissimilar pairs are found for each parameter-set. These parameters along with key-point counts, matching percentages and matching percentage differences are stored as a CSV file. They are then sorted in the descending order of their matching percentage differences. And parameter set with the highest difference is selected as the base or default parameter set for further steps.
This parameter set represents the maximum gap between matching percentages of similar and dissimilar images. Therefore, it is the best provisional choice to decide whether two images are similar or not.

6) Base Parameter Iteration:
Keeping the base parameterset as the default values for each parameters, one parameter is chosen at a time to be iterated over a range of values which are characteristically valid for that particular parameter except for Theta. For every iteration, 16 Gabor filter kernels with Theta value ranging from 0 to π with a step size of π/16 are formed and each set of 16 images are combined to form the new pair. ORB key-point extraction and Brute-Force matching is performed on this new pair of images. Metrics like key-point counts, structural similarity, matching percentages and matching percentage differences are found and stored for further analysis.
Values of each parameter that gave the highest match differences in their respective iterations were selected and used to form a final parameter-set. This was further used to find the matching differences between similar and dissimilar images. This is done to check if parameter values that individually produced the best matching difference along with other default values could produce better results when collected together as a single parameter set. Depending on the result, it was either retained or discarded.

7) Deduplication:
After finding the best case parameter set for Gabor filter, it can now be used for deduplication on the entire MRI Scan Image Set. The filter now being fine-tuned for MRI scans, we can set well defined thresholds when two images are similar or dissimilar. By running this in an iterative manner, the entire set is deduplicated and only relevant slides or images are retained.
The result of finding out feature matching points using ORB feature extracting and Brute Force matcher is shown in Fig. 3.

IV. RESULT AND DISCUSSION
All the above mentioned steps were tried using SIFT, SURF and ORB key-point extractions and their preferred matchers. It was found that ORB provided the best-case efficiency in both matching performance and execution time. Hence ORB along with Brute-Force matcher was selected as the algorithm/technique to be used for further analysis in the paper. Results of the iterated base parameter set for fine tuning each parameter of the Gabor filter is as follows.

A. Ksize
K-size is the size of Gabor-filter kernel. Here K-size of 31 represents a kernel of size 31x31 pixels. In order to improve time-efficiency of initial iterations, just a single pair of ksize values were selected. This pair (3,31) represents two possible extremes of values that k-size could take. In later iterations along with a variable k-size and default values for other parameters, it was noted that the key-point count and matching difference stabilized after k-size value of 13. Fig. 4 shows graphically this iterations. Hence the assumption was validated that time-efficiency was improved by choosing two extremes  of values rather than an extensive range. It also shows that matching difference is sensitive to k-size individually for a limited range values and the default k-size value that gave the best overall performance could very well be outside the high sensitivity range. Table I gives a summary of the similar keypoint count, dissimilar keypoint count, total matched keypoints and match differences with varying ksize in the Gabor filter. The iteration was done with an increase of 2 starting from 3 to 31. Above this, the results were repeating and constant.

B. Sigma
Sigma represents the bandwidth of the Gabor envelop. A higher sigma values increases the overall size of the envelop  It was noted that the matching difference for medical scan images are highest for low values for sigma. This is despite a sudden increase in the key-point count as is shown from the graph. This can be clearly seen in Fig. 5. As can be noted, keypoint count stabilizes after sigma of 3, but matching difference has got an overall downward trend and keep fluctuating at higher values of sigma. This is due to larger amount of details that is passed through the Gabor filter for higher sigma values This results in adversely affecting matching performance even though there are more features to extract key-points from. Hence matching performance is highly sensitive to sigma. Sigma variation also has the greatest effect on perceptual state of the output image. Table II shows the iterated values for fixing the sigma value.

C. Lambda
Lambda or wavelength governs the width of the Gabor function. Higher lambda values produce thicker Gabor strips. As can be noted from Fig. 6, there is appreciable improvement in matching difference as lambda increases till a specific level. Meaning increasing amount of details in the image due to wavelength change is constructive towards better matching difference performance. But beyond the point of optimality, more details work against the match and hence do not produce the best possible matching difference. But overall stability of matching difference is quite high for any higher values for lambda after the point of optimality. Table III shows   gamma results in lower height of the filter, while a lower value increases the height. For a series of increasing gamma values, a seesaw behavior in the matching difference was observed. For lower number of key-point counts, the matching difference appreciated with an increasing gamma, while at higher ranges with more details in the images, it showed depreciation followed by stability. Table IV shows the iteration for finding the gamma value for Gabor filter and is graphically shown in Fig. 7.

E. Discussion
MRI Scan Images of a subject was taken at two different periods live from a radiology center. The subject under experiment was suffering from brain tumor while diagnosed. This paper deals with a method where the subject or the patient has undergone a treatment and it on regular follow-ups. Thus, this methods from all the above observations apparently justifies that MRI scan images of this particular type works very well within a narrow set of values of filter parameters. And hence it validates the need for an algorithm to fine tune filter parameters to the application, as is proposed in this paper. When each parameter value from their individual point of