Clustering: Applied to Data Structuring and Retrieval Clustering Was Employed as a Data Grouping and Retrieval Strategy in the Filtering of Fingerprints in the Fingerprint Verification Competition 2000 Database 4(a). an Average Penetration of 7.41% Obtained from the Experiment Shows Clearly

—Clustering is a very useful scheme for data structuring and retrieval behuhcause it can handle large volumes of multi-dimensional data and employs a very fast algorithm. Other forms of data structuring techniques include hashing and binary tree structures. However, clustering has the advantage of employing little computational storage requirements and a fast speed algorithm. In this paper, clustering, k-means clustering and the approaches to effective clustering are extensively discussed. that the clustering scheme is an effective retrieval strategy for the filtering of fingerprints. I. INTRODUCTION A collection of datasets may be too large to handle and work on hence may be better grouped according to some data structure. Large datasets are encountered in filing systems in digital libraries, access to and caching of data in databases and search engines. Given the high volume of data there is need for fast access and retrieval of required or relevant data. Several of the existing data structures are hashing [1, 2, 3, 4, 5, 6], search trees [7, 8], and clustering [9]. Hashing is a technique that utilizes a hash function to convert large values into hash values and maps similar large values to the same hash values or keys in a hash table. Clustering is however a useful and efficient data structuring technique because it can handle datasets that are very large and at the same time n-dimensional (more than 2 dimensions) and similar datasets are assigned to the same clusters [9]. A 2D or 3D point can be imagined and illustrated however it will be difficult to imagine or illustrate a 9-dimensional data. When datasets are clustered, the clusters can be used rather than the individual datasets.


I.
INTRODUCTION A collection of datasets may be too large to handle and work on hence may be better grouped according to some data structure.Large datasets are encountered in filing systems in digital libraries, access to and caching of data in databases and search engines.Given the high volume of data there is need for fast access and retrieval of required or relevant data.Several of the existing data structures are hashing [1,2,3,4,5,6], search trees [7,8], and clustering [9].Hashing is a technique that utilizes a hash function to convert large values into hash values and maps similar large values to the same hash values or keys in a hash table.Clustering is however a useful and efficient data structuring technique because it can handle datasets that are very large and at the same time n-dimensional (more than 2 dimensions) and similar datasets are assigned to the same clusters [9].A 2D or 3D point can be imagined and illustrated however it will be difficult to imagine or illustrate a 9dimensional data.When datasets are clustered, the clusters can be used rather than the individual datasets.
Clustering is a process of organizing a collection of data into groups whose members are similar in some way [9,10,11,12] According to Jain et al. [13] "Cluster analysis is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in a multidimensional space) into clusters based on similarity".Similarity is determined using a distance measure and objects are assigned and belong to the same cluster if they are similar according to some defined distance measure.Cluster analysis differs from classification because in clustering the data are not labeled and hence are naturally partitioned by the clustering algorithm whereas in classification the data are labeled and partitioned according to their labels.The former is hence an unsupervised mode of data structuring while the later is supervised [13].Jain [14] identifies three main reasons while data clustering is used; to understand the underlying structure of the data; to determine degree of similarity amongst the data in their natural groupings and to compress data by summarizing the data by cluster groups.
Clustering has a vast application in the life sciences, physical and social sciences and especially in the disciplines of Engineering and Computer Science.Clustering is used for pattern analysis, recognition and classification, data mining and decision making in areas such as document retrieval, image processing and statistical analysis and modeling [13].Documents may be clustered for fast information access [15] or retrieval [16].Clustering is used in image processing to segment images [17] as well as in marketing, biology, psychiatry, geology, geography and archeology [13].Figure 1 shows a general data clustering illustration.The data are grouped in clusters.Each cluster has a collection of data that are similar.A cluster is a group of similar datasets represented by an ndimensional value given by the cluster centroid.Clusters may also be defined as "high density regions separated by low density regions in the feature space" [13].
Every cluster is assumed to have a centroid, which is the arithmetic mean of all data in that cluster.The mean is what is common to data assigned to a cluster and creation of clusters Similar data Dissimilar data Data Clustered data www.ijacsa.thesai.orgbuilds from the arithmetic mean.A similarity measure is used for the assignment of patterns or features to clusters.

II. CLUSTER SIMILARITY MEASURES
Similarity is fundamental to the definition of a cluster hence a measure for the similarity otherwise known as the distance measure is essential.The dissimilarity or similarity between points in the feature space is commonly calculated in cluster analysis [13].Some of the distance measures used are: The distance metric is used for computing the distance between two points and cluster centers.For the distance measures explained in the following sections, two points, a and b, are defined in an n-dimensional space as: a = (w 0 , x 0 , y 0 …z 0 ) coordinates (1) b = (w 1 , x 1 , y 1 …z 1 ) coordinates (2)

A. Euclidean distance
Euclidean distance is the distance between two points, a and b, as the crow flies in an n-dimensional space.
where n is the number of dimensions.The Euclidean distance is the most commonly used metric because it is appealing to use in an n-dimensional space and it works well with isolated clusters [13].

B. Manhattan distance
In the Manhattan distance, the distance between two points is the absolute difference of their coordinates.
The difference between the Euclidean distance and the Manhattan distance is that the Euclidean is a squared distance while the Manhattan is not squared.

C. Chebyshev distance
In the Chebyshev distance metric the distance between two points is the greatest of their differences along any coordinate dimension [18].This distance is named after Pafnuty Chebyshev.
This is also known as the chessboard distance.In the chessboard the length of side of a chess square may be assumed as one unit.In this case the minimum number of moves needed by a king to go from one chess square to another equals the Chebyshev distance between the centers of the squares.

D. Hamming distance
The Hamming distance is a way of determining the similarity of two strings of digits of equal lengths by measuring the number of substitutions required to change a string into another.It is the number of positions at which corresponding digits in the two strings are different [19].

A. Exclusive clustering
In exclusive clustering, data that belongs to a particular cluster cannot belong to another cluster.An example is Kmeans clustering.

B. Overlapping clustering
Data may belong to two or more clusters.Example of this in fuzzy-c-means clustering.

C. Hierarchical clustering
In this case clusters are represented in tree from.Two close clusters are derived from the top-level cluster.The hierarchy is built by individual elements progressively merging into bigger clusters.
Figure 2 shows the types of data clustering algorithms.
Jain [13] classifies clustering algorithms as hierarchical and partitional.In hierarchical clustering each cluster arises from and depends on the parent cluster.A typical partitional clustering algorithm is the K-means algorithm.

IV. CLUSTER SIMILARITY MEASURES
K-means clustering algorithm was first proposed over 50 years ago [14] and is commonly preferred to other clustering algorithms because of its ease of implementation and efficiency in cluster analysis.K-means clustering is a type of cluster analysis that partitions n observations into k disjoint clusters, k<<n, such that the number of clusters are much less than the number of observations [18,20].The k-means algorithm partitions n observations {O i | i=1, 2...n} into k number of clusters, {C j | j=1,2...k}, as follows This is illustrated in Figure 3. www.ijacsa.thesai.orgAny of the distance metrics which include the Euclidean, Manhattan, Chebyshev or Hamming may be used as the distance measure for determining the similarity of the datasets, though the Euclidean is most preferred and widely used [14].
The K-means algorithm basically follows these steps.
 A similarity or distance measure is chosen and used throughout.
 K number of centroids are chosen.
 The distance between each dataset from each of the k centroids is determined.
 Then a dataset is assigned to the centroid for which it had the minimum distance.
 All datasets are hence assigned to a particular centroid.Figure 4 shows a very simple illustration using 5 datasets and 2 clusters.
 The arithmetic mean is recalculated for each of the k centroids and the distance of each dataset from the new means is recalculated for each of the k centroids.This is the second iteration.
 The datasets are reassigned again to the new k centroids.In other words, a dataset assigned for instance to centroid 2 in the first iteration may be reassigned to centroid 1 in the second iteration.V. APPROACHES TO EFFECTIVE CLUSTER ANALYSIS Several challenges to cluster analysis include the difficulties in choosing an appropriate clustering algorithm; representing the data to be clustered; choosing a suitable similarity measure; determining which data should be used and choosing a suitable number of clusters that would yield maximum success.A user is not faced with these problems in hierarchical clustering analysis since all the datasets are related.These problems arise when a dataset is to be classified into unique clusters.
There are many partitional clustering algorithms and the user may be faced with the dilemma of choosing an appropriate algorithm.What guides in choosing an appropriate algorithm is to know the purpose and goal of the clustering exercise and this consequently would guide in representing that data to be clustered.It is also necessary to know if the dataset has a clustering tendency [18] and if it should be normalized.A data set that does not have a clustering tendency should not be clustered as it would yield invalid clusters.For example, if a data set that has all similar data and consequently has no variance is clustered it would result in invalid clusters.On the other hand a data set with high variance has a clustering tendency.
The choice of number of clusters may pose a problem because the performance of the clustering algorithm is affected by the number of clusters.It is usually difficult to determine the best number of partitions that will give the best and valid clustered groups.Some factors that may be considered while choosing the number of clusters are the size of the data set and the variance of the data in the data set.If the dataset is widely varied such that the data set may need to be classified by many groups then it may make sense to use more clusters.
In feature classification, the success of the cluster analysis is largely dependent on the feature set.The clustering algorithm would have a good performance and give compact, isolated and valid clusters if the choice of features is good [18].If for instance a database of face images of a multi-racial group comprising African, Chinese, Latin American, Indian and European faces need to be clustered into five different groups, the features that would be used would be such that the faces can be effectively separated into five valid clusters.The success of this task is clearly dependent on the features used for the separation.The features making up the data set play a vital role in clustering analysis.
A similarity measure is required for separating data into clusters.The choice of the similarity measure is a challenging problem because the valid clustering of the data also depends on the similarity metric.The performance of the cluster analysis varies according to the similarity metric used and hence it may be difficult to determine the similarity metric that would give the best performance.But this problem can be overcome by having a good understanding of the data to be clustered.

VI. CLUSTERING USED AS A FINGERPRINT INDEXING
RETRIEVAL STRATEGY An indexing technique must include a retrieval strategy.A retrieval strategy defines the method for which data within the same class as the query or input data are retrieved.In fingerprint indexing, the retrieval strategy ensures that fingerprints with similar index codes [21] to that of the query fingerprint are retrieved from the database of enrolled fingerprints.
In this work, a modified Ross's partitional clustering scheme [22] is used as a fingerprint retrieval strategy by compressing the numerous fingerprint features into similar groups of data and hence limiting the search for similar fingerprints to only a few clusters that are identical to the cluster of the query fingerprint.This requires the following

•
First the creation of an index space of k clusters for the indexing using the k-means algorithm and the Euclidean distance similarity measure.

•
Secondly the assignment of the features of the fingerprints in the enrolled database to the k clusters.

•
Thirdly the determination of the clusters, c << k, that have the features of the fingerprints similar to a query fingerprint.
A query fingerprint should have a matching identity in a list of fingerprints outputted by the indexing algorithm.This list is otherwise known as the candidate list.The ratio of the fingerprints in the candidate list to the database size gives the penetration rate of a query fingerprint.The penetration rate is the fraction of fingerprint identities, including the genuine fingerprint, retrieved from the database upon presentation of an input fingerprint.The penetration rate determined for a number of tests, T, in a database of size, N, is [23]. The fingerprint features were extracted using the minutiae quadruplets technique [24].
 30 clusters were created in the index space using fingerprints from FVC 2002 database 4(a).
 The FVC 2000 database 4(a) was divided into two equal groups -Group A and B.
 Group A had 400 fingerprints of the first four impressions of a subject  Group B had 400 fingerprints of the last four impressions of a subject.
 The fingerprint features of group A were assigned to the 30 clusters in the index space.
 The fingerprints of group B were used to query the index space to find a matching identity determined by the penetration of the database.Every query resulted in a penetration rate.Majority of the queries had little penetration rates while some had long penetration rates.
The penetration rates of the 400 query fingerprints used in the experiments are shown in Table 1.The average penetration for the 400 query fingerprints is obtained using Equation ( 8) and can also be determined from Table 1 as: Where f x is the product of the first and third columns in Table I and T is the number of queries corresponding to the number of tests in the experiment.There were 400 queries.
The retrieval of a candidate list for a query fingerprint takes 0.592ms.
VIII.COMPARISON WITH OTHER DATA STRUCTURING TECHNIQUES In [25], a binary tree based approach was used for matching fingerprints.The work done on this paper is indexing.However, the computational time for a fingerprint match using the binary tree technique in [25] is compared with the computational time for indexing a query fingerprint using the clustering technique described in this paper in Table II.IX.CONCLUSION In this paper, clustering was discussed extensively.Experiments were conducted by employing a modified clustering scheme as a retrieval strategy for filtering fingerprints.The average penetration, 7.41%, is very small showing clearly that the clustering algorithm employed is an effective scheme for the filtering and retrieval of the candidate fingerprints to a given query fingerprint.

Figure 1 .
Figure 1.Data Clustering Given two strings a and b where a = 0110110 and b = 1110011, the difference between the two strings a and b, D(a,b), where D(a,b) = 3, as the corresponding digits differ in three places.III.CLASSIFICATION OF CLUSTERING ALGORITHMS Clustering algorithms may be classified as:  Exclusive clustering  Overlapping clustering  Hierarchical clustering

Figure 2 .
Figure 2. Types of Data Clustering Algorithms

Figure 3 .
Figure 3. K-means Clustering  The arithmetic means is recalculated and the datasets reassigned again.This continues to i number of iterations, and the iteration stops when there is no change in the assignments between the ith iteration and the (i-1)th iteration.The last k centroids are the k clusters.

Figure 4 .
Figure 4. K-means clustering algorithm ijacsa.thesai.orgWhere {C j | j = 1, 2…T} is the size of candidate list of the fingerprints.The less the penetration rate the better the performance of the algorithm.In this newly created file, highlight all of the contents and import your prepared text file.You are now ready to style your paper; useVII.EXPERIMENTS AND RESULTS The Fingerprint Verification Competition (FVC) 2000 database 4(a) and FVC 2002 database 4(a) were used for this experiment.Each database has 800 fingerprints from 100 subjects at 8 impressions per subject.

TABLE II .
COMPARISON OF COMPUTATIONAL TIMES OF THE BINARY TREE AND CLUSTERING TECHNIQUES ON FVC 2002 DB1 SET A