An Approach to Finding Similarity Between Two Community Graphs Using Graph Mining Techniques

Graph similarity has studied in the fields of shape retrieval, object recognition, face recognition and many more areas. Sometimes it is important to compare two community graphs for similarity which makes easier for mining the reliable knowledge from a large community graph. Once the similarity is done then, the necessary mining of knowledge can be extracted from only one community graph rather than both which leads saving of time. This paper proposes an algorithm for similarity check of two community graphs using graph mining techniques. Since a large community graph is difficult to visualize, so compression is essential. This proposed method seems to be easier and faster while checking for similarity between two community graphs since the comparison is between the two compressed community graphs rather than the actual large


INTRODUCTION
A graph arises in many situations like web graph of documents, a social network graph of friends, a road-map graph of cities. Graph mining has grown rapidly for the last two decades due to the number, and the size of graphs has been growing exponentially (with billions of nodes and edges), and from it, the authors want to extract much more complicated information.Graph similarity has numerous applications in social networks, image processing, biological networks, chemical compounds, and computer vision, and therefore it has suggested many algorithms and similarity measures.Graph similarity is that "a node in one graph is similar to a node in another graph if their neighborhoods are similar" [1].

II. LITERATURE SURVEY
Graphs are general object model; graph similarity has studied in many fields.Similarity measures for graphs have used in systems for shape retrieval [2], object recognition [3] or face recognition [4].For all those measures, graph features specific to the graphs in the application, are exploited to define graph similarity.Examples of such features are given one to one mapping between the vertices of different graphs or the requirement that all graphs are of the same order.
A very common similarity measure for graphs is the edit distance.It uses the same principle as the well-known edit distance for strings [5,6].The idea is to determine the minimal number of insertions and deletions of vertices and edges to make the compared graphs isomorphic.In [7], Sanfeliu and Fu extended this principle to attributed graphs, by introducing vertex relabeling as a third basic operation beside insertions and deletions.In [8], the measure is used for data mining in a graph.
The main idea behind the feature extraction method is that similar graphs probably share certain properties, such as degree distribution, diameter, and Eigen values [9].After extracting these features, a similarity measure [10] is applied to assess the similarity between the aggregated statistics and, equivalently, the similarity between the graphs.In the iterative method "two nodes are similar if their neighborhoods are also similar".
In each iteration, the nodes exchange similarity scores, and this process ends when convergence has achieved.A successful algorithm belongs to this category is the similarity flooding algorithm by Melnik et al. [11] applies in database schema matching.It solves the "matching" problem, and attempts to find the correspondence between the nodes of two given graphs.Another successful algorithm is SimRank [12], which measures the self-similarity of a graph, i.e., it assesses the similarities between all pairs of nodes in one graph.Furthermore, another successful recursive method related to graph similarity and matching is the algorithm proposed by Zager and Verghese [13].This method introduces the idea of coupling the similarity scores of nodes and edges to compute the similarity between two graphs.
A new method to measure the similarity of attributed graphs proposed in [14].This technique solves the problems mentioned in similarity measures for attributed graphs and is useful in the context of large databases of structured objects.First, BP-based algorithm implemented for graph similarity [1] uses the original BP algorithm as it is proposed by Yedidia [15].This algorithm is naive and runs in O (n 2 ) time.

III. PROPOSED METHOD
In the literature survey the authors have studied thoroughly the existing methods which checks for similarity of two graphs.In this paper, the authors have proposed graph mining techniques for checking of similarity between two community graphs.Further, the authors have proposed a community graph which is depicted in "Fig.1".For similarity measure of two community graphs, the authors have first compressed both the community graphs.Then the compressed community graphs are used for comparison for similarity.The authors have www.ijacsa.thesai.orgadopted the compression of large community graph to smaller one technique from [16].
The authors have proposed a village community graph having ten communities namely C 1 to C 10 , and the total number of community members is 118.The black color edge represents the edge among the community members of similar communities.Whereas the blue color edge represents the edge among the community members of dissimilar communities.To compress the community graph to a smaller one depicted in "Fig.1", the authors have adopted the logic from [16].The compressed community graph is depicted in "Fig.2".Then its corresponding adjacency matrix is represented in the memory and depicted in "Fig.3".In this weighted adjacency matrix, the self-loop of the community has some weight and considered as a total number of edges among the community members of that particular community.Similarly, the edge between the pair of communities as the total number of edges between the community members of dissimilar communities.For this proposed approach, the authors have considered "Fig.1" community graph as the principle community graph for comparison with six more community graphs namely CG 2 to CG 7 .Before comparison, these six community graphs, i.e., CG 2 to CG 7 's adjacency matrices are compressed and represented in the memory.Finally, the principle community graph CG 1 's compressed adjacency matrix has compared with all the six community graphs, i.e., CG 2 to CG 7 's compressed adjacency matrices for similarity check.The details of all the seven community graphs, CG 1 to CG 7 has considered as datasets for the proposed algorithm is listed in "Table I".Finally, Phase-3 for comparison of both the compressed community matrices by calling procedure CS( ).Further, it returns a numerical value i.e., from 0 to 3.So based on the numerical value, the similarities of both of community graphs are judged.The numerical value 1 for similarity; whereas values 0, 2, and 3 for no similarity between communities graph CG 1 and CG //flag: To assign the similarity check value from 0 to 3. { n1:=RCD (NCM1, "commun1.txt");// CG 1 details n2:=RCD (NCM2, "commun2.txt");// CG 2 details tcm1:=ACMC (NCM1, n1, CMM1, CCM1); tcm2:=ACMC (NCM2, n2, CMM2, CCM2); CMMatrix (CMM1, tcm1, "data1.txt");CMMatrix (CMM2, tcm2, "data2.txt");SCED (NCM1, n1, CMM1, CCM1); SCED (NCM2, n2, CMM2, CCM2); DCED (NCM1, n1, CMM1, tcm1, CCM1); DCED (NCM2, n2, CMM2, tcm2, CCM2); flag:=CS (CCM1, n1, CCM2, n2); if(flag=0) then write("Both the Community Graphs are not Similar"); if(flag=1) then write("Both the Community Graphs are Similar"); if(flag=2) then write("Both the Community Graphs are Similar on Similar Edges"); if(flag=3) then write("Both the Community Graphs are Similar on Dissimilar Edges"); }

EVALUATION OF ALGORITHM AND RESULTS
To evaluate the performance of the proposed algorithm, the authors have considered seven community graphs namely CG 1 to CG 7 , where 1 st community graph CG 1 is considered as principle community graph for comparison with the remaining six community graphs for finding similarities.
For the seven examples of community graphs, two sets of dataset files were created for each example of community graphs.The 1 st dataset file contains community graph details such as number of communities, community number, and number of community members.So for the seven community graphs, these dataset files were from datacom1.txt to datacom7.txt.Similarly the 2 nd dataset file contains community graphs edge details i.e., edge between community members which only consist of 1s and 0s.So for the seven community graphs, these dataset files were from dataedg1.txt to dataedg7.txt.These fourteen dataset file details are depicted in "Table I".
The algorithm was written in C++ and compiled with TurboC++ and run on Intel Core I5-3230M CPU +2.60 GHz Laptop with 4GB memory running MS-Windows 7. The comparison results of CG 1 with CG 2 to CG 7 are depicted from "Fig.6" to "Fig.17".
The datasets for community graphs CG 1 to CG 7 are in text files from datacom1.txt to datacom7.txtand for datacom1.txtis depicted in "Fig.4", which contains the total number of communities, community numbers, and a total number of community members.Similarly, the datasets for community graphs CG 1 to CG 7 are in text files from dataedg1.txt to dataedg7.txtand for dataedg1.txtis depicted in "Fig.5", which contains the edge details, i.e., 0s (no edge) and 1s (edge) between the community members of similar communities as well as dissimilar communities of the community graphs.The authors have studied the existing techniques of Danai Koutra et al. method [1], Sergey Melnik et al. method [2], Glen Jeh et al. method [12], L Zager et al. method [13], and Hans-Peter Kriegel et al. method [14] for graph similarity.
In Danai Koutra et al. method [1], two graphs G 1 (N 1 , E 1 ) and G 2 (N 2 , E 2 ), with possibly different number of nodes and edges for similarity check, then adopting belief propagation (BP) into the proposed method for finding similarity between two graphs which finally returns a similarity value i.e., a real number between 0 and 1.
In Sergey Melnik et al. method [2], the matching of two graphs based on a fixed point computation.It takes two graphs as input, which is preferably a schema or catalog or other data structures for similarity check.Finally, it produces the result as mapping between the corresponding nodes of the graphs.Depending on the matching goal, a sub-set of the mapping is chosen using some filtering methods.Moreover, it allows the user to adjust the results if it is necessary.www.ijacsa.thesai.org In Glen Jeh et al. method [12], to find similarity between two objects based on their relationships.Two objects are said to be similar, if they are related to similar objects.This similarity measure is called SimRank.This method is based on the simple graph-theoretic model.
In L Zager et al. method [13], it is a node-edge coupling, i.e., two graph elements is similar if their neighborhoods are similar.So edge score is constructed "when an edge in G 1 is like an edge in G 2 if their respective source and terminal nodes are similar".This is called edge similarity.
In Hans-Peter Kriegel et al. method [14], attributed graphs are considered as a natural model for the structured data.The authors proposed a new similarity measure between two attributed graphs, called "matching distance".The matching distance is calculated by sum of the cost for each edge matching.
The proposed method in this paper is different from the above existing methods.In the proposed method two community graphs with possibly equal number of nodes (communities) and different number of edges for similarity check.Each node (community) is labeled with a unique community number.Based on the community number of node, the similarity measure takes place by considering the weight of self-loop of community as well as the weight of edge between the communities.After similarity between two community graphs, it finally returns a similarity value i.e., a number from 0 to 3. Based on this number, the similarity of two community graphs can be judged.The proposed algorithm has capable of showing similarity and five different ways of dissimilarity.The five different dissimilarities are "similar on dissimilar edges", "similar on similar edges", "communities same but different edges", "communities not same", and "number of communities are different".Moreover, the proposed method is completely based on labeled community graphs and simple graph-theoretic model.So the authors conclude that the proposed community graph similarity is simply different from the above existing methods and fast since the time complexity is O(n 3 ).The comparison takes place on community graph CG 1 and CG 4 's number of edges belonging to similar community codes member and the number of edges belonging to dissimilar community codes member since these two are not same.So finally the algorithm shows as "Both the Community Graphs are not Similar".www.ijacsa.thesai.orgThe comparison takes place on community graph CG 1 and CG 7 's community codes.Since the community codes of community graphs CG 1 and CG 7 are same.Then the comparison takes place on a number of edges belonging to similar community codes member and number of edges belonging to dissimilar community codes member.So finally the algorithm shows as "Both the Community Graphs are Similar".

VI. CONCLUSIONS
Graph similarity technique is helpful in the fields of shape retrieval, object recognition, face recognition and many more areas.This paper starts with literature survey related to various techniques implemented for graph similarity.So it is important to compare two community graphs for similarity check to extract the reliable knowledge from a large community graph.This paper proposes an algorithm for similarity check of two community graphs using graph mining techniques.The authors have implemented the proposed algorithm using C++ programming language and obtained satisfactory results.

Fig. 3 .
Fig. 3. Adjacency matrix of Fig.2 IV.PROPOSED ALGORITHM The proposed algorithm has three phases.Phase-1 is to open for reading four dataset files.The dataset files commun1.txtand commun2.txtfor reading number of communities, and community code and their total number of community members of two community graphs CG 1 and CG 2 , and assign to the matrices NCM1[][] and NCM2[][] respectively.Similarly two more dataset files data1.txtand data2.txtfor reading edge details of two community graphs CG 1 and CG 2 , and assign to the matrices CMM1[][] and CMM2[][] respectively.So Phase-1 is about read data and creation of community member matrices, and creation of initial form of compressed community matrices.Pahse-2 for counting edges of community members of same communities by calling procedure SCED( ) and counting edges of community members of dissimilar communities by calling procedure DCED( ).Using procedures SCED( ) and DCED( ), the compressed community adjacency matrices CCM1[][] and CCM2[][] are assigned with the edge values and self loop values.

Fig. 5 .
Fig. 5. Dataset of CG1 contains edge details of community members C1 to C10

TABLE I .
DATASET TABLE Fig. 4. Dataset file of CG1