Graph Mining Sub Domains and a Framework for Indexing – A Graphical Approach

Graphs are one of the popular models for effective representation of complex structured huge data and the similarity search for graphs has become a fundamental research problem in Graph Mining. In this paper initially, the preliminary graph related basic theorems are brushed and showcased on with various research sub domains such as Graph Classification, Graph Searching, Graph Indexing, and Graph Clustering. These are discussed with few of the most dominant algorithms in their respective sub domains. Finally a model is proposed along with various algorithms with their future projection.


INTRODUCTION
The primary goal of data mining is to extract statistically significant and useful knowledge from data [1][2] [3] which may be in any of the forms like image, text, links, vectors, tables and so on.Various forms of representing the data are available for both structured and semi-structured form.But both forms of data can be represented by a graph.Naturally this caused the vast area of research known as Graph Mining.
Raymond Kosala, Hendrik Blockeel in "Mining Research: A Survey", explore the connection between the web mining categories, and related agents.Interesting fact is graph structure occurs everywhere in the web mining research which is still at the budding stage [25].
From table I. , web graph is a form of representation propelled in web structure and usage mining research.In this paper, we show case the various sub domains in the field of graph mining and a model to index, update and upgrade without performance degradation.

II. RELATING GRAPH SUBSTRUCTURES WITH MATHEMATICS THEOREMS
A Graph is defined to be a set of vertexes (nodes) which are interconnected by a set of edges (links) [23].The total sum of degree of each vertex in a graph is equal to twice the number of edges.From the number of vertices and their degrees, the number of connectivity which may be present among the vertices in the graph can be predicted which would be more useful while indexing and searching.

Theorem: 2
The vertex v is a cut vertex of the connected graph G if and only if there exist two vertices u and w in the graph G such that (i) u ≠v, v ≠ w and u ≠ w, but (ii) v is on every u-w path.www.ijacsa.thesai.orgw u v In this graph, u is connected to v and v is connected to w.If v is removed the connectivity is incomplete.Hence, here v is called cut vertex.
Theorem: 2 play a key role in graph classification, soon after the data are categorized according to the various conditions.The association among the content in the graph can be effectively refined by this theorem.
Theorem: 3 Every vertex of a graph G belongs to exactly one component of G. Similarly, every edge of G belongs to exactly one component of G.
Theorem:3 role comes in a graph database, when updates has to be inserted into an index, data features should be abstracted and categorized such that they can be inserted at right position in the index.Here, updates refer to the vertices and their relationship refers to the edges.

III. GLIMPSES OF RESEARCH SUB DOMAINS IN GRAPH MINING:
Using graphs as a strong method to model complex datasets, various disciplines have been recognized by various researchers in domains such as chemical [23,24,25], computer vision [5,6], image and object retrieval [6,9], and machine learning [8,7,9].Enormous amount of graph data found throughout, many data mining process can be imparted but for a graph databases, it comes in different dimension.Graph classification [12], graph indexing [10] [11], and graph clustering [13] [18], sub graphs patterns as features are some of the major key areas of research in Graph Mining.
For example, biological structures can be stored as graphs, and in order to classify these structural graphs as active or inactive format, number of subgraph patterns are needed to build classification model [14], [15], [16].
Subgraph Isomorphism, Video Indexing, Correlated Graph Pattern Mining, Optimal Graph Pattern Mining, Approximate Graph Pattern Mining, Graph Pattern Summarization, Graph Classification, Graph Clustering, Graph Indexing, Graph Searching, Graph Kernels, Link Mining, Web Structure Mining, Work-Flow Mining, Biological Network Mining, , Improving Storage Efficiency Of Semi-Structured Databases, Efficient Indexing And Web Information Management are also some of the sub domains [23] in the field of graph mining of which few are discussed.

A. Graph Classification:
Xifeng Yan and Jiawei Han has proposed GSpan [29] (graph-based Substructure pattern mining) finds frequent substructures without candidate generation.Subgraph Mining is recursively called to grow the graphs and to find all their frequent descendants.It terminates its search when the support of a graph is less than the minimum support.It builds a new lexicographic order and maps each graph to a unique minimum Depth First Search code as its canonical label.Through this lexicographic order, it adopts the depth First search strategy to mine frequent connected sub graphs and uses a sparse adjacency list representation to store graphs.
Let {A,B,C….}be the vertices and {a,b,c….}be the connecting edges.The algorithm discovers A-a A and then Aa B until all frequent subgraph are discovered.
Michihiro Kuramochi and George Karyused proposed Frequent Sub Graph (FSG) [12] to find all connected subgraphs that appear frequently in a large graph database.It finds frequent subgraphs using the same level-by-level expansion adopted in Apriori [17] [24].

Key features of FSG are
(1) uses a sparse graph representation minimizing both storage and computation.
(2) increases the size of frequent subgraphs by adding one edge at a time, allowing to generate the candidates efficiently (3) uses simple algorithms of canonical labeling and graph isomorphism which work efficiently for small graphs (4) incorporates various optimizations for candidate generation and counting which allow it to scale to large graph databases.

B. Graph Clustering:
Brian Kulis et.al has proposed a kernel approach [13] unify vector-based and graph-based approaches.The objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, are expressed as a special case of the weighted kernel k-means objective.It is an extension of probabilistic framework for semi supervised clustering with pairwise constraints.This paper was based on Hidden Markov Random Fields [18].This framework with semi-supervised clustering algorithm SS-Kernel-k means unifies vector-based and graphbased approaches using a kernel approach.
(2) Diagonal-shift K by adding σI to guarantee positive definiteness of K. (3

where 1 is the vector of all ones C. Graph Searching:
Rosalba Giugno and Dennis Shasha has proposed an algorithm GraphGrep [20] which is an applicationindependent method for querying graphs, (i.e) for finding all the occurrences of a subgraph in a graph database.The interface is a regular expression graph query language Glide (a graph linear query language) the combined features from XPath and Smart acts as interface.Glide incorporates both single node and variable-length.
Steps of GraphGrep are: www.ijacsa.thesai.orgThe algorithm first extract all Cycle structures in a graph g, then extract all Star structures, and finally, identify the remaining structures as either Line structures or as attachments to the extracted basic structures.Basic Structure [20] Haoliang Jiang et.al in this paper [21] describes the transformation of a graph into a string representation, or capturing the semantics in graph data.The meaningful components in graph structures are found and are used for the most basic units in sequencing.It reduces the size of resulting sequences, but also enables semantic-based searching.Here it is approached with chemical compounds which can also be tested with protein structures as well.

D. Graph Indexing:
There are plenty of research efforts to solve the sub graph isomorphism problem for a large graph database by utilizing graph indexes of which few are listed below: In this paper [28], Peixiang Zhao et.al proposed a new cost-effective graph indexing method based on frequent treefeatures of the graph database.Effectiveness and efficiency are analyzed in three critical aspects: feature size, feature selection cost, and pruning power.To achieve better pruning, frequent tree-features (Tree),a small number of discriminative graphs (¢) are selected on demand.It has two implications: (1) the index construction by (Tree+¢) is efficient, and (2) the graph containment query processing by (Tree+¢) is efficient.
Wook Shin Han et.al has proposed iGraph [19], a framework with binary executables , heap files, B+-trees, inverted indexes, disk-based prefix trees, binary large object (BLOB) files, an LRU buffer manager, m-way posting list intersection, and external sorting.
Xifeng Yan et.al has proposed an algorithm gindex [10] which makes use of frequent substructure as the basic indexing feature.
Frequent substructures are ideal candidates as they explore the intrinsic characteristics of the data.Two techniques such as size-increasing support con straint and discriminative fragments, are introduced to reduce the size of index structure.
The design and implementation of gIndex algorithm is segmented to 5 sub sections: (1) Discriminative fragment selection (2) Index construction (3) Search (4) Verification and ( 5) Incremental maintenance.James Cheng et.al has proposed FG-index [11], novel indexing technique that constructs a nested inverted-index based on the set of Frequent subGraphs (FGs).For a graph query, FG-index returns the exact set of query answers without performing candidate verification.In case, if the query is an infrequent graph, the algorithm a candidate answer set as output which is close to the exact answer set.
The algorithm is divided into three parts: (1) computation of T (where T is a sub graph) (2) construction of the core FG-index, (3) creation of Edge-index.

IV. A FRAME WORK FOR INDEXING:
Irrespective of the type of graph data, there are various mine at once algorithms to build index for any large database.After indexing, due to various updates, the index has to be restructured such that the retrieving efficiency or speed doesn't get degraded (performance).If the changes cause major performance issues, then the complete work has to be indexed from the scratch which is quite expensive and tedious.
Therefore, we propose a framework which can index with its features and update the right features at right place through search algorithms at the index.www.ijacsa.thesai.orgMine at once indexing algorithm index any type of data.Most of the algorithms are extension or improved version of some basic techniques so a hybrid model for indexing can be built, such that indexing will be much more effective.
To upgrade the indexes with updates, the feature mining is one of the technique, in which iterative sub graph feature mining algorithm [22] is more effective in finding the upgraded parts in a graph.
Once the changes in the graph are extracted by any of the feature mining technique, right place has to be found out where the feature has to be pushed into or popped off from the index for which the basic searching techniques like BFS, DFS, G-string can be used to find the exact location where the particular extracted feature has to be pushed or popped into or off the index.

V. CONCLUSION
This paper includes the various areas of research fields in graph mining along with a model or architectural Framework which includes Graph Searching, Indexing and feature mining techniques.As there are plenty of mine at once algorithm, according to type of the data, effective indexing can be done by imparting the particular type of algorithm for particular data.Irrespective to the field of any applications, this model can act as a core algorithmic structure for effective indexing and upgrading the index.

( 1 )
Build the database to represent the graphs as sets of paths(2) Filter the database based on the submitted query to reduce the search space (3) Perform exact matching.