Partition based Graph Compression

Graphs are used in diverse set of disciplines ranging from computer networks to biological networks, social networks, World Wide Web etc. With the advancement in the technology and the discovery of new knowledge, size of graphs is increasing exponentially. A graph containing millions of nodes and billions of edges can be of size in TBs. At the same time, the size of graphs presents a big obstacle to understand the essential information they contain. Also with the current size of main memory it seems impossible to load the whole graph into main memory. Hence the need of graph compression techniques arises. In this paper, we present graph compression technique which partition graphs into subgraphs and then each partition can be compressed individually. For partitioning, proposed approach identifies weak links present in the graph and partition graph at those weak links. During query processing, the partitions which are required need to be decompressed, eliminating decompression of whole graph.


I. INTRODUCTION
Today, numerous large-scale systems and applications need to analyze and store massive amounts of data that involve interactions between various entitiesthis data is best represented as a graph; for instance, the link structure of the World Wide Web, group of friends in social networks, data exchange between IP addresses, market basket data, etc., can all be represented as massive graph structures.As witnessed in the core tasks of these applications graph patterns could help build powerful, yet intuitive models for better managing and understanding complex structure.Some of these application domains are [19]:  World Wide Web.The Web has a natural graph structure with a node for each page and a directed edge for each hyperlink.This link structure of the Web has been exploited very successfully by search engines like Google [18] to improve search quality.Other contemporary research works mine the Web graph to find dense bipartite cliques, and through them Web communities [16] and link spam [05].Recent estimates from search engines put the size of the Web graph at around 3 billion nodes and more than 50 billion arcs [14].(Note that these are clearly lower bounds since the Web graph has been growing rapidly over the years as more of the Web gets discovered and indexed.)Thus, the Web graph can easily occupy many terabytes of storage.
 Social Networking.Popular social networking websites like Facebook, MySpace and LinkedIn cater to millions of users at a time, and maintain information about each user (nodes) and their friend-lists (edges).Mining the social network graph can provide valuable information on social relationships between users, the music, movies, etc. that they like, and user communities with common interests.
 IP Network Monitoring.IP routers export records containing source and destination IP addresses, number of bytes transmitted, duration, etc. for each IP communication flow.Recently, Iliofotou et.al. [12] proposed the idea of extracting Traffic Dispersion Graphs (TDGs) from network traces, where each node corresponds to an IP address and there is an edge between any two IP addresses who sent traffic to each other.Such graphs can be used to detect interesting or unusual communication patterns, security vulnerabilities, hosts that are infected by a virus or a worm, and malicious attacks against machines.
 Market Basket Data.Market basket data contains information about products bought by millions of customers.This is essentially a bipartite graph with an edge between a customer and every product that he or she purchases.Mining this graph to find groups of customers with similar buying patterns can help with customer segmentation and targeted advertising.
Several approaches have been proposed for the analysis and discovery of concepts in graphs in the context where graphs are used to model datasets.Modeling objects using graphs allows us to represent arbitrary relations among entities and capture the structural information.The utilization of richer and more elaborate data representations for improved discovery leads to larger graphs.The graphs are often so large that they cannot fit into the dynamic memory of conventional computer systems.Even if the data fits into dynamic memory, the amount of memory left for use during execution of the discovery algorithm may be insufficient, resulting in an increased number of page swaps and ultimately performance degradation.One of the main challenges for knowledge discovery and data mining systems is to scale up their data interpretation abilities to discover interesting patterns in large datasets.This paper addresses the scalability of graph-based discovery to monolithic datasets, which are prevalent in many real-world domains where vast amounts of data must be examined to find meaningful structures.
In [23], many challenges are faced by graph mining algorithms due to the huge size of graph.One issue is that a huge graph may severely restrict the application of existing www.ijacsa.thesai.orgpattern mining technologies.Additionally, directly visualizing such a large graph is beyond our capability.In computer science, it is more important to understand the information embodied in abstract structures that are of our particular interests.For instance, how can we quantify the amount of information in the structure of graphs such as the Internet, social networks, and biological networks?How can we understand and utilize the "structure" of nonconventional data structures such as biological data, topographical maps, medical data, and volumetric data?Imagine a compressed graph, conserving the characteristics of the original graph.We can easily visualize it.The goal of compressing a graph is to make the high-level structure of the graph easily understood.Therefore, informative graph compression techniques are required and have wide application domains.Many graph compression techniques have been developed for compressing a web graph [7,14,25,10,4,9].In this paper we proposed partition based compression approach which helps in storing the compressed subgraphs on the systems that are located geographically apart.Thus it reduces the network traffic in distributed computing [6] since data will be available on local system itself.The aim of the proposed technique is to represent the data in compressed form while retaining the ability to answer the same queries as their uncompressed counterpart.We aim at representing graphs in highly compressed form, so as to manage huge instances in main memory.
The remainder of this paper is organized as follows.
Section II reviews the background information as well as related work on graph compression.Section III presents the details of proposed partition based approach.Section IV presents the results of performance evaluation.Section V summarizes and concludes our paper.

II. BACKGROUND
The biggest challenge in graph compression is ever increasing demand of high compression ratio, which reduces memory requirement of a graph.A graph containing billions of nodes and trillions of edges cannot be stored in memory without compression and if we store it on disk then operations which need to be performed on this graph would require many disk I/O and disk seek operations which reduces algorithm performance drastically.Hence a graph needs to be divided to ensure that each partition is small enough to fit in main memory and thus reduces I/O operations significantly.

A. Problem definition
Given an undirected graph  = ,  , where  is set of vertices and  is set of edges in the graph  .We need to represent graph such that the compression ratio and bits per edge are maximum and minimum respectively.Compression ratio and bits per edge are given by the following formulae:

B. Related work
In recent years many compression algorithms have been proposed.In [14] Gap encoding makes use of locality [8] property of web graph.Locality suggests that each list of successors should be represented as list of gaps.More precisely, if   =  1 ,  2 , … ,   , then it can be represented as  1 − ,  2 −  1 − 1,  3 −  2 − 1, … ,   −   −1 − 1 .However; reference compression [14] technique exploits similarity property of web graphs.In this method, adjacency list   , is represented as a "modified" version of some list   , called the reference list.The difference  −  is called the reference number.This results into reference compression, in which a sequence of bits, one bit for each successor in the reference list, tells whether the corresponding successor of  is also a successor of  .Nodes which are not covered by reference list are called extra nodes.
In differential compression, the differences with   are represented as a sequence of copy blocks.Copy list can be represented as an alternating sequence of 1 and 0-blocks, and specify the length of each block.This sequence of integers is preceded by a block count telling the number of blocks that will follow [14].Consecutivity among extra nodes is frequent, hence to exploit this consecutivity, subsequences are isolated corresponding to integer intervals and number of integers in these intervals is called length [14].
In [8], an un-weighted graph  =   ,   can be represented as  = ,  where  =   ,   is graph summary and  is set of edge corrections.Every node  in   belongs to a super node  in   which represents a set of nodes in G.A super edge  =   ,   in   represents the set of all edges connecting all pairs of nodes in   and   i.e. it simply collapse one bi-partite graph into two super nodes   and   and replaces all the edges by super edge between the super nodes.The edge correction  has parts + (edge to be added) and - (edge to be removed) which is considered during recreation of original graph.
In [7], Re-pair recursively finds pair of repeated symbols across all the lists and replace them by a new "non-terminal" symbol which has to be expanded later when extracting the lists.In [3], a directed bipartite clique  = (, ) can be transformed into a directed star.A directed bi-partite clique (, ) is a pair of two disjoint set  and  such that  ∈  and  ∈  and there is a directed link from  to  in.For a biclique ,  a new compressed graph  ′ =  ′ ,  ′ is formed by adding a new vertex  to the graph, removing all the edges in ,  and adding a new edge  ∈  ′ for each u ∈  and new edge  ∈  ′ for each  ∈ .
In [13], an undirected graph  = ,  , where  is a set of nodes and  is set of edges, is represented using adjacency list method and thus  +  space is required, where  =  and  =  .But for simple undirected graphs , it should be noted that the complement graph   of  is sufficient for representing .For a very dense graph , the size of the edge set of the complement graph may be much less than  .Therefore, the original graph is store in a data structure if  ≤  III.PROPOSED PARTITION BASED APPROACH Compression allows more efficient storage and transfer of graph data, and may improve the performance of various algorithms by allowing computation to be performed in faster levels of computer memory hierarchies.Good compression requires using the structural properties of the graph, and hence first important step is to understand this structure.For example, in Web graphs, there appear to be natural clusters of related pages with similar connections.
In this paper we restrict our discussion to undirected graph but can be easily extended for directed graph.We use an undirected graph for modeling complicated structures which contains dense clusters and these dense clusters are connected with weak links called bridges.Proposed partition based compression algorithm exploits graph property locality.Link locality has been independently observed and reported by several authors.For instance, Suel and Yuan [16] observe that on average, around three-quarters or 75% of the links from a page are to other page on the name domain/host.Given this observation, we attempt to partition graph into dense clusters and then these dense clusters are further compressed using reference compression technique.
We employ breadth first search algorithm starting from a randomly chosen node, say  ∈  which returns a connected component .Now take a node say  from  and make two sets  1   2 .Set  1 contains all neighbors of node and set  2 contains all the nodes in  except the nodes in  1 .A node  in  1 with degree  is a bridge node of width w if the following conditions are satisfied: 1)   −  neighbors of node x are in set  1 and exactly w neighbors are in set  2 .
If both the conditions are not satisfied then node may be shifted to set  2 or if more than half neighbors of node  are in set  2 .We repeat this process for all the nodes in set  1 and  2 until we find a bridge between the two sets or no change in set  1 and  2 .In this way we find bridge between set  1 and  2 which results into two subgraphs.Repeat the same procedure by choosing another random node from  until we get sufficiently small subgraphs.These subgraphs are compressed sequentially using reference compression technique [14].
Each subgraphs thus obtained after partitioning is compressed by applying reference compression algorithm.In this method, instead of representing adjacency list   for node x directly, it is represented as a "modified" version of some next list (), called the reference list.The difference  −  is called the reference number.Thus the reference compression results in a sequence of bits, one for each successor in the reference list, which tells whether the corresponding successor of node y is also a successor of node x.The representation of   with respect to () is made of two parts: a sequence of |()| bits, called the copy list, and the list of integers   /(), called the list of extra nodes.Copy list specifies which of the links contained in the reference list should be copied: it will contain 1 at the i th position; iff the i th entry of   also appears in   [14].
For each node i in a subgraph s, we find reference node j (node which has maximum number of common successors with node i).we consider reference_width for finding reference node.reference_width can be fixed or can be equal to size of subgraph.
For reference node, we calculate reference_number, copylist and extra nodes.copylist is further compressed as a sequence of copyblock which contains the information about the number of 1's and 0's appearing in copylist alternatively.Further extranodes are compressed since there is consecutively among extranodes.Once all the nodes are covered in subgraph we take next subgraph for compression.

A. Algorithm
In this section pseudo-code for partitionven.ingthe large graph is given.Function check_condition_1 and check_condition_2 will return "1" if condition 1 and 2 mentioned in section III is true for node k.
Arrange all nodes of G in decreasing order of degree.

IV. PERFORMANCE EVALUATION
In this section, we present experimental results.We have performed experiments on 2.10 GHz Intel core i3 processors with 4GB main memory, running on 32-bits Windows 7 platform.Proposed algorithm is implemented in Java.We performed experiment on synthetic datasets generated using graph generator.Details of the graph dataset used for experiments are given in Table II.

A. Graph Partitioning
A graph having 9985 nodes and 123416 edges is partitioned into 354 subgraphs, size of each subgraph varies from 26 to 30 nodes both inclusive with bridge width equal to 3. Whereas number of bridges is 510, among these 358 bridges are of width one, 98 bridges are of width two and 54 bridges are of width three.On the other hand, a graph of 1979 nodes and 24340 edges is partitioned into 71 subgraphs, size of each subgraphs again vary from 26 to 30 both inclusive where bridge width is three.Number of bridges is 98 among these 66 bridges are of width one, 20 bridges are of width two, 10 bridges are of width three.

B. Effect of different parameters on compression ratio
In Fig. 1, y-axis represents compression ratio and x-axis represents reference width w [9].Different reference widths are 3, 5, 7 and a subgraph.Reference width equal to the subgraph means all the nodes in the subgraph will be considered in search of reference node [10].Fig. 1. shows compression ratio without copy blocks for different graph size.Compression ration increases slowly with the increase in reference width.Fig. 2 shows compression ratio with copy blocks for different graph size.Fig. 3 shows compression ratio with copy blocks and extra nodes for different graph size.From Fig. 2 and 3 can observe that the compression ratio increases rapidly with the increase in reference width.When reference width is equal to subgraph compression ratio is maximum.For all reference width, compression ratio of the graph with 9985 nodes and 123416 edges is higher among all graphs.Hence ratio increases with increase in number of nodes and edges i.e. we get better compression ratio for dense graphs.
Boldi and Vigna [14] have given the best algorithm ever which takes 2 to 3 bits per edge for a graph of size 18.5 million nodes and 300 million edges.S. Raghavan [21] has shown that super node and super edge representation takes 5.07 bits per edge for average over 25 million, 50 million, 100 million nodes.Broder [1] showed that a graph of 200M nodes and 1.5G edges requires 37.87 bits per edge.

V. CONCLUSION & FUTURE WORK
In this paper we proposed an effective solution in the form of a partitioning approach, to one of the main challenges for graph-based knowledge discovery and data mining systems, which is to scale up their data interpretation abilities to discover interesting patterns in large graph datasets.We observed that for partition based reference compression approach, compression ratio increases with increase in reference width and it is maximum when reference width is equal to size of subgraph.Moreover it helps in distributed computing by reducing network traffic and storage burden on single system.
Possible future enhancement to the proposed approach is reducing partitioning time which increases sharply with the increase in graph size.Since we run BFS algorithm for each node in the graph which gives connected component but we can ignore the nodes which are in the partition and cannot be partitioned further.
Our algorithm is sequential i.e. first graph partitioning is done and then reference compression algorithm is applied.This causes re-loading of each partition for the compression.It can be improved by compressing the partition when it cannot be partitioned further.

Fig. 3 .
Fig. 3. Compression ratio (with copy block and extra nodes) v/s reference width w.

TABLE II .
DETAILS OF GRAPH DATASET