Designing Graphical Data Storage Model for Gene-Protein and Gene-Gene Interaction Networks

Graph is an expressive way to represent dynamic and complex relationships in highly connected data. In today’s highly connected world, general purpose graph databases are providing opportunities to experience benefits of semantically significant networks without investing on the graph infrastructure. Examples of prominent graph databases are: Neo4j, Titan and OrientDB etc. In biological OMICS landscape, Interactomics is one of the new disciplines that focuses mainly on the data modeling, data storage and retrieval of biological interaction data. Biological experiments generate prodigious amount of data in various formats(semi-structured or unstructured). The large volume of such data posses challenges for data acquisition, data integration, multiple data modalities (either data model of storage model, storage, processing and visualization. This paper aims at designing a well suited graphical data storage model for biological information which is collected from major heterogeneous biological data repositories, by using graph database. Keywords—Big Data; Graph Theory; Graph Database; GeneGene Interaction; Protein-Protein Interaction; Large Scale Biological Graphs; Storage Model; Neo4j


I. INTRODUCTION
Big Data is defined as data that contains variety, volume, velocity, veracity, valance and value.Key term in Big data is data, not big.Data speed, frequency, volume and connectedness are being driven by the source of transmission of data.The data gathered from different sources are in different forms such as structured data, semi-structured data and unstructured data.Some major repositories of Biological Data include: Molecular Interaction Database (MINT) [1], Database of Interaction Protein (DIP) [2], Biomolecular Interaction Networks Database (BIND) [3] which is a component of Biomolecular Object Network Database, Reactcome [4], Search Tool for the Retrieval of Interacting Gene/Protein (STRING) [5], Unified Human Interactome (UniHI) [6], Online Mendelian Inheritance in Man (OMIM) [7], Kyoto Encyclopedia of Genes and Genomes (KEGG) [8], Human Protein Reference Databases (HPRD) [9], Biological General Repository for Interaction Datasets (BioGrid) [10], National Center for Biotechnology Information (NCBI) [11], and Universal Protein Resource Knowledgebase (UniprotKB) [12].
Graphs databases are trending in today's highly connected world where the flood of data is having dynamic and complex relationships.It is required in coming decades to get insight of vast graphs and highly connected data in order to achieve competitive advantages.Graphs formally consist of nodes (vertices) which represent entities and edges (relationships) which represent connections between nodes.From real world perspective, everything is connected and can be represented as graph.
With the emergence of recent tools and technologies, it is challenging to keep track of all of the storage, analytics and management frameworks.In this study, the scope of graph landscape is discussed in order to understand the presented graphical data storage model for Biological Interaction Data.There are two broader views of graph landscape: one perspective is the Graph Models and the other is Graph Processing.

Graph Model Perspective:
The prominent graph models which are used by various other graph technologies are Property Labeled Graph Model [13], RDF (Resource Description Framework) [14] and HyperGraphs [15].Property Graph model contains nodes which represent entities and edges which represent relationships.Both nodes and relationships can contain properties in the form of key-value pair.Relationships must have start and end node, and are directed and named.Hypergraph model is a generalized graph data model which allows any number of nodes connected with a relationship (called hyper-edge).It can be used to model many-to-many relationship scenarios.Hyperedges can be multi-dimentional.The concept of triple stores is originated from the movement of Semantic Web.Triple is the data model which contains subject-predicateobject structure.It is suitable to capture the semantically-rich information and logically connected data.Among aforementioned graph databases, OrientDB [16] provides Property Graph Model, Neo4j [17] provides Property Labeled Graph Model (Labels can be assigned to nodes) and HypergraphDB [18] provides Hypergraphs.
Graph Processing Perspective: The technologies that are exploiting the concept similar to the OLTP (Online Transactional Processing) [19] of traditional relational space are termed as Graph Databases.Graph Databases offers online transactional processing and provides access in real time either from a user or an application.From another perspective, the technologies that are exploiting concepts similar to OLAP (Online Analytical Processing) [20] or Data Mining are cat-egorized as Graph Processing Engines (GPE) [21] [22].These are typically designed to perform analytics on bulk of data in batch steps.
Graph Databases (Graph Database Management Systems) [23] are online transactional systems that expose graph data model by exploiting CURD (Create, Update, Read, Delete) [24] approach, and are designed for better transactional performance, integrity and availability.The distinguished properties of graph databases include graph storage and graph processing.Some Graph databases offer their native graph storage while others store graph data serially into general purpose database such as relational database [25], object-oriented database [26] and NoSQL store [27] (other than graph store).The approach used by graph database in which adjacent nodes directly point to each other is termed as index free adjacency.In other words: a graph database qualifies as a graph database when it behaves like real graphs from the user's perspective.Some graph databases use native graph processing means that they provide index free adjacency [28].
Relational Databases are used to store data in tabular and structured form and they are doing it exceedingly well.But today's technologies are facing challenges to store data which is highly connected and semi-structured, which should be well modeled and suitable for ad-hoc queries.Almost everything is connected in this world and it is needed to understand the influence of connections in order to thrive and progress.In Biological Domain, data is more connected and have complex relationships.This research is aimed at designing storage model for connected data which is collected from major biological data repositories, by using Graph Database (Neo4j).Neo4j [17]   Biological interaction networks are typically dense, semistructured, unpredictable and highly connected.For example, in protein-protein interaction network [29], a gene may be interacted with other proteins, or may it be participated in biological pathways [30], or may it be involved in disease relevant network.This type of connected biological information leads to highly connected networks.Therefore, traditional database storge models are not suitable to handle such datasets.Because classical database storage models are naturally design to handle the datasets which are less-connected (few number of relationships among data entities) with the entities represent limited data types and querying the data need joins that make it computationally expensive.Graph storage models provide an easy way of modeling, understanding and visualizing data of a domain.In Biological domain, the problem is to get data from heterogeneous biological data sources, integration of collected datasets, designing storage model based on the informationrich graph model which helps to understand the connectedness of data with several other aspects.With the less-familiarity of graph databases, biologists (people from other domains) face difficulty to design graph storage models.
The objectives of this research include: • Biological data acquisition from heterogeneous data sources like NCBI [11], RefSeq [31], EntrezGene [32], BioGrid [10], OMIM [7], HGNC [33], HPRD [9] and STRING [5] etc. (Selection of datasets of Gene-Gene and Gene-Protein Interactions) • Transformation, Cleaning and Integration of datasets • Data modeling of Gene-Gene and Gene-Protein Interaction data using Labeled Property Graph Model • Designing data storage model for Graph Database (using Neo4j) • Evaluation of implemented storage model The outline followed in this paper is as: In section 2, Graphical Data Storage Model is presented for Interaction Networks by using Graph Database.In section 3 it is discussed, how a data model(Labeled Property Graph Model) can be represented as a graph storage model specifically for biological interaction graphs.Further in section 4, evaluation of storage model is discussed by using Cypher Query Language in Neo4j [17].Related work is presented in section 5, followed by the conclusion in section 6.

II. GRAPHICAL DATA STORAGE MODEL
This paper aims at offering a unifying, gene-centric view over the data made available by the heterogeneous data sources and designing graphical data storage model for integrated data.In order to achieve this objective, available typologies of biological information are formulated as: • Gene, i.e., Identification of a gene of a dataset through data source identifier.For example: a Gene, symbolically represented as RXRA is identified by its data source identifier.In this data model, diverse datasets are integrated from heterogeneous data sources including HGNC [33], HPRD [9], UniProt [12], Ensembl [34], EntrezGene [32].BioGrid [35], NCBI [11], STRING [5] and RefSeq [31].Properties of gene include Gene-Family Identifier, Gene-Symbol, Gene-Aliases, Gene-Description, Genomic-Coordinates and Cytogenetic-Location.
• Locus, i.e., Information about Locus Type and Locus Family.
• External Links, i.e., Identification of a gene or a protein of a dataset through data source identifier.For example: a Gene, symbolically represented as RXRA is identified in HGNC as 10477, in UniProt as Q6P3U7, its Ensembl identifier is ENSG00000168824, HPRD identifier is 1577 and so on.In this data model, diverse datasets are integrated from heterogeneous data sources including HGNC [33], HPRD [9], UniProt [12], Ensembl [34], EntrezGene [32].BioGrid [35], NCBI [11], STRING [5] and RefSeq [31] • Molecular Information, i.e., Molecular Weight (unit: Dalton) of a Gene, information about Molecular Class from which a Gene belongs and Information about Molecular Function a gene may be performed.
• Disease, i.e., Information about participation of a Gene in Disease-Association [36] Networks for example a gene can can be associated to a certain kind of Tumor or other kind of disease.
• Publication, i.e., Reference of existing biological literature [37] for Gene that includes information about Author, Publication Year and Publication Identifier.
• Sequences, i.e., biological sequences include DNA Sequence and Protein Sequence.
• Pathways, i.e., Information about participation of a Gene in biological processes for example a gene can take part in cell communication or in signal transduction etc.
• Gene-Gene Interaction Information, i.e., Interaction of Gene with other Genes carries information about the experiment method through which the G-G interaction is detected and recorded (by the data sources).
•   In Table II, entities are represented as nodes and edges are represented as relationships between biological entities.Nodes have properties and can have one or more labels.Relationships are directed and can have properties as well.In figure 1, Graphical Data Storage Model is presented that is based on Labeled-Property Graph Model.Nodes are representing aforementioned entities of biological domain along with the label and properties of each node.

III. PHYSICAL DATA STORAGE IN GRAPH DATABASE
The way in which graphs are stored in graph database is one of the key aspects of the designing graph database.Neo4j is one of the prominent graph databases which provides indexfree adjacency, native storage, native processing and native query language(Cypher).Storage model is designed in section 2, for graph databases.This section aims at illustrating that how biological interactions(binary) are physically stored in a graph database(Neo4j).Neo4j is designed to store graph data in different store files, i.e., Nodes, Relationships, Properties and Labels have different physical stores on disk.There is structural dissimilarity between the actual graphical view of a graph and the actual view of stored records on disk.
Protein-Protein interaction networks are usually very diverse and have various properties.The reason is the generation of data from heterogeneous sources both experimentally and computationally.Mostly, Protein interaction networks follow the characteristics of scale-free networks.In such networks, higher degree of protein connectivity shows the higher biological significance of that protein.Gene-Gene interaction networks are usually sparse and highly connected networks, also known as Gene-Regulatory Networks.In fig 3, it is presented that how Gene-Gene interactions are physically stored in Neo4j.The Biological Networks are naturally more complex, and the complexity increases with the accumulation of data.The variability of biological information is one of the major cause of data inaccuracy.As for this research, data is integrated from different major data repositories, and storage model is presented for querying and visualization on Neo4j.The results are evaluated by the verification of queried information with the major sources of biological information.
In order to demonstrate, how the biological data can be accommodated in neo4j, some queries results are presented.The diverse data sets are polled in Neo4j, particularly for Biological Domain and Gene-Gene and Gene-Protein Interaction scenario and are queried by using Cypher Query Language.Query results are evaluated on the basis of designed storage model and its potential to capture all the information, a biological network have, about its entities and relationships.Additionally, query results are verified from the heterogeneous data sources from where the data had been collected.In fig 5, the way is depicted which is used in Neo4j for the representation of G-P and G-G interaction networks.Study of protein-protein, protein-gene and gene-gene interactions are becoming increasingly important to understand human diseases on a system-wide level.These proteinprotein interactions provide significant information for new perceptions in different ways that can impact biomedical research.Protein functionality often modulate with other interactors which can either be proteins, or genes or other molecules.Biochemical Interaction Detection Methods are used to detect interactions among biological entities, such methods include protein affinity chromatography, affinity blotting, coimmunoprecipitation, and cross-linking etc.Other prominent experimental methods for interaction detection in molecular biology are protein probing and two-hybrid system.Examples of genetic interaction detection methods include suppressors [41], synthetic mutants [42], and non-complementing mutants [43] etc.
In [44], a practical analysis guidance of interactions in genetic, biochemical and molecular biological methods is presented.In [45], protein interaction fundamentals, publicly available protein interaction databases with their useful data significant information which facilitate genome or genetic studies, are briefly discussed.A systematic prediction method of protein-protein interaction type is proposed in [29], based on solely techniques used to detect interactions.Lactose effect investigation on structural variation of aging induced by changing lactose content is presented in [46].
In biological literature [37], systematic views of human genome are presented from antiquity evolution to precision medicine against diseases .For research purpose, biological databases are increasing their importance with rapid growth of data.In [47], a review of biological databases is presented followed by the challenges such as data volume, processing, data exchange and curation from big data perspective.
Human(Homo-sapiens) databases are categorized by the information provided by database such as DNA [34], RNA [48], protein [2] [12], Expression [49], Pathway [4], disease [50], and literature [37].Ancestral networks mechanisms of human and mouse genomes that are characterized by the new gene integration, and gene evolutionary significance are discussed in [51].Exploration of their generation frequencies and patterns of new gene-driven evolution of Gene Gene Interaction networks is also discussed.
In [52], interaction pattern discovery with characterization of different types of interactions is discussed along with their use in protein-protein interaction.Graph databases enable efficient storage and processing of the encoded biological relationships.Systems biology graphical notation (SBGN) [53] represent STON [54] (SBGN TO Neo4j), a framework that exploits the Neo4j graph database to store biological pathways.In [30], a novel algorithm for the identification of spurious curves is presented where curves are used for different unfolding pathways.An evaluation of different resulting graphs generated from statistical analysis is presented in [55].[56] shows detailed description of protein domains, functional sites, and families as well as associated patterns and their profiles identification methods.A brief description of major biological interaction databases such as BIND [3], DIP [2], HPRD [9], In-tAct [57], MINT [1], MIPS [58], PDZBase [59] and Reactome [4] is represented in [60].BioGrid [10] database is an open access database that houses protein interactions and genetic curated data from the primary biomedical literature for all major model organism/species [35].Currently, BioGRID [35] contains 749912 interactions as drawn from 43149 publications that represent 30 model organisms.

VI. CONCLUSION
We are living in the age of Big Data and graphs are the most suitable choice for representing large scale multimodel biological data as they can effectively represent the relationships of data that is being collected by heterogeneous data sources.Large scale biological graphs have been used for analysis of complex data sets from biological domain like Interaction Networks, Bioinformatics, Health Informatics, Molecular Networks, Gene-Disease and Gene-Phenotypes Association Networks and applications that produce large amount of biological data.To fully utilize the information represented by graphs, efficient storage model and graph database are required.In this paper, a storage model has been presented for diverse data sets, collected from major biological data repositories by using one of the prominent Graph databases, Neo4j.Storage Model is described according to various types of biological information.Moreover, potential Graph Theory in Biology and tools and techniques used in biological research activities has been presented.This article will be helpful for the researchers to get firsthand knowledge of existing Graph Databases and techniques to plan for future research.

Fig. 1 .
Fig. 1.Graphical Data Storage Model Fig 2 presents, how a Protein-Protein interaction is physically stored in Neo4j.

Fig. 5 .
Fig. 5. Neo4j Results based on presented Data Storage Model provides Native Graph Storage and Native Graph processing.Other prominent Graph Databases are discussed in table I.

TABLE I .
EXISTING GRAPH DATABASES WHICH ARE PROVIDING NATIVE/NON-NATIVE STORAGE AND PROCESSING