A Novel Big Data Storage Model for Protein-Protein Interaction and Gene-Protein Associations

NGS (Next Generation Sequencing) technology has resulted in huge amount of proteomics data that exists in the form of interactions (protein-protein, gene-protein, and genedisease). ETL (Extraction, Transformation, and Loading) techniques are very useful for Databases. Existing Rational Databases are not unified and having SQL (Structured Query Language). Proteomics data requires improvement for Integration of different Data sources. With the usage of NoSQL (not only SQL), improve the efficiency and performance. For this, a novel based unified model has been designed for protein interactions data (P-P, G-G, and G-D) by using Apache HBase to evaluate given the model, different case studies have been used. Keywords—Hadoop; HBase; Big Data; Apache Drill; ProteinProtein Interaction; Gene-Protein Association; Gene-Disease


INTRODUCTION
Biological data plays an imperative role in Bioinformatics domain that comprises DNA, RNA, Proteins, and Genes (Microarray).With the passage of time, these data have been growing very quickly in the form of interactions/associations such as [1][2][3] protein-protein and protein-gene.These interactions provide valuable information about the structure of the cell and their controlling mechanism.For the detection of Protein and Disease interactions, a lot of approaches are [4,5] designed that improve the accuracy of Biological data interactions.
Over the time, the volume of biological data has increased.It is very important to find out specific genomic disease [6,7] with the help of Proteomics interactions.Many researchers are trying to find out Protein and Disease interactions that give important information about their functions and behaviours.Prediction of Biological Processes is very informative [8] for molecular interactions.Protein pathways and complexes are determined by molecular interactions.
By the upcoming era, large interactions data have increased in the perspective of variety and volume.This data is referred to as Big Data which needs to be stored in the database very effectively.Existing PPI (Protein-Protein Interaction) Relational databases are DIP (Database of Interacting Proteins), MIPS (The Munich Information Centre for Protein Sequences), HPRD (Human Protein Reference Database), MINT [9] (The Molecular Interaction Database), BOND (Bimolecular Object Network Databank), IntAct and Reactome.However, these databases do not store large Interactions data in a structured and efficient way.DIP [10] is specially designed to determine Proteins interactions by combining multiple sources into a unique and consistent set of PPI (Protein-Protein Interaction).MIPS' [11] research centre is used to manage the methods in Microarray gene expressions and Proteins data in a systematic way.HPRD [12] is OO (Object Oriented) database that is developed for specific Protein-Disease association.It provides the functionality of query optimisation by displaying data dynamically.MINT is based on verified Protein interactions that are presented graphically.BOND [13] is powerful databank that is designed for a combination of interactions and multiple sequences.It includes GenBank and stack of tools.IntAct [14] is a valuable open-source database that provides tools for interactions.Reactome [15] is a project that provides the cross-referenced functionality for many sequence databases.The above-mentioned databases lack to find some specific associations hence an Integration of these databases is required.
To remove these bottlenecks, open source Apache Hadoop [16] Platform have been developed for parallel execution of tasks in distributed manner across thousands of nodes.Its main tools are HBase [17] and Hive [18].HBase framework is used to access real-time data randomly.It is NoSQL (Not only Structured Query Language) technology because scalability of large data in RDBMS (Relational Database Management System) shows poor performance.NoSQL databases consist of CAP (Consistency, Availability, and Partition Tolerance) mechanism with ACID (Atomicity, Consistency, Isolation and Durability) characteristics for tables.Sharding occurs automatically for sparse data by using HBase.Its logical view contains specific row key, column family, column key, timestamp and cell value.Its main parts are Region, Master, www.ijacsa.thesai.orgRegion Server, HDFS (Hadoop Distributed File System) and API (Application Programming Interface).Its basic operations are created, read, update and delete in the put, get and delete commands.Hive is DWH (Data Warehouse) framework that is designed for ad-hoc queries and writing reports by providing HQL (Hive Query Interface) for large data analysis.Its components are a web browser, driver, thrift server and client that interact with Hadoop.Its meta store exists in the form of Embedded, Remote, and Local states.Its data units contain tables, buckets, and partitions.It supports primitive and complex data types such as integers, strings, binary, arrays, maps, union, and structs.It provides shell interface, built-in functions, relational and arithmetic operators.
In this paper, a model is designed for large Protein-Genes interactions by integrating existing Relational databases.It provides the meaningful information for specific interactions.

The objectives of this paper are:
 A unified model for integration of different data sources  NoSQL storage model

 Empirical study using HBase
The rest of this paper is structured as follows: Section II highlights the related work.Section III explains proposed a model.Section IV represents evaluation and Case Studies of that model.Section V concludes the whole work and mentions the future research domains in this field.

II. RELATED WORK
Zanzoni et al. [9] have worked on the Protein interaction databases which signify distinctive tools to store this information disseminated in the scientific literature in a computer-understandable form.A systematic and easily accessible database permits the examination of wide interaction data sets and enables easy retrieval.MINT presents a database which helps to reserve data for functional interactions among proteins.It was also considered to keep further types of functional interactions, containing enzymatic alternations of one of the partner.On the other hand, it provides cataloging binary complexes.Chaurasia et al. [19] worked on the Systematic mapping of protein.Mapping of protein has highly been observed as a dominant task while practically working on functions of genomics.Numerous policies have just been followed to map human PPI.However, the author has produced a different kind of data set that is of high value for medicine experts and biomolecular data researchers.An open data management system named UniHI has been introduced to store and query information for more than 17000 human proteins interactions.
Apweiler et al. worked on the Universal Protein Resource (UniProt) [20] which is considered as a vital source of protein sequencing in bioinformatics as it gives a practical demonstration using three data storage mechanisms.First one is UniProt knowledge base that manually explains protein annotations, second is UniProtKB/TrEMBL, that stores these annotations and the third one is UniProtKB/Swiss-Prot that annotates proteins itself.Not only this database stores protein annotations but also help researchers to query for annotations and cross-references by linking them to the previous work done.It is an open source project that can be freely downloaded and used to get complete proteomes.
Chen et al. worked to visualize human protein-protein interaction (PPIs) and functional role of the data.Though numerous human PPI databases were found at that time yet defining all features of data was poor.The author named this data management system as Human Annotated and Predicted Protein Interaction (HAPPI) [21] database that is positioned at extraction and integration of new proteins interaction databases, which consists of BIND, STRING HPRD, MINT, and OPHID by means of database assimilation procedures.
HAPPI is an open project that provides annotated information to help discover new horizons in biomolecular networks.
Aryamontri et al. worked for the explanation and study of proteins genetically and chemical interactions for all the species and introduced the Biological General Repository for Interaction Datasets (BioGRID) [22].BioGRID is an open hub that provides all biological process related to humans diseases and suggests treatment for them.This data store includes 27501 interactions of chemical proteins that help to discover drugs to cure diseases.BioGRID is a dynamic interactions network that relates genetics and proteins interactions including bioactive compounds.This system gives results in visualisation form that can be adjusted according to the user's requirements.

Saeed et al. have worked on the proteomics and genomics.
Proteogenomics is a [23] evolving ground of structures.The author has used mass spectrometry for proteomics and next generation sequencing for genomics.To mine Proteogenomics data set the author assimilated next-generation sequencing and mass spectrometry.Also for sequencing and high-performance computing solutions for such a big and complex data are discussed.The author has described possible storage format and analysis problems for such a multidimensional, large, and unstructured Proteogenomics data set.The study helps research community to recognize challenges and work on future guidelines as discussed.Lehne et al. given the info about the protein interaction [24] databases.As protein-protein interactions are growing up with the passage of time so to store all the possible information related to these interactions some easily accessible databases are available.The author collected useful information from six major databases, described as, the Biological General Repository for Datasets Zhang et al. used the model driven architecture [25] software, that can store DNA and protein sequences efficiently.The author stored overlapping and non-overlapping DNA sequences in Apache Hadoop platform for space efficiency.
Xu worked on the vast availability of protein data including protein functions, sequences, annotations, and structures.The www.ijacsa.thesai.orgauthor has started a new area of research by studying relationships between proteins of one family, between different protein families of one genome, and between the protein of different species.This study helps researchers to mine relating data and do predictive analysis based upon PPIs.The author has done working in Hadoop and its MapReduce functionality is used to explore insights for a protein of protein data storage.
Taylor has extensively worked on the Hadoop platform using MapReduce framework.Because bio Scientists have started dealing with ultra-large-scale data set analytics [27], the author used Hadoop as an open software for implementations on data of petabyte scale for distributed environments.Hadoop provides an efficient and cheap solution for NGS analysis for ultra-large and distributed data set across the cloud.The implementation includes HBase data storage along with Hadoop's map reduce function for data analytics.
Sarwar et al. proposed the work on Bioinformatics tools for sequencing [28], which are helpful to store a large amount of genomics data within a short time.The analysis study has shown that conventional bioinformatics tools cannot cope with the rate of production of such large amount of genomics data.So, there is a need to update previous tools or develop new ones to find new research aspects by defining proper storage structures of data on genetics.
Ali et.al [29] have discussed Microarray data analysis which gives the details of many gene Selection/Extraction and Classification tests/Algorithms.They also discuss the performance of different algorithms and Machine Learning techniques.Ahmed et al. [30] have discussed the modern data formats (models) for the implementation of spark, techniques in Hadoop MapReduce and Machine Learning Algorithms.It also describes the performance comparison of different data formats.R. Rehman et al. [31] have explained the importance of Scala language for Bioinformatics Tools/ Algorithms.They demonstrate the supported languages for Motif Finding Tools, Multiple Sequence Alignment Tools, and Pairwise Alignment tools.

III. DATA STATISTICS
This dataset consists of protein, gene and disease columns which have a different type of interaction among them.The data set contains different column families which can have one or number of columns.These columns have values according to the families.The proposed data set contains 7 column families and defines different numbers of columns in each family.This protein, gene and disease interaction values are taken from different protein-interaction databases such as BioGRID, HPRD, EntrezGene, Ensembl etc.This dataset is the Homo-Sapiens organism.The available data sets on these platforms are in the form of CSV file.HBase column-oriented database is used for the storage of data.

IV. PROPOSED MODEL
A model is an object or a procedure that explains some particular phenomena.There are many models that exist for PPI data.These models are used to store, analyse and search information related to protein interactions and also specify the characteristics of PPI data.Different models are used for different sets of purposes and also cover their usage in various fields.These models are DIP, OMIM, BIOGRID, STRING, UNIPROT, HPRD, INTACT, and so on.The Database of Interacting Proteins (DIP) does experimental interactions to determine various organisms.DIP contains 20728 proteins.57683 interactions, and eight species that are (coli, Escherichia, norevegics.Rattus, Homo sapiens, muscles, helicobacter pylori, drosophila melanogaster).Its query format works as of relational databases and the user can fire text query via a web browser that displays results in visual form.It is organized in five key tables consists of proteins, trials and related data.
MINT is designed to stock information on practical interactions among proteins.It  The main function of BioGRID is to store proteins and genetics data in various organisms.BioGRID is mainly focused on investigating the interactions of networks regarding human health.
The HPRD shows a unified platform that integrates human proteome information and relates interaction networks between proteomes and diseases.It represents the relationship between them visually.All information is in this database is manually mined and explored from available literature by the analysts using the object-oriented database in zope.
The string is a projected interface in the database of more than 8000 organisms, it is used to organize a massive class of biochemical relationships between proteins to proteins and DNA to DNA.Strings work for two interactions.One is physical and second is direct e.g. two proteins contributed in an identical path.
MIPS is a research center presented at Neuherberg, Germany with an emphasis on genomes that are concerned with bioinformatics.Its purpose is to support and preserve fungal and plant genomes feature in a regular generic database.
All of these models stores, analyse and search the information about proteins interactions and some other features of PPI data.These databases use Relational schema to store data and in a structured format.These PPI database models offer a simple mechanism for the storage of data.These models of PPI can't store unstructured and/or semi-structured PPI data sets.
In contrast to these researchers, we have designed a new data model for protein-protein, gene-protein, and gene-disease interactions.This model has two distinct features as compared to other existing interaction models.First of all, we integrated all existing protein-protein interaction data models and proteingene interactions.We provide the facility to query all information for gene/ protein such as what is protein www.ijacsa.thesai.orginteraction, gene interaction, and disease related information, in one storage system.The second prominent feature of this model is to follow the schema-less structure to store PPI data.Our data model is NoSQL storage and that can keep structured, semi-structured and unstructured data of protein-protein and protein-gene interactions in specified formats.
There are many technologies available in NoSQL databases, but this model is developed using HBase, that is built on the upper layer of Apache Hadoop.HBase is that is a column-oriented, distributed database, designed after the development of Google's Big tables.This database manages structured, semi-structured and unstructured data.HBase includes non-relational, open source, versioning, compression scalability and garbage collection features.The data stored in HBase can be manipulated using the programming structure of Hadoop like MapReduce.The storage format of HBase tables is given below in Figure 1.We applied our data model of protein-protein, protein-gene interaction in Apache HBase using column families for different purposes such as data source integration, Protein details, Gene details, RefSeq, Sequence in a different format, protein molecular information and biological information of protein/gene.These column families have different numbers of columns.The detail of data model is shown in Figure 2.
First column family is named as -Data-Integration-Source‖ has a defined number of columns in it.The first column contains Ids from the different data sources such as BioGRID, HPRD.www.ijacsa.thesai.orgThe IDs: Entrez-gene, Uniport, String, IntAct, OMIM, Ensembl, Swissport, HGNC, MINT, and DIP are different for the same protein in.Since this model integrates all existing models in a single column family so interaction types, interaction method, confidence scores and all the features of protein/genes can be viewed.www.ijacsa.thesai.org The column family -Protein‖ has a column named -protein-name‖ that gives information about protein name.-Gene‖ column family has four column named as Gene-name, official symbol, official name according to NCBI taxonomy and information about the gene.
The -Ref-seq‖ column family has three columns RefSeq-No, locus and Accession of protein.This family gives information about RefSeq of the protein, locus, and accession of the protein from the NCBI database.-Sequence‖ column family gives details about FASTA, DNA and protein /gene in three columns.
Two more column families are -Protein-Molecular-info‖ and -Biological-Process‖.In -Protein-Molecular-info‖ column family we have three columns that provide info of protein/gene such as the molecular-weight, molecular-class and molecularfunction.The -Biological-Process‖ column family helps to get information about biological processes of protein/gene.This NoSQL data model provides many advanced features that exhibit better performance, efficient storage, fast searching, deep analysis and integration of all models.This NoSQL model is a protein/gene interaction model that stores a huge number of data in a de-normalized form.It provides low latency operations for protein interaction data.They provide access to a single protein or gene interaction data from billions of interaction data records.

V. EVALUATION OF MODEL
As our NoSQL data model is an integration of different protein-protein interaction databases like OMIM, BioGRID, Uniport, HPRD, Ensembl, UniHI, HAPPI, APID, and MiMI.The installation process for our data model starts from Apache Hadoop.
Hadoop is an open-source, fast, reliable, low cost, distributed, and scale up from the individual server to thousands of machines.It provides storage and local computations that detect and handles the failures at applications layer.Hadoop by default uses HDFS (Hadoop Distributed File System) but our proposed data model stores data in HBase on top of Hadoop.
We wrote simple queries to identify different relationships and object of protein, gene, and diseases from the model that fetch the related records.These queries can easily fetch data according to user requirements from relevant columns of column families.After entering into HBase shell all operations on created table named 'protein data' can be applied.We write scan (keyword) followed by table name in single quotation marks to get all data entries in that table along with column names for every single column family.The output of applying scan query on HBase table is shown in Figure 3.The data can also be extracted from an entire columnfamily.'Scan' command is used to extract all cells entries along with column names and time stamp.For example scanning a particular column family named as 'DSI' (Data-Sourceintegration) will result in all column names and data values in it.The names of columns in this column family are IDs from all specified databases, written as, BioGrid_id, EDS_id, EnGene_id, Ensembl_id, HPRD_id, IntAct_id, OMIM_id, String_id, and Uniprot_id as given in Figure 4. Similarly, for 'disease' column-family, the query will be written as 'scan (keyword)' followed by table name and then 'COLUMNS (keyword)' along with column family name, according to the syntax, to get all columns entries.The query results in all columns covering details of disease for particular genes and proteins as shown below in Figure 5. Similarly to scan 'gene (G)' column family the query would be written as 'scan (keyword)' followed by table name and then 'COLUMNS (keyword)' along with column family name, according to the syntax, to get all columns entries.The query results in all columns covering details of genes such as gene name, gene symbol, gene location, coordinates of a gene and gene information.The -G‖ stands for the gene in the query.In Figure 6, different genes attributes are given.www.ijacsa.thesai.orgTo get details of all columns in all column families against a particular entity we have to specify the index for that row.For example 'Entrez gene/locuslink: 8797' is used as an index to get all entries for this record. it shows a separate list of all column families followed by a colon (:) and their column names that have data entries in it.The query format and its results are shown below in Figure 7. Apache drill is an open-source platform implementing SQL queries on NoSQL databases that store big data.The main purpose of introducing this framework is to provide a standard language like SQL that can query big data applications' data sets (that can be semi-structured and/or unstructured) stored in NoSQL data storage formats.Drill by default does not support Apache Hive and Apache HBase but we have to enable these storage formats in it and enable data ports on which our local host is working.It provides the functionality to query multiple data storage systems in one single query.For example, a user can query accountant information from HBase and event logs from local HDFS in Hadoop.Drill facilitates researchers with its datastore-aware optimizer that can automatically rebuild queries to leverage its datastore's internal processing capabilities.Apache drill also provides data locality, so keeping drill and datastore on same nodes can save time and provide faster results.
In this model, we use Apache Drill in integration with Apache HBase for getting results of protein and gene interactions datasets.Query format for Apache Drill is different from HBase.For our proposed data model, drill query to get all entries of columns from the same column family can be defined so easily.For example, if we want to get gene IDs of all databases stored under 'DSI' column-family, we have to mention table name, column-family Name, column-Name from HBase table.The query format and its results are shown below in Figure 8.
Drill query to retrieve data from different column families at a time to predict different relations in our proposed model is shown as below.First of all, we mention 'Gene_id' as row_key for indexing and after that required column names are called using dot operator for related column families and table name.Query to get information of disease ID named as 'OMIM_id' from 'disease' column family and associated gene name from 'G' column family is shown below in Figure 9.This NoSQL data model provides the opportunity to search data against a particular value, from any column of one or more column families.For example, if we want to get gene ID against a specific BioGrid_id='121229' from the full table, we'll use 'WHERE' clause followed by data entry to get matched to.In this case, the query and retrieved information are shown below in Figure 10.
As we have extracted specific ID of data-source column family and -G‖ gene column family in HBase shell, in the same way, can use drill query to get sequences of a diverse type like FASTA, DNA and protein sequences from 'seq' column-family.For our data model, drill query to get all this information along with proteins and their related genes is given in Figure 11.We have written another drill query to show some important relations between gene names and protein names against a particular Gene_id as defined in NCBI's Entrez database.This query searches for gene name for a particular gene_id and shows the name of the protein that it makes interactions with.Query to extract Gene and protein names from -DSI‖ (data source integration) and -G‖ gene column family respectively, is shown in Figure 12.

VI. CONCLUSION
It is concluded from the above discussion that an integrated NoSQL data model for protein-protein, protein-gene, and genedisease interactions can help researchers to get insights of biomolecule networks.The data model can return all important factors that can take part for interactions such as gene ID, Gene name, gene location, gene code, protein name, protein structure, disease ID, and disease name all at one place.The proposed data model provides best storage format for this type of data sets (that are huge, complex and unstructured) to overcome the limitations of relational databases.This model has been implemented for 8000 different entries of all defined interactions and obtained search results are faster end effective than existing data models.This data model is an organized compilation of genes, proteins, and diseases from all known available resources to relate different factors amongst them.Apache drill queries written for proposed data model are easy to implement on any biomolecular dataset of this type.Drill provides users/researchers an opportunity of column-wise querying, to get values from required column/s and nonrelating entries against that particular queried value will not be displayed.Future work may involve unifying all genephenotypes associations for the diseases or other important features such as treatment of diseases or environmental risk factors that cause gene mutations.
[BioGRID], the Molecular INTeraction database [MINT], the Biomolecular Interaction Network Database [BIND], the Database of Interacting Proteins [DIP], the IntAct molecular interaction database [IntAct] and the Human Protein Reference Database [HPRD]).All these databases show different information on PPI and annotations.

Fig. 2 .
Fig. 2. Unified Data Model for P-P, G-P, and G-D Interaction

Fig. 8 .
Fig. 8. Extracting Column-ID of DSI Column Family using Apache Drill

Fig. 9 .
Fig. 9. Disease ID against Gene/Protein in Model using Apache Drill

Fig. 12 .
Fig. 12. Extraction of Gene_Name and Protein_Name Using Apache Drill contains both physical interactions and other types of molecules.It delivers an integrated data model that experimentally confirms proteins interactions given in scientific literature by proficient curators.