A Simple Approach for Representation of Gene Regulatory Networks ( GRN )

Gene expressions are controlled by a series of processes known as Gene Regulation, and their abstract mapping is represented by Gene Regulatory Network (GRN) which is a descriptive model of gene interactions. Reverse engineering GRNs can reveal the complexity of gene interactions whose comprehension can lead to several other details. RNA-seq data provides better measurement of gene expressions; however it is difficult to infer GRNs using it because of its discreteness. Multiple other methods have already been proposed to infer GRN using RNA-seq data, but these methodologies are difficult to grasp. In this paper, a simple model is presented to infer GRNs, using RNA-seq based coexpression map provided by GeneFriends database, and a graph-based database tool is used to create regulatory network. The obtained results show that it is convenient to use graph database tools to work with regulatory networks instead of developing a new model from scratch. Keywords—Graph theory; graph database; gene regulatory networks; RNA-seq; Genes Co-Expression; Neo4j


I. INTRODUCTION
The required information to make proteins and other molecules is stored in DNA.The functional circuitry of all living organisms is formed by genes [1]and synergistic actions between inter related genes is the reason of all biological reactions inside a cell.Gene regulation is a mechanism of increasing or decreasing production of gene products.In this process, genes are regulated by regulators to produce proteins or RNA, which produces a complex network of regulatory relationships.To understand the cellular process, it is critical to understand the regulatory relationships between genes.These relationships are expressed by gene regulatory network and are used to understand functions of genes.A gene regulatory network consists of nodes which represent genes, and edges between nodes represent relationships between genes.GRNs play important role in cell transduction, metabolism, cell differentiation, cell cycle and every other biological mechanism.In-depth and comprehensive understanding of complex biological processes can be provided by gene regulatory networks.The study of gene regulatory networks not only unveils the dynamics of organisms but also reveals their behavior in different scenarios and shows how their fate is controlled.The interaction of genes can be understood by reconstructing gene regulatory networks.The representation of genetic data, reverse engineering of GRNs and performing analytics on regulatory networks to retrieve information is challenging tasks.It can not only help to diagnose diseases but can also shed light on those changes which became reason for a disease and what changes have occurred because of that disease.Different people can be on different stages of a diseases, have different medical history which can lead to different response to the treatment from others; GRNs can be helpful in individualized treatment [2].Determining or knowing drug sensitivity on target is an important aspect of the process of individualized treatment and is a research area in which GRNs can be used to detect drug sensitivity easily.Using GRNs we can answer the question that if a gene is mutated can its function be restored or not?The main genes responsible for a differentiation of cell into an organ or responsible for a disease can be identified using GRNs.In the process of regulating wide range of activities which include cellular, physiological and behavioral, circadian rhythm plays fundamental role.The researchers know a small number of genes which play a key role in circadian rhythm, however by using these genes, the existence of other key genes and how they work can be unveiled through GRNs.There are two main types of data which are used to infer GRN: microarray and RNA-seq.Continuous probe intensities are measured by microarrays, but discrete digital sequencing read counts which are aligned to sequence and are quantified by RNA-seq.Genes differential expression is measured more accurately with transcriptome sequencing (RNA-seq) than with microarrays.Different splice variants and non-coding RNA (ncRNA) can play important role in regulation of gene expression.The measurement of levels of transcripts provided by RNA-seq is far more precise than other methods [3].In a single experiment RNA-seq can identify novel isoforms, novel transcripts allele specific expression, alternative splice sites and rare transcripts beyond gene expression analysis.It provides abilities to perform such types of experiments which traditional microarray-based methods cannot provide.Huge size of RNAseq data increases challenges to interpret results, due to which processing mechanism and interpretation of results from RNAseq experiments was often impeded.There was no database available for bioscience community which could provide RNAseq data in a form through which any useful information could be extracted without going through time consuming processes of analyzing raw RNA-seq data before GeneFriends [4].It allows researchers to identify those genes which are poorly annotated and associated with genes under study.Genes that are responsible for lung cancer is used in this research work to infer regulatory network.In this research work, GeneFriends database is used to obtain the genes co-expression network and an existing graph database tool is used to infer GRN.Deployment method shows that it is much simpler than other existing methods.www.ijacsa.thesai.orgII.METHODOLOGY

A. Data Set Selection
Genes which are involved in commonly identified genomic/genetic alterations of lungs including chromosomal fusion rearrangements, nonsense or missense mutations, alternative splicing and small deletions or insertions are EGFR, KRAS, MET, LKB1, BRAF, PIK3CA, ALK, RET, and ROS1 or in other words these genes are the most relevant to lung cancer [5].These genes are selected in this research work to infer regulatory network.

B. Data Extraction and Transformation
Genomic databases are produced by a project named Ensembl for vertebrates.It is used to extract Ensembl transcript IDs of selected lung cancer genes Table I.

Gene
Ensemble ID These IDs are used in GeneFriends database to generate coexpression network of all the genes.Connection strength of direct partners within the seed list is 1, the genes having 0.75 connection strength are strongly co-expressed, 0.5 is connection strength of other direct partner and indirect partners have strength of 0.25.More than 1700 gene entries were generated as co-expression network in CSV format.

C. Tool
In our research scenario, graph databases are appropriate tools to infer network.These are database engines which model nodes and edges as first-class entities.Complex interactions between nodes can be represented in natural form.There are multiple graph databases available to work with and ArangoDB, Neo4j, Oracle Spatial and Graph, IBM System G Native Store and OrientDB are a few of them.Neo4j [6] graph database is used in this research work because it protects data integrity while providing fast reads and writes, and it is easy to learn.

D. GRN Construction
Cypher query language is used to query dataset.The network data available in CSV form was loaded into network.This data file contains genes and their connection strengths.The first gene was treated as source node, second one was treated as destination node and connection strength between two genes was used to create an edge between them, Figure 1.While creating nodes MERGE function was used instead of CREATE as values in data file are repeating multiple times and if a value is repeating, CREATE function always creates a new node having same label and property value which is not required in this research work.We could not afford repetition because multiple nodes with same properties will add extra work to get information about a single gene and it will also need more computing resources.On the other hand, only one node is created by MERGE against all the repeating values.But it does not happen in every case as if unbounded elements of an existing graph are being used to MERGE with a pattern it will start repeating nodes so always use bounded elements if an existing graph is being used.As in this research work there was no existing graph available, this issue was not considered.A value from data file was required to create a relationship, the default mechanism provided by neo4j could not be used as it does not provide this functionality.There is another function APOC.CREATE.RELATIONSHIP provided by apoc plugin, which was used for this purpose.

III. RESULTS AND DISCUSSION
Using neo4j, a network of more than 1700 genes was created in which nodes indicate genes; relationships among nodes indicate connection strength between genes, and Ensemble IDs were assigned as property values of nodes.In this network, first value is treated as source node and is indicated by blue nodes, second value is treated as relationship type, third value is treated as destination node and is indicated by green nodes.The whole network is shown in Figure 2.There are 322 source nodes and 852 destination nodes having 1791 relationships, out of which three source nodes are connected to three destination nodes with connection strength 1.There are 65 source nodes connected to only 8 destination nodes with connection strength of 0.75.On the other hand, 9 source nodes are connected to 313 destination nodes with 0.5 connection strength and 0.25 is the connection strength of 1599 relationships among 1068 nodes out of which 322 are source nodes and 746 are destination nodes.www.ijacsa.thesai.orgThere is a node in the entire network which has the most relationships with other nodes among all.It means that, the gene named ALK associated with this node ID is the most important gene in the entire network.A gene named RET is a direct partner to ALK within seed list Figure 3.The production of protein involved in signaling within cells receives instructions from RET gene.Several kinds of nerve cells are developed by this protein.Mutation in this gene is the reason for Hirschsprung disease, pheochromocytoma, and most importantly lung cancer.
There are 32 genes associated with ALK with 0.75 connection strength, Figure 4.These 32 genes act as sources for ALK, which means it is co-expressed with these 32 genes.On the other hand, 39 genes are connected as destination nodes with ALK having 0.5 connection strength, Figure 5, which means it is direct partner with these 39 genes.It has 8 relationships of strength 0.25 as source node with 4 destination nodes, Figure 6.
Since it is present in every type of connection strength, it the most important node in the entire Network.There are strong evidences available to prove that it is the driving force of different types of cancers, including Non-Small Cell Lung Cancer and neuroblastoma [7] and inferred network in this research confirms this well.A list of 31 genes is given in Table II, these are common genes connected to ALK gene with both connection strengths 0.75 as well as 0.5.But when these genes have 0.75 strength they are being source genes of ALK and when they have 0.5 strength they are being destination genes of ALK.It means that if ALK is being directly co-expressed with these gene then it is also their direct partner at the same time.These genes are not only connected with mentioned genes but there are multiple other genes interacting with them at the same time in the entire network.We have discussed multiple techniques already available to infer regulatory network.There is one thing common in above mentioned methods and even most of the other methods proposed in last two decades to infer GRN, that is, they need a highly expert person in graph theory or mathematics in general to implement these methods and infer GRNs; however, the method and already existing tool presented in this research work does not need a researcher to be highly expert in such areas.Even though graph theory is working at backend of the tool, but researchers don't need to know how it is working and how it is inferring networks.So instead of spending time on comprehending those methods and then implementing them, researchers can easily infer GRN and spend their most of the time on analysis on inferred regulatory network which is the actual process of unveiling underlying mechanisms of genes interactions.

IV. RELATED WORK
For the integration of graph database with the analytical process of transcriptome data, a platform is presented in [8] through which data coming from Affymetrix platforms on rhesus, rat, mice and humans can be analyzed.An algorithm bLARS which is based on regression is used to construct GRNs using steady state gene expression data [4], which allows different genes to have different regulatory mechanisms.It uses bootstrapping for scoring purpose.Based on FA, PSO, BA-PSO which are swarm intelligence techniques, RNN formalism is used to investigate reverse engineering of GRNs from time series microarray datasets [1].For refinement of classical network thresholding, a GRN post processing tool is represented in [9], linking nodes that belong to the same cluster with nodes that have higher weight, get favor by this method.It uses random walker to compute an optimal gene and select an optimal edge jointly.GRN inference is improved when clustering process is introduced in edge selection process.A novel technique for discovery of gene regulatory network is proposed [10] in which discovery process is integrated into heuristic information.To construct large scale gene regulatory networks a dynamic multi-agent genetic algorithm is represented [11], which is based on FCM.The method proposed in [12] uses low rank property to construct a common GRN structure from other inferred GRNs, drug effect is also inferred and estimated by this method.Through anti-diabetic drug, Metformin, the benefits to target tumor cell metabolism are investigated using simulations [13].The use of S-System modelling formulation is proposed in [14], by combining standard system identification procedures with this modelling formalism, the type of regulation between each gene is established and then a model which is suitable for designing a synthetic genetic feedback controller is derived.To predict transcription factors and gene interactions, a method which uses iterative SVM and clustering is implemented in [15].To discover relationships between genes this method [16] combines PCA-CMI and GA algorithms.For a target gene to obtain the best predictor, GA was performed and PCA-CMI method was used to create initial population to reduce search space.For encoding dynamics of multi-valued network, use of an extension of ASP named FASP as the language in continuous domain is proposed [17].A methodology [18] in which observed count data is modelled as being negative binomially distributed is proposed to infer gene regulatory networks using RNA-seq time series data.To identify multivariate gene interaction in RNA-seq data, an application of BEE and OBC is demonstrated [19] to differentiate biological phenotypes.To infer gene regulatory network Legendre neural network (LNN) is proposed [20] and to www.ijacsa.thesai.orgoptimize the parameters of Legendre neural network, Firefly algorithm is used.

V. CONCLUSION
Complex biological mechanisms are widely elucidated by gene expression information, which is expressed by interacting along with one or multiple other genes and this interaction with other genes formulates a regulatory network.Reverse engineering of this regulatory network is an important task to get the insight of biological mechanisms.To study cell trascriptome at system level, RNA sequencing is a revolutionary technique.In this research work, a graph database neo4j is used as a tool to construct GRN and RNAseq data of lung cancer genes and is used as dataset provided by GeneFriends database.Many techniques are available to infer a GRN but most of them implement complex mathematical models during the process.In this research work, we have used an already existing tool, even though graph theory is working behind the scene to infer network in this tool as well, but researchers don't have to pay any attention on background process instead they can focus on network analysis part so that, the complex underlying mechanism can be understood.