Identifying Cancer Biomarkers Via Node Classification within a Mapreduce Framework

Big data are giving new research challenges in the life sciences domain because of their variety, volume, veracity, velocity, and value. Predicting gene biomarkers is one of the vital research issues in bioinformatics field, where microarray gene expression and network based methods can be used. These datasets suffer from the huge data voluminous, causing main memory problems. In this paper, a Random Committee Node Classifier algorithm (RCNC) is proposed for identifying cancer biomarkers, which is based on microarray gene expression data and Protein-Protein Interaction (PPI) data. Data are enriched from other public databases, such as IntACT1 and UniProt2 and Gene Ontology3 (GO). Cancer Biomarkers are identified when applied to different datasets with an accuracy rate an accuracy rate 99.16%, 99.96% precision, 99.24% recall, 99.16% F1measure and 99.6 ROC. To speed up the performance, it is run within a MapReduce framework, where RCNC MapReduce algorithm is much faster than RCNC sequential algorithm when having large datasets. Keywords—Big data; cancer biomarkers; MapReduce; node classification


I. INTRODUCTION
Bioinformatics is one of the main applications that adopt big data through microarray gene expression analysis, next generation sequencing, text mining of literature publications, and large graph analysis of biological networks, such as metabolic networks, signal pathways, and protein-protein interaction networks.Bioinformatics researchers have an excellent opportunity to achieve scalable efficient and reliable computing performance on Linux clusters and within cloud computing environment [1].However, scalable and efficient data mining algorithms are needed to perform different tasks in bioinformatics.Biomarkers play an important role in diagnosing, assessing prognosis and directing treatment of cancer.A cancer biomarker refers to a substance or process that is indicative of the presence of cancer in the body.A biomarker may be a molecule secreted by a tumor or a specific response of the body to the presence of cancer.Genetic, epigenetic, proteomic, glycomic, and imaging biomarkers can be used for cancer diagnosis, prognosis, and epidemiology4.Biologists can now quickly identify hundreds, and even thousands of candidate genes associated with a target disease or functionality.One of the main traditional techniques to find interactions and similar structure is applying text mining techniques to literature abstracts, i.e. through PubMed5 [2,3].However, this is a very time consuming issue because of the tremendous high volume of current literature reviews.
Other techniques fall into two main categories: Mircoarray gene expression analysis and biological networks.Microarray gene expression analysis can measure thousands of gene expressions which make it a good chance to identify biomarkers through microarray technology [4][5][6].However, better prediction accuracy is required since the accuracy of applying network techniques is relatively low.Identifying significant gene sets or pathways involved in diseases or biological processes by incorporating some prior biological knowledge, such as gene set enrichment analysis or pathway enrichment analysis are proposed via several methods [7][8][9].In addition, PPIs, protein-DNA interactions, or regulatory pathways algorithms are developed.For instance, Chuang et al. [10] identified biomarkers of metastasis using breast cancer gene expression data, based on protein-protein interaction networks.Li et al. [11] introduced a network-constrained term based on L1-norm of regression coefficients of microarray data.Jahid and Ruan [12] identified a small number of intermediate genes containing important information about the pathways involved in metastasis genes, using a randomized steiner tree.Zhu et al. [13] recently built binary classifiers as prediction models, using support vector machines.In addition, Wei and Li [14] developed a Markov Random Field Model for network-based Analysis.Furthermore, Chen et al. [15] developed network-constrained Support Vector Machine (netSVM) for cancer biomarker identification with an improved prediction performance.Hwang et al. [16] applied the network propagation algorithm to study three large-scale breast cancer datasets, achieving competitive classification performance.Xia et al [17] have developed Network Analyst, enabling high performance network analysis with rich user experience in order to identify genes/ proteins of interest in biological networks.
One of the main computational challenges have become increasingly important is using High Performance Computing (HPC) in bioinformatics data analysis [18].Another computer architecture / service model is cloud computing [19][20][21], where it is used to scale up the performance of the required service.Recently, biomarker prediction based on large-scale feature selection and MapReduce has been discussed in [22], where Kmeans clustering and Signal to Noise Ratio have been combined with optimization technique as Binary Particle Swarm Optimization.A key problem arises when using hybrid approaches of microarray gene expression and network-based methods is handling very large networks which require high performance time.
In this paper, a node classification algorithm is suggested in order to identify biomarkers, which is considered one of the main problems in the bioinformatics domain.This algorithm is applied and compared to other machine learning algorithms, such as naïve bayes and random forest.In addition, the RCNC algorithm is applied within MapReduce framework, as one of the open source Apache Hadoop project.Node classification has been previously introduced in dynamic content-based networks [23].The main contributions of this paper are: 1) A hybrid approach of microarray gene expression and PPI networks is proposed to predict protein biomarkers via Random Committee Node Classifier algorithm (RCNC).
2) Speeding up the performance of the algorithm via MapReduce. 3

) Developing an information topological PPI network
The organization of this paper as follows: section two explains materials and methods and section three illustrates results and discussion.Finally, section four concludes the work and gives insights into future work.

II. MATERIALS AND METHODS
In this section, identifying biomarkers based on node classification within a MapReduce framework is proposed, as illustrated in Fig. 1.This framework depends on a hybrid approach of microarray gene expression data and PPI network.The framework consists of two main phases: data preprocessing and biomarker identification, which will be discussed in details in the following subsections.Data preprocessing phase has two main goals, which are 1) Computing Differentially Expressed Genes (DEGs) and 2) Integrating data.The goal of biomarker identification phase is to identify biomarkers for different types of cancer (Breast, colon, ovarian and hepatocellular carcinoma), using the proposed RCNC algorithm.

A. Phase i: data preprocessing
The objectives of this phase are to a) Compute Differentially Expressed Genes (DEGs) and b) Integrate Data.

1) Computing deg:
Microarray technologies now enable the simultaneous interrogation of the expression level of thousands of genes to obtain a quantitative assessment of their differential activity in a given tissue or cell.Microarray analysis has enabled the identification of gene signatures for diagnosis, molecular characterization, prognosis and treatment prediction.Microarray gene expressions data are obtained from GEO 4  database for Breast, colon, liver (hepatocellular carcinoma), and ovarian cancer.For each type of cancer, five series are used, which are illustrated in Table I, where both healthy and unhealthy microarray gene expression series are downloaded (Affymetrix experiments).Differentially Expressed Genes (DEGs) are computed for all downloaded samples, using R statistical language 4 ; in addition, p value < 0.05 is set as the threshold for DEGs and t-test [23] is applied.

2) Integrating data
Data integration is one of the vital tasks in bioinformatics, where many diverse public databases' formats exist, such as  XML, csv, and RDF.PPI data sometimes are not enough to identify biomarkers.As a result, in this approach data are integrated from heterogeneous resources: IntAct (release 2.5) and UniProt (August 2015) in addition to the DEGS results of micorrary gene expressions, computed at step 2.1.a.
In this work, cancer interaction datasets are downloaded from IntAct, which contain the target types of cancer discussed here: breast cancer, ovarian cancer, hepatocellular carcinoma, and colon cancer.The following preprocessing steps are accomplished for IntAct and UniProt data:

1) Removing missing values 2) Deleting irrelevant attributes 3) Extracting data 4) Mapping attributes
To illustrate the idea, downloaded cancer interaction data contain UniProtkb identifiers of interacting proteins, alternative identifiers for each protein at IntAct database European Bioinformatics Institute identifier, aliases, interaction detection method (two hybrid, pull down, etc), publication date of each, taxonomy identifier, interaction type (physical association, colocalization, direct interaction, and association), database source, interaction identifier, and confidence.Some of the GO ontologies are missing so the corresponding values are deleted.In addition, irrelevant attributes (attributes not used as parameters for determining biomarkers) are deleted: the publication date, taxonomy identifier, interaction detection method, interaction identifier and source database.For each protein, each UniProtkb identifier is mapped into its corresponding Uniprotkb identifier in UniProtkb database.Other included information from UniProtkb is protein function, Gene Ontology (GO) molecular function, biological process, and cellular component.In addition, DisGeNet database has been used as for validation of biomarkers' prediction results.

B. Phase II: Biomarker Identification
To identify biomarkers, RCNC algorithm is proposed, which depends on topological node classification algorithm in an ensemble learning manner.The problem of node classification has been addressed in a number of applications, such as social network analysis [25].In this section, RCNC algorithm of biomarkers identification is explained in details.RCNC uses a random committee technique, which is an ensemble tree classifiers based.Ensemble methods like combine the decisions of multiple hypotheses are some of the strongest existing machine learning methods [26][27][28].Ensemble classifiers gather randomizable base classifiers, where each base classifier is built using a different random number seed.A random committee algorithm is an ensemble of random tree classifiers, where it predicts a class label by averaging probability estimates over these classification trees.This algorithm produces better overall accuracy for all testing cases than any individual committee member.In this paper, a random committee technique is used to handle: 1) too large data volume, 2) inadequate data, and 3) complexity of decision boundary.The learning procedure for ensemble algorithms can be divided into the following two parts: 1) Constructing base classifiers/base models: In this part, data preprocessing is performed first where noisy data are removed then base classifier are constructed.Data preprocessing step is already at the data integration phase, as previously explained.
2) Voting: The main objective of this part is to combine the base classifiers models built in the previous step into the final ensemble model.There are several kinds of voting but the most used ones are the weighted and un-weighted voting.Voting includes the weighted average (of each base classifier holds) when using regression problem and majority voting when doing classification and the weighted-majority output is given by, which is used in this paper: Argmax �∑ p i (x), w i k i=1 � Pi(x) is the results of the prediction of ith prediction model and Pi(x, w) is indicator function defined as: Problem Definition: given a graph, which is represented as G= {V, E, W}, where V is a set of nodes, E is the set of Edges, and W is the edge weight matrix n x n; W = [wij] and n = |V|.L is the set of labels L= {l1, l2, …, lq} for the set of q attributes associated with each node V.
Homophily: is a term used in social networks and defined as a link between individuals (i.e.friendship or other social connection) when they are being similar in nature.When applying "homophily" to PPI information network, two protein nodes are connected based on "homophily" property if they interact with each other and have similar characteristics.These characteristics include: • Sequence similarity scores.
• GO relations where two nodes are GO related if there is a semantic relation holding between those proteins.This semantic relation between two proteins is divided into the following: • If functions are connected through ontology • If cellular components relations exist.
• If Biological process relations exist.For example, for the protein P35125 which is a biomarker for ovarian cancer interacts with protein Q8N8A2.P35125 has gene molecular functions: calmodulin binding, cysteine-type endopeptidase activity, nucleic acid binding, ubiquitin-specific protease activity.Q8N8NA2 has a protein binding molecular function, where calmodulin binding is a protein binding type.P35125 and Q8N8A2 proteins have 84.3% sequence similarity.Sequence similarity scores are taken into consideration when >70%, as shown in Fig. 2. Table II explains the steps of graph construction algorithm.Many machine learning algorithms have been investigated to be transformed into the MapReduce paradigm in order to make use of the Hadoop Distributed File System (HDFS).In the current work, RCNC is run under the MapReduce framework and is evaluated on four datasets in order to evaluate scalability comparisons of using RCNC sequentially and RCNC under the MapReduce environment (RCNC MapReduce).The proposed MapReduce architecture used for this classifier is clarified in Fig. 3. Through this architecture, the number of occurrences of an attribute with a specific value given a certain class is obtained.The Hadoop uses Input Data Format to divide the big file into small input files which record Key and Value.In this case, the key will be the feature of the data (i.e.interaction type).Then, the Map process defines the data structure (key, value) on the Map operation.The Map process is applied to each input dataset in parallel.With the result from the MapReduce task, one can assign the instance to a class after training each segment via a random committee algorithm.Finally, the ensemble of classifiers is computed via equation (2).For i = 1 to Max do Begin f(i) <-(   (V.id, W) ) End V.label  (f (1) … f (T) ) Emit(Vertexid V.id, Vertex V) End

III. RESULTS
In this paper, four kinds of cancer are used: breast, colon, liver (hepatocellular carcinoma) and ovarian interaction datasets.Data are split into 66% for training and the rest for testing within a 10-Fold validation on the training dataset to select the optimal value of parameters.Experiments have been performed using Java JDK version 1.7 and for MapReduce implementation Hadoop version 2.4.1.
MapReduce implementation is tested in a cluster of 4 data nodes running Linux.Each node is an Intel ® Core TM i7-3770 CPU @3.4 GHZ, and 32GB RAM.Several comparisons are performed: 1) the proposed RCNC algorithm for node classification in a sequential manner versus naïve bayes, random forest classifiers, proposed method in [22], and [29], as shown in Table IV.In [29], an approach based on Neighborhood Rough Set and Probabilistic Neural Networks Ensemble is proposed for the classification of Gene Expression Profiles.Comparison contains the precision, recall, F1-measure, and ROC.
As summarized in Table IV, RCNC is always higher than Random Forest and naïve bayes classifiers when for all datasets.For example, for breast cancer dataset, RCNC has shown an accuracy of 99.72% , a recall of 99.7%, ROC of 100%, where the True positive rate is 99.7% and False Positive rate is 0.05% with F1-measure 99.7% for breast cancer datasets.For ovarian datasets, both datasets 15,154 and 54,675 are tested for all algorithms: RCNC, Random Forest, naïve bayes, BSMO, and [34].In the first case, RCNC is higher than BSPO and [34], where in the second case RCNC and BSMO give the same accuracy rate.However, RCNC gives more information regarding related biomarkers from the PPI information network.Furthermore, datasets are enlarged to 4GB each synthetically and the accuracy is the same but performance time is very fast.
The second testing of RCNC MapReduce is its time performance versus RCNC MapReduce, as illustrated in Fig. 6, where the time of RCNC MapReduce is faster than RCNC sequential.Finally, Fig. 7 clarifies the runtime of RCNC MapReduce having one, two, and four nodes for each dataset.Experiments for different size of data chunk and different number of maps are performed to evaluate impact of MapReduce parallelism.One can notice that having two nodes, the time performance is reduced to near half of the time required when having one node only.In addition, having four nodes, the runtime of the algorithm is reduced.The accuracy rate of RCNC sequential versus RCNC MapReduce is also tested when having four nodes, where the accuracy remains the same.Identified genes are evaluated against the DisGeNet database, where the relation between genes as biomarkers can be downloaded for cancer datasets.Examples of cancer detected biomarkers are: HSP60 (ovaries), HSPD1 (ovaries), FANCD2 (breast), FANCD3 (breast), FANCD4 (breast), MYL2 (breast), FANCD1 (ovaries), FACD (ovaries), XRCC9 (breast), DGKI (breast), APCS (colon), STK11 (colon), PTEN (colon), MLH1 (colon), MLH6 (colon), POLE (colon), EPCAM (colon), and MYH (colon)

IV. CONCLUSIONS
In this paper, a Random Committee Node Classifier algorithm (RCNC) was proposed to predict cancer biomarkers, where microarray gene expression and network based methods were used.These datasets had a very large volume, which caused main memory problems.
Compared with other classifiers, RCNC had proven high accuracy.Biomarker genes were identified when applied to different datasets with an accuracy rate 99.16%, 99.96% precision, 99.24% recall, 99.16% F1-measure and 99.6 ROC.To speed up the performance, it was run within a MapReduce framework, where RCNC MapReduce were much more faster than RCNC sequential when having large datasets.Future work includes taking RNAseq data into consideration and enlarging the datasets into multiple types of cancer.In addition, more ontologies will be added as ChEBI and disease ontologies.Furthermore, more enhancements can be performed to RCNC for covering multi-dimensional graphs.

Fig. 2 .
Fig. 2.An example of Breast Cancer PPI Information Network

Fig. 5 .
Fig. 5. Time Comparisons of RCNC MapReduce for Different Number of Nodes

TABLE I .
GEO CANCER SERIESCancer Type Series (Samples) # of Gene Instances Table III illustrates the steps of RCNC MapReduce algorithm.RCNC sequential is the same idea but without dividing the algorithm into Map & Reduce functions.

TABLE IV .
COMPARISONS OF RCNC WITH OTHER CLASSIFIERS