Need and Role of Scala Implementations in Bioinformatics

Next Generation Sequencing has resulted in the generation of large number of omics data at a faster speed that was not possible before. This data is only useful if it can be stored and analyzed at the same speed. Big Data platforms and tools like Apache Hadoop and Spark has solved this problem. However, most of the algorithms used in bioinformatics for Pairwise alignment, Multiple Alignment and Motif finding are not implemented for Hadoop or Spark. Scala is a powerful language supported by Spark. It provides, constructs like traits, closures, functions, pattern matching and extractors that make it suitable for Bioinformatics applications. This article explores the Bioinformatics areas where Scala can be used efficiently for data analysis. It also highlights the need for Scala implementation of algorithms used in Bioinformatics. Keywords—Scala; Big Data; Hadoop; Spark; Next Generation Sequencing; Genomics; RNA; DNA; Bioinformatics


I. INTRODUCTION
Today, we are living in the world of Big Data.Huge amount of data is being produced on daily basis.Major sources of data include social media, enterprise systems, sensor based applications, Bioinformatics sequencing machines, smart phones, digital videos or pictures and World Wide Web.Big Data's characteristics are Veracity, Velocity, Variety, Volume and Potential Value (these are known as 5 V's).To make this data useful, it needs to be stored and analyzed with accuracy and speed.Traditional techniques are unable to store and analyze such large amount of data.These techniques are better for a limited amount of data analyses as the cost of analysis increases with increment in data volume.
To deal with this hurdle, Big Data platforms and tools are introduced which can analyze a large amount of data with accuracy, speed and scalability.Using Big Data Platforms like Hadoop, cost of analysis is also reduced as it runs on commodity hardware.Major challenges for Big Data are speed, performance, efficiency, scalability and accuracy.Big Data platforms and tools like Hadoop (distributed management System) and Apache Spark (for big data analysis) address these issues.NGS (Next Generation Sequencing) machines bring an evolutionary change in data generation of different sequences.NGS machines are generating a huge amount of sequence data per day that needs to be stored, analyzed and managed well to seek the maximum advantages from this.Existing bioinformatics techniques, tools or software are not keeping pace with the speed of data generation.Old Bioinformatics tools have very less performance, accuracy and scalability while analyzing large amount of data.When storing, managing and analyzing large amount of data which is being generated now a days, these tools require more time and cost with less accuracy.

Apache Hadoop is best Platform for Big Data processing.
Hadoop is open source Java Platform that contains thousands of clusters that is used for parallel processing and execution of Big Data.Its main components are Pig, HBase, Hive, HDFS (Hadoop Distributed File System), MapReduce and Apache Spark Framework.Pig is High level language that is used for scripts.It includes load store operators and provides users the capability of creating own built-in-functions (extensible).HBase is used for automatic sharding and sparse data processing by replacing RDBMS (Relational Database Management System).Hive is not used for real time processing but it is used for large analytics and efficient query processing with the help of meta-store unit.HDFS is file system that is developed for processing and execution of large files in database that is created by Hadoop components.Its two units are data node and name node.MapReduce is designed for parallel execution and processing of large datasets in Hadoop Platform.Apache Spark is framework especially designed for Analytics by using the Languages Java, Python, C and Scala.Its main components are caching, action and transformation.
Many Bioinformatics Algorithms are implemented in Scala language for Apache Spark Framework.Scala is functional, statically typed and object oriented language.It is better for concurrent processing.Its main features are traits, closures and functions that are used for processing of multiple Genome Sequencing Algorithms.Scala mostly works like C++ language.www.ijacsa.thesai.orgScala consists of Arrays, Loops, Strings, Classes, Objects, collections, Pattern Matching and Extractors.All of these structures and statements are used for Bioinformatics Algorithmic comparison by Scala in Spark Framework.Scala also contains many Built-in-Methods, Libraries and Functions that are very useful for designing Bioinformatics Algorithms.Scala language plays an imperative role in Bioinformatics Applications.
Genome Sequencing, Motif Finding, Pairwise Alignment and Multiple Alignment are main features for Bioinformatics.Scala language is very important for these Algorithms.In Genome and Multiple Sequencing, a lot of algorithms are used for handling Biological Sequences.These Algorithms are implemented in Scala language.In Apache Spark, Motif Finding Algorithms are implemented using Scala language.In Pairwise Alignment, Scala language is very significant for pattern Matching.
Spark provides the facility of Scala shell for the implementation of these Bioinformatics Algorithms.Primitive Types and anonymous functions in Scala perform well for managing arrangements of Multiple Sequences.Anonymous functions are used in transformations, actions and loading files for Analytics of Bioinformatics datasets in Apache Spark Framework.Shared variables and key-value pairs are used in Hadoop using Scala language for Bioinformatics Algorithms.SeqPig is a library and tool for Analysis and query sequencing data with scalability [3].It uses the Hadoop engine, Apache Pig, that automatically parallelizes and distributes tasks that are translated into sequence of MapReduce jobs.It provides extension mechanism for library functions supported by languages (Python, Java and JavaScript) and also provides import and export functions for file format such as Fastq, Qseq, FASTA SAM and BAM.It allows the user to load and export sequencing data.SeqPig provides five read statistics.(a) average base quality read; (b) length of reads; (c) base by position inside the read; (d) GC content of read.Finally combined with single script, it is also used for ad-hoc Analysis but SparkSeq is the best option for ad-hoc analysis.Wiewiorka et al. [4] have launched bioinformatics tool used to build genome pipeline in Scala and for RNA and DNA sequence analysis.The purpose of this work was to determine scalability and very fast performance by analysis of large datasets such as protein, genome and DNA.A new MapReduce model has been developed for parallel and distributed execution in Spark.Data cannot be stored in HDFS without BAM library (for direct access data and support formats).After data storage in Hadoop, Spark queries applied to sequencing datasets and data is analyzed.Nordberg et al. [5] proposed the BioPig, used for analysis of large sequencing datasets in the perspective of Scalability (scale with data size), Programmability (reduced development time) and portability (without modification Hadoop).To evaluate these three perspectives, Kmer application was implemented to check its performance and compare with other methods.BioPig uses methods (pigKmer, pigDuster and pigDereplicator).Dataset size for Biopig ranges from 100 MB to 500 GB.Biopig is same as SeqPig in such a way that both use Hadoop and Pig environment and same functions (import and export) and similar run time performance.Only difference is that BioPig includes many Kmer applications and wrapper for BLAST that the SeqPig does not have.The limitation of BioPig is the startup latency of Hadoop.This problem is solved by Spark.Sun et al. [6] presented the Mapping of long sequence by Bwasw-cloud algorithm with the help of Hadoop MapReduce implementation.Many single processor algorithms like BLAST, SOAP and MAQ are struggling for quick reads.Many multiprocessor algorithms perform much better work like BlastReduce and short reads but some problems occur as its performance and expense for equipment.These problems are decreased by Bwasw-cloud algorithm.This algorithm contains www.ijacsa.thesai.orgthree phases (Map, Shuffle and Reduce) by using seed-andextend technique and sequence alignment functions are mostly implemented in Map phase.The scaling is measured by length of reads, different mismatches and different number of reference chunks, whereas performance is measured as the speedup over this algorithm.
Taylor et al. [7] focused the next-generation sequencing data and its use in bioinformatics field.Hadoop and MapReduce play an important role in NGS.In this work, he has discussed some terminologies such as Hadoop, MapReduce, HBase, Hive, pig and Mahout then their role in bioinformatics field such as CloudBurst software same as BlastReduce (for NGS short read mapping into reference genome), Bowtie crossbow (for genome re-sequencing analysis), Contrial (for assembly DNA short reads without reference genome), R/Bioconductor (for calculating different gene expression in large RNA-seq dataset).Hadoop and HBase also used for Biodoop tool that consist of three algorithms (BLAST, GSEA and GRAMMAR).Hadoop also used for multiple sequence alignment.Srinivasa et al. [8]  After these three phases, hierarchical clustering is performed by UPGMA (to produce rooted trees).Due to scalability of Hadoop framework, the proposed method for Phylogenetic is suited for large scale problems.

III. TOOLS FOR BIOINFORMATICS
There are several Bioinformatics tools those are used for the analysis of small and large datasets.Every tool performs specific function.Different tools are used for sequence analysis, motif finding, database search and genome analysis.These tools require the data to be stored in a specific format for any kind of analysis.These tools are built using different programming languages.It is important to know the specific language in order to customize the tools.The skills in a programming language are more helpful when extending these tools for Hadoop MapReduce or Apache Spark framework.

A. Motif Finding Tools
Sequence motifs are repeated patterns that are of biological significance.Many tools are available for motif finding in the nucleotide or protein sequence.These tools are also implemented using different programming languages like C, C++, Java, Perl, FORTRAN, Python, and R. a list of the motif finding tools is presented in TABLE I.
Like the alignment viewer and genomics Analysis, the motif finding tools also implemented in Apache spark and Hadoop MapReduce Framework for the experimentation of Big Data analysis.PMS and BLOCKS are implemented in a Hadoop MapReduce Framework for the Big Data analysis.Algorithms in these tools are not implemented in Spark using Scala language.We can use Scala language for the implementation of these Multiple Sequence Alignment and Pairwise Alignment Bioinformatics Algorithms to attain better outcomes.
Many Bioinformatics Algorithms are based on Greedy and Dynamic Programming paradigm.Some Bioinformatic sequences are Map/Align with Local, Global, Multiple and Pairwise method.Nussinov-Algorithm and Viterbi-Algorithm also require Scala language for their implementation.SCABIO is the best framework for Bioinformatics Algorithms in Scala language.It includes many built-in-methods and libraries that are helpful for Scala implementation.It also provides Greedy and Dynamic Programming approach for Bioinformatic sequences.We can use SCABIO for Global, Local, Multiple and Pairwise Alignment.Pattern Matching is best performed with the help of SCABIO because SCABIO includes Scala language implementation concepts.

V. CONCLUSION
Keeping in view the data analysis demands in Bioinformatics, Big Data Platforms and tools are an obvious choice.Among these platforms, Spark is most efficient platform for rapid analysis of large data sets.Spark itself is implemented in Scala languages and supports programs in Java, Scala and Python.Majority of the tools in bioinformatics are not designed for Big Data Platforms.As discussed in previous sections, most of the Multiple Alignment tools, Pairwise Alignment tools and Motif Finding tools still need to be enhanced for use on Big Data Platforms like Hadoop and Spark.So, there is need of time to implement bioinformatics tools on Big Data Platforms.Several languages are available for implementation of bioinformatics tools like Java, C, Perl, Python and Scala.Among these languages, Scala is a good choice especially for Spark Implementations.It provides structures and constructs that are suitable for Bioinformatics applications.It provides support for dynamics programming and pattern matching.It can provide efficient implementations of machine learning algorithms.We recommend that Scala must be used for future implementations of Bioinformatics tools on Big Data Platforms.

For
implementing Bioinformatics Algorithms in Scala language on Hadoop Platform, datasets are stored in specific format.Different storage formats are used for different Algorithms on Hadoop and Spark Platform for example, Fasta, Fastq, CSV, ADAM, BAM (Binary Alignment Map)/ SAM (Sequence Alignment Map) and ADAM.The objectives of this study are:  To explore the Supported Languages and Supported Platforms for Genome Sequencing, Motif Finding, Pairwise Alignment and Multiple Alignment Algorithms  To analyze the need for Scala language for the implementation of Bioinformatics Algorithms on Hadoop Platform  To explore the Scala Language used in existing Bioinformatics tools The rest of the paper is organized as follows: Section II explains the related work in this field.Section III describes the tools for Bioinformatics.Section IV represents Role of Scala Implementations in Bioinformatics.II.RELATED WORK Ali et al. [1] have explained study in which many Machine Learning classification and clustering Algorithms are implemented in Hadoop MapReduce and Apache Spark using Scala language.They also describe the Performance comparison of different Machine Learning Techniques and Algorithms in the perspective of Hadoop and Spark.It illustrates further research ideas in his paper in which Machine Learning Techniques and Algorithms are implemented in Hadoop and Spark Framework.Sarwar et al. [2] have proposed review study about Bioinformatics Tools.They demonstrate the implementations of Tools for Alignment Viewers, Database Search and Genomic Analysis on Hadoop and Apache Spark Framework using Scala language.It also describes further research domains for the implementation of Bioinformatics Tools on Hadoop and Apache Spark using various languages such as Java, Scala and Python.
have proposed a technique to classify sequences with the help of Distance matrix formula (m*m) and to understand the relationship among different species during evolution using MapReduce model by dividing the sequences into blocks.Dynamic algorithms Needleman-Wunsch and Smith-waterman are limited to number and size of sequence.So, new MapReduce model developed to reduce these limitations.The input format is FASTA format and output in the custom type.It includes three MapReduce jobs: (a) Data preprocessing (b) Cartesian product (c) Sequence alignment.

TABLE I
These tools are used for the alignment of more than two nucleotide or protein sequences.These tools are helpful in finding the homology and evolutionary relationships between the studied sequences.A number of multiple sequence alignment tools are developed using Ruby, C, C++ and Python.ABA, ALE, AMAP, anon, BAli-Phy are implemented in Ruby, C, Python and C++.Multiple sequence alignment tools support different format of data for storage and alignment purpose of protein and nucleotide.ABA, ALE, AMAP, anon, BAli-Phy tools have the different data format like Fasta GenBank, EMBL, GDBM, PHYLIP, MFA.With the growing technologies in Bioinformatics, the tools of Multiple Sequence Alignment are also implemented in Modern technology like Hadoop MapReduce and Apache Spark.MSA, SAGA MSAProbs are tools of Multiple Sequence Alignment category that are implemented in Hadoop MapReduce and Apache Spark.www.ijacsa.thesai.orgTABLE II presents the available multiple sequence alignment tools.