Discovering the Relationship between Heat-Stress Gene Expression and Gene SNPs Features using Rough Set Theory

Over the years of applying machine learning in bioinformatics, we have learned that scientists, working in many areas of life sciences, call for deeper knowledge of the modeled phenomenon than just the information used to classify the objects with a certain quality. As dynamic molecules of gene activities, transcriptome profiling by RNA sequencing (RNA-seq) is becoming increasingly popular, which not only measures gene expression but also structural variations such as mutations and fusion transcripts. Moreover, Single nucleotide polymorphisms (SNPs) are of great potential in genetics, breeding, ecological and evolutionary studies. Rough sets could be successfully employed to tackle various problems such as gene expression clustering and classification. This study provides general guidelines for accurate SNP discovery from RNA-seq data. Those SNPs annotations are used to find relation between their biological features and the differential expression of the genes to which those SNPs belong. Rough sets are utilized to define this kind of relationship into a finite set of rules. Set of (32) generated rules proved good results with strength, certainty and coverage evaluation terms. This strategy is applied to the analysis of SNPs in A. thaliana plant under heat-stress. Keywords—RNA sequencing (RNA-seq); variant calling; Single Nucleotide Polymorphisms (SNPs) analysis; rough sets; gene expression


I. INTRODUCTION
RNA sequencing (RNA-seq) technology has resulted in exceptionally fast and wide scale analysis of the genetic information exists in all organisms. This mainly includes the concurrent study of alternative splicing, Single nucleotide polymorphisms (SNPs) and differential expression. The approach of genome-guided transcriptome has been the standard RNA-seq analysis method for model organisms like A. thaliana. Some existing software packages are available to perform this task [1]. New tools are continuously developed to be used for RNA-seq analysis task starting from reads alignment ending with the pathway analysis mission. Unfortunately, some non-expert users for those tools cannot get the full power and capabilities of them on wide scale [2].
SNPs are single nucleotide base variations, caused by transitions or transversions, in the same position between individual genomic sequences. Genetics and breeding are the most two important studies using SNPs as significant molecular markers. In genetic studies, SNPs are ideal genomic resources used for functional gene identification for traits and characterization of genetic resources because of their extensive genome distribution, wide density and, high scalability [3]. Fortunately, SNPs discovery can be accomplished on both approaches of genome-guided and de novo on variety of organisms [4]. This is applied on many plants, including those with little or no available genetic information.
Among the various benefits of performing SNPs analysis using RNA-seq data, there are two important ones [5]. First the reasonable cost for simultaneous discovery of thousands SNPs together with expression levels of functional genes at the same time. Second is involving phenotypes which can be predicted according to genotypes and, the location of SNPs in coding regions related to the possibly identified plant biological and agronomical traits.
RNA-seq is considered the ideal method for gene expression profiling [6] and, it is commonly used for precision medicine due to its high capability of measuring dynamic gene activity in the genome for a specific tissue type. Moreover, when applying RNA-seq on some disease tissue samples [7], it detects most of mutations exist in expressed genes that are related to disease biology.
Machine learning techniques can support very interesting and critical analysis applications dedicated for the fields of molecular biology and bioinformatics. Particularly, rough set method is considered very commonly used for this task of data analysis due to its flexibility in handling qualitative data. Rough Set theory was proposed in 1982 by Z. Pawlak [8] and, has been used as a methodology of database mining or knowledge discovery. It can contribute in many processes like attribute selection, attribute extraction, data reduction, decision rule generation and pattern extraction.
Rough Set uses information system or information table to represent data. This table consists of objects (rows) and attributes (columns) [9]. There are two types of attributes named as the condition attribute and decision attribute. Each row of an information table defines a decision rule, which specifies the decision attribute values when conditions are indicated by condition attributes are satisfied. Additionally, a set of objects is classified using rough set theory by finding dependencies and relations between attributes [10]; reduction of unnecessary attributes; discovering the most important attributes; or by decision rule generation. www.ijacsa.thesai.org Rough set-based rule generation provides easier explanations and descriptions for complex biological systems [8]. The challenges of those complex systems can be summarized into determining the features or attributes that can demonstrate the biological phenomenon, and what combinations of features' values can define that phenomenon and make a significant added value to the system study [11]. The possible decision values can be the participation of some particular genes in a biological process. Furthermore, learning the set of minimum features that can determine the gene involvement in this process may be interesting issue for some biologists. High throughput problems have a great care about discovering which features, in which order and, in which combinations define decisions.
The main motivation for this research is to provide aid to find a reasonable answer about the relationship between expressed genes and SNPs under the effect of heat-stress phenomenon. This relationship is achieved through determining the set of features that best describe this biological process. Rough-set based rule induction method is applied on RNA-seq data for the A. thaliana plant.
The rest of the paper is organized as follows. Related work is reviewed in Section II. Section III describes the data used in the research and tools adopted to perform SNPs detection and analysis. Methodology and techniques utilized for SNPs Detection phases as well as SNPs analysis are discussed in Section IV. The results of the experiments in the form of evaluation terms, tables and charts are discussed in Section V. Section VI provides our derivations, outcome on the study and, recommendations for the future work.
II. RELATED WORK Various research efforts in the literature have been targeted to the two main focal topics of this research; SNPs identification and Rough set theory in bioinformatics. This section lists a summarization of these efforts as follows.
The authors in [7] investigated the most suitable method that can provide the greatest number of SNP calls with high specificity and sensitivity. Following the steps of alignment sequence reads to the genome, removing duplicates, and using SAMtools to call SNPs had achieved the required purpose. SAMtools proved higher consistency than GATK with 8-10% more variants identification.
Plant functions, related to climate adaptation, have leading genes involved in transcriptional mechanisms. In the study [12], they realized that neat and strong peaks of association were identified in expected functional variants in the extreme tail of genetic differentiation. Those results proved that climate adaptation can mainly cause the genomic variation when applied on A. thaliana at a small scale. SNP-ML (SNP machine learning) suggested in [13], a novel tool, predicted true SNPs from sequence data using machine learning. It was designed for calling more trusted SNPs from polyploids. Moreover, it provided SNP machine learner (SNP-MLer), a functionality to train new models for customized use. Tetraploid peanut SNPs were identified using SNP-ML, and the validated true-and false-positive SNP mapping data improved the discovery process.
Another research [2] suggested Visualization Pipeline for RNA-seq analysis (VIPER) that combined stages of an RNAseq analysis workflow. This workflow graded from raw RNAseq data, then quality control and genome alignment, reaching to the differential expression and pathway analysis. VIPER listed the most popular tools used in the workflow like, RSEM for quantification, and SNPeff for annotating identified SNPs.
A reasonable amount of work has been performed on the usage of rough set methodology in solving bioinformatics issues and challenges [14]. These studies have focused on problems of classification and reduction of bioinformatics data. Some other literatures have dealt with topics related to selection of genes, classification of protein sequence and, prediction of protein structure.
A novel approach for tumor classification was proposed by [15]. This approach was based on Wavelet Packet Transforms (WPT) and Neighborhood Rough Sets (NRS) as tools for effective features extraction and selection. WPT performed features extraction, and then decision tables are formed. High classification with few attributes was reduced by NRS. The proposed method was applied on three gene expression datasets and experimental results showed feasibility and effectiveness.
A feature selection algorithm based on rough set theory had been suggested in [16]. It depended on selecting reduced set of genes from microarray data based on relevance and significance criteria of the selected genes. The importance of rough set theory here was computing both criteria to produce theoretical analysis justification. The proposed algorithm performance, along with a comparison with other related methods had obtained 100% predictive accuracy for three cancer and two arthritis data sets.
Suitable solutions were provided in [17] to solve two important issues exist in the data represented in information table. Those two solutions were applied based on concepts exist in Rough Set Methodology. The first issue was the indiscernible objects that were represented several times and solved by data reduction. This reduction included eliminating the unnecessary attributes and deletion of identical rows. The second issue was the existence of many redundant attributes and solved by dimensionality reduction. This solution used simplifying discernibility function to get reducts which used for generating if-then rules for classification.
A Promising framework was introduced in [18] to handle the complexities of protein structure prediction. Rough set improved harmony search quick reduct algorithm to be used for selecting the optimum number of features. More compact rules were generated via Rough set classification which showed a higher overall accuracy rates compared with classification algorithms in Weka.

III. DATASETS AND MATERIALS
This section lists the datasets with their types and full description of the experiment conditions. Moreover, the tools, and computational power needed for accomplishing this study are presented. It involves Data Sources, Software Packages and Tools and, Computational Requirements. www.ijacsa.thesai.org

A. Data Sets
The A. thaliana reference genome FASTA sequences have been downloaded from the Ensemble FTP (https://plants.ensembl.org/info/website/ftp/index.html). The RNA-seq FASTQ data files for A. thaliana under heat-stress were downloaded from the NCBI website. These data files represent an experiment that is performed on A. thaliana plants in Moscow, Russia. A Third leaf was collected from 15 plants of age 21 days after heat treatment at 42°C for 1, 3, 6, 12, and 24 hours. The experiment was accomplished with 2 replicates for each of the mentioned 5 different time points to resolve false-positive calls at the low end of signal detection [5]. This experiment was SINGLE stranded -Illumina Hi-Seq 2000 -RNA-seq libraries from TRANSCRIPTOMIC PolyA RNA. Ten files were downloaded and their attributes and description are listed in Table I.

B. Software Packages and Tools
The following list contains software tools and packages that were integrated with custom code to carry out the execution of the various processes along the presented workflow.

C. Computational Requirements
The experiments conducted in this research are based on the Unix-type operating systems (primarily Linux); it provides a command-line interface and is best run on a high-memory, multicore computer or in a high-performance computing environment. In general, having ~1 GB of RAM per 1 million paired-end reads is recommended. A typical configuration is a multicore server with 256 GB to 1 TB of RAM. For the research problem presented in this article, the lack of the required computing resources to accomplish the required work could be a challenge. In this study, the used computational resources were provided by The Bibliotheca Alexandrina (bibalex) 1 . The super computer BA-HPC capabilities are used to achieve this work.

IV. METHODOLOGY
This study proposes a promising framework to illustrate how SNPs can be discovered, annotated and, analyzed from RNA-seq data in order to be used to describe genes expression. Methodologies are divided mainly into two phases: (A) SNPs Detection, and (B) SNPs Analysis. Details of both phases and needed resources are being described below.

A. SNPs Detection
This phase shows an overview of the steps and methods that are employed to identify the most suitable performing of RNA-seq SNPs detection pipeline in Fig. 1. The employed steps start from creating genome indices, and go along till finding out annotated SNPs.

1) Creating genome indices:
Using the A. thaliana reference genome (*.fna) file from (Data Sets) section, Indices are created using STAR tool. 2) Mapping raw reads to reference genome: This step aims to find matches between the reference genome and the sequences of the sampled RNA-seq short reads. During mapping, using existing gene models obtains the maximum advantage to some read mapper in order to map the coordinates accurately.
Using indices files generated by STAR, each individual read with the reference genome is mapped. Convert mapped reads from SAM to BAM, sort, and index. BAM files are generated sorted by coordinates, so they can be loaded much more quickly.
3) Variant calling: Find deviation from reference genome, the output of both previous steps (RefGenome indices and mapped reads) are used together to perform the variant calling task. This step is done using SAMtools which uses the mpileup command to compile information about the bases mapped to each reference position. It collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format Output BCF file is a binary form of the text Variant Call Format (VCF). 4) Obtaining raw variants: BCF file came from the previous step is converted into VCF file using BCFtools. It is a collection of utilities to call SNPs and manipulate VCF files. Those utilities are calling SNPs and small indels, annotating and sub-selecting entries from VCF files, querying, filtering, merging VCF files, and converting BCF to human-readable VCF. VCF file has a nice header explaining what the columns mean. Below that header, there are rows of data describing potential genetic variants. Fig. 2 shows a sample for one of the produced (*.vcf) files header and content.
Header contains mandatory lines like the first line (1) and the line containing columns' headers (30). Lines (4) and (5) in the shown sample include data about the reference genome and bam files used to get that vcf variant calling. While lines from (6-12) have the contigs or chromosomes of the reference genome and length of each of them. Moreover, there are optional lines that describe some meta-data about the information in the VCF body shown in Table II

5) Variant filtering:
This step applies the prior and does the actual calling. It performs filtering short variants using vcfutils.pl varFilter. This filtration contains; delete duplication, remove low-quality reads (defined by sequencing device), filter unmapped reads and, filter low quality reads/mappings.

6) Finding out annotated SNPs:
The last step in the phase of SNPs Detection is to discover the categorization of the variants effects in genome sequences. SnpEff (an abbreviation of -SNP effect‖) tool is able to analyze and annotate thousands of variants per second and predict their possible genetic effects. Since many databases containing genomic annotations are available with SnpEff distribution, the SNPs annotation is called through SnpEff DB. The output information provided using SnpEff (*.ann.vcf) includes some additional lines at the end of the header which are concerned with the annotation part, as presented in Fig. 3. In this research, ‗ANN' field is the main target which includes the information needed to determine the value of the variant as shown in Fig. 4.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 2, 2020 94 | P a g e www.ijacsa.thesai.org

B. SNPs Analysis
The input of this phase is the 10 (*.ann.vcf) files created in the first phase, which include analysis ready variants. This phase includes set of well-ordered processes which are applied to determine the relationship between SNPs biological features and gene expression.

1) Adjustment:
This process handles the 10 (*.ann.vcf) files and put them into another flexible form enabling the separation of some specific information to be analyzed (INFO, ANN) tags.
2) Separation of variants from indels: The main goal is analyzing SNPs and their effect on the genome sequence. So, indels are removed to focus on SNPs only.
3) SNPs selection: Choose only SNPs located in the common heat-stress genes of A. thaliana, published in reference databases; DRASTIC 2 and TAIR10 3 .

4) Detection of SNPs biological features:
Some biological features of SNPs mainly describe the biological value of the detected SNP. They are isolated and been prepared for analysis. Table III lists some of those features, description and, their possible values. 5) Discovery of relationship between SNPs' features and genes differential expression using Rough Set: The most suitable technique to represent this kind of relation is Rough set. Rough set theory has been a methodology of database mining or knowledge discovery in relational databases [9]. The target is to find the set of rules that translate the relationship between the values of the biological features of detected SNPs for some gene and the differential expression of the same gene.
Rough Set Analysis approach has many important advantages like; Discovery of hidden patterns in data, Data reduction (finds minimal sets of data), Evaluating the importance of data, Representing data as sets of decision rules and, Providing the interpretation of obtained result [19].
Rosetta is a general-purpose tool that is not geared towards any particular application domain. The name ROSETTA can be construed as an acronym, for a Rough Set Toolkit for Analysis of Data. It has been put to use by a large number of researchers world-wide, and has resulted in scientific publications in a wide variety of areas. Moreover, it implements features relevant to build and evaluate rough set models in different domains, and offers a highly user friendly environment in which to conduct experiments. In this study, Rosetta is used to generate rough set rules for the predicted SNPs. This will be discussed obviously in the (Generation of Rough set rules) section [20].

6) Measuring the generated rules:
To quantify the generated rules, several numerical measures for the rules are illustrated in Definition 1, 2, and 3 and described in

A. Identification of SNPs in Heat-Stress Genes
Continuous decision values may cause a challenge. In most practical approaches, there are about two to five decision classes. So, if the problem has continuous decision values, they can be split into 2 or 3 intervals [11]. In another example of exon expression values, the decision experimentally was split into three classes by taking 20%: 60%: 20% corresponding to highly expressed, medium expressed and low expressed exons [23]. Similarly, in this study the decision is divided only into two classes, by taking the highest expressed (Yes): the lowest expressed (No) genes.
DRASTIC and TAIR10 reference databases are used as trusted sources for the highest expressed heat-stress genes for A. thaliana. The union of heat-stress genes in those two databases are (225) unique genes. Next, all SNPs that are located into those set of genes exist in the resulting 10 (*.ann.vcf) files are being selected. The number of heat-stress genes detected in each replicate and also numbers of their identified SNPs are listed in details in Table V. For balance, an equalized set of the lowest expressed heatstress genes are selected from work presented in [24], to analyze their SNPs features too. About (200) genes are selected and their SNPs are got from the 10 (*.ann.vcf) files. The number of the lowest expressed heat-stress genes detected in each replicate and, number of their identified SNPs are listed in details in Table VI.

B. Capturing Biological Features of Identified SNPs
After that, the biological features of SNPs for both groups of genes, for the highest and lowest expressed heat-stress genes, are picked up from the 10 (*.ann.vcf) files. The most effective biological features exist in these annotation files due to the rough set are (Gene Name, Gene ID, Annotation, Annotation Impact, Feature Type and, Transcript BioType).
The top-ranked attributes (biological features) are used to build a rule-based classifier using the Rosetta system.

C. Generation of Rough Set Rules
An information system or information table can be viewed as a table, consisting of objects (rows) and attributes (columns). The captured set of SNPs are used to discover finite set of rules that can describe whether the genes, those SNPs belong to, are heat-stress or not. Rules were generated in Rosetta [20] with the manual reducer which determines decision rules (Heat Expressed: Yes; No) based on characterization of a set of objects in terms of attribute values (SNPs biological features). Table VII explores the given replicates and their total number of objects, number of (Yes) decision, number of (No) decision, and the number of resulting rules. Total number of Rules (251) represents the sum of all rules generated over the 10 replicates. However, the set of non-repeatable rules shared between the 10 replicates is (32) rules.

D. Rules Evaluation
To quantify the generated rules, three main numerical parameters for the rules are defined: Rule Strength, Rule Certainty (accuracy) and, Decision Coverage.
Those parameters are calculated for the generated set of rules by applying Definition 1, 2, 3 and, Table IV mentioned in section (Measuring the generated rules). The pie chart shown in Fig. 5 displays the Rule Strength of all rules, showing rules that have the highest strength percentages. Moreover, Fig. 6 shows the different Rule Certainty in a bar chart. Rules that have the same condition values but different decision value (Yes, No) are represented in adjacent bars. Finally, Decision Coverage of rules is represented in Fig. 7 for (Yes) decision rules and Fig. 8 for (No) decision rules. www.ijacsa.thesai.org

VI. CONCLUSION
The ultimate goal of this research is to find the relationship between the set of heat-stress expressed genes and their detected SNPs biological features in A. thaliana RNA-seq raw reads. Utilizing rough set-based rule induction resulted in set of descriptive rules which can draw the correlation between those two significant concepts; genes and SNPs. A promising analysis framework was presented to detect SNPs in RNA-seq raw reads then using annotations of those SNPs to figure out their biological features. Additionally, about (225) unique genes got from DRASTIC and TAIR10 databases were used to represent the highly expressed heat-stress genes for A. thaliana. However, (200) genes were selected to represent the lowly expressed heat-stress genes for the same plant. The topranked biological features of SNPs for both groups of genes with decision rules (Heat Expressed: Yes; No) were utilized to build a rule-based classifier using the Rosetta system.
The system stated set of (32) non-repeatable rules. Results showed acceptable outcomes and, evaluation had been applied to check the suitability of the generated rules using Rule Strength, Rule Certainty and Decision Coverage. In conclusion, relation between SNP calls and expressed genes in RNA-seq data can be a very useful by-product and increases the amount of knowledge for SNPs discovery and analysis in functional genomics research. With this important result in mind, this method can be verified using in vivo tests to improve the work results. Moreover, generating rules for more species of the same plant may improve complete and welldefined base for machine learning approach to researchers of all expertise levels.