Applying Machine Learning Techniques for Classifying Cyclin-Dependent Kinase Inhibitors

The importance of protein kinases made them a target for many drug design studies. They play an essential role in cell cycle development and many other biological processes. Kinases are divided into different subfamilies according to the type and mode of their enzymatic activity. Computational studies targeting kinase inhibitors identification is widely considered for modelling kinase-inhibitor. This modelling is expected to help in solving the selectivity problem arising from the high similarity between kinases and their binding profiles. In this study, we explore the ability of two machine-learning techniques in classifying compounds as inhibitors or non-inhibitors for two members of the cyclin-dependent kinases as a subfamily of protein kinases. Random forest and genetic programming were used to classify CDK5 and CDK2 kinases inhibitors. This classification is based on calculated values of chemical descriptors. In addition, the response of the classifiers to adding prior information about compounds promiscuity was investigated. The results from each classifier for the datasets were analyzed by calculating different accuracy measures and metrics. Confusion matrices, accuracy, ROC curves, AUC values, F1 scores, and Matthews correlation, were obtained for the outputs. The analysis of these accuracy measures showed a better performance for the RF classifier in most of the cases. In addition, the results show that promiscuity information improves the classification accuracy, but its significant effect was notably clear with GP classifiers. Keywords—CDK inhibitors; random forest classification; genetic programming classification


I. INTRODUCTION
Different important biological processes in the human body is related to the process of phosphorylation.In which, a phosphate group is added to proteins to activate their functionality.Protein kinases are enzymes that catalyze this process by adding the phosphate group to other proteins.
Due to the importance of the phosphorylation process in cellular processes and metabolism, protein kinases gained their importance and they are subject to many studies including drug design studies.Protein kinases are related to different diseases and cancer types when inappropriately regulated [1].
In humans, there are more than 500 kinases.They are divided into three types as they catalyze three types of phosphorylation [2].
Although kinases were targeted by several drug discovery studies that led to developing many inhibitors for them, only few of these inhibitors were approved.The reason for that is the undesired side effects caused by inhibitor reactivity against unintended targets.This is caused by the high degree of similarity between kinases, as there are few structural differences between them especially in their highly conserved binding domains.This similarity in binding domains led to the selectivity problem in many kinase inhibitors [2].In most cases, this problem is caused by the high conservation is in the ATP binding site, which is the target for most of the inhibitors developed for kinases [3].
Among protein kinases, cyclin-dependent kinases (CDKs) are protein kinases, which have essential roles in cell divisions and transcription.CDKs are marked by being dependent on a protein subunit called cyclin to activate their enzymatic function.They belong to the serine/threonine kinase family [4].
Blocking the cell cycle by targeting kinases is proposed to kill cancer cells as in [5].CDKs related to cell cycle are divided into three subfamilies, Cdk1, Cdk4, and Cdk5.The Cdk1 family consists of CDK1, CDK2, and CDK3 kinases.Although CDK1 is the most important kinase in this family because its major role for cell cycle, CDK2 is also essential as it participates in the cycle of cell division.In addition, CDK2 is investigated as being related to cancer and is targeted for cancer treatment as in [6].
CDK5 is an important enzyme that has different functions related to cell-cycle, gene expression, and others.CDK5 belongs to the cdk5 subfamily.In addition to its role in the cellcycle progress, it is also known for controlling neuronal proteins [4].CDK5 is also related to neurodegenerative diseases if was deregulated [7].It is also linked to cancer and other diseases [8].
Computer-based approaches is being utilized in order to help profile the activity of different inhibitors against kinases and to explore and tackle the selectivity problem.Among these techniques is machine learning, which is widely utilized in biological and medical related problems.Different machine learning techniques were used in interaction modelling studies to predict protein-inhibitor interactions.
In [9] random forest was used to classify kinases variants in order to understand the relation with different diseases.The www.ijacsa.thesai.orgclassification was based on protein kinases sequence features.The resulting accuracy was 88%.
In the area of kinase inhibitors, machine learning was used in [10] to a study the kinase inhibitory data of [11] in order to model the prediction of interactions between kinases and their inhibitors.The study aimed for building a computationalexperimental framework by using Kernel-based regression methods on molecular descriptors and fingerprints of kinase inhibitors.The predicted results were found correlated to kinase assays experimental results by 0.77.
In [12] Machine learning for predicting the binding of kinases to inhibitors by modelling different sets of features.Features used for kinases are based on sequences, in addition to phylogenetic features and amino acid positions in the active site.For inhibitors, 2D structural features and chemical features were used.Their experiments showed the importance of different sets of features based on the decision tree and SVM modelling results.The highest prediction accuracy achieved was 86.1%.
Another application of machine learning to predict active or inactive confirmations of kinases was done in [13].The study proved that classification based on the activation segment orientation is performing better than other methods.Genetic Programming (GP) [14] is a machine learning technique that simulates biological evolution and is used for modelling by regression or classification.It starts by a random population, then it continues to produce generations and individuals by performing evolutionary operations such as mutations, crossover, and selection, aiming to improve a fitness function.The individuals of GP is trees representing mathematical models to relate the modelled features to a target variable [15].Random Forest (RF) [16], is a machine learning technique based on a large number of decision trees.A bootstrap sample is drawn and a set of variables are selected randomly to decide the split of each node.The tree grows and splits using the variables at each node until a specified criteria is achieved [17].
In this study, we use genetic programming and random forest classification techniques for classifying inhibitors and non-inhibitors for two of the cyclin-dependent kinases, CDK5 and CDK2.Both techniques were used for modelling chemical descriptors information.In addition to classification, we investigated the response of the classifiers to adding information about kinase binding promiscuity of compounds.
The outputs of the classifiers were analyzed using different accuracy measures and metrics.Because there is no standard single evaluator of classification accuracy, we calculated and obtained a group of measures for a wide evaluation of the results.These measures are confusion matrices, accuracy, ROC curves, AUC values, F1 score, and Matthews correlation coefficient.
Additionally, the analysis shows how could the classifiers reflect compound promiscuity information.Compound promiscuity against kinases is the ratio of the kinases that could be inhibited by that compound at a specific concentration [11].
This document is structured as follows: In section 2 we describe the dataset we used and illustrate data processing and workflow steps.In section 3, different results are presented and discussed.In section 3, we present our conclusion on the results and expectations for future improvements.

II. DATA AND METHODS
We describe in this section the data sources, tools, and the methodology we used in order to achieve our objectives in building and evaluating the classifiers.

A. Data Sources
The dataset we used was extracted from the data of [11].The original dataset contains the measured interaction values for more than 3000 compounds against 172 kinases.The values represent the pK I values, which are the negative values of base-10 log of the K I interaction value.We extracted the values for the first 1497 compounds against two protein kinases belonging to the cyclin-dependent kinases subfamily, CDK2, and CDK5.
The original dataset contains five cyclin-dependent kinases.We decided to study the data for CDK2, and CDK5 only as they have higher number of measured inhibitor activities with 868 and 1038 values respectively.
A threshold pK I value of (value >5.9) was used for classifying compounds as inhibitors or not.This threshold was determined based on what the original study in [11] mentioned about compound activity against kinases.We used the molecular descriptor values for the 1497 compounds.These values were obtained previously using edragon online tool [18].The number of descriptors extracted for each compound is 1666 descriptor values.
Promiscuity value for each compound is provided in the original dataset in [11].Promiscuity_1uM of a compound represents the portion of kinases tested with a potency of 1uM achieved by that compound.

B. Data Preparation
For each of the two proteins, two files were created with all the information needed for modelling.Each file contained the descriptor values for the compounds that interacted with one protein after removing columns that contained only zeros for all compounds.In addition, the interaction values were added as the last column as the target value.For each protein, another version of the data file was created including the value of promiscuity_1uM for each compound as an additional feature.So, each of the two proteins had two data files, and building a classifier was done twice for each protein.One time with descriptor values only, and another time with promiscuity and descriptor values.
The interaction values were classified based on the threshold mentioned in [11], considering the value of 5.9 pK I as the inhibition threshold.The data file for each protein was modified replacing the interaction value with the class number.Class 1 represents that the corresponding compound is a potential inhibitor (pK I > 5.9) while class 2 represents a noninhibitor compound (pK I <= 5.9).Table I shows the counts of www.ijacsa.thesai.orgcompounds as inhibitors or non-inhibitors according to the specified threshold in both protein datasets.Data separation, processing and cleaning were all done using python scripts that we wrote to read, manipulate and write csv files with the desired structure.

C. Methodology
In this study, we followed a multi-step methodology to explore the ability of two machine-learning techniques for predicting the inhibition activity of compounds against two cyclin dependent kinases.Random forest and genetic programming were applied on datasets of CDK5, and CDK2 kinases.The effect of adding promiscuity to the modelling features was also investigated, as it provides information about the ability of a compound to inhibit kinases.
First, we obtained, separated, and preprocessed the data sets for CDK5 and CDK2 kinases.After that, we obtained and prepared the required tools for testing random forest, and genetic programming classifications.Then, we performed different experiments, namely four for each protein, and collected the outputs.Finally, we evaluated the performance of the classifiers with different measures and compared the results.We concentrated more on the outputs of the RF classifier.The workflow of the complete steps for our work is shown in Fig. 1, which shows the steps followed to build both classifiers for each protein dataset.
In all experiments, descriptor values were considered as variables or features, and the class number was the target to be predicted.Each dataset was divided into a 70% training set, and a 30% test set.

D. Genetic Programming Classification
To perform GP classification we used a free desktop tool, HeuristicLab Optimizer 3.3.15[19], in the mode (symbolic classification).
The input in each GP experiment was one of the files we created previously, in addition to setting GP parameters.We used different combination of parameters trying to achieve higher accuracy.The set of parameter values used with GP experiments are shown in table II.

Model Depth 10
Model length 100

E. Random Forest Classification
Data files for each protein were loaded into R studio.Each dataset was divided by random sampling into a 70% training set, and a 30% testing set for validation.
The R package (randomForest) was used for the modelling.We set two basic RF parameters, the number of trees constructed (ntree), and the number of randomly preselected features, or variables, in each tree (mrty).We tried different values for these two parameters until we achieved a relatively low error value.Parameter values used for RF experiments are www.ijacsa.thesai.orgshown for different datasets in table III.The parameter values used with the datasets including promiscuity are also shown in the table.The clear difference when using promiscuity in the dataset was achieving higher accuracies with lower number of trees.
Finally, results of different experiments were collected and different accuracy metrics were calculated for enhanced analysis.

III. RESULTS AND DISCUSSION
Both machine-learning classification techniques, genetic programming and random forest, were tested on two datasets for two cyclin dependent kinases and their inhibitors.Results varied between datasets and techniques.We mention the results in this section showing different accuracy measures we used, along with a discussion of the variations in these accuracies.

A. Accuracy
RF classifier could classify all test data.Table IV shows the overall accuracy of the RF classifier in terms of all correctly classified items in training and testing sets for both proteins.The accuracy is also shown when promiscuity was used in the data set.

B. Confusion Matrix
A confusion matrix is a table with a specific layout that is usually used to describe and visualize the performance of a classification algorithm.We show here the confusion matrices for each experiment.
The confusion matrices resulted from RF experiments are shown in tables V to VIII.Table V shows the matrix for the RF Result Accuracies for both proteins with and without promiscuity.It should be noticed that the test sets were selected by random sampling, so, number of items in each class will not remain the same among different experiments.
The confusion matrices in all experiments show a high ability of the RF classifier to identify non-inhibitors.On the other hand, the ability to identify inhibitors is not in the same level.The reason for that could be the imbalance in data provided for the classifier, as most of the compounds in the data sets are already non-inhibitors as shown in table I.

C. ROC Curves
The Receiver operating Characteristics curve (ROC Curve) was plotted for all outputs to understand the ability of each classifier in discriminating between inhibitors and noninhibitors.RF ROC curves were plotted using ROCR package in R [20], and are shown in Fig. 2, while ROC curves for GP experiments were obtained from HeuristicLab, and are shown in Fig. 3.
The curves show a fairly high ability for the RF classifier to label and determine the class for test data.For additional better understanding of the ROC curves, we show the values of AUC (Area Under the ROC Curve) for these ROC curves in table IX.
From the AUC values and the ROC curves, we can see that RF outperforms GP with both protein datasets, especially when promiscuity information exists.The AUC values also show a remarkable improvement when promiscuity information exists in the dataset for both proteins and with the two techniques.However, the improvement ratio in the case of promiscuity information is notably higher with GP classifier.

D. F1 Score
F1score is calculated based on the precision and recall measures.F1 score measures the accuracy of a classification model based on the number of positives identified correctly and the total number of positives.Tables X and XI show the F1 scores for RF and GP classifiers on CDK5 and CDK2 dataset respectively.
For CDK2, F1 scores are almost within a close range to each other except for GP classifier without promiscuity.Also in this measure, we can see that GP could better reflect the promiscuity information by increasing the F1 score value with a higher ratio than RF, although RF values were better beforehand.Additional note here is that CDK5 dataset without promiscuity could not result in high accuracy predictions of positives, even after many experiments.

E. Matthews Correlation Coefficient
Matthews correlation coefficient (MCC), is a quality measure used to evaluate binary classifications.So it is applicable in our case.It takes into consideration true positives and negatives and hence it is considered as a balanced measure.Tables XII and XIII show the MCC values for RF and GP classifiers on CDK5 and CDK2 datasets respectively.The values of MCC measure in general ranges between -1 (No prediction), and 1 (Perfect prediction).In this case, the values for MCC in both datasets almost near to 0.5 or higher, except in the cases where GP classifier predicts the test sets for both proteins.As a general note, GP is performing better than RF in training data, but it cannot predict test sets accurately.On the other hand, RF is more accurate in predicting the test sets classes.
It is also clear from the tables that MCC value increases when promiscuity information is included in the datasets.Promiscuity information improved the accuracy of GP www.ijacsa.thesai.orgclassifier more than its improvement for RF classifier on the training set level.This improvement is clearly noticeable in GP results for the test sets, although GP accuracy is still low on test sets compared to RF.

F. Important Vairables
RF has the ability to rank different available considered while training.So, it can produce a list in each experiment with the most important variable affecting the prediction results.In table XIV we show a portion of the top important variables in the two experiments for each protein.We selected these important variables that had high rank in RF ranking for both mean decrease accuracy, and mean decrease Gini, and appeared with each protein in its corresponding two experiments.Variables names represent chemical descriptors produced by e-dragon.

IV. CONCLUSION
Machine learning techniques provides a useful means to model and understand kinase-inhibitor interaction data.Although the results were not usually of high accuracy for different accuracy measures, but still there are many measures showing promising values and representing good predictions.Machine learning classifiers produced good predictions for the class with more data in the dataset, non-inhibitor class.We suppose that this could be a result of imbalanced data distribution.
Another important conclusion is the ability of the classifiers to response effectively to one feature reflecting its importance.Kinase inhibitors are likely to bind to more than one kinase.The improvement in predictions when compound promiscuity is added to the features means that it was efficiently modeled.This suggests that adding more features such as protein binding site properties could highly improve the prediction accuracy.
Compared to previous work using different techniques mentioned in section 1, our results achieved promising values in terms of overall accuracy.The average overall accuracy from RF experiments was about 85%, which is comparable to the 88% in [9], and 86% in [12].Most of previous work tried to predict kinase inhibitors for the whole family, while in our work here we concentrated on the CDK subfamily to be more specific and more responsive to any special binding features of CDKs.We expect that extending the data by adding more features and considering protein-related properties on different levels would improve the classification accuracy.Finally, it is not necessarily that a good performing technique to be usually the most sensitive one for new features.Different approached should be tired with different datasets and features with a comprehensive and accurate evaluation of the results.

TABLE I .
Number of Compounds in ecah class

TABLE III .
RF CLASSIFICATION PARAMETER VALUES

TABLE IV .
RF RESULT ACCURACIES FOR BOTH PROTEINS WITH ANDIn all confusion tables, the columns show the number of predicted items in each class (Inhibitor, Non-Inhibitor), and the rows display the actual items in each class.The results are shown for training and testing sets.Table V shows results from CDK5 dataset without promiscuity, while table VI shows the results for CDK5 dataset including promiscuity information.Similarly, for CDK2 dataset, tables VII shows the results without promiscuity information, while table VIII displays the confusion matrix of CDK2 dataset that includes promiscuity values.

TABLE VIII .
RF CLASSIFIER CONFUSION MATRIX FOR CDK2 INHIBITORS (DESCRIPTORS AND PROMISCUITY)

TABLE IX .
AREA UNNDER ROC CURVE FOR RF AND GP CLASSIFIERS

TABLE XIV .
IMPORTANT VARAIBLES AS SELECTED BY RF