An Hybrid Approach for Cost Effective Prediction of Software Defects

Identifying software defects during early stages of Software Development life cycle reduces the project effort and cost. Hence there is a lot of research done in finding defective proneness of a software module using machine learning approaches. The main problems with software defect data are cost effective and imbalance. Cost effective problem refers to predicting defective module as non defective induces high penalty compared to predicting non defective module as defective. In our work, we are proposing a hybrid approach to address cost effective problem in Software defect data. To address cost effective problem, we used bagging technique with Artificial Neuro Fuzzy Inference system as base classifier. In addition to that, we also addressed Class Imbalance & High dimensionality problems using Artificial Neuro Fuzzy inference system & principle component analysis respectively. We conducted experiments on software defect datasets, downloaded from NASA dataset repository using our proposed approach and compared with approaches mentioned in literature survey. We observed Area under ROC curve (AuC) for proposed approach was improved approximately 15% compared with highly efficient approach mentioned in literature survey. Keywords—Cost effective problem; principle component analysis; adaptive neuro fuzzy inference system; area under ROC curve


I. INTRODUCTION
Software Development process involves Requirement specification, Design, Implementation and Testing. During each phase of software development, reviews will be conducted to assess the progress and quality of software. The quality of software depends on defects found in the software. Defect is a condition that doesn"t meet user requirement, specified in requirement specification. If a defect is found during late stages of software development i.e. during software maintenance, the penalty is very high. To reduce this penalty, the defective proneness must be identified in advance [27].
According to Boehm, the cost of fixing errors increased gradually as the software development progress. If we consider cost of fixing error during requirement phase as 1 unit, then the cost of fixing error in design phase will be 3-8 units, implementation phase will be 7 to 16 units, integration & testing phase will be 21 to 78 units and maintenance phase will be 29 to more than 1500 units. This motivates application of machine learning techniques in early stage identification of software defects [28]. Fig. 1 shows soft escalation of defect resolving during various phases of software development life cycle.

A. Machine Learning Techniques
Various Machine learning techniques such as K nearest neighbours, Support Vector machines, Decision Trees, Bayesian Networks and etc. are used to identify software defects.

B. Approaches for Software Defect Prediction (SDP)
1) Decision trees: Decision Trees are used as early classifier techniques for software defects. In a decision tree, the attribute with less impurity value is selected as root node. There are three measures for impurity 1) Entropy 2) Gini Index 3) Misclassification error. Decision tree will output whether the module is defective prone or not, based on input attributes like IO Comments, Cyclometric complexities etc. Fig. 2 shows Decision Tree constructed on cm1 dataset.   There are two types of Bayesian classifiers: 1) Naive Baye"s classifier 2) Bayesian Belief Networks.
Naive Baye"s classifier: In Naive Baye"s classifier, the given unknown sample is considered as "X". The classifier finds the posterior probability P (Defect=Yes/X) and P( Defect=No/X) for given sample "X".
If P( Defect=Yes/X) > P(Defect=No/X), then classifier outputs the sample "X" as defective. Otherwise it outputs the given sample "X" as non defective. The drawback with Naive Baye"s classifier is, it assumes the target variable (Defect) is independent on input variables. Bayesian Belief Networks: In Bayesian Belief Networks, There are two components: 1) Direct Acyclic Graph (DAG); 2) Probability table DAG encodes the relationship between attributes into a graph. Probability table comprises of posterior probabilities dependent on their parent attributes. Fig. 3 represents the DAG, constructed on cm1 dataset. 3) Support vector machines: Support vector machines are one of the popular classifier technique for regression and classification problems. Binary Support Vector Machines solves classification tasks while support vector regression solves regression tasks. Software defect prediction is a classification problem and hence Binary Support Vector Machines are used to classify the module as defective or nondefective. In support vector machines, there exist a boundary function that classify sample. There are various types of boundary functions like Linear, Polynomial, Radial basis, ANOVA and etc.
Multi Layer Perceptrons: Multi layer perceptrons are the neural networks which comprises Processing units, called neurons, organized in multiple layers. These neurons are having computing capabilities on inputs, receiving from previous layers, and propagate output to the next layers. These neurons are connected by weighted edges. Each neurons applies activation functions on the inputs along with threshold and produces output signals. There are various activation functions like Threshold, sigmoid, Tangible and etc. Table I illustrates the architecture of multi layer perceptrons along with nodes, connections and their weights.
Artificial Neuro Fuzzy Inference System: Artificial Neuro Fuzzy Inference System: ANFIS is a five layered architecture used for classification tasks. Satya srinivas et al. [26] proposed Artificial Neuro Fuzzy Inference System for Software defect prediction.
ANFIS generates Sugeno Fuzzy Inference system as output for classification task. In ANFIS input attributes are fuzzified and target attribute is defuzzified. Intially subtractive clustering method is used to generate Sugeno fuzzy inference system. The premise and consequent parameters in Sugeno Fuzzy inference system are trained used training data. Here training rate parameter must be set to appropriate value. Setting high training rate parameter converges the ANFIS model into unstable state. Setting low training rate parameter creates high complexity model.

4) Cost effective learning:
Misclassifying some class samples results high penalty compared to misclassification of other classes. For example, in software Defect prediction, misclassifying defective module as non defective imposes high penalty compared to misclassifying non defective module as defective. If a defect was found during later stages of software development, it imposes high penalty and hence pronable defective module should not be misclassified as non defective even though non defective module was misclassified as defective. This error cost escalation was shown in Fig. 1.

5) Ensemble learning:
Ensemble learning is the process of constructing multiple classifiers and combining them to improve the accuracy for classification problems. Some of the ensembling techniques are Simple voting, Average voting, Bagging, Boosting and etc. In simple voting, each classifier will vote for an output value. The output, value with high number of votes, considered as actual output. In average voting, the average value of output of each classifier is considered as actual output. This technique is suitable for regression tasks. In Bagging, the dataset is sampled into equal size subsets of data and a classifier is constructed with each subset. Finally each classifier will vote for output value. Bagging and Boosting techniques improves the performance of classifiers by constructing multiple classifiers.
In Bagging, classifiers are constructed in sequence. The samples which are incorrectly classified are given with higher weight for construction of next classifier. This procedure is repeated until required accuracy obtained or maximum numbers of classifiers were constructed. In this paper, we are applying hybrid approach to overcome cost effective problem in SDP. Section II presents literature survey on SDP. In Section III, we designed methodology using hybrid approach for SDP. Section IV Presents the results by applying proposed methodology on SDP.

II. LITERATURE SURVEY
Yan Naung Soe et al. proposed Random Forest algorithm on Software Defect Prediction and compared the performance of Random forest algorithm with other machine learning techniques. They concluded that maximum accuracy is 99.59 and minimum accuracy is 85.96 [1]. Taek Lee et al. proposed micro interaction metrics, such as browsing events, file editing, for prediction of software defects and observed high accuracy by combining these metrics with existing metrics in cost effective manner [2]. Fei Wu et al. proposed a costsensitive local collaborative representation (CLCR) approach for software defect prediction and concluded that accuracy has been increased with proposed approach [3]. Jinsheng Ren et.al proposed asymmetric kernel principle component analysis for solving class imbalance problem in software defect prediction. They evaluated the validity of their proposed model using Fmeasure, Friedman"s test, and Tukey"s test [4].
Ayse Tosun et al. proposed decision threshold optimization on Naive Bayes classifier to find best threshold that separate defective and non-defective samples in software defect data [5]. Ming Cheng et al. proposed semi supervised approach for identification of software defects. Their proposed model evaluates the confidence probability of unlabelled sample to predict class labels. They considered different misclassification cost to improve classifier performance [6]. Igor Ibarguren et al proposed consolidated tree construction that ensembles weights of misclassification in training of classifier. They showed that consolidated tree construction performs better than other rule based classifiers [7]. Yuanxun Shao et.al proposed weighted associative classification for addressing imbalance problem in software defect prediction. They determined weights of features using correlation analysis. They proved GMean measure has been increased with their approach [8]. Shuo Feng et al. proposed complexity based over sampling technique to address data imbalance problem in identification of software defects [9]. Rakesh Rana et al. proposed Bayesian Inference method for software defect prediction to analyse inflow distribution of defects. This technique has been used for early detection of software defects in large software projects [10].  [19]. Jaroslaw Hryszko et al investigated the effect of Software defect in modules on Quality assurance of Software. Their investigation proved that quality assurance cost can be reduced by 30% with their proposed approach [20]. Kazuya Tanaka et al focused on usage of auto-sklearn tool that automatically selects appropriate prediction model for data pre processing and classification in software defect prediction. This tool presents random forest is the best model in various machine learning techniques [21]. Pradeep Singh proposed stacking based framework, in which he combined class balancing technique SMOTE with ensemble classifiers to predict software defects. He concluded that the accuracy of stacking based model increased compared to traditional approaches used in their literature survey [22]. Haitao He et al proposed Ensemble RIPPER classifier for software defect prediction. In their research, they applied Principle component analysis for dimensionality reduction, Adaptive Synthetic sampling for balancing the dataset and RIPPER model for classification. They concluded that classification error has been reduced with their proposed model [23]. Zhiqiang Li et al proposed ensemble multiple kernel correlation alignment for heterogeneous defect prediction and they concluded ensemble approach outperforms remaining competing methods [24]. Xin Xia et al proposed Hybrid model reconstruction (HYDRA) approach for Software defect prediction. It consists of two phase"s Genetic algorithm followed by Ensemble learning. They concluded that HYDRA improves F1 score of Zero-R base classifier [25].
In prediction of software defects, some researchers addressed class imbalance problem and someone addressed high dimensionality problem. But In this research work, we are addressing cost effective problem in SDP.

III. METHODOLOGY
In this paper, we are proposing Ensemble approach of Adaptive Neuro Fuzzy Inference system for prediction of Software defects for cost effective learning. In step 1, we are performing Synthetic Minority oversampling technique (SMOTE) to balance the dataset. In step 2, Dimensionality reduction will be performed to reduce the dataset. Here, we are proposing Principle component analysis (PCA) for dimensionality reduction. In step 3, multiple ANFIS classifiers will be constructed for ensemble approach. In step 4, Aggregation will be performed on votes given by multiple ANFIS classifiers and it produces the actual output.
In our research work, we considered data from NASA dataset repository. The dataset is neither noisy nor in complete but imbalanced. To remove imbalance, we are applying SMOTE technique and to overcome for high dimensionality problem we are applying PCA. Fig. 4 represents the proposed methodology for SDP.

Algorithm:
Step 1: Apply Synthetic Minority Over Sampling Technique for Class Balance. Step 2: Apply Dimensionality reduction using PCA 2.1 Perform Z score normalization on data. Z-score = (x i -μ)/ 2.2 Create a covariance matrix for eigen decomposition. 2.3 select principle components with high relevance.
Step 3: Construct classifier using Artificial Neuro Fuzzy Step 4: Repeat steps 1 to Step 3 for multiple times (Possibly odd number of times).
Step 5: find number of votes for each class from multiple classifiers Step 6: Output the class variable based on number of votes(High number of votes) www.ijacsa.thesai.org

IV. EXPERIMENTATION AND RESULTS
In this work, we addressed class imbalance problem using Synthetic Minority Oversampling Technique. After balancing the data, we applied principle component analysis for dimensionality reduction. In Table II, we are projecting few principle component values constructed on cm1 dataset, downloaded from NASA dataset repository.
The reduced dataset is used to construct classifier using Adaptive Neuro Fuzzy Inference System. In this work, we applied AdaBoost Ensemble learning technique with ANFIS as base classifier. The performance of a classifier, constructed from imbalance data, can be measured using AuC (Area under ROC Curves). Receiver Operating Characteristics curves are constructed by plotting True positive rate against False positive rate. Area under this ROC curve is considered as a performance metric in our research work.
We applied cost sensitive approach to our classifier. In cost sensitive approach, the cost values are derived from imbalance nature of data. We found cost sensitive approach improves the performance of classifier. We applied our proposed model on various software defect datasets cm1, pc1, kc1 and jm1. In the Table III

V. CONCLUSION AND FUTURE WORK
In this work, we proposed an hybrid approach cost effective problems in software defect prediction. To reduce number of dimensions, we applied Principle Component Analysis. Ensemble ANFIS were constructed for cost effective learning of software defects. We compared the performance of proposed models, with algorithms specified in literature survey, using AuC values. Our proposed model got approximately 15% high Auc values over all datasets. As a future work, we can improve the AuC values by addressing High dimensionality, Class Imbalance, Cost effective Problems in SDP.