An Empirical Investigation of Predicting Fault Count , Fix Cost and Effort Using Software Metrics

Software fault prediction is important in software engineering field. Fault prediction helps engineers manage their efforts by identifying the most complex parts of the software where errors concentrate. Researchers usually study the faultproneness in modules because most modules have zero faults, and a minority have the most faults in a system. In this study, we present methods and models for the prediction of fault-count, fault-fix cost, and fault-fix effort and compare the effectiveness of different prediction models. This research proposes using a set of procedural metrics to predict three fault measures: fault count, fix cost and fix effort. Five regression models are used to predict the three fault measures. The study reports on three data sets published by NASA. The models for each fault are evaluated using the Root Mean Square Error. A comparison amongst fault measures is conducted using the Relative Absolute Error. The models show promising results to provide a practical guide to help software engineers in allocating resources during software testing and maintenance. The cost fix models show equal or better performance than fault count and effort models. Keywords—Software metrics; fault prediction; fix cost; fix effort; regression analysis


INTRODUCTION
Predicting faults in modules is important to assess software quality and to direct software engineers' effort to spend more time on more trouble-prone modules.Software metrics are surrogates for fault measures such as fault-proneness, fault count, fault-fix cost, and effort.Software metrics measure the complexity of software and can be used to identify the faulty modules using statistical and machine-learning techniques.These techniques can be used to build prediction models such as fault count, fix cost, and fix effort to predict which modules are likely to have these problems.Software systems are becoming larger and larger and contain thousands of modules that are investigated in testing and maintenance phases.However, the cost of testing and maintenance are growing with the size of systems.This growing trend leads to either very costly system or compromised quality.Software engineers can use prediction models to prioritize modules to focus the testing and maintenance activities on the modules that are either have more faults, more costly to fix or demand more efforts to fix.Hence, detecting and ranking faulty modules is an important engineering task for improving system quality and reducing cost.There are usually two measures of module quality: fault count or fault-proneness.In most systems, a small number of modules have faults and the majority of modules have zero faults.Researchers use fault-proneness by using binary coding of modules (zero for no faults and one if there are faults in a module) to build prediction models that are usually easy to interpret [1][2] [3][4] [5][6] [7].However, the binary coding does not explore all information available about faults.Fault count is an indicator of quality in a module but may not provide enough information about the fix cost or effort.Therefore, regression and machine-learning models are used to identify complex modules by considering fault count, fix cost and effort.In this paper, five regression and machine-learning techniques are used to predict the three fault measures.Twenty procedural metrics used as independent variables in the prediction models.The models are trained and tested on three data sets provided by NASA.Overall, fifteen models were built for each data set using 10-fold cross-validation.The results for the three fault measures have shown similar results, but the cost-fix models are slightly better.These models can help in allocating resources for software testing and maintenance.The results of the models are used to rank the modules based on the fault measures, and the results are promising and commensurate with previous works [8] [9].The performance of the three fault measures is compared to find the best ranking.The results show similar results for the three measures with some advantage for fault count and fix cost over fix effort.
The rest of the paper is organized as follows: related work to the three fault measures are discussed in Section 2. In section 3, the study design is discussed which includes a description of the dependent, independent variables and regression models used in this paper.The data analysis is presented in Section 4, which also evaluates the predictions of the fault measures.Validity threats to the study are discussed in Section 5.The study is concluded in Section 6.

II. RELATED WORK
Fault prediction has been discovered in many previous research in two major themes: fault-proneness and fault count.Studies on fault proneness categorized software classes into groups.Usually, classes are divided into two groups: faulty classes that had one or more faults in the current release, and non-faulty classes.Software metrics have shown significant relations with fault-proneness using many machine learning and statistical techniques [1] [10].Many research studies used the NASA fault data to build fault-proneness models.For example, Pai and Dugan [11] conducted a Bayesian analysis of fault count and fault proneness.The study produced statistical significant results using linear, Poison, and binomial logistic regression.The modeling of the results have www.ijacsa.thesai.orgshown 20:60 relationship when classes were ranked using module-fault order.Catal and Diri [12] used the NASA data sets to predict fault-prone modules and proposed an artificial immune system (semi-supervised approach) that uses a recent algorithm called YATSI.Gondra [13] also used the NASA's Metrics Data Program data to build prediction models of faultproneness of modules using two machine l++-earning techniques: Artificial Neural Networks (ANN) and Support Vector Machines (SVM).Zheng [14] used four datasets from NASA projects to compare the effect of cost-sensitive boosting algorithms on the performance of neural networks for predicting fault-prone parts.In other studies on fault measures , Ohlsson and Alberg [15] noted that in commercial products, the average cost of fixing an operational fault was $7000.Biyani and Santhanam [16] found correlation between the number of faults found in development and the number of faults remaining in operation.Ostrand et al. [17] developed a negative binomial regression model to predict the number of faults in each file for many consecutive releases of a software.Khoshgoftaar and Gao [9] used two statistical models: Poisson regression model and the zero-inflated Poisson to predict fault count in two industrial case studies.The zero inflated model showed better performance than poison regression model.Other researchers focused on other fault measures such as fix cost and effort.For instance, [18] used the KC1 data to build faults fix cost using Neural Networks.Panjer [19] proposed to build machine-learning models to predict fault-fix time.(Khoshgoftaar and Gao [9] proposed to use a program moduleorder models to explore the relationship between %modules and %faults as a more practical model that is based on the predictions resulting from machine learning models.Khoshgoftaar et al. found that 80% of faults are found in the top 20% of files when ordered by faults predicted by models [9].In a recent study, Hamill and Goseva-Popstojanova [20] studied the relationship between faults and failure of 21 largescale software components extracted from a safety-critical NASA mission.However, the study focused more on fault types.
Fault prediction models are reported frequently in previous works as reported in surveys on software fault prediction [21] [22].This study provides an exploration of the added dimension for the relationships between software metrics and fault measures such as fix cost and fix effort.In addition, the module-order models proposed in Briand et al. and Khoshgoftaar and Gao [8][9] are used to prioritize modules according to models predicting fix cost and fix effort.

III. STUDY DESIGN
Fault data are becoming more available on many repositories such as PROMISE [23], Eclipse Bug Data [24], and NASA fault data [25].The NASA data provides more details on the costs and efforts of fixing software faults, which are the focus of this research.Three data sets, KC1, KC3 and KC4 report the cost of fault fixes in terms of person hours and effort measured in Source Line of Code (SLOC) modified to accomplish the fix.Table 1 shows a summary of the three data sets.All these projects were built in similar software development environments and analyzed by the same set of software product metrics.These data sets are available publicly and other researchers can repeat and verify this study's results.
The MDP is funded by NASA's Software Independent Verification & Validation (IV&V) facility.These systems met the requirements to support NASA mission [26]. A. Research Questions Given the information available on fault count, fix cost and fix effort, this research aims to find answers for the following research questions.

RQ1: Can software metrics predict fault count?
Fault count is defined as the number of faults fixed in a module.This question is already answered in previous research as explained in more details in the related work section.However, this study adds the evaluation of faults prediction using other machine learning techniques.Fault prediction is important to assess the complexity of software modules.Five prediction models are conducted to answer this question.The results of the prediction models are used to rank the modules by sorting according to the predicted fault count.The models can be used to allocate resources efficiently to identify for instance the 20% modules that have the most faults.

RQ2: Can software metrics predict fix cost as measured in man-hours?
Fix cost is defined as the total number of hours the developers spent to fix all faults in a module.For each module, the cost of fault fixes are aggregated.The fix cost in hours is an indicator of the complexity of code.A positive relationship is expected between the studied metrics and fix cost, i.e., more complex modules cost more than less complex modules.To put the cost prediction models in practical use, the results of prediction models are used to sort the modules by the predicted fix cost.The models can be used to allocate resources efficiently to identify for instance the 20% modules that have the most fix cost.

RQ3: Can software metrics predict fix effort as measured in SLOC modified?
Fix effort is defined as the actual number of SLOC added or modified to fix all faults in a module.In this study, the aggregation of all modified SLOC for a particular module is used to investigate the relationship between the fix effort and the complexity of modules.To put the effort prediction in www.ijacsa.thesai.orgpractical use, the results of the prediction models are used to sort the modules by the predicted fix effort.The models can be used to allocate resources efficiently to identify for instance the 20% modules that need the most fix effort.
The results of the three quality predictions are compared using the relative absolute error to find which models are better.

B. Dependent variables
NASA MDP has many projects, but only three of these projects have details on fault fixes, cost and effort.For each module, the number of faults (fault content), the total fix hours, and the total SLOC changed or added are aggregated.Table 2 provides a summary of the fault measures used in this study.The scale for fix cost and effort are larger than the fault count.The scale has effect on the performance measures used in evaluating the prediction models and the comparison should be based on unbiased performance measures.Relative absolute error is used to evaluate models besides the root mean square error.

C. Independent variables -software metrics
The software metrics under investigation are procedural metrics for three systems collected by NASA MDP.The metrics collection were applied to the lowest level functional unit, procedures.The data were stored in a structured format.For example, a file named KC1_static_defect_data.csv, keeps all information related to faults including severity, priority, fix hours, the actual number of SLOC changed or added.Another file includes all the static metrics for each module and recognized using a unique variable, MODULE_ID.These files are then combined together into one file using the MODULE_ID, which is an identifier of module records in all files.
The NASA MDP data needs preprocessing as reported in [27].Therefore, we use only those metrics that were reported by [27] which had 21 metrics as reported in Table 3.The LOC_BLANK metric is deleted because it is not meaningful and its interpretation is not clear.These metrics were originally proposed in [28]

D. Regression Models
We propose to use a set of data mining techniques to predict the value of a numerical variable (e.g., fix cost) by building a model based on many software metrics.This research uses the following regression techniques to predict fault count, fix cost and fix effort.
Regression Decision Trees (M5P): Decision tree is used to build regression models in the form of a tree structure using the M5 algorithm [30].The algorithm constructs a decision tree for regression different from classification by using Standard Deviation Reduction instead of Information Gain.A dataset is continuously partitioned into smaller subsets while the standard deviation is larger than zero.

Multiple Linear regression (MLR):
Multiple linear regression (MLR) is a well-known statistical technique used to model the linear relationship between a count variable and many independent variables.MLR is based on calculating ordinary least squares (OLS), the model is fit such that the differences between actual and predicted instances are minimized.
k Nearest Neighbors (kNN): The kNN algorithm is an instance-based method that is not used to build a model from training data; rather, it keeps the training instances with the intention of analyzing future instances.The kNN algorithm searches the training instances to find the closest instances to a new unknown instance to be analyzed.The search starts by finding the distance with all other instances using the Euclidean Distance.The kNN algorithm selects the average of the closest group of k objects in the training set [31].The number of operators contained in a module

NUM_OPERANDS:N2
The number of operands contained in a module

NUM_UNIQUE_OPERATORS:µ1
The number of unique operators contained in a module

NUM_UNIQUE_OPERANDS:µ2
The number of unique operands contained in a module www.ijacsa.thesai.orgMulti-layer Perceptron -Backpropagation algorithm: The multi-layer perceptron (MLPRegressor) is similar to the organization of the brain neurons.Artificial neurons are arranged in layers (i.e., input layer, hidden layers and output layer).Connections between the neurons provide the network with the ability to learn patterns.In MLP, each neuron in the hidden layer uses a combination of weighted outputs of the neurons from the previous layer.In the final hidden layer, neurons are combined to produce an output, which is compared to the correct output and the difference between the two values (the error) is fed back to update the network [13].
Support Vector Machine (SMOreg): SMOreg implements the support vector machine for regression.SMOreg is more complicated to be taken into consideration than the classification version.However, both aim to minimize error, i.e., individualizing the hyperplane which maximizes the margin while error is tolerated [32].

E. Regression performance evaluation
The models are trained and tested using 10-fold crossvalidation, in which data is partitioned into ten equal sample sizes.Nine partitions are used for training while the last partition is used for testing.This process is repeated ten times to use all partitions in testing.The performance of regression models is usually evaluated using the Root Mean Squared Error (RMSE) as defined in Eq. (1).RMSE is frequently used to measure the difference between predicted and actual values.RMSE is calculated as follows.
In this research the dependent variables have different units and to be able to compare models on different units, the Relative Absolute Error (RAE) is used as defined in Eq. ( 2).[32].RAE is calculated as follows.

RAE =
In both measures, a is the actual value, p is the predicted value, and ̅ is the mathematical mean.

IV. DATA ANALYSIS
In the following, the evaluation of the prediction performance for fault measures are reported using RMSE and then compared using RAE.

A. Evaluation of fault count prediction
Five prediction models are built for fault count using twenty metrics under investigation.The performance of fault prediction is calculated and summarized in Table 4.The results of the five models do not differ from each other when compared within any data set.However, the LR models look better in two data sets, while KNN models are also better in two data sets as marked in bold.However, the differences in the performance among the models are not enough to provide ranking of the machine learning techniques.The MLP can be considered the worst in performance among all.To put models in practice, the results of the models are depicted using Alberg diagrams as proposed in [15].In Figure 1, modules are sorted in decreasing order by the predicted faults.The plot shows the percentage of modules (x-axis) against the percentage of actual faults after sorting the instances.Figure 1 shows the results of fault count prediction in KC1.These results are taken from running the models in the 10-fold cross-validation.The figure can be used as follows, for example at X=20 the value of the curve is 60, which means 20% of modules (369 modules) with highest predicted fault count constitute of 60% of faults.It can be noticed that the top 30% of modules has 70% of actual faults.This behavior is similar in all models.
We also plot the same graph for KC3 and KC4 prediction models in Figure 2 and 3.In Figure 2, we observe similar results for the top 20% modules, i.e., about 60% of faults are found in the top 20% of modules in all prediction models.In Figure 3, we observe similar results for KC4 data in kNN model.Other models show 20:50 relationship, i.e., 50% of faults are found in the top 20 modules.We can conclude that software metrics can be used to predict fault count and models can be used in practice to rank modules based on predicted fault count.Therefore, RQ1 is answered in this research.When planning for quality inspection during the software development process, we can make a trade-off between the resources spent on inspection and the effectiveness of inspections [8].The prediction models can be used to put the modules in a priority list for more investigation such as testing and maintenance.We can use the graph in Figure 1 to determine the percentage of faults that are expected in the system by inspecting a certain percentage of the system modules.For example, the top 20% modules can be www.ijacsa.thesai.orginvestigated first if allocated resources are only available for investigating such number of modules.
The graphs in Figures 1-3 have shown similar behavior to works in [33][8] [11].For instance, Briand et al. [8] found that the first 20% of classes have 52% of faults in the system.They also suggested that such curves can be used in practice if they appear to be constant across projects.Software managers can use fault prediction models to allocate more resources on the parts of the code that were predicted to be more fault-prone [5] [34].

B. Evaluation of fix-cost prediction
We repeated the same experiment to predict fix cost using all metrics and the results are shown in Table 5.We notice no significant differences among the models except MLP, which is again the worst modelling technique.M5P regression trees can be considered the best among all models, while others have almost equal performances.The fix cost can be used in practice to order modules based on cost prediction.We plot the percentage of modules (x-axis) and the percentage of actual costs after sorting the instances in decreasing order by the predicted fix cost.Figure 4 shows the results of the five prediction models for fix-cost prediction in KC1.The figure can be used, at X=20 the value of the curve is 60% in three models whereas in two models (LR and MLP) is about 50%.This result means 20% of modules (369 modules) with highest predicted fix cost incurred 60% of the spent person hours on fixing cost.It can be noticed that the top 30% of modules ordered by the prediction model has 60-70% of actual fix cost.
We plot the Module-Cost graph for KC3 and KC4 in Figure 5 and 6.The graphs show a 10:60 relationship, i.e., 10% of modules incurred 60% of the fix cost in both KC3 and KC4 in four models except the MLP prediction models which shows 20:40 relationship.These results are better than the models in KC1.In addition, the use of fix cost seems more efficient than the use of fault count in two data sets: KC3 and KC4, which show 20:60 relationship.Therefore, RQ2 is answered in this research as well.Fix cost can be predicted using software metrics and models can be used in practice to rank modules based on predicted fix cost.Fix cost can be used to allocate resources in software testing and maintenance activities.

C. Evaluation of fix effort prediction
The SLOC modified in a module is also studied as a fault measure and the results are presented in Table 6.The results are not conclusive in identifying the best model.The MLP models are again the least in performance among all.We plot the percentage of modules (x-axis) and the percentage of actual SLOC modified to fix faults in each module after sorting modules in decreasing order by the predicted effort as shown in Fig. 5. Alberg diagram for five prediction models of KC3 Fig. 6.Alberg diagram for five prediction models of KC4 Figure 7,8 ,and 9.The results of the five prediction models do not show consistent results in all data sets.Almost all models show 20:60 relationship in KC1, but are different in KC3 and KC4 for different models.However, the results of the models on KC4 are similar to the models in KC1.While the models obtained from KC3 do not show promising results.These results show that RQ3 is answered.Fix effort as measured using SLOC can be used in practice to order the modules based on fix effort.However, the fault count and fix cost in person hours can be more beneficial to software managers.

D. Comparison of models performance
The RMSE results cannot be used to compare the results across the three fault measures because of the differences in measurement units.Therefore, we use another measure, the Relative Absolute Error (RAE), to analyze the results among the fault measures.The results of the models performance in RAE are reported in Table 7, where we find the following observations.In KC1, the Fault count models are the best in most models except one.In KC3, the fix cost models are the best except for two models.In KC4, the Fault count models are again the best.Therefore, for the systems under investigation, we can observe that prediction models based on fault count are slightly better in performance than other studied models.However, we do not observe large differences among the three fault measures under investigation.These results help the software engineers to consider other quality factors related to fault discovery and fix processes.The regression models for the fix cost and fix effort can be used similarly to fault count models.The three quality factors can be used in practice to allocate resources, but it is important to know which models are consistent and always useful.The results of the practical implementation of the models for the three factors when 20% of modules are selected for further investigation are summarized in Table 8.The results show that both fault count and fix cost are more consistent than fix effort.In cases, cost models show better results.Furthermore, the use of cost in allocating resources provide more insights about the person hours spent to fix faults and can be considered a stronger indicator of where difficulties in code may appear.

V. VALIDITY THREATS
In the following, we address two kinds of possible threats that may affect the conducted research.
Construct Validity Threats: Construct validity refers to the degree to which the dependent and independent variables in this research measure the intended targets.Fix cost as measured in person hours are estimated by the developers and there is no detailed information about how developers estimate the fix cost.However, the data comes from a well-reputed organization, NASA, and their work is focused on quality of data and quality of work.The metrics in this study are wellstudied metrics and recommended by many researchers to measure modules at procedural level.
Internal validity threats: internal validity is the degree to which conclusions can be drawn from the proposed data sets.This study depends on data from other organization and there is not enough information available about the development process followed in developing the three applications under study.However, the studied systems were considered in many other research papers and recommended to use by NASA.

External validity threats:
External validity is concerned with the degree to which the results can be generalized to other research settings.The results of this study is based on only three data sets published by NASA MDP.We need more data sets to be able to generalize the results of this study into other systems.In addition, the systems are measured at procedural levels and conclusions may not be applicable for other paradigms like object-oriented paradigm.

VI. CONCLUSIONS AND FUTURE WORK
The fault prediction models are surrogates for the software quality.The assessment of faults in modules can be used to direct the efforts of software engineers in assuring software quality.Five well-known regression models were used to predict fault count, fix cost, and fix effort.The results of regression models for three data sets were reported.The results were not conclusive to find the best models in each data set and all regression models had similar performance.The prediction of fault count had a better performance in most models in KC1 and KC4 data sets.We found the prediction of fix cost is the best in KC3 only.Engineers may not have enough time to explore the quality of all modules in large software systems.It is vital to show the value of using these models in doing costeffective quality assurance, e.g., prioritizing modules for further investigation.We have modeled the results of the prediction models by plotting the relationship between %modules and %faults after sorting the modules by faults predicted.

TABLE I
[26].The McCabe and Halstead measures are module-based where a module is the smallest unit of functionality.McCabe argued that code with complicated pathways are more error prone.Halstead considered the code readability as indicator of fault proneness.Halstead metrics measure software complexity by counting the number of concepts in a module[26].

TABLE IV .
FAULT COUNT REGRESSION MODELS

TABLE VI .
FIX EFFORT REGRESSION MODELS