A New Learning to Rank Approach for Software Defect Prediction

—Software defect prediction is one of the most active research ﬁelds in software development. The outcome of defect prediction models provides a list of the most likely defect-prone modules that need a huge effort from quality assurance teams. It can also help project managers to effectively allocate limited resources to validating software products and invest more effort in defect-prone modules. As the size of software projects grows, error prediction models can play an important role in assisting developers and shortening the time it takes to create more reliable software products by ranking software modules based on their defects. Therefore, there is need a learning-to-rank approach that can prioritize and rank defective modules to reduce testing effort, cost, and time. In this paper, a new learning to rank approach was developed to help the QA team rank the most defect-prone modules using different regression models. The proposed approach was evaluated on a set of standardized datasets using well-known evaluation measures such as Fault-Percentile Average (FPA), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the Cumulative Lift Chart (CLC). Also, our proposed approach was compared with some other regression models that are used for software defect prediction, such as Random Forest (RF), Logistic Regression (LR), Support Vector Regression (SVR), Zero Inﬂated Regression (ZIR), Zero Inﬂated Poisson (ZIP), and Negative Polynomial Regression (NPR). Based on the results, the measurement criteria were different than each other as there was a gap in the accuracy obtained for defects prediction due to the nature of the random data, and thus was higher for RF and SVR, as well as FPA achieved better results than MAE and RMSE in this research paper.


I. INTRODUCTION
Predicting the exact and precise defect number is the best and most accurate option for software engineers, but because of the difficulty of achieving this task in real scenarios, it is not enough to rely on classifying modules into defects or not, so there is a need for another solution that can improve the defect prediction performance and increase the quality assurance of confidence in defect prediction [1]. This solution can be achieved by using a learning to rank approach that supports defect prediction models to rank and prioritize modules based on certain factors [1].
The importance of SDP models for predicting software defects has been discussed by using the LTR approach to rank a program according to the number of defects. The new model is supposed to improve the performance of the existing defect prediction models. Predicting the number of defects in software modules using machine learning regression models. However, this paper proposes a new learning to rank approach that supports defect prediction models for ranking and prioritization.
Most of the research in the past decade has focused on proposing new indicators for constructing predictive models [1]. The most studied indicators are the source code and process metrics [2]. Source code metrics measure the complexity of the source code [2]. Process metrics are derived from software documents, such as version control systems and issue trackers, which regulate the entire development process. Process metrics quantify many aspects of the software development process, such as source code changes, ownership of source code files, and developer interaction. The process metrics used to predict errors have been validated in many studies [2].
Defect prediction research is generally based on machine learning [3]. Predictive models built using machine learning approaches can predict the probability of errors in the source code (classification) or the number of errors in the source code (regression). Some studies have proposed the latest machine learning techniques, such as: improving active learning and prediction. The researchers also focused on determining the accuracy of predictions. Failure prediction models attempt to identify failures at the system, component, package, or file/class level. According to recent research, errors in modules or methods can also be identified and changed to different levels. Better accuracy can help developers by limiting the scope of source code reviews ensure quality. Suggesting a preprocessing method for predictive model is also an important research put forward in error prediction research. Before building the model, the following methods can be used for prediction: function selection, normalization, and noise protection [3]. Through the proposed pre-processing method, the predictive characteristics of actions in related research can be improved [3]. The researchers also proposed methods to predict defects in software projects [3]. The majority of the above representative studies were performed and verified within an internal prognostic framework, and the predictive model was developed and tested within the same project [4]. However, this is difficult for new projects that lack development history information. Create a predictive model. Typical methods for predicting crossover errors are metric compensation [4], nearest neighbor (NN) filters, naive transfer Bayes (TNB), and TCA+ (stateof-the-art transfer learning approach). Adjust the predictive model by selecting similar instances, transforming data values, (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 13, No. 8, 2022 or developing new models [4].
Another important topic for defect prediction between items is to study the possibility of cross-prediction. Several studies have confirmed that cross-prediction is difficult to achieve; only a few cross-prediction combinations are effective [5]. Determining cross-prediction capabilities will play a major role in predicting errors between projects. There are many studies on the possibility of cross-prediction based on decision trees [5]. However, their decision tree has only been tested on certain software datasets and has not been studied.
The purpose of the SDP for the classification task is to predict which modules are likely to contain the most defects in order to allocate efforts to improve software quality, which means relative prediction and extraction of the exact number of defects, but this requires many conditions and accurate data to give the exact number of defects, which becomes difficult when the data is too large. However, the Learning to Rank (LTR) method provides a linear model by directly improving classification performance. It has been verified that it is useful to make forward adjustments to the classification performance metrics of the SDP model for constructing classification problems [6].
In this paper, a new learning to rank approach was developed to help the QA team with the ranking of the most defectprone modules. The proposed approach was evaluated on a set of benchmark datasets using known measures such as fault percentile average (FPA), cumulative lift chart (CLC), mean absolute error (MAE), and root mean square error (RMSE). Our proposed approach will be compared with the current learning to rank approaches used in defect prediction.
The paper is organized as follows: Section II presents work related to software defect prediction as well as learning to ranking methods. Section III presents the proposed model including the data sets as well as the evaluation metrics used. Section IV presents the implementations made in the paper and finally Section V presents the findings and discussions about them before ending with the paper's conclusion.

A. Software Defect Prediction
There are many studies that address the issue of predicting software defects. Among them, for example, X. Huo and M. Li, in [4] who proposed a new perspective for software defect prediction. This clearly articulates the "pair-wise" relationship between the bad module and the clean module to better prioritize the modules that are prone to failure, thus using benchmark dataset to ensure software reliability. X. Jing et al. in [8] attempt to systematically summarize all the typical work on predicting software failures in recent years. Based on the results of this work, this paper will help software researchers and professionals to better understand previous failure prediction studies based on datasets, software indicators, scores, and technical modeling perspectives in a simple and effective way. A. Okutan and O. Yıldız in [9] used Bayesian networks to study the relationship between software performance and error propensity. They used 9 records in the Promise data repository and showed that RFC, LOC, and LOCQ are the most error prone. On the other hand, the effect of NOC and DIT on defects is limited and unreliable. Y. Ma et al. [10] looked at a cross-company defect prediction scenario in which the source and target data came from different companies. They presented a novel technique called Transfer Naive Bayes (TNB), which uses the information of all the proper features in training data to select training data that is similar to the test data. J. Zheng in [11] studied three cost-sensitive impulse approaches for driving neural networks to predict software failures using four datasets related to a single action from the NASA project. Experimental results show that threshold shift is the best choice for cost-effective prediction of software failures using neural network models from the three approaches studied, especially for project datasets developed in object-oriented languages. X. Jing et al. in [12] used vocabulary learning methods to predict software errors. They used the characteristics of opensource software measurement to study various vocabularies (including error-free modules and damaged modules and subwords of general vocabulary) and sparse representation coefficients. The dataset from the NASA project is used as a benchmark for evaluating the performance of all comparison methods. Experimental results show that CDDL is superior to several typical current error prediction methods. G. Czibula et al. in [13] proposed a classification model based on the mining of relational association rules. It is a discovery of relational association rules that can be used to predict whether a software module is flawed or not. On the open-source NASA datasets, an experimental evaluation of the proposed model. The results reveal that the classifier outperforms existing machine learning-based defect prediction approaches for the majority of the assessment measures studied. I. Laradji et al. in [14] introduced a two-variant (with and without feature selection) ensemble learning technique that is robust to both data imbalance and feature redundancy. Poor characteristics do not affect ensemble learners like random forests and the proposed technique, average probability ensemble (APE), as much as they do weighted support vector machines (W-SVMs). Furthermore, for the NASA datasets PC2, PC4, and MC1, the APE model paired with greedy forward selection (improved APE) attained AUC values of roughly 1.0. S. Liu et al. [15] employed the FECAR feature selection framework with Feature Clustering and Feature Ranking to forecast software defects. Using the FF-Correlation metric, this framework divides original features into k clusters. Then, using the FC-Relevance measure, it selects relevant features from each cluster. The data is based on real-world projects such as Eclipse and NASA. P. Krause and N. Fenton in [16] Focuses on a model developed for the Philips Software Center (PSC) using the expertise of the Philips Research Laboratory, which is specifically designed to predict the number of errors in various testing and operational phases. Seven of the 28 projects can obtain comprehensive data (completed questionnaires, more project data, and more error data). The study was not as successful as expected, and the authors confirmed that more investigations will be conducted in the future. The research is still in progress.
Li, M. Shepperd, and Y. Guo in [17] investigated the use and performance of unsupervised learning techniques in predicting software defect by conducting a systematic literature review that identified 49 studies with 2456 individual experimental results that met our inclusion criteria and were published between January 2000 and March 2018. Everything is in order. In this study, unsupervised classifiers did not appear to perform worse than supervised classifiers.
T. M. Khoshgoftaar and colleagues in [18] proposed a methodology that incorporates a feature selection approach for picking relevant qualities and a data sample approach for resolving class imbalance. They used nine software measurement datasets from the PROMISE software project repository. Experimental results show that feature selection based on sample data performs significantly better than feature selection based on raw data, and the fault prediction model can achieve the same effect whether the training data is sample data or raw data.
L. Son et al. in [19] proposed a methodological mapping where they dealt with nine studies questions similar to distinctive stages of improvement of a DeP model. They explored every issue related to the method from collecting records; Preprocess records, strategies used to build a DeP fashions for the metrics used to evaluate the overall performance of the model and statistical evaluation plans used to mathematically validate the results of the DeP model. Out of the full 156 research, they decided on ninety-eight research for addressing 9 studies questions fashioned for this systematic mapping. [20] used a lot of project data to prepare a balance and unbalanced dataset to build a prediction of software defects. Experimental results show that no significant changes are observed between balanced and unbalanced learning models. For a balanced learning model with an unbalanced test dataset, only the AUC value (area under the curve) increases exponentially. X. Cai et al. in [21] proposed a hybrid multi-purpose dynamic local search Cuckoo Search (HMOCS) to simultaneously identify health solutions. The problem of class mismatch in the dataset and the selection of SVM (support vector machine) parameters is critical to the prediction software defect. Eight datasets were selected from the Promise database to verify the proposed model for predicting software failures. Compared with the results of 8 prediction models, this method effectively solves the problem of predicting software failure. W. Li et al. in [22] proposed a two-step classification method and a two-step classification method based on three-way decision-making to predict costsensitive software failures by using NASA data. On the same direction Abu-Alhija et al. [23] studied the impact of kernels and SVM on the performance of defect predictions. they found that RBF is more Superior than other kernels.

M. Sohan et al. in
B. Learning to Rank Approaches in Software Defect Prediction X. Yang et al. in [1] used the LTR methodology for a wide range of real-world datasets and provided a full evaluation and comparison SDP for the ranking job, which included 10 construction approaches compared to other approaches on eleven real-world datasets. The relationship between CLC and FPA was also explored, as well as the need for metric selection over two sets of data for SDP for the ranking assignment. For the ranking job, the LTR technique to building SDP models yielded good accuracy and clarity of understanding. Also Xiaoxing Yang et al. in [2] used the learning-to-rank approaches to anticipate software defects. They presented the experimental results, which include a comparison of their approach to three other approaches from the literature, as well as five publicly available datasets. They employed the evolutionary optimization method to directly optimize the model performance measure, fault-percentile-average, which is not the same as the loss functions. For most datasets, the proposed learning-to-rank approach outperformed linear regression and logistic regression in terms of fault percentileaverage models. Z. Cao et al. in [3] employed a learning to rank based approach to address the lack of legacy specifications that quantify the possibility of a candidate rule becoming a specification using 38 interesting measures. The benchmark dataset contains 28 classes from the Java 6 SDK that have been manually identified as having specification rules. These guidelines were derived from the completion of 14 projects. Experimental results using classes from the Java 6 SDK show that our learning to rank-based technique can enhance the best ranking performance using a single measure by up to 66 percent. X. Yu et al. [5] investigated the effect of 23 learning to rank approaches for EADP using 41 releases of 11 open source software projects taken from the PROMISE data repository to examine the impact of 23 learning to rank techniques for EADP. When the 23 approaches are trained on the original feature subset, the experimental findings demonstrate that BRR performs best in terms of FPA, while BRR and LTR perform best in terms of Norm (Popt) subset.
M. Buchari Yu et al. in [6] used two public benchmark datasets to create and assess the implementation of Chaotic Gaussian Particle Swarm Optimization on the Learning-to-Rank software defect prediction methodology for train model parameter. They conclude that using Chaotic Gaussian Particle Swarm Optimization in a Learning-to-Rank strategy can increase defect module ranking accuracy in datasets with highdimensional characteristics. Y. Ma et al. in [7] used a top-k learning to rank (LTR) approach in the scenario of CPDP. The PROMISE dataset shows that SMOTE-PENN outperforms the other six competitive resampling approaches and Rank Net performs the best for the proposed.

A. Dataset
In this paper, a benchmark dataset was used from several types of versions, and the dataset was collected from the GitHub libraries and from previous research. The dataset applied to the developed regression models was 28 in total with different features, as shown in Table I below. The methodology on which the dataset was applied is to read the required data, then it was ensured that the data did not contain null values, and then the data was divided into x (features) and y (total defect). Finally, feature scaling technique was applied to make the output the same standard as it mentioned in the implementation section.

B. Developing Regression Model
In this paper, a model was proposed to predict the defects of the software modules and then rank the most defect-prone modules using six regression models such as (Random Forest, Logistic Regression, Support Vector Regression, Negative Binomial Regression, Zero Inflated Regression, and Zero Inflated Poisson). However, after we prepared the dataset, we applied the data to our modules. They are divided into two categories: variations of the Poisson regression model and regression trees: www.ijacsa.thesai.org in which the dimension is increased, and the data points become separable by a hyperplane. Logistic regression is a data analysis technique that is used to define and explain the connection between one dependent binary variable and one or more nominal, ordinal, interval, or ratio-level independent variables.

C. Evaluation Measures
In this paper, different measures were used to evaluate the accuracy of our modules, the goal of accuracy evaluation is to make it easier to determine which modules we evaluate is good. The evaluation method is to obtain the percentage. The percentage of defects in the preceding modules of the ranking is commonly applied. To evaluate SDP models for the ranking task. The following are the evaluation measures we used: • Fault-Percentile-Average: FPA is one of the evaluation measures which could reflect the effectiveness of different prediction models across all cuts off values as shown in equation 1. FPA is the average of the proportions of actual defects in the top m (m=1,2,..,k) modules to the whole defects, which is a more comprehensive performance measure than the percentage of defects in the top 20% modules. A higher FPA means a better ranking, where the modules with most defects come first [1]. where: • k is the number of software modules.
• n is the total number of defects in all modules.
• m is the modules to the whole defects.
• Root Mean Square Error: RMSE stands for Root Mean Squared Error. The standard deviation of the errors that occur when making a prediction on a dataset is known as the RMSE. This is the same as MSE (Mean Squared Error), but the root of the number is considered when calculating the model's accuracy. The errors are squared before being averaged in RMSE as shown in equation 2. This means that RMSE gives larger mistakes a higher weight. This suggests that RMSE is far more beneficial when substantial errors exist and have a significant impact on the model's performance. This characteristic is important in many mathematical calculations since it avoids taking the absolute value of the error. In this metric as well, the lower the value, the better the model's performance.
where P i is the predicted value for the i th observation in the dataset. O i is the observed value for the i th observation in the dataset. n is the sample size. • • Cumulative Lift Chart: A lift chart graphically Represents the improvement provided by the mining model to random estimation and measures the change in the form of elevation estimation as shown in equation 4. Through comparing the elevation estimates of different models, you can determine which model is better.
where k is the number of software modules.

D. Research Methodology
In this paper, a new learning to rank (LTR) approach was developed to help the QA team rank the most defect-prone modules in the software and thus reduce testing efforts using various regression models. The datasets used were taken from the standard dataset, and the datasets are divided into training and test data. In the Software Defect Prediction Program (SDP), training data and test data were selected in two separate ways. First, in the same dataset, the training and test data were randomly selected (or may be sequential). In the second stage, the training will be taken from the dataset as the previous version, and the test data from another dataset will be taken as the next version. The first approach was adopted and used. We then evaluated the data using known evaluation measures such as Fault-Percentile-Average (FPA), Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE) and Cumulative Lift Chart (CLC). Our proposed LTR approach was compared with the current LTR approaches used in software defects prediction. Fig. 1 illustrates the research methodology used in this paper.

E. Implementation
To prove the success of our proposed LTR approach, it is necessary to apply our work and show and compare the results. Various regression models were used in a separate way from previous studies, by applying the LTR approaches and the programming language that we will discuss. Python 3.6 and Spyder 3.2.6 were used to evaluate the accuracy of the ML regression models (SVM, RF, LR, ZIP, ZIR and NPR). In addition to the usage of Google Colab to run the existing LTR approaches to do comparison with the proposed LTR approach in this paper. Each model was developed separately from the others, but in this section, we collected the models to present the methodology in an obvious way. Each model was used from twenty-eight datasets. The databases were configured prior to use so that they were all applied in a uniform manner. We will go through the methodology in a clear manner by explaining the steps of the code.

IV. RESULT
In this paper, four evaluation measures were used to calculate the accuracy of our regression models, and we will present the findings in tables depending on the evaluation measures. Twenty-eight datasets were used over six regression models in this paper. The goal of this paper is to present the accuracy of our software modules by applying 5-fold cross-validation technique. The reason to use 5-fold cross-validation is to get the best result based on 28 datasets with different features.

A. Fault Percentile Average
By calculating the average accuracy of Fault-Percentile-Average using 5-fold cross validation, we show the values in Table II.
The goal of building these regression models was achieved by finding the model that contains the largest number of errors and comparing the machine-trained model with the outputs in the data. The accuracy was measured to discover the model that predicts the number of errors, and the accuracy expresses the result of the prediction of the machine in how close it is to the original result. Here, the closer the result is to zero, the better the result. From our experience, it is difficult to determine which model is better because of the amount of disparate data, but we can determine the best model by comparing the models on one dataset only.
Fault-Percentile-Average (FPA) evaluation measure was used based on previous studies, which were studied based on classification models, thus showed satisfactory results, but in this paper, we applied it to six regression models on a larger scale, so that we used all available databases in the field of rank learning, and we have obtained satisfactory results. In this paper, we demonstrated the success of the error-percentagemean scale. All results were not shown over-fitting on the result.
As seen in Table II, the columns represent all the datasets we used and the rows represent the regression models that we created, for example row ant-1.7 represents the first dataset to which Fault-Percentile-Average has been applied to show the accuracy results for the regression model that was built To determine the model that contains the largest number of program errors, and this accuracy represents the proximity of the learned data to the test data and here we find that the best reading for it is 0.81054, which represents the negative binomial regression model, and this result does not mean that it is the best model because it may depend on the nature of the data and the evaluation measure.

B. Mean Absolute Error
By calculating the average accuracy of Mean-Absolute-Error using 5-fold cross validation, the values shown in Table  III.
Because we are using regression models in this paper, it is necessary to mention the measurement criteria for the regression, such as Mean Absolute Error. We have used the same methodology in building defect models that contain the largest number of errors and measuring the average accuracy of the models by using 5-fold cross-valuation on a twenty-eight dataset with all features. As shown in Table III, the accuracy results from using the evaluation of the Mean Absolute Error of the regression. It is clear that some results have exceeded the relevance because most of the datasets are intended for classification. However, satisfactory results were shown in some datasets. This does not mean that other models failed to show accuracy in every way, but rather they showed satisfactory results according to the nature of the data. Table IV shows the RMSE results based on different numbers of matrices. We applied 10 times 5-fold crossvalidation over 28 datasets with all metrics. By calculating the average accuracy of Root-Mean-Square-Error using 5-fold cross validation, we show the values in Table IV. This is another way to calculate mean precision with 5fold validation using RMSE. As shown in Table IV, more than appropriate occurred in some of the data, and this is because the nature of the data is for classification and not for regression. However, we have achieved satisfactory results, and these results were mostly concentrated on two models (Linear Regression and support vector regression).

D. Cumulative Lift Chart
This is a way to evaluate the measure of our modules to show the relationship between two evaluation measures (FPA, MAE) in easy and effortless way by presenting the chart of all six modules we have used before as shown in Fig. 2. The charts show the performance of our regression models against other well known LTR methods algorithms [1][2] [3][4][5][6][7][8][9][10].
V. CONCLUSION SDP models for ranking task manage testing resources more effectively by predicting which modules are likely to have more errors in the software program. SDP data is gathered by a variety of IT organizations and individuals, and it is noisy. As a result, estimating the number of errors per software  module is difficult, if not impossible, due to a lack of precise historical data. Some academics propose utilizing a rankingbased performance metric to assess SDP models such as CLC and FPA. However, contemporary SDP models have been enhanced to properly predict a specific number of errors. However, a decent model based on individual loss functions may fail to provide a satisfactory ranking. As a result, in this paper, we proposed a unique approach, distinct from earlier studies, for developing models by direct improvement of the ranking performance measurement. We applied the LTR approach to a wide range of real-world datasets in this paper and present a complete assessment and comparison of RF, SVR and LR with other approaches. We also estimated the error using FPA and MAE and then used CLC to explain the disparity between its results.
The following are the key findings from our research paper: 1) Employing the regression approach rather than the classification approach, as opposed to prior studies in the literature where the classification technique is employed. This is to highlight the contrast between the classification and regression models. Whether or not this model has mistaken, the data is divided into 0 and 1, with 0 containing no errors and 1 containing errors. This is known as classification. However, the regression models that we are working on estimate the number of mistakes in each website, which means that the first website based on the characteristics (x values) has a number of errors, and so the regression models train the model when the data enters it. The characteristics will predict mistakes that are either equal to or near to the amount of genuine errors. This is the point of using regression models. 2) Proposing a new LTR approach with scipy.stats and apply it to multiple models to compare and calculate accuracy. We discovered that the model produced using regression models accomplished what was expected of it in terms of identifying models with the highest number of errors, and the percentage of accuracy varied according to the type of data. And according to the comparison with the standard measures, we found that the model RF and SVR is better.
Based on the results and their comparison, we found that the measurement criteria differ from each other, so that we found a gap in calculating the accuracy in some measurements due to the nature of the random data, and FPA achieved better results than MAE, RMSE in this research paper.
ACKNOWLEDGMENT Yousef Elsheikh and Sara Al-omari are grateful to the Applied Science Private University in Amman, Jordan, for the financial support granted to cover the publication fee of this research article.