Systematic Review Study of Decision Trees based Software Development Effort Estimation

The role of decision trees in software development effort estimation (SDEE) has received increased attention across several disciplines in recent years thanks to their power of predicting, their ease of use, and understanding. Furthermore, there are a large number of published studies that investigated the use of a decision tree (DT) techniques in SDEE. Nevertheless, in reviewing the literature, a systematic literature review (SLR) that assesses the evidence stated on DT techniques is still lacking. The main issues addressed in this paper have been divided into five parts: prediction accuracy, performance comparison, suitable conditions of prediction, the effect of the methods employed in association with DT techniques, and DT tools. To carry out this SLR, we performed an automatic search over five digital libraries for studies published between 1985 and 2019. In general, the results of this SLR revealed that most DT methods outperform many techniques and show an improvement in accuracy when combined with association rules (AR), fuzzy logic (FL), and bagging. Additionally, it has been observed a limited use of DT tools: it is therefore suggested for researchers to develop more DT tools to promote the industrial utilization of DT amongst professionals. Keywords—Systematic literature review; decision tree; regression tree; software development effort estimation


I. INTRODUCTION
Much of the greater part of the literature on software project management pays particular attention to SDEE. According to [1] SDEE refers to the process of estimating the necessary effort needed for developing any software with regards to money, timeline, and staffing. The effort's unit is generally expressed in man-day/month/hour [2]. For instance, precise and accurate software cost prediction can result in successful control of the budget, time, and appropriate resource allocation. Unfortunately, overestimating is almost as strong a risk factor for software project failure as underestimating. Similarly, [3] found that inaccurate estimates of required resources are one of the most common reasons why software projects fail. Making correct estimation, therefore, helps in analyzing the practicability of any project regarding its cost-effectiveness [4] which ensures its success.
To date, there is a notable amount of studies investigating new models to perform accurate SDEE. In the SLR made by [5] over 304 candidate journal studies, they have outlined 11 prediction techniques which are grouped into two main groups: 1) algorithmic effort modeling which predict costs using a mathematical formula of project's attributes, 2) Machine learning techniques like (decision tree (DT), artificial neural networks (ANN), genetic programming (GA), and case-based reasoning (CBR)). Generally, machine learning techniques (MaL) have received considerable attention thanks to their power of modeling complex relations between software attributes and the target value (software cost), extremely where the form of the relationship cannot be straightforwardly determined. In the same vein, [6] has also conducted an SLR where they listed eight types of machine learning models. Overall, the results indicate that ANN, analogy based estimation (ABE), and DT are the most commonly employed SDEE techniques with (37%, 26%, and 17% respectively). A similar decreasing order is reported in [5]. Furthermore, DTs were adopted for SDEE mostly for their capability of predicting and interpreting results, unlike other MaL techniques as claimed by [7] in their systematic mapping study of decision tree-based SDEE where they identify 46 relevant papers. However, there exist some strong conditions and limitations that affect the ease use of DT techniques in a specific context (see Section III.C).
Also, results from earlier studies demonstrate a strong and inconsistent accuracy of DT, as compared to MaL and Non-MaL cost estimation techniques. According to some papers [8][9] [10], DT outperforms regression models. This outcome is contrary to that of [11][12] [13] who have highlighted the relevance of regression models in providing more accurate estimates than DT models. Moreover, DT show superior accuracy than RBFN models as reported in [14][15] [16] [17] differs from the findings presented in some published studies [17] [18]. These existing inconsistent results have heightened the need for reviewing the evidence of the DT model, to better understand and enhance their application. Furthermore, in reviewing the literature, it should be noted that there is no SLR of DTs for software effort estimation. Thereby, we follow the methodology presented by [19] in order to make a concise selection, deep examination, and synthesizing findings of all DT studies made from 1985 until 2019. This study examines the evidence of DT models concerning the following five perspectives: (1) the prediction accuracy of DT methods; (2) the comparison of prediction accuracy of DT techniques and other methods; (3) the suitable estimation contexts for employing DT techniques; (4) the effect of combining other methods with DT models; and (5) tools that implement DT methods.
The organization of this study is as follows. Section II outlines the methodology of research used to perform this SLR. Section III describes and analyzes distinct review results; Section IV summarizes the fundamental finding and suggests some recommendations for research and practice. Section V reports this review's limitations. Finally, Section VI presents conclusions and gives the perceptiveness of future work.

II. METHODOLOGY
The main steps of this SLR are: determining review questions, explicating the strategy of research, making a study selection, performing a quality assessment, extracting, and synthesizing data. All these steps will be detailed in the subsequent subsections.

A. Review Questions (ReQs)
This SLR attempts to assess the evidence of DT methods and to perform favorable recommendations based on the certainty of results. The five review questions are as follows:

B. Search Strategy
The search strategy encloses three phases that help at answering the ReQs, which are outlined precisely thereafter.

1) Search string:
We construct the search string from words derived from ReQs and also by searching their homonyms, along with employing AND, OR, and NEAR operators to restrain the research results. We use the same search string conceived by Najm et al. [7].
2) Literature resources: To seek relevant studies, we use the next five electronic databases considering that they are largely employed in review studies: IEEE Xplore, Science Direct, ACM Digital Library, Springer, and DBLP.
3) Search process: The search process is handled out in two stages: in the first stage we search in digital databases for a query string to select relevant studies, the inspection takes into account the abstract, the document's title, keywords/Index as well as the whole text to not miss any suitable paper, after that the second stage consists of looking for additional papers by examining references of predetermined articles (selected in the 1st stage).

C. Study Selection
The study selection aims at identifying appropriate articles that address ReQs. So, to achieve this purpose, we use the inclusion/exclusion criteria to choose or discard the papers.
We notice that we employ the similar inclusion/exclusion criteria used by Najm et al. [7]. Fig. 1 shows the total of selected or remained papers after each phase, while phases are marked by letters from a to f.

D. Quality Assessment
Quality assessment (QuA) was conducted in this review to prevent any biased information that can affect the findings. For this purpose, the quality of 50 extracted papers was evaluated using the six following questions: QuA1: Does the paper define explicitly the intended goals of the study?
QuA2: Does the study present properly the solution proposed?
QuA3: There exists a clear explanation of the estimate's context?
QuA4: There exist some supporting studies reported in the paper?
QuA5: Does the paper make any significant contribution to academia/industry? QuA6: What is the quality of the publication channel where the articles were published?
Conferences/workshops/symposiums: (+1) for conferences/workshops/symposiums ranked CORE A, (+0.5) for the conferences/workshops/symposiums ranked CORE B, and (0) for conferences/workshops/symposiums ranked CORE C. 543 | P a g e www.ijacsa.thesai.org Although the QuA criteria, as well as their rates, might be nonobjective, they help us to compare the chosen studies. We note that the same criteria were employed in [6] [21]. The quality assessment was conducted separately by two researchers who answer carefully the answers; any discord was discussed and finally, fixed by mutual agreement between the two researchers. We then selected only papers whose score rise above 3 (50% of the excellent quality of a paper: 6). All 50 relevant papers were then selected due to their suitable quality score of more than 3.

E. Data Extraction and Synthesis
The data extraction is used to extract all relevant data from selected papers to answer ReQs. Table I shows the form of data extraction.
To deal with the research question posed in this review study, two researchers read separately and synthesize carefully the selected papers, there were some disagreements concerning some review questions. Though, any discord was discussed and finally resolved by mutual agreement between the two researchers. It is worth noting, that for some review questions such as ReQ1, ReQ2, and ReQ4 the data was not obtained directly. We followed the same solution reported in [6]. Therefore, for the studies using multiple configurations, only the value relative to the best performance was extracted. While for studies using different database sampling, we used the mean of the accuracy value.
To address the review questions, the next step after the data extraction is the data synthesis, which aims to promote and enhance the generalization of the result. Yet, various methods were adopted: • Narrative synthesis: It consists of enumerating the data and summarizing the finding of studies. We use tables, bar charts, and boxplots to strengthen the visualization of results.
• Vote counting: It intends to sum up the number of cases where a model outperforms or underperforms other models. It was used to address ReQs (ReQ2).
• Reciprocal translation: It consists of a translation of notions listed in the selected studies to determine similarities and recognize a difference between them. It was used to address the review question ReQ3. In this section, we report and analyze the findings of all ReQs. A deep discussion and interpretation of the finding will be addressed in the following subsections.

A. Estimation Accuracy of DT Techniques (ReQ1)
The majority of studies are based on a history-based type, which means that the evaluation of DT techniques is based on historical software project datasets. Consequently, the accuracy of these DT estimation techniques may depend on certain categories of parameters which are organized into three different groups: the first concerns the dataset's characteristics like (dimension, outliers and missed data, etc); the second is about the DT's structure (Split rule, number of cases per node, depth of the tree, stopping criteria, effort calculation method, etc); the third concerns the employed techniques of evaluation and validation such as (assessment measures, k-fold, the leave-one-out method, etc.).
Additionally, it has been observed in the 50 selected studies that several datasets were applied to form and to assess the performance of DT models (see Fig. 2). Table II shows the most commonly used databases mainly those employed in more than four studies, along with the proportion, the number of papers that employ each database, and the totality of projects per number of studies. What can be seen in Table II is the high rate of usage of the ISBSG dataset (20%), and then the COCOMO (13%) followed by Desharnais and NASA with (11% and 8% respectively).
Besides, several evaluation techniques were used to assess the prediction accuracy of DT models. The three techniques mostly used were holdout, leave-one-out (LOO), and k-fold (k>1). The Holdout was largely used about (72% or 28 of papers), followed by k-fold cross-validation (36% or 14 of studies), and LOOCV (21% or 8 of studies), we note that the total number of percentage exceeds 100% since some papers use more than one evaluation method.
Regarding the accuracy criteria, the selected studies use several measures; especially the MMRE is employed in 31 papers (63%), Pred(25) is employed in 29 papers (59%), and MdMRE is employed in 15 papers (31%). Therefore, these three measures were chosen to address the ReQ1. 544 | P a g e www.ijacsa.thesai.org  Typically, all databases apart from Tukutuku and COCOMO, have a mean of MMRE ranging between 17% and 68%, that of MdMRE between 11% and 44%, and that of Pred(25) between 36% and 89%. Therefore, it is awkward to report any conclusion because of the modest number of studies and experiments. To achieve this purpose, we had counted the amount of evaluations where DT models perform better (or less) than the eleven methods in terms of a particular estimation accuracy measure. Fig. 4 to 6, provide the results obtained from this comparison analysis (the "+" sign in front of MMRE/MdMRE/Pred signifies that DT methods perform better while the "-" sign signifies that DT methods perform less than the other models), we mention that the blue colors show the total examinations where DT methods perform better, while the red colors present the total examinations where DT methods perform less. Concerning Non-MaL methods, the majority of papers compare DT models with Reg models (87 examinations). From the data in Fig. 4 to 6, it is apparent that DT models perform better than Reg according to the MMRE measure. Similarly, regarding the MaL methods, the major part of DT papers makes a comparison with MLP (41 examinations), SVM (35 examinations), then RBFN and CBR (21 and 20 examinations respectively). According to the MMRE, MdMRE, and pred(25) values, we found that DT models perform better than MLP, RBFN, and CBR. Moreover, SVM outperforms DT methods in terms of the aforementioned three accuracy measures. However, for the remaining techniques, it is hard to report any inspection because of the few numbers of evaluations (less than 10 evaluations).

B. Accuracy Comparison between DT Models and MaL/Non
Additionally, all previously mentioned results are gathered from DT studies, so they might be subjective.

C. Prediction Context of DT Methods (ReQ3)
Given that, the investigated software effort estimation techniques provide various results, it is crucial to give closer attention to the favorable context of prediction more than looking for the perfect estimation model.
Mendes [22] has investigated numerous effort estimation techniques and asked a question: what technique to employ? The answer is "it depends". The main explanation is that the estimation depends on the context of prediction, which is related to database characteristics (dimension, outliers, attributes' types, missed data, and amount of collinearity) and different model designs.
Our review study intends to investigate these issues; therefore, we have retrieved and listed the advantages and limitations of DT techniques which were especially reported in the selected papers, see  [27]. However opposed results were found, for instance [25] found that DT techniques can perform well when large datasets were employed. Nevertheless, it is challenging to confirm that DT techniques should be favorable in small datasets considering that satisfactory results were achieved in large datasets.   Besides, DT techniques have a great challenge to extrapolate beyond the data on which it was trained, for example, these studies [8][28] [27][29] [30] confirm that DT techniques are typically unable to give accurate estimates for a project not similar to those available in the training set. Furthermore, classical DT methods cannot deal with imprecision and uncertainty [23][31] [27]. As a result, numerous techniques have been suggested to handle imprecise data and therefore obtain more accurate estimates. In particular, [32][33] [34] have suggested an improved technique which uses the concepts of fuzzy logic (FL). Their methods improve the performance of the traditional DT methods by incorporating the concept of FL theory.
One of the great advantages of using DT techniques is their resistance to outliers as reported in [14][25] [35][27] [10] and robustness to any multi-collinearity problems such as [36][37][38] [27]. This is because DT methods perform an automatic feature selection as argued in these papers [14][39][36] [40] which means that they just select the relevant features which have an important impact on the effort. Moreover, these studies [14][15] [35] have shown that DT methods provide accurate effort estimation without performing a variable selection which strengthens the idea of resistance to multi-collinearity problems.
Other influencing factors must be taken into consideration along with dataset characteristics. They are all listed in Table III among project attributes and effort. This is because DT techniques guide the practitioners to know which effort factors have a potential effect on the prediction and how the model derives the results; this is why DT methods produce more interpretable and comprehensible results; also they can perform well at an early stage of project development with just early available attributes; which help practitioners at taking good decisions; but they require the use of historical dataset to generate an estimate.

Advantages
Supporting studies A DT method can handle categorical variables.
[8], [27]- [30] The sensitivity of DT approaches to the nature and quality of historical data.
[23], [28], [39], [51] Not provide any meta information to guide the project manager in the budgeting process. [50] Accuracy Not significantly sensitive to the company-specific data or multiorganizational data.
[10] Need completed and historic databases. [22] In sum, DT techniques have many advantages however; they suffer from some limitations, which can be bypassed by ensembles methods like in [52]  Note that in the next section we will discuss concisely the impact of the combination of other techniques on the performance of DT models.

D. Effect of Combining a DT with Other Method (ReQ4)
The present subsection investigates the effect of combining a DT with another technique especially the effect of each technique on the estimation accuracy. Table IV gives the MMRE improvement along with MdMRE improvement and Pred(25) improvement for each method employed in association with DT approaches. We note that the accuracy improvement was made only for studies that report the accuracy of DT combination compared to the accuracy related to DT alone. Table V provides more details about each associated method: the total number of articles dealing with each method, the number of articles comparing the accuracy, and the total examinations done in these papers. For instance, from the 8 papers, which combines Fuzzy Logic with DT (FL-DT) methods, just 2 papers made an estimation accuracy comparison with that of a DT method alone, and only 3 evaluations were made to assess the accuracy of the estimation. Meanwhile, for a certain number of methods, which are associated with DT models, the number of examinations was considerably greater than the total papers including those techniques. For example, grid search combined with generic backward input selection (GS+GBIS), there was only 1 study investigating the comparison of DT techniques with that of a GS+GBIS+DT technique, yet 9 evaluations were performed.
We mention that the number of combined methods may be (>=1) such as ABE line in Table V shows the accuracy values when combining (ABE) alone with DT techniques while (Boost+PCA+Poisson) line presents the values of accuracy when combining Bootstrapping, Principal Component Analysis (PCA) and Poisson Regression. In sum, note that the Bagging, Regression, Fixed Size Window Policy (FSWP), and (Boost+PCA+Poisson) were less incorporated with DT techniques (one examination by one paper). Table V, considering the number of evaluations and MMRE's median, FL is the best method, which strengthens the accuracy of DT techniques (92,56% improvement), followed by Boost+PCA+Poisson (88,33 %) and AR (78,22%). On the basis of the MdMRE's median, Boost+PCA combined with Poisson Regression has the most improvement (71.42%), followed by GS combined with GBIS (5.99%). According to the median of Pred (25), AR has the greatest effect (84.99% improvement), followed by FL (18.45%).

Closer inspection of
To prevent the bias coming from the evaluations made on the same study, we investigate the impact of combining other techniques with DT methods, by taking advantage of the totality of articles, instead of the totality of examinations. Table V shows that Reg, FL as well as ABE are the three methods frequently combined with DT techniques. Table IV indicates that according to Reg, FL lines as well as that of ABE, only FL technique has the greatest improvement based on the MMRE and Pred(25) accuracy measures. 547 | P a g e www.ijacsa.thesai.org (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 11, No. 7, 2020 To summarize the findings, we realize that not all presented methods in Table IV, contribute necessarily to the accuracy improvement of DT techniques mainly, FL, Bagging and AR are the only ones that improve both MMRE and Pred (25) criteria, which are supported by 2,1 and 1 studies respectively. We figured out that, Bagging contributes to a small improvement in accuracy when combined with DT techniques. Due to the fact that Bagging gives good results with good basic learners otherwise if the basic learner is bad, bagging may contribute to the degradation of the accuracy of estimates.
Moreover, AR appears to be a more promising technique than FL when combined with DT techniques since it improves significantly both accuracy measures MMRE and Pred (25).
Nevertheless, all these results require more evaluations in more search studies due to the restricted amount of papers that analyzed the effect of incorporating other techniques with DT.

E. DT Tools (ReQ5)
In this SLR we identify seven tools, which are listed in Table VI. Weka presents the mostly employed tool, then Matlab, SPSS AnswerTree version, and Fispro.
Weka is an application developed by researchers, it is open-source software based on Java. It contains a set of machine learning algorithms, in particular data preprocessing, clustering, classification, and AR extraction.
MATLAB is a numeric-computing environment that was developed by MathWorks but it was created by Cleve Moler in the 1970s. Also, there exist statistical tools built on MATLAB, which offer a set of unsupervised and supervised MAL algorithms including decision trees with boosting and bagging techniques. Over the whole selected papers, only a few studies have employed DT tools to obtain or generate software effort estimates. Moreover, the majority of the existing tools implement the traditional DT methods, which didn't integrate other techniques, for example, FL, GS (Grid Search) to enhance the estimates.

IV. SUMMARY AND IMPLICATIONS FOR SEARCH AND USE
Our suggestions concerning the use of DT models in SDEE concern are listed below: The estimates' accuracy of DT methods: Due to the modest number of studies we were unable to draw any conclusion. Furthermore, the majority of studies use historic databases, so we suggest carrying out more works with the help of concrete and practical experience in the industrial sectors.
The accuracy of DT compared to that of MaL and Non-MaL methods: the DT techniques outperform some models including MaL and Non-MaL. Typically, RBFN, for which there were enough evaluations. Nevertheless, to report a definitive result is a challenging issue, because of the insufficient number of studies investigating accuracy comparison. It is therefore interesting for researchers to conduct more experiments to deal with this issue.
The suitable conditions for an accurate estimation of DT techniques: It should be noted, that it is difficult to make a conclusion concerning the use of DT techniques. Consequently, practitioners have to figure out which techniques had to be in combination with DT methods towards overcoming limitations relative to (missing values, categorical data, features selection, etc.) and accommodate DT to their context.
Effect of combination of other methods with DT methods: the accuracy of estimates of DT models was not usually enhanced. The results show that using bagging techniques doesn't improve greatly the accuracy of DT techniques in comparison with the AR and FL techniques. This indicates that MaL techniques are more desirable to be incorporated with DT methods rather than Non-MaL techniques.
DT tools: We have recognized in this review study, seven tools to estimate software effort using DT methods.
Especially, WEKA and MATLAB are the tools most often used. Moreover, the majority of tools implement classical DT methods. It is therefore suggested for researchers to investigate the implementation of other techniques along with DT models that enhance significantly the estimates' performance like AR, FL, and Bagging and hence encouraging industrial utilization of DT amongst professionals.

V. LIMITATIONS OF THIS REVIEW
The three accuracy metrics used in this review are MMRE, MdMRE, and Pred (25).
However, these indicators don't take into account the quality of databases so implicitly they suppose that the estimation method may give estimates with a maximum precision of 100% for a particular database [65]. Additionally, the MMRE has been subject to criticism for being not balanced in several evaluation contexts in addition to its penalization character of overestimated values further than underestimated ones [66], [67]. Even though, in this review study, we are based on these three criteria, since they were widely employed in relevant articles.
In addition, it is challenging to define the circumstances of all estimates because they were obtained from the selected studies based on various DT techniques and using several experimental designs, which include design decisions (feature selection, project selection, split rule, stopping criteria, pruning, etc.) and validation methods (holdout, LOOCV, kfold cross-validation, etc.).
Moreover, in this review, we consider only studies about DT techniques. For that reason, the mentioned performance of DT techniques would be overestimated, besides that, the advantages and limitations of each DT technique may be subjective. Therefore, the reader must also take into consideration the potential effect of the authors' concern and viewpoint on these results.

VI. CONCLUSION AND FUTURE WORK
This systematic review synthesizes the results of DT studies in conformity with software effort estimation. Moreover, the selected papers were examined according to the five perspectives: prediction accuracy, the performance of DT 549 | P a g e www.ijacsa.thesai.org techniques in comparison with other methods, contexts of the estimates, and effect of the combination on DT's performance, and DT tools.
In sum, we identified 50 relevant papers, especially between the years 1985 and 2019. The important results found in this review study are as follows: What is the overall performance of DT techniques? The overall picture suggests that no conclusive affirmation can be made since the mean accuracy values are around 52,5% for MMRE, 26,1% for MdMRE, and 56,1% for Pred (25).
What is the performance of DT techniques in comparison with other methods (MaL or Non-MaL)? In general, DT techniques outperform RBFN, MLP, and CBR techniques. Especially, they outperform also Regression models according to MMRE.
What are the suitable conditions for an accurate estimation of DT techniques? Many studies confirm that DT methods can describe the complex relationships that exist among project attributes and effort, and can produce more interpretable and comprehensible results. In addition to their resistance to outliers and robustness to any multi-collinearity problems. However, classical DT methods cannot deal with imprecision and uncertainty. Furthermore, several papers propose the use of hybrid models to overcome the existing DT limitations.
How the combination of other techniques with DT techniques does affect estimation accuracy? The techniques the most commonly used in combination with DT studies are fuzzy logic followed by regressions. However, not all combined techniques improve the accuracy estimation of DT techniques. Typically Association rules, fuzzy logic, and bagging are the techniques that improve the prediction accuracy of DT based on the MMRE and Pred (25) measures.
What are the most commonly used DT tools? WEKA, created by researchers at the University of Waikato is the most widely used tool to estimate effort using DT techniques.
In terms of future work, it would be interesting to perform a comparative study and repeat the experiment using unbiased evaluation criteria like the standard accuracy (SA), and the effect size rather than the biased MMRE.