Evolving Software Effort Estimation Models Using Multigene Symbolic Regression Genetic Programming

Software has played an essential role in engineering,
economic development, stock market growth and military
applications. Mature software industry count on highly predictive
software effort estimation models. Correct estimation of software
effort lead to correct estimation of budget and development time.
It also allows companies to develop appropriate time plan for
marketing campaign. Now a day it became a great challenge to get
these estimates due to the increasing number of attributes which
affect the software development life cycle. Software cost estimation
models should be able to provide sufficient confidence on its
prediction capabilities. Recently, Computational Intelligence (CI)
paradigms were explored to handle the software effort estimation
problem with promising results. In this paper we evolve two new
models for software effort estimation using Multigene Symbolic
Regression Genetic Programming (GP). One model utilizes the
Source Line Of Code (SLOC) as input variable to estimate the
Effort (E); while the second model utilize the Inputs, Outputs,
Files, and User Inquiries to estimate the Function Point (FP).
The proposed GP models show better estimation capabilities compared
to other reported models in the literature. The validation
results are accepted based Albrecht data set.


I. INTRODUCTION
Estimating software effort on the early stage of development might produce uncertainty of up to 400% as mentioned in [1].In 2001, it was also reported by the Standish group that, 53% of U.S. software projects ran over 189% of the original estimate [2].In the 21st century software technology was capable on providing variety of software tools, techniques and software estimation models with many features which can help software project developer, manager, analyst and tester to do their job in a better way.The question that arises according to this opportunity, which tool and which model can really help in providing an accurate estimate?In most cases, the models adopted were based on expert judgment including Delphi technique [3] and work breakdown structure based methods.Models inspired by mathematical equations later came in line and named as Algorithmic Method.For example, Constructive Cost Model (COCOMO) [1], [4], Software Life Cycle Management (SLIM) [5], [6], and Software Evaluation and Estimation of Resources-Software Estimating Model (SEER-SEM) [7], Function Point models [8] and many others.
Practitioners figured out that the inability to correctly estimate software development costs is a challenging problem.Solving this problem becomes a pressure on IT companies since costs associated with their development became higher than before due to software complexity.As a result, more research focused on gaining a better understanding of the software development life cycle as well as the intelligent techniques which can help in developing accurate and efficient software cost estimation models.
In this paper, we continue exploring the idea of developing evolutionary software effort estimation models based on CO-COMO and FP models [9].Multigene Symbolic Regression GP shall be used to derive a mathematical model in both cases.The models should take in consideration the most important attributes which affect the effort modeling process for both the COCOMO and FP models.

II. LITERATURE REVIEW
Early investigations on using Machine Learning techniques as a tool for software development effort estimation were presented in [10]- [12].Recently, Machine Learning techniques were also explored to solve the effort and cost estimation problem for software systems.In [10], author explored the use of Neural Networks (NNs), Genetic Algorithms (GAs) and Genetic Programming (GP) to provide a methodology for software cost estimation.A novel soft computing model to increase the accuracy of software development cost estimation was presented in [13].Authors claims that their proposed NNs model can be interpreted and validated by experts, and has good generalization capability.CI techniques were presented and analyzed for software cost estimation along with the emerging trends was presented in [14].A new approach to find architectural design models based on multi-criteria genetic algorithm with optimal performance, reliability, and cost properties was presented in [15].In [16], author provided a state of the art article on the use of search based approaches for software development effort estimation.The capabilities of these approaches were fully explored and the empirical analysis was carried out.A comparison between Neuro-fuzzy model and the most common software models such as Halstead, WalstonFelix, Bailey-Basili and Doty models was presented in [17].www.ijarai.thesai.org In [18], author provided an innovative set of models modified from the famous COCOMO model with interesting results.Later on, many authors explored the same idea with some modification [19]- [22] and provided a comparison to the work presented in [18].Exploration of the advantages of Fuzzy Logic using the Takagi-Sugeno (TS) technique on building a set of linear models over the domain of possible software Kilo Line Of Code (KLOC) were investigated in [23].Authors in [24], [25] presented an extended work on the use of Particle Swarm Optimization (PSO) and Differential Evolution (DE) to build a suitable model structure to utilize improved estimations of software effort for NASA software projects.The developed PSO model provided promising results.Many model structures were explored including COCOMO-PSO, Fuzzy Logic (FL), Halstead, Walston-Felix, Bailey-Basili and Doty models.The potential of the developed COCOMO-PSO and FL models were high compared to other models from the literature.

A. COCOMO Model
COCOMO is one of the most famous software effort estimation model used in the literature.This model was originally developed by Barry Boehm [1], [4] and was extensively revised in [26].The model is given by Equation 1.Recently, tuning the parameters of the COCOMO model using differential evolution to provide a better effort estimate was presented [25].
Given that: • E is the effort in person-months • Size is measured by the Kilo Source Line of Code • EAF is an Effort Adjustment Factor from cost factor multipliers Software size may not be the most significant attribute in effort estimation but it does have major influence on the effort and time computation.If we could not accurately estimate the project size it is always hard to plan for project budget and duration.The values of the parameters A and B can be found in Table I.Three types of COCOMO models are presented.They are: Organic, Semidetached and Embedded models [27].

B. Function Point Model
Function points are a well-known concept although only recently they gained wider acceptance as a software size measure [28], [29].Function points measure software size based on the functionality requested by and provided to the end user.Albrecht's function point gained acceptance during the 1980's and 1990's because of the tempting benefits compared to the models based on the SLOC [30], [31].Albrecht proposed his model of computing the software size based on the system functionality [32], [33].Albrecht originally proposed four function types [32]: files, inputs, outputs and inquiries with one set of associated weights and ten General System Characteristics (GSC).In 1983, the work developed in [33], proposed the expansion of the function type, a set of three weighting values (i.e.simple, average, complex) and fourteen General System Characteristics (GSCs) were proposed as given in Table II.Because FP is self-governing and independent of language type, platform, it can be used to identify many productivity benefits.FP is designed to estimate the time required for a software project development, and thereby the cost of the project and maintaining existing software systems.Because FP is self-governing and independent of language type, platform, it can be used to identify many productivity benefits.FP is designed to estimate the time required for a software project development, and thereby the cost of the project and maintaining existing software systems.
The Albrecht FP model consists of two parts 1) Unadjusted Function Point (UFP) and 2) Adjusted Function Point (AFP).The UFP consists of five components.They are given in Table II.There are also 14 GSCs factors that affect the size of the project effort, and each is ranked from "0"-no influence to "5"essential.GSCs consists of 14 factors known as f 1 , f 2 , . . ., f 14 .These factors are listed in listed in Table III.The sum of all factors is then multiplied given in Equation 2 which constitute the Adjustment Factor (AF) defined in the range [0.65,-1.35].Then, the Unadjusted FP is then multiplied by the UFP to create the Adjusted Function Point (AFP) count as given in Equation 3. The Adjusted FP value-will is within 35% of the original UFP figure.AF = 0.65 + 0.01 IV. GENETIC PROGRAMMING GP is an evolutionary computation technique which allows computer programs to evolve and produce a solution to a problem.GP a biologically inspired machine learning method which randomly generate a population of computer programs (i.e.solutions) represented by a trees structure of LISP expression [34], [35].Using mutation and crossover, GP produce a new population of solution which is more likely to have better solution than their parents.This process in repeated till the end of certain number of generations or the best solution is reached.GP often use symbolic regression to build a mathematical model or expression based on given data set [36], [37].The foremost advantages of GP is that; it evolves both the model structure (i.e.function) and the tune the model parameters.This makes GP more suitable to modeling and identification of nonlinear dynamic systems [38]- [42].

A. Multigene Symbolic Regression
Assume we are using GP to develop a model for a system with x inputs and y output.GP can produce a tree structure which introduce the mathematical relationship y = f (x 1 , x 2 , . . ., x n ).Given that n is the number of input variables.In multigene symbolic regression, each prediction of the output variable ŷ is formed by a weighted output of number of trees/genes in the multigene individual plus a bias term.Each tree is represents a model of zero or more of the given inputs n.
Mathematically, a multigene regression model can be written as: where a 0 represents the bias or offset term while a 1 , . . ., a M are the gene weights and M is the number of genes (i.e.trees) which constitute the available individual.The weights (i.e.regression coefficients) usually computed using least square estimation for each tree.A multigene symbolic model usually consists of one of more gene (i.e.GP tree) weighted by linear combination parameter.An example of multigene model is shown in Figure 1.The presented model can be introduced mathematically as given in Equation 5.

B. Performance Criterion
The Route Mean Square (RMS) was used as the fitness function for genetic programming.RMS can be described by Equation 6.
Other performance criterion was used to evaluate the goodness of the developed GP model.They are given in following equations: 1) Variance-Accounted-For (VAF): 2) Euclidian distance (ED): 3) Manhattan distance (MD): 4) Mean Magnitude of Relative Error (MMRE): where y and ŷ are the actual and the estimated effort based on the developed GP model.

A. Experimental Setup
To develop the proposed GP effort estimation model we used the GPTIPS [43] toolbox with the setting parameters given in Table IV.The default GPTIPS multigene symbolic regression function was used in order to minimize the root mean squared (RMS) prediction error on the training data.The default recombination operator probabilities were used as 0.85 for crossover, 0.1 for mutation and direct reproduction of 0.05.

B. GP effort model as a function of SLOC
The famous COCOMO model always used the SLOC as main input to develop the effort equation as given in Equation 1.In our case, we adopted the COCOMO model as a basis for our development.Thus, we used the SLOC as an input and the effort as an output.The developed model is given in Equation 11where x 1 stands for the SLOC.
In Figure 2, we show the GP convergence process where the best RMS was measured as 6.3475 which were received at generation 292. Figure 3 shows the actual and estimated effort using GP over the sorted list of projects.The characteristics between the two curves look very similar with high VAF criteria.In Table V, we show the values of each evaluation criteria adopted in this study.and User Inquiries (x 4 ) to estimate the Function Point (FP).Thus, we considered these attributes as input to our model and the number of FP as an output.We run the GPTIPS [43] toolbox with the setting parameters given in Table IV.The developed GP model for the FP is given in Equation 12.
In Figure 4, we show the GP convergence process where the best RMS found was 30.8868 which were received at generation 297. Figure 5 shows the actual and estimated effort using GP based Albrecht data set adopted in this study.The developed model's performance were computed using number of criteria reported in Table VIII.

VI. CONCLUSIONS AND FUTURE WORK
In this paper, an evolutionary software effort estimation models based Multigene Symbolic Regression Genetic Programming were developed.Two GP based models were developed; one model considered the Source Line Of Code (SLOC) as input variable to estimate the Effort (E); while the second model considered the Inputs, Outputs, Files, and User Inquiries to estimate the Function Point (FP).The proposed GP models show better performance compared to other reported models in the literature.They were tested using the Albrecht data set reported in [12].The mathematical equation which represents both models is adequately simple and can be easily used to predict further project's effort.These types of models significant help project managers to estimate time and cost for future developments.

Fig. 1 .
Fig. 1.A pseudo linear multigene model of output ŷ along with x 1 , x 2 and x 3 as inputs

TABLE IV .
TUNING PARAMETERS OF THE GPTIPS TOOLBOX

TABLE VII .
ACTUAL AND ESTIMATED GP NUMBER OF FP