Predictive Approach towards Software Effort Estimation using Evolutionary Support Vector Machine

The project effort measurement is one of the most important estimates done in project management domain. This measure is done in advance using some traditional methods like Function Point analysis, Use case analysis, PERT analysis, Analogous, Poker, etc. Classical models have limitations that they are burdensome to implement, especially when there are LOC (lines of code) or objects’ count required in measurement. Sometimes historical information regarding a project is also considered to estimate the projects’ effort. But these estimates are then needed to be adjusted. The idea proposed in this research is to determine what factors regarding a project are directly related to the effort estimation. Other than that a model is proposed to predict the effort using minimum number of parameters in software project development. Keywords—Correlation coefficient; Decision tree; Effort Estimation; Evolutionary Support Vector Machine; Software project management


INTRODUCTION
We need software project cost estimation and project effort estimation to get an idea of the required amount of work to be done and the related amount to be spent on that particular work during the course of work of software product [3].Effort estimation means that we are going to calculate or forecast the required exertion by the manpower involved in the project in person hours or person months.There are various techniques to estimate effort like Function Point analysis (FP), Use case analysis, PERT analysis, Analogous, Poker, etc. [17] [18].The classical ways of calculating effort are still in practice and providing services in software project management.Although these methods have limitations like FP is related to the size of project and has some long calculations, need to specify adjustment factor which if wrong than it leads to false FP analysis.Similarly all other techniques have their own merits and demerits.
Machine learning is making advancements because of its algorithms used for making predictions.There are number of algorithms which are handy in prediction and forecasting.In calculating effort by any of the above mentioned method, we require large datasets with maximum information regarding a project.What we need now is fast calculation with accurate results even if we have less information.This research paper makes use of machine learning to predict the effort if we know certain parameters.What parameters can help in prediction is determined by linear correlation and decision tree.

II. LITERATURE REVIEW
In a company, its data is the most important entity to be considered as a source of information and a source of production of new products by forming a firm connection of data with management and expertise.The company's competitive position is furnished by the launch of new products.Most of the companies fail to develop new successful products due to inability to develop fine schedule and development plans in New Product Development (NPD).CPM (Critical Path Method) and PERT (Program Evaluation and Review Technique) are outmoded approaches in project scheduling.For non-linear data, a system with combination of Neural Networks and Fuzzy Neural Networks (intelligent systems) are to be used.[1] Cost estimation of a project is done to complete the project within specified budget or before that.Two types of models of cost estimation are there; one is algorithmic and the other is non-algorithmic.Algorithmic types have defined formulae for cost estimation calculation while non-algorithmic types have no defined formulae for this purpose.Then there are evaluation techniques to determine the difference between estimated and actual cost like root mean square error (RMSE) [2] We need software project cost estimation and project effort estimation to get an idea of the required amount of work to be done and the related amount to be spent on that particular work during the course of work of software product.Schedule, effort and quality are the three corners of a -magic triangle‖ and to maintain a balance between these three aspects is a tough job yet essential job in software projects.Time management has its drawbacks in all cases whether it is accurately estimated, underestimated or overestimated.[3] Human Resource is an asset of software development firm.Due to scarcity of skilful developers, software managers find it challenging to schedule the time and cost for the limited developers resource to fill in the total time and cost of the whole project.-Who must do what, and when?‖ is the question to be answered while scheduling tasks within specified time for developers.Scheduling is a hard-to-do thing for software projects as 1) software is not material hence its www.ijacsa.thesai.orgprogress is difficult to monitor.2) There are only rough estimates of each development life cycle.3) For those activities, which run parallel, when interfaced with each other their other components are also changed to support this sudden interface change, hence taking more time than estimated.[4] The development of techniques to schedule software projects is a challenging task.The challenge is not only to schedule project but also the human resource according to project.The existing models are designed such that to make project scheduling more accurate, human resource allocation is made to suffer.Event based schedule (EBS) and Ant Colony Optimisation (ACO) algorithm combination is an innovative plus flexible approach for scheduling software projects.In EBS, events are the times when employees start or leave the project or when resources are released.[5] There are several evolutionary algorithms to solve project scheduling problems but each algorithm has its own way of functioning, how they perform depends on the way they are designed.The design of proposed evolutionary algorithm is made better by combining it with normalisation techniques and it improves practical effectiveness.The project scheduling problem becomes problematic whenever the project is large scaled, employees has dedication to a task up to specific limit, there is a large space of allocation of tasks for employees.So it is a need of the time and management to automate the process of project scheduling so that each employee gets the optimised workload.[6] The dimension of human resource working on a software project is more crucial as compared to the technical dimensions in that particular project.Each individual human plays the key role in the software development life cycle deliverables delivery.The performance of a project, its success and failure cases depend on the personality of the individuals involved in it.-Belbin Team Roles Assessment Tool‖ is a tool to assess the personality type of individuals and in this work, it is used to support the argument of human personality impact in one way or the other on information technology projects.[7] In-house software, outsource software and off-the-shelf software are the various types of software.Whenever outsource software projects are not being monitored and managed by appropriate ways, there are possibilities that the threats and suspicions associated with project will be handled based on individual knowledge of the work and personal ideas.Risk management in software engineering is a domain to handle risks effectively and managed in a timely fashion.This indicates that project manager and other team members must train themselves to handle the risks associated with each stage during the software development life cycle.[8] Performance evaluation of component based software is actually the performance evaluation of individual component‗s performance.There are specific models of development process of component based software which are specialised to best utilise the plus points of component based systems i.e. reuse of components and division of labour.The performance of component based systems is hard to find because in a running environment, components work the way they are deployed in a system according to the system and developers of components have no idea about the usage profile of particular component, or how they will be used.[9] Decision Support maker involve multiple actives and variable to analyse the performance of the project.Project Performance Measurement System (PPMS) involve variable and manager to tackle with volume of data to evaluate project performance.Decision Support categorised performance of the project by taken current state and decision maker view and if deviation is detected then management team will analyse reason per project manager performance interest.Performance of the project is basic support for the decision maker.MACBETH tool is used for analysing performance of the subsequent data of (PPMS) with respect as per project manager.[10] The Selection of Project Manager is a critical task that contains multiple criteria that must consider such as past project performance, suitability and nature of project and qualification of candidate.Project Performance are often measure by the cost, time and technical description.Decision Make Support System (DMSS) is based on the pervious performance for the selection of the candidate as project manager by using the ranking method of the past project.Different Ranking methods are presented for the ranking previous project that should fulfill specific as well as multiple requirements for ranking.DMSS analyse tool enable ranking the candidate on base of the previous project success DMSS also consider History of Candidate, Antiquity of Managing Project and outcome Performance of the Project.[11] In I.T based business Projects Knowledge management involves understanding of three areas such as knowledge of Technical, Organisational and business value Solution.
Empirical Study shows that business esteem is better accomplished in IT-empowered project change with active knowledge management in three domain areas moreover empirical study also shows that knowledge management explain 38% of the project value performance.It has been proposed that knowledge management have significant impact on project performance.Performance against other essential targets, e.g.budget and schedule has positive impact of knowledge management by to managing project.[12] Paper presented a Hyper-Cube framework for Ant Colony Optimisation ACO to solve the Software Project Schedule Problem by using System Max-min algorithm.Hyper-Cube Framework improves the performance of the Algorithm.Maxmin Algorithm Results Compared with Genetic and ACO Algorithm and it is observed that Max-min Algorithm give better results for small instances and attain low cost and duration of the project.[13] Scheduling of Software Projects is Critical task in the process of software development.Software Project Schedules SPS involves human resource and people intensive activity.In SPS two goals arises such as reduction of duration and cost.Paper analyses eight multi object algorithm scalability for large project in the competitive software industry.It has been analysed that PAES Algorithm shows best scalability among eight multi object algorithm.[14] www.ijacsa.thesai.orgSchedules in Critical chain project management not necessary always meet dates for completion of task.This paper proposes critical chain schedule optimal approach.The Problem related to critical chain schedule is taken in to account that involve duration uncertainty and resource constraints.CCPSP (Critical Chain Project Scheduling Problem) is expressed and then DE algorithm is applied as per the characteristic of the CCSPSP to find out solution of the problem new strategies of differential evolution (DE) algorithm follow to obtain fast coverage and global exploration ability.Critical chain method Implementation for large scale project is difficult and complex task.It is observed that modified algorithm DE achieve global coverage effectively for CCPSP.[15] Considering project complexity, this paper throws light on the how project success is related with risk management while correlating the soft and hard skills.For this purpose, a hypothesis was tested where 263 projects were involved where interviews were conducted from project managers and risk managers and analysed.A structural model was proposed that correlated the soft and hard skills of risk management with project success.It was concluded that soft side has 10.7% effect on the project success and hence, cannot be neglected while it also supports the hard side in addition to the correlation that is 25.3% effect on the hard side.[16] For Estimation of effort, schedules, cost, and size of software projects use case method is use.Use case method depends on the use case diagrams for estimation.Estimation of Software Project helps to determine cost, effort size and schedule of software projects.For this purpose, use case method use widely use case method have also limitation that it may affect estimation accuracy.Some techniques have been address to solve the problem related to the estimation, Techniques such as neural networks and fuzzy logics can be used for better accuracy of the estimation to improve use case points methods.[17] Use case point (UCP) technique is use for the effort estimation of the software project.UCP technique gives a better and accurate estimation of effort but still have certain limitation with it due to which UCP is not use accepted by the Software Industries.To Enhance the Predication and accuracy for the effort estimation Random Forest Technique is use to overcome constraint of the UCP technique.RF is Machine Learning Technique that utilises approach of Use Case Point method.RF combines results obtain by different model which give more accurate effort estimation.Results shows that RF Technique gives more accurate predication as compare to other techniques such as SGC (Stochastic Gradient Boosting), LLR (Log-Linear Regression), MLP (Multi-Layer Perceptron), RBFN (Radial Basis Function Network) [18] Cost and Performance are important factor for reducing the exploration time of target system of the design space at high level of abstraction in Software and Hardware code-signs.Estimation methods, such as S-Graphs are in software/hardware code-sign with POLIS at different abstraction level.S-Graph Level method provide better accuracy in software synthesis for optimisation and CFSMlevel provide better accuracy for generation of schedule and automatic portioning into software and hardware.Experimental results show that S-Graph method when compared to assembly level have accuracy within (-20% to +20) and for maximum time estimation CFSM-level accuracy range is (-10% and +25%) [19] Cost estimation for software projects are difficult task in early lifecycle of software development process.Deterministic methods are used for the software project evaluations.A Comparative analysis has been done between COCOMO and Fuzzy Logic for cost estimation of software.Fuzzy Logic is a new approach which is used to investigate Software Cost Estimation (SCE).Multiple Membership functions such as Trapezoidal, Triangular, Generalised Bell, Sigmoidal and Gaussian has been to analyse the Fuzzy Logics (FL) for the SCE.FL shows a better performance in comparison to the COCOMO model when tested for dataset of software projects.[20] III.
EXPERIMENTAL SETUP SAMPLE: A sample of 81 projects is taken from NASA Repository [21].The actual dataset contains 12 parameters with varying types.They are mentioned in Table 1.The parameters mentioned in Table 1 are the ones in actual dataset.All of them will not be required in our work as the actual dataset was used in Analogous technique of effort estimation [22].The parameters -TeamExp‖ and -ManagerExp‖ are numerical but they have nominal values from 0 to 4 and 0 to 7 respectively.The dataset contains 81 instances but 4 of them have incomplete information so 77 instances are in workable form.
In our research -PointsAdjust‖ and -PointsNonAdjust‖ are not used as they are specific to analogous method of effort calculation.
The parameters involved to calculate effort are given in Table 2 www.ijacsa.thesai.orgThe Dataset has pre-calculated efforts required in each project.Following figures (Figures 1 to 8) show the effort plot across each parameter.Effort is considered to be a dependent variable (y-axis) and other parameters as independent variable (x-axis).The scatter plot is chosen for this demonstration because it will give a rough idea about correlations and dependencies.

IV. METHODOLOGY
The methodology so followed is that the data is first tested for correlation.For that a correlation matrix is created to get an idea about the relation of each attribute with other attributes.The threshold value is set to 0.5.All correlation values less than 0.5 are discarded i.e. those attributes are not correlated.The correlation matrix can be viewed in Figure 9.The decision tree is made using Effort as a key label.The decision tree is used to shortlist the attributes, the matching attributes are then finally shortlisted.The tree can be viewed in Figure 10 and the shortlisted attributes can be seen in Table 6.www.ijacsa.thesai.orgFrom the tree we can see that the attributes playing role in decision making are the same as the output of correlation matrix i.e.Entities, Transactions and Length.Predictions will be made using these three parameters as they are correlated with Effort.
After parameter selection several prediction models where developed and their performance was measured to calculate the correlation of predicted effort and actual effort.These models were preferred over other machine learning models because they were able to handle numerical data and as per previous tests to determine the parameters for prediction, all of them are numerical in nature (Table 7).The parameter -Effort‖ itself is a numerical attribute in this dataset.

1) Gradient Boosted:
Boosting is to add prediction models in ensemble serially.The Gradient Boosting technique adds new models in current model to increase the accuracy of the target Prediction variable.New base learners are established which are meant to be highly correlated with negative gradient of loss function.[23] 2) Deep Learning: Deep learning is a type of neural networks [24].Depending upon the need to make weights more accurate, NN might require long chains of computational phases.Each stage may transform the cumulative stimulation of network.Here is when deep learning comes in action by giving due credit to large number of stages as per requirement.[24] 3) Support Vector Machine: Support vector machine is a linear classifier that divides data into two classes by hyper plane [25].The training data in SVM is actually vectors in space.Those training points that are close to hyper plane are support vectors [26].

4) Linear Regression:
Regression is a statistical method to analysis a relationship between variables.Linear regression is so called because it gives a straight line as a result between depending variables.In linear regression, predictor x is kept fixed, regressor y is than analysed.Any change in y demonstrates the external factors which determine its behaviour represented as ε.[27]

5) Vector Linear Regression:
Vector regression is efficient regression model.It includes the concept of support vectors i.e. training points near the separating boundary.Vector regression adjusts ε of the loss function in linear regression given training set and target variable to predict.[28] 6) Generalised Linear Model: A general assumption in linear model is that data being modelled will be on continuous measure.Hence linear models cannot handle binary data or data that represents count.To overcome that limitation generalised linear model is used.[29] 7) Neural Networks: Neural Networks have connected pre-processors called neurons.Input neurons get activated when they receive sensation from environment.Remaining neurons need weights from already activated neurons for activation.For learning process, the main task for Neural Networks is to find weights that will give best desired results.[24] 8) Polynomial Regression: Polynomial regression is a type of linear regression in which there might be a case that either the predicting variables are non-linear or the relationship between the predictors and regressor is curvilinear.They are also linear models but just have higher powers f polynomial involved [30].

9) Evolutionary Support Vector Machine:
Using the above three parameters, evolutionary support vector machine model is developed to predict the value of -Effort‖ parameter.Effort is considered as a dependent variable whereas Entities, Transactions and Length are Independent variable.The developed model can be viewed in Figure 11.There is hyper plane which separates out the classes in datasets such that they have maximum distance in between them SVM is not only used for classification but for regression also, where regression is how we separate out the data based on correlation.The input here will be elements of set X where X is a training set with x 1 , x 2, x n being the input sample parameters which in this case are the shortlisted parameters.
The out of SVM is a set of weights for input parameters needed for prediction.These weights adjust the value of elements of X and make predictions more accurate.
Equation for hyper plane is for d i = +1 (1) Where w is a weight vector, x is input, b is bias and d being the distance between the plane and vector point (data point).www.ijacsa.thesai.org In evolutionary support vector machine there are evolutionary strategies for fast optimisation.Old SVM are not able to optimise the function if it encounters the negative or non-positive kernel function.Hence a better approach would be to introduce evolution strategies (ES).Using this method, weights can be directly optimised.Evolutionary support vector machines simply use ES.In it there is real esteemed vectors , Gaussian distributed random variables having standard deviation and this random variable is used for transformation.Initial vectors are random with The training set comprises of 81 observations from the dataset, the proposed model is trained for those values.The test set checked across the proposed model comprised of 20 observations.

V. OBSERVATIONS
N c is number of elements in same order and N d is number of elements in different order.
If the value of Kindall coefficient is greater than 0.5 than it implies high correlation between variables.These correlation values are calculated for the actual Effort and the Predicted effort.
From the results of prediction we can deduce that in predicting the effort required in a project in person hours, the crucial parameters needed are the transactions performed within a module of the developed software, the length in moths of each project and the entities in development model.

VI. CONCLUSION AND FUTURE WORK
This paper is written to find a solution to estimate software project effort using minimum information from a project history of the same organisation.To find out the best parameters for prediction two methods were used, namely, (1) correlation matrix, and (2) decision tree.Two methodologies were considered just to verify the results of each test.Both tests generated the same results and three parameters were selected for prediction.Several prediction models were built and trained for selected parameters.The evaluation results revealed that Evolutionary Support Vector Machine gives the best prediction results.The final result is that if we know only the number of entities in a project, the transaction of that project and the length of project in months, we can predict the effort required for that project using Evolutionary support vector machine.
In the future other algorithms, such as Bagging and Ensemble algorithms can be implemented for predicting effort in projects.Results would be observed against each in order to identify better performance keeping in view the accuracy and prediction estimates.
The value of correlation if greater than +0.70 than it indicated high positive correlation which in our case is 0.965.
The formula for Spearman Rho is ∑ (3) P is spearman roh coefficient, d is the difference between x and y values and n is the number of elements in a dataset.

Fig. 11 .
Fig. 11.Evolutionary Support Vector Machine model for prediction Support Vector machine is a supervised learning technique.In support vector machine, training examples which are closest to the hyper plane are called support vectors.There is hyper plane which separates out the classes in datasets such that they have maximum distance in between them SVM is not only used for classification but for regression also, where regression is how we separate out the data based on correlation.The input here will be elements of set X where X is a training set with x 1 , x 2, x n being the input sample parameters which in this case are the shortlisted parameters.

TABLE VI .
SHORTLISTED ATTRIBUTES

TABLE VII .
OTHER PREDICTION MODELS CORRELATIONS