Software Project Estimation with Machine Learning

This project involves research about software effort estimation using machine learning algorithms. Software cost and effort estimation are crucial parts of software project development. It determines the budget, time and resources needed to develop a software project. One of the well-established software project estimation models is Constructive Cost Model (COCOMO) which was developed in the 1980s. Even though such a model is being used, COCOMO has some weaknesses and software developers still facing the problem of lack of accuracy of the effort and cost estimation. Inaccuracy in the estimated effort will affect the schedule and cost of the whole project as well. The objective of this research is to use several algorithms of machine learning to estimate the effort of software project development. The best machine learning model is chosen to compare with the COCOMO. Keywords—Software effort estimation; project estimation; constructive cost model; COCOMO; machine learning


I. INTRODUCTION
Problems are created for software professionals, their clients, and stakeholders from the impractical project strategy and budget overruns. Despite many studies and numerous attempts to learn from experience, the problem of inaccurate often happen and has not been solved yet [1]. Software cost, effort, and resources are estimated at the beginning of the development. That information will be used by developers and clients to estimate the budget and time needed to finish develop an application or a system. Techniques and models were invented to assist the developers in estimating budget and effort. However, the problem of inaccuracy in estimation still becomes one of the problems for the developers and stakeholders. Even the emergence of one of a well-established project estimation model in the 1980s, COCOMO model, does not solve the problem of inaccuracy in software project estimation. Therefore, in this research, machine learning algorithms are used to estimate the effort of a software project that is more accurate compared to the COCOMO model. COCOMO model datasets are used to build machine learning models.
Although the COCOMO model has many advantages, it has some weaknesses too. One of it is its estimation varies as time progress [2]. Furthermore, the COCOMO model works depends on historical project data which are not available at all times [3]. COCOMO model cannot be used to estimate in all Software Development Life Cycle phases [3]. A large amount of data required for the COCOMO model to work [4]. A user has to insert input of 15 effort multipliers in order to get output from the COCOMO model. Thus, it will consume a lot of time for industry that has to estimate a large number of projects. COCOMO has difficulty in learn and identify data patterns [4] which is an important element in the regression model such as COCOMO.
Our main three objectives are to pre-process and analyze the COCOMO dataset. Second, is to apply several algorithms and to predict the output based on the COCOMO dataset and to evaluate the performances of the selected algorithms with the COCOMO method.
An application called SOFREST Estimator is developed to demonstrate how the estimation work and what are the inputs that needed to produce the outcome. The application will require user to insert five inputs about a project, which are number of Lines of Code (LOC), Database Size (DATA), Required Software Reliability (RELY), Execution Time Constraint (TIME) and Main Storage Constraint (STOR). The output of the application will be estimation of effort that needed for that particular project in person-months unit.
This project is significant because it concerns the accuracy in predicting the budget and time needed to develop a whole project. By doing the project cost and time estimation using machine learning, higher accuracy of cost and effort of a software project estimation will be produced since, in machine learning, we build a prediction model by train and test the dataset. By having machine learning as the project cost and effort estimator, money and time can be saved as it will need less human effort. This project will be useful for every software development company for them to estimate the cost and effort of a project. The best algorithm to be used in this project cost and time estimation can be determined based on the highest classification accuracy in machine learning.

A. Software Project Estimation
Project estimation is an essential part of completing a project. Projects are planned in terms of cost, effort, and budget at the beginning phase of development. Precise effort estimation of software development plays a main task to predict how much workforce should be prepared during the works of a software project so that it can be completed on time and with the budget that planned without ignoring the quality of a software [5]. Accuracy of development cost estimation is a key factor in the success of a construction project and influenced the decision-making by the stakeholders of a software project [6] and to bid a contract with them [7].
The capacity of a budget estimating model is determined by calculating its bias, stability, and precision. Measures of bias, stability, and precision are concerned with the difference in the average between the actual costs and the estimated costs, considering both the degree of variation around the average and the combination with bias and consistency. By far, the most popular evaluation criteria used involve statistics such as mean, standard deviation, and coefficient of variation [6].
Identifying and calculating software metrics are important for various reasons, including estimating programming execution, measuring the effectiveness of software processes, estimating required efforts for processes, reduction of defects during software development, and monitoring and controlling software project executions [8]. An example of the wrong cost estimation that happened recently was in estimating the budget of international arrivals facility that being built at Seattle-Tacoma International Airport in Seattle, Washington, USA. Initially, in 2013 the budget was estimated at US$ 300 million but then the budget increased up to US$ 968 million in September 2018 [9]. Research shows that usually projects seem to be unclear at the beginning and become less vague as they progress [10].
One of the software metrics that used to estimate the cost and effort is called lines of code (LOC) metric and is considered basic software metric [11] as it is used in most software project estimation techniques.

B. Project Estimation Techniques
It is hard to quickly and accurately predict the development budget at the planning stage because the documentation is generally incomplete. For this reason, various procedures have been created to accurately predict construction costs with the limited project data available in the early phase [6]. There are three known models that have been used to estimate the project effort, cost and resources which are the Constructive Cost Model (COCOMO), Analogy-based Model, Use Case Points model.

1) COCOMO Model: COCOMO (Constructive Cost
Model) is a screen-oriented, interactive software package that assists in budgetary planning and schedule estimation of a software project [12]. The intermediate COCOMO model used 15 drivers to estimate the cost of a project. The drivers are classified into four attributes; Product attributes, Hardware attributes, Personnel attributes and Project attributes [7]. Table I shows the intermediate driver of the COCOMO model. Each driver has its own multipliers (refers to Table II) that divided into six categories which are Very low (VL), Low (L), Neutral (N), High (H), Very High (VH), Extra High (XH) [7].
2) Analogy-based model: The core of the Analogy-based model is to differentiate the projects that will be estimated with all the software project's former data. Dataset will be from the company itself or that available publicly. The comparison will be carried out to identify which former projects are similar to the current project that its cost and effort will be estimated. Similar projects will be chosen to be reworked so that the estimated effort of the new project can be identified. Similarity measures, how near the distance between project, will be measured on each type of attribute and using three measurement methods; Euclidean, Manhattan and Minkowski distance [5].
3) Use-case points model: Use Case Points (UCP) is a notable size estimate designed mainly for object-oriented projects that use the use case diagram to estimate the size of projects at the beginning development phase. Other software sizing methods that depend on functional requirements, called Function Point, was what encouraged the idea of UCP [13].

C. Machine Learning
A computer becomes much more intelligent with their ability that can think by using Artificial Intelligence (AI). One of the subfields of AI is Machine Learning (ML). The computer intelligence is developed through various methods of learning. Thus, there are many types of Machine Learning which are Supervised Learning, Unsupervised Learning, Semi-Supervised Learning, Reinforcement, Evolutionary Learning and Deep Learning [14]. Machine Learning models are build based on learning the dataset using algorithms such as Regression Tree, Linear Regression, Neural Network, Naïve Bayes, Support Vector Machine, K-Nearest Neighbor, Random Forest, etc. The training and testing will be carried out to the dataset to build the ML models. Previously, ML has offered self-driving cars, speech recognition, systematic web explores, and improved realization of the human generation. Today machine learning is available everywhere that one can possibly use it many times a day without knowing it. A lot of researchers consider it as an excellent way of moving towards human-level as machine learning are advanced to the extent that it can recognize speech like human [15].
Machine learning is a subfield of computer science. It allows machines such as computers to build analytical models of data and find hidden perceptions by learning the data itself. It has been applied to a variety of aspects in modern society, ranging from Deoxyribonucleic Acid sequences classification, credit card fraud detection, robot locomotion, to natural language processing. It can be used to solve many types of tasks such as classification and prediction. Software project estimation is one of the tasks that machine learning is capable of [16].
Machine Learning is important as it is always up to date with the current environment and the model keeps improving its performance itself by learning data or experience. Human efforts and mistakes can be reduced by using Machine Learning.
The traditional software effort performance criterion is not accurate and not satisfying. There were many metrics and a number of techniques in cost estimation have been proposed. Unfortunately, most of them have lacked one or both of two characteristics which are sound conceptual, theoretical bases and sstatistically significant experimental validation.
Most performance criterion metrics have been defined by an individual and then tested in a very limited environment [17]. So, there is a need design optimization algorithm for correct, precise and reliable effort estimation [18].
Data mining is about acquiring perception in the data in order to detect useful patterns that imply information. Data mining has a record of success in business and more recently in scientific applications. Data mining is usually carried out using process models and employs tens of techniques that span a wide spectrum of interdisciplinary fields including statistics, machine learning, and pattern recognition. The use of data mining in software project prediction has recently gained remarkable popularity inspired by a large amount of error in traditional estimation methods and the continuous improvements of machine learning algorithms which could help to provide more accurate prediction [7].
Machine Learning has been used to predict the software project cost and effort estimation since late 1980 [6]. Measuring the performance of estimation of machine learning models is accomplished by calculating the metrics including Sum Squared Errors (SSE), Root Mean Square Error (RMSE), Mean Magnitude of Relative Error (MMRE), Mean Absolute Error (MAE), etc.
They are the well-known parameters that are used for the performance evaluation of methods [19]. The evaluation consists of comparing the accuracy of the estimated effort with the actual effort. There are many evaluation criteria for software effort estimation and among them, the most frequent one is the Magnitude of Relative Error (MRE) [20]. Linear regression and Multi-perceptron are the most popular machine algorithms for software development effort estimation [21].
In this research, four Machine Learning algorithms that are used are Random Forest (RF), Linear Regression (LR), Regression Tree (RT) and Support Vector Machine (SVM). Each model's performance was measured by the Mean Magnitude Relatives Error (MMRE). Among the four algorithms, the best one is chosen to build a Machine Learning model that can predict the cost and time of a software project.

D. Literature Analysis
This section contributes to the knowledge of previous studies on software project estimation by using the state-of-art machine learning techniques available. Alongside with the limitation mentioned in the subsection C, most of previous literature studies addressed to measure the project and cost estimation using a single machine learning algorithm which reveals numerous limitations of the particular algorithm. Besides, most previous related work have inadequately present an extensive comparison and evaluation towards their proposed solution. Therefore, this paper aims to extend the evaluation with any other similar machine learning algorithm with four different datasets to further investigate the performance of the most sophisticated algorithm compared to COCOMO model.

III. METHODOLOGY
This research uses an experiment as a methodology to develop the prediction model for software project cost and effort estimation using selected machine learning algorithms. The experiment procedure is illustrated in Fig. 1. There are four selected machine learning algorithms which are Support Vector Machine, Linear Regression, Regression Tree and Random Forest.

A. Data Collection
In this experiment, software project measurement datasets use for developing the prediction model using machine learning. All datasets can be accessed publicly from http://promise.site.uottawa.ca/serepository/datasets-page.html and a study conducted by Kaushik Table III. 728 | P a g e www.ijacsa.thesai.org

B. Data Pre-processing
The data is pre-processed in order to calculate the effort estimation. In this experiment, the data is imported in r studio. Mice package is used to check the missing values, the datasets contain no missing values. The value of the drivers is in numerical weight converted to numerical values due to avoid bias during constructing the machine learning model. The mode constant is assigned based on the COCOMO predefined values.

C. Prediction Model Development
Four regression machine learning algorithms are used for this experiment.
1) Linear regression model: The linear regression model summarizes a relationship between two variables, independent and dependent variables. The practical use of linear regression in this experiment is to find the approximate prediction as a predictive model. The relationship of the prediction and the actuals data is then observed from the best fit line. The best fit line is where the total error prediction is as small as possible.
2) Support vector machine: Support vector machine model is a linear model for classification and regression problems. Linear and non-linear problems can be solved by support vector machine model. The aim of this model is to create a hyperplane and separate the data into classes. In support vector machine model, between the data points and the hyperplane we can find maximum margin to reduce misclassifications. Also, it can be used to solve unbalanced data problem.
3) Regression tree algorithm: egression tree is a type of decision tree and it is a method that can create and visualize prediction models from the data. The output of this model is numeric output, and the average value is assigned to the leaves of tree. The decision making in regression tree is easier compared to other method because the undesired data will be filtered and reduces the work on the data as it goes deeper in the tree. The regression tree is used due its ability to reduce ambiguity in decision-making.

4) Random forest algorithm:
Random Forest model is model made up of many decision trees that each tree depends random vectors values. This model called random because during building trees it uses random sampling for training data points and during splitting nodes it uses random subsets of features considered. Each tree in random forest learns from a random sample of the data points. The random forest is used due to it can produce high accurate classifier.  Min-Max Accuracy is a good metric to see how much they are close that considers the average between the minimum and the maximum prediction. The higher the value of Min-max accuracy the better the accuracy.
Correlation Accuracy is the correlation between predicted and actuals used as an accuracy measure. The Pearson productmoment correlation coefficient is used to measure the strength of predicted and actuals value of the experiment. The predicted and actuals value has similar directional movements when the correlation accuracy is high. P-Value also known as calculated probability is to determine the significance of the experiments results. The P-Value is lower than 0.05 shows strong prove against the null hypothesis, thus the null hypothesis is rejected. The smaller the P-Value, the stronger the evidence to reject the null hypothesis.
Null hypothesis of this project is, the population correlation coefficient is not significantly different from zero. There is no significant linear correlation between control and experimental values in the population.
Alternative hypothesis of this project is, the population correlation coefficient is significantly different from zero. There is a significant linear relationship between control and experimental values in the population [22].
Vargha and Delaney A (VDA) measure is one of the examples to measure effect size, differentiate between two samples of observations, control and experimental sample. The range of VDA is from 0 to 1. VDA is calculated there is no effect result in a value of 0.50 [23]. Interpretation of A measure: • A value around 0.56 = small effect; • A value around 0.64 = medium effect; • A value around 0.71 = big effect; Wilcoxon Rank Sum Test is nonparametric test is used to compare two related samples on a single sample to see if their population ranks differ. The null hypothesis is difference between the two samples has equal medians. The alternative hypothesis is there is no difference between the two samples. If the p-value is larger than 0.05, we must accept the null hypothesis because there is enough evidence to conclude. The null hypothesis is rejected, there is sufficient evidence to conclude the sample has no identical distributions [24].

D. Data Training and Testing
Training dataset are 80% and testing dataset are 20% for COCOMO81, COCOMO NASA 1. Training dataset are 70% and testing dataset are 30% for COCOMO NASA 2, and Kaushik et al.

IV. DATA ANALYSIS
In this project, correlation matrix uses to evaluate the correlation of the two variables. The dependent variable is actual effort attribute, while the 15 cost drivers and the line code are independent variable. From Fig. 2, 3, and 4     In this research further experiment is to analysis the regression models with selected attributes COCOMO dataset using A measure and Wilcoxon Rank Sum test. 731 | P a g e www.ijacsa.thesai.org From Table VIII, the project calculates the accuracy metrics of COCOMO NASA 1, train and test Support Vector Machine and Random Forest with all attributes of COCOMO NASA 1, also with 5 selected attributes and one attribute only, LOC. From the result, COCOMO NASA 1, has the highest correlation accuracy compared to other experiments with machine models. There is sufficient evidence to reject the null hypothesis as the p-value of COCOMO NASA 1 is smaller than 5 percent, thus COCOMO NASA 1 is statistically significant. The MMRE of COCOMO NASA 1 is smaller compared to machine learning models' MMRE. The smaller the MMRE indicates the more accurate of the estimation [25].
As for Vargha and Delaney A measure there is no effect of difference between actual effort and the predict value of COCOMO NASA 1 model, to support the statement rank sum p-value is used to measure the distribution between the control and experimental sample. It shows that the p-value of Wilcoxon rank sum test is higher than 5 percent, the null hypothesis is fail to reject, the two samples has identical distributions.
To compare the trained machine learning models with all attributes and machines learning models only with 5 selected attributes, the five selected attributes performance better in producing more accurate results. The correlation accuracy of the five selected attributes have higher relationship between the actual effort and the predicted effort values compared to the all attributes. MMRE of five selected attributes has lower values compared to all attributes, this show that five selected attributes has more accurate estimations between the actual and predicted effort values. For Vargha and Delaney A measure, there are only slight differences between the all attributes and the five selected attributes, Support Vector Machine with all attributes has no effect differences compared to Random Forest model. The rank sum of all attributes and five selected attributes are statistically significant, there are enough evidence to support null hypothesis and to reject the alternative hypothesis.
Then experiment is using only one attribute, LOC. The correlation accuracy of machine models are increased compared to five attributes and all attributes. The p-value of Pearson's correlation also show the models are statistically significant. The MMRE of Random Forest is lower than Support Vector Machine, 67.6706. The Vargha and Delaney A measure of Support Vector Machine and Random Forest has no effect in difference, this mean the distribution of actual and predicted effort values are identical. The Wilcoxon rank sum test of Support Vector Machine and Random Forest are statistically significant where, there are enough evidences to support null hypothesis and reject the alternative hypothesis since the p-value of both machine learning models is higher than 5 percent.
To conclude, the machine learning models learn and prove that not all attributes are needed to trained and needed. From this experiment, using one attribute, LOC can have closer MMRE towards the COCOMO prediction model, higher correlation accuracy and identical distribution of actual and predicted effort values.
From the Table IX, the project calculates the accuracy metrics of COCOMO NASA 2, train and test Support Vector Machine and Random Forest with all attributes of COCOMO NASA 2, also with 5 selected attributes and one attribute only, LOC. COCOMO NASA 2 has 93 projects and highest number projects compared to other COCOMO datasets. From the result, COCOMO NASA 2 has the lower correlation accuracy compared to experiments with machine learning models that used all attributes. There is no sufficient evidence to reject the null hypothesis as the p-value of COCOMO NASA 1 is larger than 5 percent, thus COCOMO NASA 2 is not statistically significant. The MMRE of COCOMO NASA 2 is smaller compared to machine learning models' MMRE.
The smaller the MMRE indicates the more accurate of the estimation (Malhotra, 2014). As for Vargha and Delaney A measure there is intermediate differences between actual effort and the predict value of COCOMO NASA 2 model, however the Wilcoxon rank sum p-value is used to measure the distribution between the control and experimental sample. It shows that the p-value of Wilcoxon rank sum test is higher than 5 percent, the null hypothesis is fail to reject, the two samples has identical distributions. The COCOMO NASA 2 is still statistically significant according to rank sum test.  In the experiment of machine learning models are using all attributes, there are large differences result obtained from Support Vector Machine models and Random Forest model. The correlation accuracy of the Random Forest is lower than Support Vector Machine however Random Forest has higher MMRE value compared to Support Vector Machine. Moreover, Vargha and Delaney A measure, Support Vector Machine and large difference between the predicted and actual effort values, while Random Forest has closer accuracy of control and experimental sample. The Wilcoxon rank sum test for Support Vector Machine and Random Forest model are statistically significant where there are enough evidence to support null hypothesis and reject the alternative hypothesis, as the p-value of both machine learning is higher than 5 percent.
Then, the experiments are evaluated between all attributes and five selected attributes. The correlation accuracy of the five selected attributes have higher relationship between the actual effort and the predicted effort values compared to the all attributes. However the MMRE of Random Forest for the five selected attributes has higher value compared to Random Forest for the all attributes, while for Support Vector Machine is vice versa.
For Vargha and Delaney A measure, the selected five attributes for both machine learning show medium differences between the actual and predicted effort value. The rank sum of all attributes and five selected attributes are statistically significant, there are enough evidence to support null hypothesis and to reject the alternative hypothesis.
Then experiment is using only one attribute, LOC. The correlation accuracy of machine models drop compared to five attributes and all attributes. The p-value of Pearson's correlation only show the Random Forest model is statistically significant. The MMRE of Random Forest is lower than Support Vector Machine compared to all attributes and COCOMO NASA 2 model. The Vargha and Delaney A measure of Support Vector Machine and Random Forest have small in difference, this mean the distribution of actual and predicted effort values are closely identical.
The Wilcoxon rank sum test of Support Vector Machine and Random Forest are statistically significant where, there are enough evidences to support null hypothesis and reject the alternative hypothesis since the p-value of both machine learning models is higher than 5 percent.
To conclude, the machine learning models learn and prove that not all attributes are needed to trained and needed. From this experiment, using one attribute, LOC can have closer MMRE towards the COCOMO prediction model, higher correlation accuracy and identical distribution of actual and predicted effort values.
From the Fig. 5, the machine learning model support vector machine and random forest show similar pattern towards to the actual effort, while the calculation of COCOMO deviates at 100, 14, 302,113, 350 and 339 lines of codes. Support Vector Machine and Random Forest has proximity predicted effort values compared to effort prediction of COCOMO models. Refer to appendices for development of machine learning model.

V. DISCUSSION
The results of the experiments found clear support that Support Vector Machine and Random Forest algorithms impressively give consistent results with the COCOMO datasets regardless on the number of effort attribute used. 733 | P a g e www.ijacsa.thesai.org However, the planned comparison in this paper reveals the good performance and increase in the accuracy of Support Vector Machine for estimation of software project effort. Support Vector Machine also able to delivers significantly better results with five important attributes compared to all attributes used to estimate project effort and cost, as some of the attributes are irrelevant to estimate.

VI. CONCLUSION AND FUTURE WORK
To conclude, many existing machine learning algorithms can train predictive models, however, the right and suitable machine learning model are needed to give an accurate estimation. In this research, the five selected attributes with high positive correlation toward actual effort attribute are obtained from the correlation matrix, DATA, STOR, LOC, TIME and TOOL. The five important attributes give better result compared from using all the attributes in COCOMO dataset. Hence, not all attributes in the dataset are relevant to be used to measure the project estimation. Further improvement, during training and testing data, the data is advised to split three parts, 70 percent of training data, 20 percent of testing data, and 10 percent of validation data.
Further improvement of this research on machine learning models is to perform ensemble stacking also known as blending, to ensemble the four machine learning models in order to optimize the predictive model. In development of machine learning model, percentage of accuracy should be included to show the difference percentage between the predicted value and the actual effort value.