Software Effort Estimation through Ensembling of Base Models in Machine Learning using a Voting Estimator

—For a long time, researchers have been working to predict the effort of software development with the help of various machine learning algorithms. These algorithms are known for better understanding the underlying facts inside the data and improving the prediction rate than conventional approaches such as line of code and functional point approaches. According to no free lunch theory, there is no single algorithm which gives better predictions on all the datasets. To remove this bias our work aims to provide a better model for software effort estimation and thereby reduce the distance between the actual and predicted effort for future projects. The authors proposed an ensembling of regressor models using voting estimator for better predictions to reduce the error rate to over the biasness provide by single machine learning algorithm. The results obtained show that the ensemble models were better than those from the single models used on different datasets.


I. INTRODUCTION
For a given project, the effort estimation of software is always a burdensome task. For an extended period, team and finance managers strive to precisely calculate the effort, cost, and time while helping evaluate the project's schedule and budget parameters [1]. It is very tough to predict those specifications during the early stages of the project, where the scope of every module has yet to be marked, and when we still have no conclusive evidence for the ultimate functional requirements of the product [2]. Most frequently, insufficient knowledge of the affecting factors and the possible risks that can happen, or the work deadline fears, and conventional effort estimation methods [3], which are widely accepted by the opinions of various software domain experts, may sometimes lead to erroneous estimates . Because of these, the software product may not be delivered in time with the expected non-functional requirements. Though there are frequent improvements in the up-gradation of software development standards, surveys [4] show that only a quarter of the total number of beginnings is successful. These issues, which result in going beyond the budget or schedule, may lead to its termination [5]. Though the usage of agile methodology [6] reduced some concerns, project omissions are still occurring because of not having access to all the country's projects. For a country that just has a limit to its country's projects, obtaining success in the projects is still a problem. All the managers of the project are claiming otherwise. The causes are the poor skillsets of the teams on the project and less bonding with stakeholders. There are high chances of more project success if the effort is well predicted from the beginning. But once the project's parameters are set, it's not a good idea to increase either the budget or the schedule because that could lead to risks that are hard to predict.
The client's approval for that scenario must be independent of the chosen process model for project development, time, and cost determination. For this case, some simple and easy conventional techniques that experts accept, such as PERT, CPM, etc., are primarily deployed [7]. Researchers started working on methods depending on software lines of code, and functional points as the previously mentioned techniques are vulnerable [8]. In various ways, the software parameters try to be up to date with the improved technologies. However, those techniques struggle to keep pace in the fastest-growing world, specifically with the reusability components and software dependencies that have already gone so far in their enhancement.
After considering all these drawbacks, researchers dug deep for efficient predictive techniques for effort, especially in data science areas [9]. This area is highly trusted, proving its potential with uncertain and unstructured data. Hence, it is believed that they can find the effort and duration way better than existing models. In other methods, they consider patterns in the previous data and do not rely on human influences, making them unique in their work. The factors behind this are systematic research that builds the best model for prediction to reduce the error rate for data. Biased models have been generated for some time. Their work is limited to a particular dataset, repeatedly underfitting or overfitting using varied ensemble techniques. The critical role is preparing data that has a crucial impact on the model, but divergent methods are to be used. Even the individual algorithms may not give an improved score, which is evident from various journals. The same can be repeated with effort and cost prediction when working with data and building models with those algorithms [10].
Though using all project parameters produced better results in the literature, some works included proper feature www.ijacsa.thesai.org selection strategies, eliminating irrelevant features for less computation usage and creating a unique effect with fewer features than the others. Some of them are the Genetic Algorithm [11], PSO [12], and WCO [13].
This work aims to improvise the existing models on reliable machine learning algorithms for the best effort prediction. An averaging ensemble of various regressors proposes a hybrid model for this aim. Also, the proposed work conducted experiments on various datasets such as cocomo81, china, Desharnais, Maxwell, Kemerer, and Albrecht etc. to evaluate the behavior of the model with varying datasets.

II. RELATED WORK
Amini et al. [14], in their paper combined two techniques, namely embedded and wrapper methods. The main motto of their writing is to integrate GA into regularized learning to improve prediction accuracy in regression problems. The outcome of their study reduced the dimension feature space by over 80% without affecting the accuracy. De Carvalho et al. [15] proposed an Extreme learning machine for forecasting software efforts. For selecting the best features, the Pearson correlation coefficient is used for feature selection. Extreme Learning Machines (ELM) are used with different numbers of hidden layers in their work. The ELM model values are compared with the models mentioned in the literature, namely LR, SVM, KNN, and MLP. The metric evaluated for comparison is MAE.
Ghosh et al. [16] proposed the binary variant of SFO for selecting features. This work compared ten state-of-the-art techniques and declared that BSFO based on adaptive hill climbing had shown better reliability. Carbonera et al. [17] surveyed over 120 studies and indicated that this study encouraged the researchers to minimize the space in the effort estimation. Chhabra et al. [18] worked in soft computing Fuzzy model along with PSO. This Fuzzy logic improved the existing COCOMO technique. The metric followed is MRE for result comparison. Ghatasheh et al. [19] proposed a firefly algorithm to optimize software effort. The results were better than the conventional models used earlier.
Wani et al. [20] worked on ANN and PSO. The limitation in their work is that the combination is giving better results for only the cocomo81 dataset. The ANN showed fast training speed than MLP. This method ended with lesser MdMRE and MRE than other models. Ali et al. [21] [24] optimized the parameters of the COCOMO II model using particle swarm optimization. PSO is a valuable strategy for resolving dataset uncertainty and optimizing the values. The author worked on the Turkish software industry dataset.
Hosni et al. [25] concentrated on parameter tuning ensemble using grid search optimization. The authors evaluated results over seven datasets to compare statistical measures, namely mean, median, and inverse ranked weighted mean. Used three algorithms, GS, PSO, and UC-Weka, and concluded that PSO gained over other approaches. Goyal et al. [26] proposed an SG5 neural network model trained on the Cocomo dataset and tested in the Kemerer dataset. It excelled over the traditional models. Padhy et al. [27] developed an Aging and survivability-related reusability optimization model, and the software metric estimation is done with the help of UML or Class diagrams. To overcome the limitations of ANN, some different Evolutionary Computing (EC) algorithms like Genetic Algorithms, Differential Evolution, and Particle Swarm Optimization (PSO) have been proposed. By implementing the above algorithms, the regression outputs are improved so that the results are significantly accurate and most effective.
Pospieszny et al. [28] proposed ensemble averaging with a 3-fold validation, namely SVM, MLP, and GLM, to predict both effort and duration. Here used the standard ISBSG dataset and considered the MMRE and PRED metrics. In their paper, Shekhar et al. [29] discussed various software cost estimation techniques and models. The authors classified these techniques into algorithmic and non-algorithmic, which helps the software team rule out the weaker methods and provides specific areas for considering an approach.
Venkatesh et al. [30] calculated the workforce to determine the cost and effort of the project, which outperformed other models, like regression models and neural networks. This work applied to several PROMISE datasets by considering RMSE as the root metric. Nassif et al. [31] worked on four different neural networks, the oldest projects used for training and the newest projects used for testing. Here ten-fold cross-validation is achieved. The author concluded that in 60% of datasets, CCNN performed better than other models, and on 40% of datasets, RBFNN performed better than others. Miandoab et al. [32] proposed a hybrid Algorithm using a particle swarm optimization algorithm and fuzzy logic.
Dizaji et al. [33] combined Ant Colony Optimization (ACO) and Lorentz transformation as Chaos Optimization Algorithm (COA). The meta-heuristic algorithms like ACO and COA are used to estimate the cost of the software. Mean Absolute Relative Error (MARE) is taken into consideration. Here the dataset is classified and distributed among the ACO and hybrid ACO and COA algorithms according to their functionalities. The results show that the performance is improved and efficient when the ACO algorithm is combined with COA. www.ijacsa.thesai.org

III. METHOD
The proposed approach introduces a novel method of using ensemble techniques with voting for software development effort estimation. This approach combines the strengths of multiple models and leverages the diversity of their predictions to improve accuracy. By investigating the impact of different factors on the accuracy of the ensemble with voting, this approach can provide insights into how to optimize the performance of the ensemble for different datasets and problems. The proposed approach can also have practical applications for software development organizations, as it can help them to make more accurate and informed decisions about project planning and resource allocation. The proposed architecture is illustrated in Fig. 1.
• Collect historical project data: Gather historical project data including information on the size of the project, the number of developers involved, the complexity of the software, and the amount of time and resources required to complete the project.
• Preprocess the data: Preprocess the data to remove any outliers or errors, and to convert the data into a format that can be used by the ensemble models.
• Train multiple estimation models: Train multiple estimation models on the preprocessed data, such as linear regression, decision trees, neural networks, and support vector machines.
• Implement the voting algorithm: Implement the voting algorithm to combine the predictions from the multiple models. There are different types of voting algorithms such as majority voting, weighted voting, and threshold voting.
• Evaluate the ensemble with voting: Evaluate the accuracy of the ensemble with voting using a validation set of historical data that was not used during training. Compare the performance of the ensemble with voting against individual models and other ensemble techniques.
• Investigate the impact of different factors: Investigate the impact of different factors on the accuracy of the ensemble with voting, such as the number of models, the type of models, the voting algorithm, and the size and quality of the historical data.
• Apply the ensemble with voting to new data: Apply the ensemble with voting to new software development projects to assess its accuracy and reliability in realworld scenarios.

A. Data Preprocessing
Preprocessing of data involves data cleansing approaches. It has a clear positive impact on training the machine learning models. It reduces the dataset's noise by filling in missing values, removing duplicate records, dropping unnecessary columns, etc. Finally, it produces the data in its best representation to be used for model building. Without preprocessing, models might learn the noise as an underlying pattern, leading to overfitting or underfitting the data. Here, we dropped some attributes in our work, such as project ids, dates of projects, other categorical details, etc. We ignored the missing data records.

B. Normalization
Normalization is done as a second step, and it is essential to scale the features within a range for the model's performance. This normalization sets the feature scale from 0 to 1 and is implemented using the MinMax scalar in Python. In our datasets, we normalized all the input and output features.

C. 5 -Cross Fold Validation
Cross-fold validation is an interesting technique, which makes our model more reliable. Instead of considering a particular subset for training and the remaining part for testing, it uses the entire dataset for training and testing purposes. A five-fold validation usually splits the entire dataset into five equal sets or folds, where for every time, four sets are used for training, and one set is used for testing. This process is repeated for four (k-1) iterations, i.e., all possible combinations, and it will give the average score of all iterations.

D. Algorithms
In our work, we build a hybrid model with the help of five machine learning Regression algorithms. Each algorithm has a different structure in its implementation.

1) Linear regression:
Linear Regression frames an equation for the given attributes to fetch the target variable. It assumes a linear relationship between the characteristics of a dataset. The equation is y = f (x), where y represents the output variable and x is the set of input attributes. This algorithm performs better than complex models when the dataset is linear.
2) Random forest: Random Forest is a bagging model. It constructs several trees for prediction. Every tree is constructed from a subset of the training data. Every tree will give some effort for a test set. All predictions are averaged to www.ijacsa.thesai.org get the final estimate of how much work needs to be done, lowering the result's error rate.
3) Boosting techniques: Every boosting algorithm has a base model. After each iteration, a new weak learner is added to the sequence of learners; every iteration model reduces the residual effort. We implemented three boosting models in our work: Ada Boost, Gradient Boost (GB), and Extreme Gradient Boost. AdaBoost handles missing data well and undergoes no overfitting. It has fewer parameters to tune when needed and is sensitive to outliers. Gradient Boosting is a sequence of tree learners robust to outliers, depending on residuals. XGB has been showing better results than GB as it includes the calculation of similarity weights.

E. Voting
Every algorithm is unique in its background processing of data. Hence, all algorithms can find their patterns of data.
Here comes the idea of ensembling [34]. Ensembling is obtained by combining various models. Bagging, boosting, voting, etc., are some of the ensemble approaches [35]. Here we aggregated predictions of various models, i.e., averaged the output predictions of all models and produced one model closer to the actual effort than any individual model. We took the Linear Regression, Decision Tree Regression, Random Forest Regression, Support Vector Regression, and Neural Network Regression outputs, calculated the average of all the values, and compared them with the actual effort in the test dataset. The results are given for evaluation metrics.  Fig. 2 represents the deviation between actual and predicted effort values on all datasets, where the X-axis represents the record number. The Y-axis represents the effort of the record. Fig. 2(a), 2(d) on Cocomo81 and Maxwell show a notifiable difference in peak effort values. The values are closer to the China dataset in Fig. 2(b). Fig. 2(c) and 2(f) show a constant gap between actual and predicted values. Fig. 2(e) on Albrecht shows a considerable difference.  Fig. 2(a) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the COCOMO81 dataset. This graph shows a noticeable difference at the peak points. Fig. 2(b) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the China dataset. We can see that the predicted line has come close to the actual line in many places. Fig. 2(c) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the Desharnais dataset. In this graph, records 10 and 11 have a significant deviation, whereas records 12 to 14 have the least deviation. Fig. 2(d) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the Maxwell dataset. This graph shows a noticeable difference at the peak points. Fig. 2(e) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the Kemerer dataset. As the test set records are meager, they show a significant deviation, but the deviation range is 0.05. Fig. 2(f) represents a line graph drawn to show the deviations between actual effort and the predicted effort by our proposed model in the Albrecht dataset, where the X-axis represents the record number and the Y-axis represents the effort of the record. Fig. 3 represents the residuals graphs between actual and predicted effort values on all datasets. Fig.  3(a) and 3(d) on Cocomo81 and Maxwell shows a notifiable difference in peak effort values. Fig. 3(b) of the China dataset shows values closer to 0 ("zero"). Fig. 3(c) and 3(f) show a constant gap between actual and predicted values. There is a significant difference in Albrecht's Fig. 3(e). The above Fig. 3(a) represents a graph that shows the residuals between actual effort and the predicted effort of the data records of the COCOMO81 dataset ranging from-0.4 to +0.4, and most of the data points are present in the range of-0.2 to +0.2.

F. Pseudo Code
The above Fig. 3(b) represents a graph that shows the residuals between actual effort and the prediction effort of the data records of the China dataset ranging from-0.10 to +0.25. In the presented graph, most data points are nearer to 0, indicating that the proposed model is working much more efficiently in the China dataset.
The above Fig. 3(c) represents a graph that shows the residuals between actual effort and the predicted effort of the data records of the Desharnais dataset, ranging from-0.10 to +0.20. In this graph, most of the data points are below point 0.
That means the proposed model predicted values are less than the actual values. Fig. 3(d) shows a graph of the residuals between actual effort and predicted effort of the Maxwell dataset data records, ranging from -0.2 to +0.8. According to this graph, the proposed model prediction is much closer to the actual values based on working on this dataset. Fig. 3(e) depicts a graph displaying the residuals between actual effort and predicted effort of the Kemerer dataset data records, ranging from 0.05 to +0.16. www.ijacsa.thesai.org The above Fig. 3(f) represents a graph that shows the residuals between actual effort and the predicted effort of the data records of the Albrecht dataset ranging from-0.15 to +0.20.
Below, Fig. 4 shows the bar plots of all the implemented models, representing the mean absolute error on all six datasets. Fig. 4(a) the voting model outperformed GB, XGB, RF, and LR except for ADB. Fig. 4(b) shows that, except for RF, voting showed less residual than all others. Fig. 4(c), (d), and (f) voting models are reliable. From all the above comparisons, we concluded that voting is a constant performer. On all datasets, the models behave randomly, whereas voting shows an upvote constantly. The graphical representation of Mean Absolute Error for various models that are worked on the COCOMO dataset is shown in Fig. 4(a), with the ADB model giving the slightest error followed by voting and the Random Forest giving the highest error among the models presented. Fig. 4(b) shows a graphical representation of the Mean Absolute Error for various models tested on the CHINA www.ijacsa.thesai.org dataset. The RF model produces the lowest error, and the ADB produces the highest error. Fig. 4(c) shows a graphical representation of the Mean Absolute Error for various models tested on the Desharnais dataset, with the Voting and XGB models producing the lowest error and the ADB having the highest error.
The graphical representation of Mean Absolute Error for various models that are worked on the MAXWELL dataset is shown in Fig. 4(d), with the XGB model giving a minor error and the LR model giving the highest error among the models presented. Fig. 4(e) shows a graphical representation of the Mean Absolute Error for various models tested on the Kemerer dataset. The XGB model produces the lowest error, and the LR model produces the highest error. Fig. 4(f) shows a graphical representation of the Mean Absolute Error for various models tested on the Albrecht dataset. The RF model produces the lowest error, and the LR model produces the highest error.
We normalized the literature results and compared them with the obtained model's results (see Tables I-VI)  Our work includes testing the voting regressor on six datasets. From the above tables, observations in all datasets voted on, showed excellent performance in minimizing the actual and predicted effort error. On the COMO81 dataset, absolute error is the minimum for voting, and squared errors are minor for linear regression. On COCOMO81, China, Desharnais, Kemerer, Maxwell, and Albrecht had excellent performances. Finally, we concluded that all dataset implementations support voting, which makes voting more reliable and robust. Voting followed by linear regression shows that the datasets have a linear relationship between the attributes of the projects.

VI. CONCLUSION
We studied various existing research papers on software effort estimation in this work. In the early days, we relied on many conventional approaches, considering the line of codes, functional points, CPM and PERT, etc., or merely relying on the people's judgment that has ample experience in software project effort determination. Because extensive developments in project building consider multiple parameters in every project, these techniques might not be feasible anymore with rapid results in software projects. And at the same time, machine learning has gained momentum in recent decades in various domains. And there is some work taking place in software engineering through machine learning. Therefore, our work aims to provide a robust machine learning model for effort calculation. We successfully used the machine learning ensembling concept to predict software development efforts. We considered every parameter for the effort estimation. Based on our research, the ensembling of models outperformed other single models. We recorded a lower error rate from the ensemble model comparatively. The average of different predictors positively impacted the output, which shows the vital role played in optimizing software effort estimation in the machine learning area. The input dataset dramatically affects how well the machine learning algorithm works, and in our work, models performed very well with our datasets.