Fuzzy C-mean Missing Data Imputation for Analogy-based Effort Estimation

The accuracy of effort estimation in one of the major factors in the success or failure of software projects. Analogy-Based Estimation (ABE) is a widely accepted estimation model since its flow human nature in selecting analogies similar in nature to the target project. The accuracy of prediction in ABE model in strongly associated with the quality of the dataset since it depends on previous completed projects for estimation. Missing Data (MD) is one of major challenges in software engineering datasets. Several missing data imputation techniques have been investigated by researchers in ABE model. Identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in existing missing data techniques adopted for ABE model. In this study, Fuzzy C-Mean Imputation (FCMI), Mean Imputation (MI) and K-Nearest Neighbor Imputation (KNNI) are investigated to impute missing values in Desharnais dataset under different missing data percentages (Desh-Miss1, Desh-Miss2) for ABE model. FCMI-ABE technique is proposed in this study. Evaluation comparison among MI, KNNI, and (ABE-FCMI) is conducted for ABE model to identify the suitable MD imputation method. The results suggest that the use of (ABE-FCMI), rather than MI and KNNI, imputes more reliable values to incomplete software projects in the missing datasets. It was also found that the proposed imputation method significantly improves software development effort prediction of ABE model. Keywords—Analogy-based effort estimation; imputation; missing data; fuzzy c-mean


I. INTRODUCTION
Software development effort is considered one of the most significant metrics estimated in software projects due to the reasons that planning, developing, managing and all other important aspects of project depend extremely on accurate estimation of development effort [1]. Many effort estimation models have been introduced by researchers in software engineering domain , they can be classified into two major categories: first is parametric models which depend on statistical analysis of software projects data and assumed a linear relationship between effort and other project attributes, and second is Machine Learning (ML) models which depends on soft computing and artificial intelligence methods and assumed a non-linear relationship between effort and other project attributes [2,3]. Among many ML models Analogy-Based Estimation (ABE) is a widely accepted estimation model since its flow human nature in selecting analogies similar in nature to the target project [4].
Missing data (MD) in software engineering datasets is major problem that affects the performance of effort prediction models [5,6]. Many techniques are proposed to solve this problem includes : deletion, toleration, and imputation of missing data [7]. Missing data imputation is the most investigated technique in software effort estimation and KNN imputation was the popular adopted method [8].
Almutlaq and Jawawi [9], classified missing data imputation challenges for software effort estimation into two major categories, the categories are performance oriented and dataset challenges. Performance oriented challenges refers to challenges and issues that exist within the techniques itself on a performance level (missing data Accuracy, Model performance accuracy, and time efficiency). While the dataset challenges revolve around the role of the dataset and its effect on the missing data imputation techniques (numerical data imputation, categorical data imputation, dataset characteristics and size variety, and MD Mechanism Variety).
MI and KNNI are the most prominent missing data imputation techniques that have been used for ABE model [8]. MI method is considered as static imputation without analyzed the dynamic nature for each missing case in the feature concerned [10,11]. KNNI depends on neighbor cases which may be related or not to the missing project values and derived a dynamic imputation value for each missing case for the feature concerned in the uncompleted dataset [12].
Identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in the existing missing data techniques adopted for ABE model. Clustered completed software projects into homogeneous clusters based on the selected dataset attributes, and then identify more reliable donors cases to the incomplete project to impute missing values based on clustered data have not been yet investigated in ABE domain.
This study concerns on improve the performance of ABE model through adopting a new imputation method based on FCM technique. And compare empirically the results with KNN imputation and Mean Imputation (MI) for ABE model using different missing ratio of MNAR missingness mechanism.
Rest of the paper is organized as follow. Section II presents the concepts of ABE model, missing data, and techniques for handling missing data in software engineering datasets. Section III presents the concept of Fuzzy C-Mean clustering. Section IV presents related research studies for missing data techniques in software engineering domain and ABE model. Section V presents the proposed (ABE-FCMI) imputation technique. Section VI presents empirical evaluation design employed in this study. Section VII presents and discusses the www.ijacsa.thesai.org reported results. Section VIII discusses internal and external threats to validity for this research study. Section IX concludes research findings and gives direction for some future work.

II. BACKGROUND
This section presents the concepts of analogy-based effort estimation, missing data, and fuzzy c-mean (FCM) clustering.

A. Anolgy-Based Estimation (ABE)
Analogy based estimation proposed by Shepherd and Schofield as one of the most prominent non-algorithmic effort estimation model [13] .Comparison dependent process of comparing similar projects to the target project is done in order to derive the development effort in ASEE. Similarity measures are used to determine similar projects. Simplicity and estimation capability make it a widely accepted model in software effort estimation field. ABE consist of four parts:  Historical completed software engineering projects dataset.
 Determine the level of similarity through Similarity Function.
 Estimate the software development effort by considering the similar projects found by the similarity function through solution function.
 Associated retrieval rules The estimation process of ABE is accomplished in the following stages:  A historical dataset in constructed based on the collected information of previous projects.
 For a comparison purpose select attributes are chosen.
 Retrieve similar projects to the target project based on the selected similarity function.
 Estimate the target project effort based on the selected solution function.
Similarity Function: Level of similarity between two projects is determined through similarity function that compares the attributes of both projects. Euclidian Similarity (ES) and Manhattan Similarity (MS) are two common similarity functions. (ES) function is represented in Equation 1.
Where projects in comparison are p and p' whereas Wight given to each attribute as wi. wight range between 0 and 1. The ith attribute of each project represented as fi and fi' and n represent the number of attributes. For gain none zero result δ is used. Solution Function: To derive software effort estimation based on most similar projects defined by similarity function a solution function is applied. Most dominant used solution functions are: inverse distance weighted mean [14] , closest analogy as the most similar project [15] , average of most similar projects [13] , median of most similar projects [16]. The median value of effort gained from K most similar projects, as K>2, described by Median. The average value of efforts gained from K most similar projects, as K>1, is described by Average.

B. Missing Data Concept
Missing data (MD) problem is a major challenge in software engineering datasets. Accurate software effort estimation depends strongly on the quality of datasets used for estimation process. In this subsection MD mechanisms and MD techniques (treatments) are elaborated.

C. Mechanisms of Missing Data
Missing data mechanisms are assumptions about the type and distribution of missing values [17].This identification of missing mechanism identify the missing treatment to be applied [7]. Three type of missing data mechanism are identified.
First Missing Completely At Random (MCAR) MD are independent of any variable observed in the data set, second Missing At Random (MAR) means that the MD may depend on variables observed in the data set, but not on the MD themselves, third (MNAR) in which the MD depend on the MD themselves and not on any other observed variable.

D. Techniques for Missing Data
Missing data treatment can be grouped in three methods as first MD deletion, second MD toleration, and third MD imputation.
MD ignoring (deletion) in this technique it simply handle the missing values by deleting them. MD deletion is properly suitable when the percentage of missing data is low. It is not utilize when consecutive data is missing like NIM (MNAR) mechanism [7,18]. MD toleration in this method the missing value is assigned a NULL value and did not deleted from the dataset and the analysis is performed to same data [18]. MD imputation MD imputation method is employed to fill up the missing values and reaches a complete data set so that later this dataset can be utilized in enhancing the estimation of software development effort. KNN imputation is the most prominent method of imputation in software effort estimation [8,19,20].KNN provides a good result so far because it dost follow explicit mechanisms. Euclidean Distance and Manhattan Distance is used as a similarity measure to find nearest neighbors in KNN imputation methods.
III. FUZZY C-MEAN (FCM) CLUSTERING KNNI uses whole completed dataset for identifying similar neighborhood donor cases based on some distance measure, for ABE context it is important that donor cases to incomplete projects are come from similar projects in characteristics and nature to incomplete software project to impute missing values.
Clustering strategy as a data mining technique has been utilized recently to impute missing value. The idea behind using clustering in MD imputation is to impute incomplete record missing values from similar cluster that incomplete www.ijacsa.thesai.org record located in, accuracy of imputation is improved by clustering data to groups with the same similarity features so that the range to substitute missing values is within cluster scope [21].
Clustering techniques can be divided into two major categories, hard clustering and soft (fuzzy) clustering. In hard clustering techniques, data object is belong to only one cluster which is the most similar cluster , however in fuzzy clustering a dataset object is belong to each one of clusters with a certain similarity given by membership function [22].
Hard clustering imputation techniques has been employed by many researchers such as k-means [23][24][25] in which incomplete data object missing values is imputed based on cluster information it is belong to. However in case of missing dataset there is uncertainty of incomplete data object is belonging definitely to certain cluster, so the need for fuzzy clustering imputation methods have been introduced such as FCMI [26][27][28] . The intra-variance in clusters is decreases by FCM compared to k-means algorithm [29] , moreover FCM is less sensitive to stuck on local minimum situation because of continuous membership function values [30]. Fuzzy imputation achieved higher performance compared to hard clustering imputation as denoted in experimental results [31].
Zadeh introduced the concept of fuzzy logic [23,32]. Fuzzy logic is a computation approach based on degree of truth to represent uncertainty concept in information. Fuzzy theory and fuzzy set are introduced to solve the problem of imprecise information and uncertainty in missing data. Fuzzy capabilities are utilized to find plausible imputation values [31,33,34].
One dataset element can belong to two or more subsets in fuzzy clustering rather than crisp clustering. In FCM one dataset element can belong all clusters with different membership value associated to each clusters [35,36].
Fuzzy C-Means (FCM) adopted recently in solving missing data problem [27,28,37]. Missing value can be derived by the calculated distance from clustered complete dataset based on obtained membership values.
This study focus on missing data imputation by clustering the completed projects into several clusters where they have similar connection between the features subsets.to best of our knowledge no research study has adopted FCM for ABE model.
FCM is a form of iterative algorithm. The goal of FCM is to find cluster centers (centroids) that minimize objective function (dissimilarity).The dissimilarity function (J) which is used in FCM is given Equation 2. n is the number of observations. dij is the Euclidian distance (||Xi-Cj||2) between ith centroid(ci) and jth observation. m is the fuzzy degree ,m=2 is the general used value.
The cluster center (centroid) rj of jth cluster is given using equation 3.  The FCM algorithm can be elaborated as follow: Algorithm 1: FCM Algorithm REQUIRE: Input data to be clustered (X1, X2, , Xn). 2. Number of clusters (c), fuzzy degree value (m), maximum number of iterations allowed (I), the smallest desired error(ε),initial objective function (J0 = 0).

Step 1: Begin
Step 2: Initialize randomly membership function to each observation (µi j) Step 3: Calculate centroid (cluster center) (rj) using equation 3 Step 4: Calculate the Euclidean distance, update the membership function (µi j) using equation 4 Step 5: Calculate objective function using equation 2 Step 6:Check for convergence criterion IF (∥Ji − J( i − 1)∥) < ε OR ( i > I) , then stop the process. ELSE repeat step 2 to 6 until maximum iteration reached.
Step 7: END IV. RELATED WORK The quality of past software dataset projects play major role in the performance of ABE model since it depend on historical past projects to predict the effort of target project. Researchers investigated missing data treatment techniques wildly in software engineering filed but few concentrate on ABE model. Idri, et al. [8] conducted a systematic mapping study in software engineering domain reviewed existing techniques treating missing data, it have been found that missing data imputation is the most used approach and KNN imputation is the most adopted method. Huang, et al. [6] Evaluated empirically data preprocessing techniques used for machine learning effort estimation models; the study validated missing data treatment techniques effectiveness to improve accuracy of prediction effort. Almutlaq and Jawawi [9] Reviewed recent missing data techniques in software effort estimation field, the study elaborated two major challenges that are imputation technique performance oriented and incomplete dataset oriented. www.ijacsa.thesai.org Strike, et al. [5] Investigated three missing data techniques (deletion, mean imputation, and hot-deck imputation) with three missing mechanisms (MCAR, MAR,and NIM) on regression effort estimation model.it have been found that hotdeck imputation outperformed other methods. Cartwright, et al. [19] Founded that KNN imputation has better results than mean imputation and missing data toleration in regression effort estimation model for MCAR missing data mechanism. Twala and Cartwright [20] combined KNN imputation with multiple imputation approach for Decision Trees effort estimation model, experimental results improved predictive accuracy of effort estimation using the proposed ensemble method. Sentas and Angelis [38] Investigated multinomial logistic regression (MLR) imputation for categorical missing data type in ISBSG dataset, the accuracy of regression estimation model improved especially with the case of high percentage of missing values. Li, et al. [18] Studied the relation between percentage of missing data (MCAR missing mechanism) and accuracy of AQUA model (form of ABE), the results confirmed a positive quadratic relation between percentage of missing data and accuracy of effort prediction. Song, et al. [7] Analyzed the impact of missing percentage and messing mechanisms on the accuracy of C4.5 effort estimation model using toleration and KNN imputation methods, the accuracy of prediction is severely affected in cases missing percentage above 40%. Idri, Abnane et al. [39] Conducted a study to evaluate prediction accuracy of ABE using different missing data techniques (toleration ,deletion ,and KNN imputation) with all missing mechanisms ,KNN imputation had superior improvement in ABE performance results.
Abnane and Idri [40] Investigated MD techniques (toleration, deletion, and KNN imputation) under different missing ratios and MD mechanisms for Fuzzy-ABE model using PRED (0.25) and SA as accuracy measures, they found that SA and PRED(0.25) measured different characteristics of technique performance. Huang, Li et al [41] Investigated datapreprocessing techniques (MD, normalization, feature selection) for ABE model under ISBSG dataset, KNNI improved ABE performance significantly compared to MI. Idri, Abnane et al [42] proposed SVR (Support Vector Regression) imputation, empirical results indicated that SVRI outperformed KNNI under different missing ratio and MD mechanisms for ABE model. Abnane and Idri [43] investigated mixed (Numerical and categorical) MD imputation techniques for ABE model, imputation techniques achieved better accuracy results, there is no significant difference between SVR and KNNI for mixed MD imputation. Muhammad Arif Shah [44] proposed Median Imputation of the Nearest Neighbor (MINN) for ABE mode , the investigation of the proposed model under Desharnais dataset outperformed both MI and KNN under MNAR mechanism.
Abnane, Hosni et al. [45] optimize parameters of KNN imputation using grid search, the optimize KNN imputation improved ABE significantly compared with regular KNN imputation. Abnane, Idri et al. [46] Proposed 2FA-KP-I (Fuzzy Analogy k-Prototypes Imputation) to impute mixed MD in ABE model, 2FA-KP-I outperformed KNNI under different missing ratio and MD mechanisms for ABE in the studied datasets. Table I introduced literature review of MD techniques used in ABE model, it also summarized the type of MD, imputation methods used MD mechanism, and the findings for each study. As can be seen from Table I that KNNI and MI is the most used techniques. Literature review in Table I gives indication that the increased MD ratio negatively affected ABE performance, and MNAR MD mechanisms significantly decreased ABE performance.
MI method impute fixed value for all missing data in the same column (feature),this is done by replacing all missing value with the average value of the feature concerned. MI method is considered as static imputation without analyzed the dynamic nature for each missing case in the feature concerned, MI can alter the variance of the data and the relationships between variables does not preserved like correlation [10,47,48].
KNNI depends on neighbor cases of the missing value and derived a dynamic imputation value for each missing case for the feature concerned. KNN imputation have limitations related to : first not efficient for large dataset size ,second it imputes values based on the neighbors which may or may not be the related projects for donor values, third depend on parameter setting for KNN algorithm , and fourth KNNI performance is decreased with MNAR missingness mechanism [12,39,49,50].
As can be seen from literature identification of the most similar donor values from the completed software projects dataset for imputation is a challenging issue in the existing missing data techniques adopted for ABE model. Clustered completed software projects into homogeneous clusters based on the selected dataset attributes, and then identify more reliable donors cases to the incomplete project to impute missing values based on clustered data have not been yet investigated by most researchers in ABE domain.

V. PROPOSED (ABE-FCMI) IMPUTATION TECHNIQUE
This section discusses the proposed (ABE-FCMI) imputation technique for imputing software engineering datasets. (ABE-FCMI) employed fuzzy clustering to divide the completed software projects into homogeneous clusters based on their features. Group completed data into similar features using FCM is the main operation to get for each feature the centroid value and obtain cluster centers finally.
The proposed (ABE-FCMI) method tries to solve gaps of, first selecting proper adjacent cases to derive the final missing data estimation value, and second improve ABE performance through MD imputation of MNAR missingness mechanism. www.ijacsa.thesai.org The basic idea behind using (ABE-FCMI) technique in ABE context is to impute incomplete software projects missing values based on homogeneous clustered completed software projects with high similarity within cluster and dissimilar with software projects in other clusters. Identification of similar donor cases for imputation is then assessed based on incomplete project membership values on each cluster.
In this study the idea of FCMI is borrowed from literature [27,33] and applied to the problem of MD in ABE model to improve the prediction accuracy of software effort estimation.
The algorithm of the proposed (ABE-FCMI) method is as follow: Algorithm 2: ABE -FCMI Algorithm REQUIRE: Normalize the software projects dataset (D) using min-max normalization. Separate dataset (D) into two subsets: Complete software projects dataset (DC) and Incomplete software projects dataset (DM).
Step 1: Begin Step 2: For all Complete software projects dataset (DC): i. Calculate the cluster center (centroid) using Equation 3. ii.
Compute the Euclidean distance iii.
Update the membership function using Equation 1, 2, and 3.
Step 3: For all Incomplete software projects dataset(DM): i. Calculate membership function to cluster centers that are Calculated from step 2.
Step 4:For each incomplete software project calculate imputation value using membership value calculated from step 3 and cluster centers calculated from step 2.

Step 5 : End
The proposed (ABE-FCMI) algorithm imputes each incomplete project using information about membership function and the calculated cluster centers of completed projects. Generating of missing values using particular missingness mechanism and normalization of the dataset is taken in advanced before the imputation process started. Fig. 1 which include mainly : calculate cluster centers of complete software projects, calculate membership values for each incomplete software project, and estimate the imputed missing values. In first step the whole dataset is separated to complete and incomplete datasets. Cluster centers for complete software projects are calculated using FCM algorithm. In second step for each incomplete software project the membership values to given cluster center are calculated. In third step the imputation value is estimated based on membership values of incomplete software project calculated in second step and the cluster centers of complete software projects calculated in first step. The imputed dataset is used to evaluate the accuracy of prediction of ABE model as elaborated in Fig. 1.

VI. EMPIRICAL EVALUATION DESIGN
In this section the empirical evaluation design is elaborated to define: first the datasets used in this study, second performance accuracy measures used to assess ABE prediction results, and third the adopted empirical process employed in this study.

A. Data Sets Description
Desharnais dataset as one of the most common datasets in the field of software effort estimation [51]. Recent research studies investigate Desharnais dataset imputation for ABE performance evaluation [39,42,44]. The data contain 81 software projects related to Canadian Software Company, 77 projects are complete with no missing values, and four projects are considered incomplete with some missing values. The data has nine features, all features are numerical except one feature which are language that are categorical. Effort feature is considered as dependent feature and other features are considered as independent features. The statistical details of Desharnais dataset is given in The percentage of missing values in Desharnais dataset is relatively very low. In this study tow Desharnais datasets with different missing ratio are artificially created with MNAR missing mechanism to validate proposed missing data imputation methods for ABE model. Desh-Miss1 dataset 28.395% missing row ratio (23 out of 81 projects have missing values) and 3.33 % missing cell ratio (24 missing cells out of 720 cells) with MNAR missingness mechanism, and Desh-Miss2 dataset with 69.135 % missing row ratio (56 out of 81 projects have missing values) 7.916 % missing cell ratio (57 missing cells out of 720 cells) with MNAR missingness mechanism. Artificial missing data generation in software effort estimation has been performed in studies such as [18,39]. The Histogram and pattern of missing data for Desh-Miss1 and Desh-Miss2 datasets can be seen in Fig. 3 and Fig. 4, respectively.

B. Performance Accuracy Metrics
Several metrics have been used to evaluate the performance of estimation models which include Mean Magnitude of Relative Error (MMRE) measure that based on Relative Error (RE), and Magnitude of Relative Error (MRE) [13]. MMRE as most used evaluation metrics is defined as: Percentage of the prediction (PRED) is defined as: Where, A is the number of projects with MRE less than or equal to X and N is the total number of test set projects. Most effort estimation models are compared within X is 0.25 as acceptable value [52]. Shepperd and MacDonell [53] proposed SA measure that based on mean absolute error (MAE). SA considered as unbiased and standardized accuracy measure and gives an idea about the effectiveness of estimation model compared to random guessing. Where MARp_i is the Mean Absolute Error of estimation technique p_i , and MARp_0 is the mean of a large number of www.ijacsa.thesai.org random guesses (in our case 1000). The goal of estimation model is to minimize MMRE and maximizes PRED and SA prediction results for software effort estimation models.
Cross validation: Cross-Validation is introduced to give a more realistic accuracy evaluation to the estimation model. By dividing the historical dataset into multiple training and testing sets. These groups have almost equal size, one group is selected as test group and the remaining groups will be test groups. After that the estimation is computed for the test set and iteratively the process will be continued until all set are involved in the estimation , this depend of the number of sets. This insures the verification of all projects. Actually, all the projects are considered as a test case only once in all iterations. The final performance achieved from all the iterations is considered as mean value of performance metrics. MMREs, PREDs, and SAs mean values from all iteration is considered as MMRE, PRED, and SA final value.

C. Empirical Process
The empirical process adopted for this study is presented in fig. 5. As can be seen from Fig. 5, it is consists of four main steps: generating missing values, missing data imputation, ABE effort estimation, and accuracy evaluation. The design for the used empirical process followed similar approach used in [18,39,44] for evaluating the impact of MD imputation for ABE performance prediction.
Step 1: Generate missing values: in this study tow Desharnais datasets with different missing ratio are artificially created with MNAR missing mechanism to validate proposed missing data imputation methods for ABE model. Desh-Miss1 dataset with 28.395% missing row ratio (23 out of 81 projects have missing values) and 3.33 % missing cell ratio (24 missing cells out of 720 cells) with MNAR missingness mechanism, and Desh-Miss2 dataset with 69.135 % missing row ratio (56 out of 81 projects have missing values) 7.916 % missing cell ratio (57 missing cells out of 720 cells) with MNAR missingness mechanism. Artificial missing data generation in software effort estimation has been performed in studies such as [18,39]. The Histogram and pattern of missing data for Desh-Miss1 and Desh-Miss2 datasets can be seen in Fig. 3 and Fig. 4 respectively. Table IV of Appendix presents a sample of the outcome (Desh-Miss2) of this step using MNAR mechanism with 69.135 % of MD on Desharnais dataset. Step 2: Missing data imputation: three imputation techniques (MI, KNNI, and (ABE-FCMI)) are used to impute missing values. The performances of these techniques are compared later to identify best imputation technique adopted for ABE prediction. Table XV of Appendix presents the outcome of the Step 2 using (ABE-FCMI) imputation under MNAR mechanism at 69.135% of MD on the sample data of Table XV. Step 3: Effort Estimation using ABE: software development effort using ABE model is predicted from the imputed dataset (complete dataset).Euclidian distance is used as similarity function and mean is used as solution function in ABE algorithmic procedure.
Step 4: Accuracy evaluation: The performance of ABE is evaluated after each imputation technique to discover which imputation method outperforms the other. MMRE, PRED (0.22), and SA are used as accuracy estimation measures. Three-fold cross-validation is considered as evaluation method in ABE prediction model. VII. RESULT AND DISCUSSION This section presents the experimental results for evaluating ABE performance using three imputation methods (MI, KNNI, and (ABE-FCMI) ) on Desharnais dataset with MNAR missingness mechanism and different missing ratio (Dish-Miss1,Dish-Miss2). First the experimental results for each incomplete dataset is evaluated individually, second a comparison between imputation methods is evaluated based on all given incomplete datasets.

A. Effects of MI, KNNI and ABE-FCMI on Desharnais Dataset
As discussed before Desharnais dataset contain missing values. In projects number 38, 44, the TeamExp feature values are missing. In projects number 38, 66, and 75, the ManagerExp feature values are missing. It can be concluded that Desharnais dataset have relatively lower number of missing values compared to other given incomplete datasets in this study. In step 1 Desharnais dataset is taken as incomplete dataset. In step 2 missing data imputation is performed using MI, KNNI, and (ABE-FCMI). In step 3 accuracy evaluation of ABE is measured for each imputation technique. Three-fold cross validation technique has been used to generate the results. The overall empirical process can be seen in Fig. 5. Table II  shows MMRE results of imputation methods on ABE, while  Table III shows the PRED(25) results of imputation methods on ABE, and Table IV shows SA results of imputation methods.
As seen in Table II, MI and (ABE-FCMI) achieved the lowest value of MMRE as 0.02622 and 0.02631 respectively with regard to the average of three folds. It is followed by KNNI where the value of MMRE is 0.02651. It is observed that the lowest value of MMRE is achieved by MI due to lower number of missing data in Desharnais dataset. Table III shows the PRED (0.25) results obtained from applying imputation methods to Desharnais dataset based on three-fold cross validation. As can be seen the PRED values are the same for all imputation methods. The SA results for imputation methods are given in Table IV. MI and (ABE-FCMI) achieved best SA results with values 56.66670, 56.49223 respectively, while KNNI achieved 56.38617 value for SA accuracy measure. It is www.ijacsa.thesai.org observed that the best value of SA is achieved by MI due to lower number of missing data in Desharnais dataset.

B. Effects of MI, KNNI and ABE-FCMI on Desh-Miss1 Dataset
As discussed before Desh-Miss1 dataset have 28.395% missing row ratio (23 out of 81 projects have missing values) and 3.33 % missing cell ratio (24 missing cells out of 720 cells) with MNAR missingness mechanism. Desh-Miss1 dataset is incomplete dataset generated from Desharnais dataset.
As can be seen from    VIII. THREATS TO VALIDITY In this empirical study, an evaluation of three imputation techniques using MNAR missingness mechanism and different MD percentages has been reported. It is difficult to carry out all possible scenarios, so some limitation may exist in this study.

A. Internal Validity
Internal validity is concerned with threats related to the scope of the study. In this study, an investigation attempted to simulate scenarios with MNAR missingness mechanism as well as different MD percentages. Generation of MD process for MNAR mechanism might considered as internal thread. A random selection of attribute for MD generation in the studied dataset is used. In this study we simulate tow incomplete datasets with different MD percentages; a threat might come from MD percentages as well as we investigate only MNAR mechanism.

B. External Validity
External validity is related to threats that are concerned with empirical design and result generalization. In this experimental study, we investigate Desharnais dataset as one of the most common datasets in the field of software effort estimation. Recent research studies investigate Desharnais dataset imputation for ABE performance evaluation [39,42,44]. Desharnais dataset is considered relatively small with 81 software projects only, and contained only numerical attributes, these might be considered as external threats, Table XIII.

IX. CONCLUSION AND FUTURE WORK
The quality of the dataset plays a vital role for accurate software effort estimation process. Handling missing data problem is a major challenge to increase the quality of the dataset used for effort prediction. ABE as wide accepted effort estimation model depend mainly on the completed historically dataset for effort prediction, therefore confronting missing values in previously completed projects will improve the accuracy of ABE prediction. Different missing data imputation techniques have been used for ABE model including MI and KNNI. MI method is considered as static imputation without analyzed the dynamic nature for each missing case in the feature concerned in the incomplete software project. KNNI used Euclidian similarity measure to whole completed dataset to identify similar donor cases which may or not be related to the incomplete software project. In this study an imputation technique based on FCM clustering have been proposed for ABE model. The proposed (ABE-FCMI) technique is investigated for Desharnias dataset with different missing ratio and MNAR missingness mechanism. Experimental results suggest that ABE model using FCM imputation have provided significant improvement against ABE model using either MI or KNNI imputation methods. ABE Performance improvement of the proposed imputation method is based that FCM algorithm clustered software projects into homogeneous clusters based on the selected dataset attributes. Based on the completed dataset FCM algorithm identifies cluster centers. Imputation values for each incomplete project is calculated based on their distance and membership to the cluster centers identified before. (ABE-FCMI) identifies more reliable donors cases to the incomplete software project to impute missing values compared to KNNI and MI.
The Performance of ABE model has been positively affected with MD imputation techniques used in this study for incompleted datasets as seen in accuracy results. In comparison, (ABE-FCMI) significantly outperforms MI and KNNI in missing data imputation for ABE model in Desh-Miss1 and Desh-Miss2 incomplete datasets. For Desharnais dataset due to low number of missing values, there is no significant difference between the three imputations techniques used in Desharnais dataset. The fuzzy clustering nature of (ABE-FCMI) to identify groups of most similar projects indicate that it imputes more reliable values compared to MI and slightly better than KNNI on small datasets.
The study results have shown that as the percentage of missing data of MNAR mechanism increased from Desh-Miss1 to Desh-Miss2 incomplete dataset, the accuracy of ABE model is decreased using MI and KNNI imputation methods, however (ABE-FCMI) improved ABE accuracy although with increased percentage of missing data of MNAR mechanism.
The investigated software engineering dataset in this study is relatively small with 81 software projects only. We suggested investigating (ABE-FCMI) for large software engineering datasets to generalize our results. Numerical missing value imputation is the focus of this study; mixed (numerical and categorical) missing data imputation is required to verify the performance of (ABE-FCMI) method for ABE model.