A Multi-purpose Data Pre-processing Framework using Machine Learning for Enterprise Data Models

Growth in the data processing industry has automated decision making for various domains such as engineering, education and also many fields of research. The increased growth has also accelerated higher dependencies on the data driven business decisions on enterprise scale data models. The accuracy of such decisions solely depends on correctness of the data. In the recent past, a good number of data cleaning methods are projected by various research attempts. Nonetheless, most of these outcomes are criticized for higher generalness or higher specificness. Thus, the demand for multipurpose, however domain specific, framework for enterprise scale data pre-processing is in demand in the recent time. Hence, this work proposes a novel framework for data cleaning method as missing value identification using the standard domain length with significantly reduced time complexity, domain specific outlier identification using customizable rule engine, detailed generic outlier reduction using double differential clustering and finally dimensionality reduction using the change percentage dependency mapping. The outcome from this framework is significantly impressive as the outliers and missing treatment showcases nearly 99% accuracy over benchmarked dataset. Keywords—Standard domain length; domain specific rule engine; double differential clustering; change percentage; dependency map


I. INTRODUCTION
Many enterprises use (probably) use a business data architecture that's an aggregation model, covering all of their details. Most business data models can be conceptual as well as physical. In certain instances, it is self-evident when to create a blueprint. Formal models (often, enterprise data models) seem to be different, And where what was requested has not been completed, or put to use, business use, enterprise data models have been abandoned or remain unfinished. The root cause of these errors is typically is found in a fundamental mathematical error. Initially, it was not obvious what issues the data model wanted to address, and it was not yet clear what was behind these responses. Setting the questions to be asked and the business data model's intent allows things obvious when finished data modeling. There is the option to build business data models unnecessarily, and this causes both cost and time to increase. When problems emerge that need more explanation, go back to the business data model. the use of an enterprise data model is particularly appropriate in the following two cases. The enterprise procedures are being changed due to an extensive reengineering program. Developing an organizational data model in tandem with an enterprise method delivers tremendous benefit to the process reengineering process. The second implementation in business design is derived from a bottom-up method Integration necessitates the use of a logical data model to display the overlaps between different structures.
Pre-processing of the dataset is one of the primary tasks in any data analytics or data dependent researches or projects. The primary component of the pre-processing ensures removal and replacement of the outliers, removal and replacement of the missing values and sometimes the attribute reductions. Also, in some non-trivial situation removal of the critical and sensitive information is also part of the pre-processing method. The work by H. F. Ladd et al. [1] has clearly suggested many case studies where information hiding is highly important without missing any other crucial information. Nonetheless, the generic datasets, unless related to the personalized recommendation systems, come without the personal identification information sets. Thus, the primary task for any data analyst or a strongly data dependent machine learning engineer are to identify and remove or replace the outliers or missing values [14].
The reduction of the outliers and missing values improves the accuracy as proven by many research attempts such as the work by T. Calders et al. [2]. Nonetheless, many of the parallel research works also have suggested that, removing or replacing the outliers or the missing values directly from the dataset without much customization can directly lead to loss of data and result into incorrect classification or clustering. Thus, it is highly important to generate the data pre-processing method suitable to domain from which the data is originally generated. This belief was initially projected by D. Pedreschi et al. [3] in the year 2008. Through many researchers such as S. Hajian et al. [4] have always emphasised on the data security.
Realizing the need for the domain specific data preprocessing and the need for enterprise scale data preprocessing for domain specific outliers and missing value imputation methods, this work formulates a novel multipurpose framework for data pre-processing [15].
The rest of the paper is formulated such as, in the Section II, the parallel research outcomes are critically analysed; in Section III the used dataset for this research is described; in Section IV the proposed solutions are formulated using the mathematical models; in Section V, the proposed algorithms based on the mathematical models are discussed; in the Section VI the complete framework is elaborated; in the Section VII the obtained results are discussed; in Section VIII the comparative analysis is furnished; and finally in Section IX, the research conclusion is formulated.

II. PARALLEL RESEARCH OUTCOMES
The final outcome of any analytical project is to generate the final results in terms of predictions or projections or classifications or clustering. Nonetheless, all these outcomes solely depend on the cleanness of the data. The cleanness of the data primarily refers to the reduction of the missing values, outliers and sometimes the noises present in the spatial datasets. Hence, a good number of research attempts can be seen in order to propose a framework, which is specific in nature to reduce the anomalies from the datasets.
The work by B. Fish et al. [5] has proposed a method to reduce the anomalies from the datasets using the confidence factors and the confidence metric. This method identifies the outliers and missing values from each domain of the dataset and in case any attribute domain has more than half of the values as anomalies, then the confidence matrix decide, whether that specific attribute contributes to the final classification of the data. In case, that attribute showcases less dependencies, then that specific attribute can be completely discarded from the dataset. Regardless to mention, this method is criticized for lesser accuracy due to the information loss, in spite of the better time complexity.
Yet another research attempt by M. B. Zafar et al. [6] have tried showcasing the effect of anomalies in the final prediction from the dataset and up to certain extend, the effects can be ignored, and the pre-processing stages can be completely ignored. Nonetheless, this work is also highly criticised as this method does not suggest any specific boundaries for domain specific dataset treatments.
In the other direction, the work by T. Kamishima et al. [7] have showcased that the missing value imputation can be completely automated using various machine learning methods and the accuracy of this method is also remarkable. Nevertheless, this work does not recommend any specific method to handle the domain specific anomalies as explained in Section IV of this literature. In the same direction, the work by M. Hardt et al. [8] has justified the process of weighted parameters for reduction of anomalies using equality principle. However, during a domain specific pre-processing task, it is nearly impossible to identify the weights as equal in the dataset. Thus, this work also cannot justify the need addressed in this literature.
Yet another approach by M. Feldman et al. [9] recommends that, during a pre-processing task, the knowledge from the previous attempts can be utilized to reduce the time complexity. Using the recommendations from anomaly reduction process from the similar datasets can be utilized on the newer datasets and time complexity can be significantly reduced. Regardless to mention, generating the similarity characteristics from two different datasets are a challenge in itself and the added time complexity shall also be considered. This thought is confirmed by the work of C. Dwork et al. [10].
The two recent research outcomes by Z. Zhang et al. [11] and by J. Kleinberg et al. [12] have recommended using the backtracking methods, which is also adopted in this literature and extended in the Section V.
Further, with the detailed understanding of the parallel research attempts, in the next section of this work, the considered dataset for this research is analysed.

III. DATASET DESCRIPTION
Master and reference data is necessary to ensure continuity across implementations, but it must also be considered scoped to prevent data processing consistency. Since most of the transaction data is almost invariably moved to data centers and monitoring structures, this is predicted to include most organizations' data. In order to carry forward, the research proposed in this work, the 'The Public 2020 Stack Overflow Developer Survey Results' [13] is utilized. The description of this dataset is furnished here [ Table I].
Further, based on this domain specific dataset, the formulation of the problems is carried out in the next section of this work.

IV. PROPOSED SOLUTIONS: MATHEMATICAL MODELS
After the critical analysis of the parallel research works and identification of the research problems in the previous section of this work, in this section of the work, the proposed solutions are presented using mathematical models.
This section primarily focuses on four different preprocessing methods as identification of the missing values, conditional outliers, generic outliers and finally reduction of the attributes. Proof: The domain count of any dataset shall be realized as the maximum number of elements without the missing or null values. Hence, the maximum count will ensure that the maximum number of elements are considered without the missing values and in case of all missing values, the complete tuple is ignored.
Assuming that, the total dataset, DS[], is a collection of multiple domains, D[], and each domain is again collection of multiple data points, D i . Thus, for a n number of domains or attributes, the initial relation can be formulated as, Also, assuming that each domain is consisting of m number of data points, thus, this relation can be formulated as, Further, assuming that, the method Φ , is responsible for identification of the number of data points without the missing or null values. Then,  being the count of data points, this proposed function can be formulated as, Further for domain the count of the number of data points, Y, must be compared with the maximum data point count, X, using the divide and conquer method as following.
Henceforth, if the count of data points is less than the expected count of the data points in first or second half of the domain, then the process must be repeated to identify the missing values only in that half of the domain and the process shall be repeated iteratively to identify all missing values.
Further, the time complexity of this proposed method is analysed against the generic method.
Assuming that, a total of k number of iterations has to be performed for n number of domains, thus the time complexity, T 1 , can be formulated as, This can be re-written as, 1 2 ( log ) In the other hand, for the similar identification, using the generic methods, thus the time complexity, T 2 , can be formulated as, It is natural to realize that 1 2 T T << (10) Hence, the proposed method for outlier detection significantly reduces the time complexity with higher accuracy.
Further, the conditional outliers are addressed and resolved.
Lemma 2: The outliers within the valid range of the data, can be removed using the domain specific rule sets.
Proof: The dataset contains multiple outliers and can be residing in the valid range of data. Thus, the domain specific outliers must be addressed with the valid set domain specific rule engine.
Assuming that, the domain specific rulesets or rule engine, R[], is a collection of individual rules, R i . Thus, a total number of n rules, this relation can be formulated as, Further, from Eq. 1, the dataset is fetched and validated against the ruleset for removal of the outliers as, Here, each and every attribute is considered to have their own domain with m number of records each and the data elements are denoted as i D , which can be represented as, Further, the Euclidian distance between the data points can be considered as the similarity measure and the total distance set is represented as [] λ , then, Furthermore, the repetitive iteration of the Eq. 16 can measure the similarities with deeper and contextual aspect, which can be represented as,  Further, accuracy must be verified with time complexity to realize the best possible reduced set.
In the results section of this work, the time complexity and accuracy are analysed for building the final reduced dataset.
Henceforth, in the next section of this work, the proposed algorithms are furnished based on the proposed mathematical models of the solutions.

V. PROPOSED SOLUTIONS: ALGORITHMS
After the detailed analysis of the problems and formulation of the proposed solutions using the mathematical models, in this section of the work, the proposed algorithms are furnished here in this section of the work.
Firstly, the iterative missing value replacement algorithms are furnished here.

Algorithm:
Step -1. Import the dataset, DS The proposed algorithm is primarily based on the divide and conquer method and thus, demonstrates a huge improvement in terms of time complexity. Also, the proposed algorithm is capable of reducing the total rows if all the fields are missing. In measurements, ascription is the way toward supplanting missing information with subbed values. There are three fundamental issues that missing information causes: missing information can present a generous measure of predisposition, make the taking care of and investigation of the information more challenging, and make decreases in efficiency. In other words, when at least one quality is absent for a case, most factual bundles default to disposing of any case that has a missing worth, which may present inclination or influence the representativeness of the outcomes. Ascription saves all cases by supplanting missing information with an expected worth dependent on other accessible data.
Secondly, the domain specific outlier removal algorithm is furnished here.

Algorithm:
Step -1. Building the rule engine, RS a. Abnormalities, or anomalies, can be a difficult issue when preparing AI calculations or applying factual methods. They are regularly the aftereffect of mistakes in estimations or extraordinary framework conditions and in this way don't depict the normal working of the basic framework. To be sure, the best practice is to actualize an anomaly expulsion stage prior to continuing with additional examination.
Sometimes, exceptions can give us data about confined peculiarities in the entire framework; so, the location of anomalies is a significant cycle due to the extra data they can give about your dataset.
Thirdly, the generic outlier removal algorithm is furnished here.

Algorithm:
Step Clustering or grouping is the errand of collection a bunch of items so that objects in a similar gathering are more comparative to one another than to those in different clusters. It is a fundamental undertaking of exploratory information mining, and a typical strategy for factual information investigation.
Fourthly, the attribute reduction algorithm is furnished here.

VI. PROPOSED FRAMEWORK
After the detailed analysis of the proposed algorithms in this section of the work, the proposed framework is furnished and discussed [ Fig. 1].
The dataset for this research is adopted from the stack overflow developer survey and identified as one of the prominent datasets for enterprise scale research for preprocessing.
The dataset is distributed in two parts as employee dataset, as described already and project dataset, as described in the previous section of the work.
The proposed framework functions in four phases as in the initial phase the missing values from the employee collection are reduced and generates the missing value reduced dataset for employee collection using the DMV-SDL algorithm. The second phase of the proposed framework actually performs two different tasks as reduction of the domain specific outliers from the employee and the project dataset, and further merges the dataset based on the employees' assigned project using the OR-DSRE algorithm.
In the third phase of the proposed framework, the generic outliers are removed using the DDOD-R algorithm from merged dataset with employee and project specific outlines.
In the final phase of the proposed framework, the reduction of the attributes is taken care using the CPODM-AR algorithm where the validation of the reduction process is done using the classification method with the measuring parameters as accuracy and time complexity.
Further, the obtained results from this proposed framework are discussed in the next section of this work.
The dataset for this research is adopted from the stack overflow developer survey and identified as one of the prominent datasets for enterprise scale research for preprocessing.
The dataset is distributed in two parts as employee dataset, as described already and project dataset, as described in the previous section of the work.
The proposed framework functions in four phases as in the initial phase the missing values from the employee collection are reduced and generates the missing value reduced dataset for employee collection using the DMV-SDL algorithm.
The second phase of the proposed framework actually performs two different tasks as reduction of the domain specific outliers from the employee and the project dataset, and further merges the dataset based on the employees' assigned project using the OR-DSRE algorithm.
In the third phase of the proposed framework, the generic outliers are removed using the DDOD-R algorithm from merged dataset with employee and project specific outlines.
In the final phase of the proposed framework, the reduction of the attributes is taken care using the CPODM-AR algorithm where the validation of the reduction process is done using the classification method with the measuring parameters as accuracy and time complexity.
Further, the obtained results from this proposed framework are discussed in the next section of this work.

VII. RESULTS AND DISCUSSIONS
The obtained results from the proposed framework and the algorithms are highly satisfactory. In this section of the work, the obtained results are furnished and discussed in five segments.
Firstly, the missing value detection and replacement results are observed from the employee dataset [ Table IV].
The results are visualized graphically here [ Fig. 2]. The missing value analysis from the initial employee dataset by the proposed DMV-SDL is highly accurate and demonstrates 100% accuracy.
Secondly, the merged dataset domain specific outlier and missing value analysis, after the merging analysis is here in Table V. The results are visualized graphically here [ Fig. 3]. The proposed OR-DSRE algorithm has demonstrated 100% accuracy during the missing value analysis and nearly 90% accuracy during the domain specific outlier detection process.
Thirdly, the generic outlier removal outcomes are furnished here [ Table VI].
The results are visualized graphically here [ Fig. 4]. The iterative outlier identification and removal algorithm have also demonstrated nearly 100% accuracy and the algorithm identifies all the outliers within 5 iterations using the DDOD-R algorithm.
Finally, the attribute reduction results are furnished here [ Table VII]. Henceforth, based on the change percentage, the order of the attributes from the highest importance to the lowest is furnished here [Table VIII].
Further, based on the given rank, the attribute reduction process is carried out. The validation of the removal process is based on accuracy of classification and time complexity of processing [ Table IX].
It is natural to realize that after the 5 th iteration, the time complexity is reduced to a greater scale, but the accuracy has also declined. Thus, the attributes identified till the 5 th iteration shall be marked as optimal.
The result is visualized graphically here [ Fig. 5].
Thus, based on the final analysis the reduced set attributes are furnished here [Table X].
Further, in the next section of this work, the comparative analysis is carried out. This research purposes on the benchmarked dataset by Stack overflow and a synthetic dataset. The proposed DMV-SDL algorithm first processes the employee-related dataset, and due to the nature of the divide and conquer method, the reduction in the time complexity is significant. Further, the stack overflow dataset and the synthetic project-specific dataset are analyzed under the OR-DSRE algorithm for domain-specific outlier imputation and provide a strategic merging of the datasets. Further, DDOD-R algorithm is applied on the merged dataset for generic outlier imputations. The proposed framework demonstrates a nearly 99% accuracy and some cases, up to 100% accuracy. The pre-processed dataset is analyzed under the CPODM-AR algorithm for dimensionality reduction and demonstrates nearly 99% accuracy with reduced time complexity for generic benchmarked classification algorithms. The work finally outcomes into a multi-purpose domain-specific data preprocessing framework for enterprise-scale data to make the data-driven business decisions more reliable.
Future Enhancements: Each pre-processed dataset attribute may be linked to as many timelines as required. In both the dependency properties and dependency forms, this is right (start-and end-attributes). In terms of accuracy, mostly related dataset related libraries are strongly recommended matched with the Original datasets.