Machine Learning for Predicting Employee Attrition

Employee attrition has become a focus of researchers and human resources because of the effects of poor performance on organizations regardless of geography, industry, or size. In this context, the use of machine learning classification models to predict whether an employee is likely to quit could greatly increase the human resource department’s ability to intervene on time and possibly provide a remedy to the situation to prevent attrition. This study is conducted with an objective to compare the performance machine learning techniques, namely, Decision Tree (DT) classifier, Support Vector Machines (SVM) classifier, and Artificial Neural Networks (ANN) classifier, and select the best model. These machine learning techniques are compared using the IBM Human Resource Analytic Employee Attrition and Performance dataset. Preprocessing steps for the dataset used in this comparative study include data exploration, data visualization, data cleaning and reduction, data transformation, discretization, and feature selection. In this study, parameter tuning and regularization techniques to overcome overfitting issues are applied for optimization purposes. The comparative study conducted on the three classifiers found that the optimized SVM model stood as the best model that can be used to predict employee attrition with the highest accuracy percentage of 88.87% as compared to the other classification models experimented with, followed by ANN and DT. Keywords—Artificial neural networks; decision tree; employee attrition; machine learning; support vector machines


I. INTRODUCTION
Machine learning is one of the artificial intelligence technologies that provide systems with the ability to automatically learn and improve from experience or gain human-like intelligence without explicit programming. In other words, machine learning focuses on developing computer programs that can access data and use it to learn for themselves [1]- [4]. Machine learning (ML) is one of the fastest-growing fields of research and has been developed and applied successfully to a wide range of real-world domains [5] - [9]. This study presents a comparative analysis of three machine learning algorithms, i.e., DT, Support Vector Machines (SVM), and Artificial Neural Networks (ANN), to predict employee attrition.
Employee attrition in an organization can mean the reduction of employees through normal means, such as retirement and resignation, clients due to old age, or retrenching them due to change in the target demographics of the organization. The high rate of employee attrition is a major issue in an organization as it greatly impacts them. When employees leave an organization, they carry with them invaluable tacit knowledge, which is often the source of competitive advantage for the organization [10]. Employee attrition causes the organization to bear the cost of business disruption, hiring and training new staff. On the other hand, higher retention means less hiring and training costs and more experienced workers to the company workforce over time. Organization nowadays has given a great business interest in understanding the drivers of staff attrition to reduce employee attrition. As a result, prediction on employee attrition and identifying the major contributing factors that lead to attrition becomes an important objective of an organization in order to enhance its human resource strategy [11].
The IBM Human Resource Analytic Employee Attrition and Performance dataset used in this paper is a publicly available dataset from Kaggle Dataset Repository. It was IBM"s fictional dataset created by IBM data scientists. The dataset includes four (4) major components: employee satisfaction, income, seniority, and demographics data. The dataset contains several attributes influencing the predicted variable named "Attrition" which signifies whether an employee left the company or not from 1,470 instances and 35 attributes. The identified class is labeled as "Attrition" with 237 instances of "Yes" and 1233 instances of "No" having imbalanced data ratio of 1:5.
The purpose of this study is to conduct a comparative study to develop machine learning models, i.e., DT, SVM, and ANN, for predicting probable employee attrition and compare between the algorithms in terms of their accuracy and efficiencies.

II. RELATED WORK
Human resources are considered an important aspect of an organization, and voluntary employee attrition has been identified as a key issue. Reference [10] in his study focused on identifying employee-related attributes to predict employee attrition using decision tree algorithms.
The classification has been identified as an important issue in the emerging field of data mining. Over the years, there have been several studies on classification algorithms. Data mining algorithms must be efficient and scalable for the effective extraction of information from huge amounts of data in many data repositories or dynamic data streams. The key criteria are efficiency, scalability, performance, optimization, and the ability to execute in real-time that drives the development of many new data mining algorithms [12]. Two www.ijacsa.thesai.org (2) important performance indicators for data mining algorithms are the accuracy of a classification and the time taken for training. These indicators are mainly useful for selecting the best algorithms for classification or prediction tasks in data mining [13].
A study conducted by [14] using the IBM HR Employee Attrition & Performance dataset indicated the imbalance in the retrieved data. The correlation plot and histogram visualization had been performed to indicate the correlation between the continuous variables in the model during the data exploration stage. Subsequently, the SMOTE (Synthetic Minority Oversampling Technique) was employed to balance the Attrition class.
The performance measurements observed in many literature reviews are mainly related to finding the best accuracy and speed to build a machine learning model. Table I briefly documents the literature review findings related to a comparative study on employee attrition using the machine learning classification algorithms: Alduailij [17] To predict when workers will leave. It proposed a combination of five ML algorithms with three techniques for feature selection.
Mohbey [14] To predict which customer or employee will leave their current company or organization  [20] To predict employee turnover using machine learning techniques SVM SVM 9.
Ozdemir, Coskun, Gezer and Gungor [21] To automatize the prediction of employee attrition utilizing data mining methods To predict an employee"s intention to leave the organization in the immediate future and identify the key features that influence the employee"s intention to leave the organization Logistic Regression and XG boost XG boost www.ijacsa.thesai.org

III. METHODOLOGY
A. Data Preprocessing 1) Data description: The initial step in carrying out this study is performing a data pre-preprocessing task. This study produces a data quality report to detect outliers and any unusual pattern about the dataset using statistical methods. Tables II and III show the data quality report of the dataset.
2) Detecting outliers: In addition to the above data quality report, forty-five (45) outliers were detected using the Interquartile Range filter based on the initial raw dataset, and the outliers were then checked. Those findings require further preprocessing, which are data cleaning, data reduction, and data transformation. There are also no missing values that are in existence, and the given data is complete.
3) Data visualization: An overview to understand each attribute pattern should be carried out and examined through data visualization. From the data visualization, we can see that a few attributes need to be examined to ensure accuracy during the model classification process. Fig. 1 shows the data visualization of each attribute in the dataset.

4) Data cleaning and reduction:
The dataset is considered high dimensional as it consists of 35 attributes. Any irrelevant attributes that are not contributing to the objectives of this study should be removed. Based on the data quality report in Table III and data visualization in Fig. 1, "EmployeeCount," "StandardHours" and "Over18" features can be removed in view that the cardinality/distinction is "1", which means it has the same values throughout the data. Other than that, "EmployeeNumber" is found not useful for the modeling and prediction process and can be removed from the dataset. No spelling inconsistencies were detected as inconsistencies may cause problems in later merges or transformations. Further description of data cleaning and reduction is explained in Table IV.

5) Normalization and discretization:
During the data transformation in the preprocessing stage, feature scaling or normalization is applied. Normalization is a method used to standardize the range of independent variables or features of data [23]. Applying feature scaling or normalization can avoid dependency on the choice of measurement units on attributes. This process made the range of features of data fall between 0 and 1. The data cleaning and reduction were performed, which include the discretization process and change of attribute type from numerical to nominal. Four (4) attributes were removed based on the findings above, leaving the remaining 30 attributes. No outliers were detected after the interquartile filter was regenerated.
6) Feature selection: The next preprocessing part in machine learning is feature selection, which involves selecting features in the data and removing irrelevant and redundant information as much as possible to reduce the dimensionality of the dataset. Feature selection is a process of data reduction that helps to improve accuracy, reduce overfitting, reduce training time and identify the fields that are most important and predictive for a given analysis. For this study, the top fifteen (15) out of 30 attributes had been selected based on several attribute selection methods that are Correlation Attribute, Gain Ratio Attribute, and Symmetrical Uncertainty Attributes as depicted in Table V: Based on

8) Model validation technique:
The k-fold crossvalidation is applied to the training set in view of its simplicity. Generally, it results in a less biased or less optimistic estimate of the model trained as compared to the other methods, such as the simple train/test split. Apart from that, this method is chosen as compared to the other training methods in view of the limited data sample in this study. Hence, the cross-validation technique splits the data into k groups, and it enables the model to be trained and validated on different sets iteratively. Overfitting refers to a situation where a machine-learning model cannot generalize or match the unseen dataset well. A strong indication of machine learning overfitting is whether the testing or validation dataset error is greater than the training dataset. There are different ways to resolve overfitting; cross-validation is an effective preventive against overfitting. [24]. 9) Imbalanced data: The data quality report indicated an imbalance in the class distribution, with 237 tuples predicted as "Yes" and 1233 tuples predicted as "No." Data imbalance is a well-known issue in classification problems, where one class is frequently far more prevalent than the others. Class imbalance usually degrades the real performance of a classification algorithm by poorly predicting the minority class, which is often the center of attention for a classification problem. Imbalanced data requires techniques that can deal with unequal misclassification costs [25]. Hence, the SMOTE technique is applied to overcome the imbalance class at a 200% oversampling degree with five nearest neighborhoods on the training dataset. Using SMOTE, the minority class is over-sampled from 194 to 582 "Yes" instances by creating "synthetic" examples rather than by over-sampling with replacement as shown in Table VII.

B. Machine Learning Classification Algorithms
This section explains the three (3) algorithms that are used in this study:

1) Decision Tree (DT):
DT is defined as a tree that classifies instances by sorting them based on feature values. The trees are made up of three fundamental segments: the root node, internal node, and leaf node as shown in Fig. 2. In a DT, each node represents a feature or attribute of the instance to be classified, each branch represents a test result, and leaf nodes represent class labels or class distribution. Classification of instances starts from the root node and is sorted based on their feature values. A sample of the decision tree, which is a flowchart like a tree structure, is as illustrated.
The basic algorithm for decision tree induction is a greedy algorithm that constructs decision trees in a top-down recursive divide-and-conquer manner [18]. C4.5 is an algorithm used to generate a decision tree based on information theory. C4.5 is known as J48 for Java. The classifiers, like filters, are organized in a hierarchy.  The decision tree is induced by various algorithms. However, as it grows deeper, it happens that sometimes it generates unwanted and meaningless, and this is called overfitting. Therefore, pruning is needed to reduce the size of the tree that is too large and deep. The problem of noise and overfitting reduces the efficiency and accuracy of data [18]. There are various decision tree induction algorithms and various pruning parameters. In this study, pruning parameters such as the confidence factor and the number of objects (at the leaf node) were tuned to improve the DT classifier"s performance.

2) Support Vector Machines (SVM):
SVM is known as a popular supervised algorithm in machine learning. Also, based on literature, SVM is also commonly used for employee attrition dataset. SVM acts as a classifier that categorizes the data into different "classes" or as a regression function to estimate the numerical value of the desired output based on a linear combination of features for both linear and non-linear data [27]; SVM is known as SMO.
In relation to his study, the SVM model which is based on the training dataset, will try to generalize the input data based on their features and make a prediction. SVM machine learning will then produce a model that predicts the test data"s target values [27]. The basic idea of SVM is to separate classes with maximum margin created by hyperplanes.
The tuning parameter in SVM includes the kernel, regularization parameter (C parameter), and gamma. Polynomial and exponential kernels calculate separation lines in a higher dimension called kernel tricks [27].

3) Artificial Neural Networks (ANN):
ANN is a machine learning technique that acquires knowledge through learning and is used to solve classification problems. The ANN can be organized in different topologies/architectures. There are different types of ANN architectures like feedforward and recurrent neural network. The most common neural network model is the Multilayer Perceptron (MLP), a non-linear predictive model that learns through training and is a feedforward network.
The objective in ANN in generic MLP is to find an unknown function f which relates the input vectors in X to the output vectors in Y, During the training of the dataset, the function f is optimized, where the network output for the input vectors in X is as close as possible to the target values in Y. Matrices X and Y represent the training data. The function f, for ANN architecture, is determined by the adjustable network weights. In ANN, the learning rate can be configured with a small positive value, often in the range between 0 and 1 [28].

C. Machine Learning Tasks Result
For this study, four (4) measures are used to compare the performance of the three (3) classifiers being studied i.e., J48, SVM, and ANN. Those four (4) common measures of the classifier are the accuracy rate, error rate, root mean square error (RMSE), receiver operating characteristic (ROC), and the time taken or speed to build a model. The prediction accuracy is defined as the percentage of correct prediction divided by the total number of predictions. The RMSE indicates an absolute measure of the fitness of the training dataset. A lower value of RMSE indicates a better fit. ROC tells how much the model is capable of distinguishing between classes. The time taken or the speed to build a model is another important consideration in choosing the best classifier model [4].
At the initial stage, the modeling task was carried out on the training dataset using the default parameter of each classifier, and SMOTE resampling technique was applied using 10-Fold cross-validation. Comparison of classifier performance is given in Table VIII. As seen from the table, the following findings in the initial process of modeling were identified: 1) ANN had the highest accuracy result at 86.76% while SVM showed the lowest at 81.97%.
2) ANN showed the best RMSE with the lowest value of 0.3359.
3) ANN showed the best ROC at the highest value of 0.922.

4)
J48 achieved the best time to build a model at 0.02 sec. Machine learning algorithms can be optimized or configured in order to elicit different modeling behavior. Hence, in the next part, parameter tuning is conducted to optimize the model"s current performance. The model will then be tested out with the unseen data after the parameter tuning is done on the model.

D. Parameter Tuning
Parameter tuning involves the process of optimizing the performance of a model, that is, to have the best result for each measurement. Parameter tuning is an important step in modeling as it is by no means the only way to improve performance.

1) J48:
For the Decision Tree (J48) classifier, the value of the confidence factor and Minimum Number of Objects are tuned to achieve the best model and to avoid overfitting.
a) Confidence factor: The default confidence factor obtained above was run at 0.25. Table IX shows the results of confidence factor parameter tuning ranging from 0.1 to 1.0 run on the J48 model.
The confidence factor parameter is tuned in DT to test the effectiveness of post-pruning. Post-pruning is the process of evaluating the decision error that is the estimated percent of misclassifications, at each decision junction and propagating this error up the tree. Fig. 3 shows that the highest accuracy of 83.57% at 0.4 confidence factor and the accuracy of 82.61% remains constant starting at 0.6 confidence factor. Hence, the 0.4 confidence factor parameter is the optimal value for J48 classifier since increasing the confidence factor leads to lower accuracy.
b) Minimum number of objects: Also, parameter tuning is also conducted to get the optimal value for a minimum number of objects. For this study, the value of a minimum number of objects ranging from 0 to 30 is tuned at the confidence factor of 0.4. Table X shows the results for the minimum number of objects pruning parameter: The minimum number of objects specifies the number of instances at the leaf node as a threshold value which means it specifies the minimum number of data separations per branch [26]. Fig. 4 shows that after the minimum number of objects of 1, the accuracy decreases when the minimum number of objects increases. The highest accuracy is at the parameter of 1 (minimum is 0 and cannot be a negative value) for the minimum number of objects with an accuracy of 84.40%. Hence, the minimum number of objects of 1 is the optimal number for the model.

2) SVM:
The performance of the SVM classifier depends on the use of different kernel parameters in view that an appropriate kernel will provide a learning capability to SVM. For this experiment, as proposed in the literature, three (3) kernel functions were used for comparison in parameter tuning, which are the polynomial kernel, radial basis function (RBF) kernel, and Pearson VII kernel function (PUK) [29]- [31]. The regularization parameter (C) for these different kernels is tuned to improve the SVM model performance. The C determines how much penalty is given for misclassification.
The result of the kernel with C tuning is indicated in Table XI as follows.  (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 11, 2021 443 | P a g e www.ijacsa.thesai.org The tuning result showed that the SVM model with PUK kernel produced the best fit with the highest accuracy of 88.87% and the lowest RMSE of 0.3335 compared to the other kernel when C is set to 10 using the PUK kernel. There is no change in the accuracy after the C value of 10; hence, the value is already optimized. This experiment also showed that the choices of kernel function gave an insightful effect on the performance of the SVM model for the employee attrition dataset after the parameter tuning.

3) ANN:
In ANN, parameter tuning is performed by adjusting the learning rate. Table XII, Fig. 5 shows the performance result with parameter tuning on the learning rate. The tuning result showed that ANN performed the best at a learning rate of 0.4 with an accuracy of 87.98%, and the time taken is 86.41sec as an optimal value. This algorithm was initially chosen in view of its capacity to detect all possible interactions between variables. However, even though this study used a small dataset with only 15 attributes after feature selection, ANN requires more time to create the model and requires more machine resources/capacity than the other machine learning algorithms. Moreover, the accuracy of 87.98% is still lower than the SVM. Hence, it is a less favorable option for this type of dataset.

E. Regularization
Regularization is basically a technique that was used to overcome the overfitting problem of a model. Overfitting refers to an occurrence where the model learns both the target function and noise during the training, which affects the performance of that model on the test/unseen data.
Regularization reduces the variance of the model without a substantial increase in its bias. In this study, few regularization techniques were performed to limit overfitting. As explained above, the tuning parameter is applied in each of the classifiers and is used as part of the regularization techniques to control the impact on bias and variance. As the value of parameter tuning rises, it reduces the coefficients" value, thus reducing the variance to avoid overfitting but not losing any important properties in the data. However, underfitting will occur when the model starts to lose important properties after a certain value, and this leads to the rising of bias in the model. Therefore, the value chosen during parameter tuning must be carefully selected [32].
Moreover, this study uses pruning to reduce the size of a decision tree to overcome overfitting. The SMOTE oversampling technique was applied to treat imbalanced minority classes in the dataset. Also, the use of the 10-fold cross-validation method, which is a resampling procedure, has given a coherent result and is used to overcome the overfitting issue in the dataset. Generally, regularization refers to a broad range of techniques for artificially forcing the machine learning model to be simpler and increase generalization chances. www.ijacsa.thesai.org

A. The Effect of Feature Selection on Classification Accuracies
The 10-fold cross-validation test option enables the accuracy improvement of 15 attributes in comparison to 30 attributes. The result is depicted in Table XIII. Based on the table, the results indicated that the use of top 15 attributes through feature selection has very much reduced the time taken to build the model from 330.23sec to 28.01sec without affecting the accuracy much where there is only a slight change from 85.96% to 85.13%.

B. Comparative Result between Classifiers after Parameter
Tuning and Regularization using 10-Fold Cross-Validation Table XIV shows the result obtained after the parameter tuning and regularization are applied for each classifier. The result in the training dataset below represents the best result for each classifier after applying parameter tuning and regularization. The results were then be compared with the unseen/test data. From the result in Table XI, SVM is revealed to be the best model that separates the class that can later be used to decide the class of a new set of data in predicting attrition. SVM ranks first at an accuracy rate of 88.87% (with parameter tuning at C=10 under the PUK kernel) while closely followed by ANN at 87.38%. DT showed the lowest accuracy rate of 84.40%. The performance measure result of the test dataset also showed a close result as compared to the training data and does not exceed the training result. It is proved that the model is not overfitted, and it is useful for predicting attrition for the new unseen dataset.

V. CONCLUSION
The comparative study on IBM Human Resource Analytic Employee Attrition and Performance was conducted to evaluate the classification models, i.e., J48, SVM, and ANN. SVM model stood at the best accuracy, RMSE, and Speed value after parameter tuning and regularization. Each of the three (3) classifiers used in this study has advantages and limitations; thus, evaluation is required to determine its suitability to solve the problem in relation to the dataset being studied.
As data preprocessing may affect the outcomes of the final model be interpreted, hence a tremendous effort is emplaced during the preprocessing stage for this study as it took a considerable amount of processing time. Several challenges and critical constraints faced in this study include the limited size of the dataset, imbalanced class, and high dimensional dataset. Hence, data preprocessing is an important stage to ensure only relevant features are selected for the training set.
The crucial part during the modeling stage is the parameter tuning conducted for each algorithm as different parameters require a different setting. In this study, this fact is proven when the initial accuracy for SVM was the lowest with no parameter tuning applied. However, SVM showed the highest accuracy after the parameter tuning due to its capacity to handle high-dimensional data with the use of different kernel functions. Also, the regularization technique is applied throughout the experiment to overcome the issue of overfitting during the modeling phase.
This paper is mainly focusing on the comparative study of the machine learning model to predict whether an employee would leave the company or not given an employee attrition dataset. Hence, future work may look into identifying the key features that lead to employee attrition. Apart from that, the use of the hyperparameter tuning approaches like grid search or random search can further be deliberated to find the best combination of parameters to enhance the model to ensure its efficiency and scalability.