Machine-Learning Techniques for Customer Retention : A Comparative Study

Nowadays, customers have become more interested in the quality of service (QoS) that organizations can provide them. Services provided by different vendors are not highly distinguished which increases competition between organizations to maintain and increase their QoS. Customer Relationship Management systems are used to enable organizations to acquire new customers, establish a continuous relationship with them and increase customer retention for more profitability. CRM systems use machine-learning models to analyze customers’ personal and behavioral data to give organization a competitive advantage by increasing customer retention rate. Those models can predict customers who are expected to churn and reasons of churn. Predictions are used to design targeted marketing plans and service offers. This paper tries to compare and analyze the performance of different machine-learning techniques that are used for churn prediction problem. Ten analytical techniques that belong to different categories of learning are chosen for this study. The chosen techniques include Discriminant Analysis, Decision Trees (CART), instance-based learning (k-nearest neighbors), Support Vector Machines, Logistic Regression, ensemble–based learning techniques (Random Forest, Ada Boosting trees and Stochastic Gradient Boosting), Naïve Bayesian, and Multi-layer perceptron. Models were applied on a dataset of telecommunication that contains 3333 records. Results show that both random forest and ADA boost outperform all other techniques with almost the same accuracy 96%. Both Multi-layer perceptron and Support vector machine can be recommended as well with 94% accuracy. Decision tree achieved 90%, naïve Bayesian 88% and finally logistic regression and Linear Discriminant Analysis (LDA) with accuracy 86.7%. Keywords—Customer relationship management (CRM); customer retention; analytical CRM; business intelligence; machine-learning; predictive analytics; data mining; customer churn


I. INTRODUCTION
For any business, customers are the basis for its success and revenue and that is why companies become more aware of the importance of gaining customers' satisfaction.Customer relationship management (CRM) supports marketing by selecting target consumers and creating cost-effective relationships with them.CRM is the process of understanding customer behavior in order to support organization to improve customer acquisition, retention, and profitability.Thus, CRM systems utilize business intelligence and analytical models to identify the most profitable group of consumers and target them achieve higher customer retention rates.Those models can predict customers with high probability to churn based on analyzing customers' personal, demographic and behavioral data to provide personalized and customer-oriented marketing campaigns to gain customer satisfaction.The lifecycle of businesscustomer relationship includes four main stages: 1) identification; 2) attraction; 3) retention; and 4) development.
1) Customer identification/acquisition: This aims to identify profitable customers and the ones that are highly probable to join organization.Segmentation and clustering techniques can explore customers' personal and historical data to create segments/sub-groups of similar customers [1], [2].
2) Customer attraction: The identified customer segments / sub-groups are analyzed to identify the common features that distinguish customers within a segment.Different marketing techniques can be used to target different customer segments such targeted advertising and/or direct marketing [3].
3) Customer retention: This is the main objective of CRM as retaining existing customers is at least 5 to 20 times more cost effective than acquiring new ones depending on business domains [4], [5].Customer retention includes all actions taken by organization to guarantee customer loyalty and reduce customer churn.Customer churn refers to customers moving to a competitive organization or service provider.Churn can be for better quality of service, offers and/or benefits.Churn rate is an important indicator that all organizations aim to minimize.For this sake, churn prediction is an integral part of proactive customer retention plan [6].Churn prediction includes using data mining and predictive analytical models in predicting the customers with high likelihood to churn/defect.These models analyze personal and behavioral customer data for tailored and customer-centric retention marketing campaigns [7].

4) Customer development:
The main objective of this phase is to increase the amount of customer transactions for more profitability.For this sake, market basket analysis, customer lifetime value, up, and cross selling techniques are used.Market basket analysis tries to analyze customers' behavior patterns to maximize the intensity of transactions [8], [9].Analyzing customer lifetime value (CLTV) can help identifying the total net income expected from customer [10]- [12].Up and/or Cross selling include activities that increase the transactions of the associated services/products [13], [14].
Customer retention and churn prediction have been increasingly investigated in many business domains, including, but not limited to, telecommunication [15]- [18], www.ijacsa.thesai.orgbanking [19]- [21], retail [22] and cloud services subscriptions [23], [24].Different statistical and machine-learning techniques are used to address this problem.Many attempts have been made to compare and benchmark the used techniques for churn prediction.In [28], [66] a comparison between (Decision trees, Logistic regression and Neural Network) models was performed.The study found that neural network perform slightly higher than the other two techniques.Another comparison between a set of models against their boosted versions is discussed in [67].This study included twolayer Back-Propagation neural network (BPN), Decision Trees, SVM and Logistic Regression.The study showed that both decision trees and BPN achieved accuracy 94%, SVM comes next with 93% while Logistic Regression failed with accuracy 86%.Additionally, study showed 1-4% performance improvement in the boosted versions.In [68] the study investigated the accuracy of different models (Multi-layer perceptron (MLP) and Decision Tree (C5)).The study showed that MLP achieves accuracy of 95.51%, which outperforms C5 decision tree 89.63%.
Most of comparisons in the literature did not consider a study that covers the various categories of learning techniques.The bulk of the models applied for churn prediction fall into one of the following categories: 1) Regression analysis, 2) Decision tree-based, 3) Support Vector Machine, 4) Bayesian algorithm, 5) Instancebased learning, 6) Ensemble learning, 7) Artificial neural network, and 8) Linear Discriminant Analysis.
This study presents a comparative study of the most used algorithms for predicting customer churn.The comparison is held between algorithms from different categories.The main goal is to analyze and benchmark the performance of the models in the literature.The selected models are: 1) Regression analysis: logistic regression.

A. Contribution
The key contribution of this paper is the analysis of most common learning techniques in the state of the arts and the evaluation of their accuracy.This paper is organized as follows: Section 2 presents a state of the arts of data mining techniques for churn prediction and briefly discusses the evaluated techniques.In Section 3, methodology of the study is discussed, Results and discussion are given in Section 4 and finally Section 5 concludes this work.

II. MACHINE-LEARNING FOR CHURN PREDICTION
Machine-learning techniques have been widely used for evaluating the probability of customer to churn [25].Based on a survey of the literature in churn prediction, the techniques used in the bulk of literatures fall into one of the following categories 1) Regression analysis; 2) Treebased; 3) Support Vector Machine; 4) Bayesian algorithm; 5) Ensemble learning; 6) Samplebased learning; 7) Artificial neural network; and 8) Linear Discriminant Analysis.A brief introduction of the chosen algorithms is presented in this section.
1) Regression analysis: Regression analysis techniques aim mainly to investigate and estimate the relationships among a set of features.Regression includes many models for analyzing the relation between one target/response variable and a set of independent variables.Logistic Regression (LR) is the appropriate regression analysis model to use when the dependent variable is binary.LR is a predictive analysis used to explain the relationship between a dependent binary variable and a set of independent variables.For customer churn, LR has been widely used to evaluate the churn probability as a function of a set of variables or customers' features [26]- [33].
2) Decision Tree: Decision Tree (DT) is a model that generates a tree-like structure that represents set of decisions.DT returns the probability scores of class membership.DT is composed of: a) internal Nodes: each node refers to a single variable/feature and represents a test point at feature level; b) branches, which represent the outcome of the test and are represented by lines that finally lead to c) leaf Nodes which represent the class labels.That is how decision rules are established and used to classify new instances.DT is a flexible model that supports both categorical and continuous data.Due to their flexibility they gained popularity and became one of the most commonly used models for churn prediction [27]- [29], [33]- [36].
3) Support Vector Machine: Support Vector Machine (SVM) is a supervised learning technique that performs data analysis in order to identify patterns.Given a set of labeled training data, SVM represents observations as points in a highdimensional space and tries to identify the best separating hyperplanes between instances of different classes.New instances are represented in the same space and are classified to a specific class based on their proximity to the separating gap.For churn prediction, SVM techniques have been widely investigated and evaluated to be of high predictive performance [37]- [41].
4) Bayes Algorithm: Bayes algorithm estimates the probability that an event will happen based on previous knowledge of variables associated with it.Naïve Bayesian (NB) is a classification technique that is based on Bayes' theorem.It adopts the idea of complete variables independence, as the presence/absence of one feature is unrelated to the presence/absence of any other feature.It considers that all variables independently contribute to the probability that the instance belongs to a certain class.NB is a supervised learning technique that bases its predictions for new instances based on the analysis of their ancestors.NB www.ijacsa.thesai.orgmodel usually outputs a probability score and class membership.For churn problem, NB predicts the probability that a customer will stay with his service provider or switch to another one [42]- [46].5) Instancebased learning: Also known as memorybased learning, new instances are labeled based on previous instances stored in memory.The most widely used instance based learning techniques for classification is K-nearest neighbor (KNN).KNN does not try to construct an internal model and computations are not performed until the classification time.KNN only stores instances of the training data in the features space and the class of an instance is determined based on the majority votes from its neighbors.Instance is labeled with the class most common among its neighbors.KNN determine neighbors based on distance using Euclidian, Manhattan or Murkowski distance measures for continuous variables and hamming for categorical variables.Calculated distances are used to identify a set of training instances (k) that are the closest to the new point, and assign label from these.Despite its simplicity, KNN have been applied to various types of applications.For churn, KNN is used to analyze if a customer churns or not based on the proximity of his features to the customers in each classes [17], [51].
6) Ensemblebased Learning: Ensemble based learning techniques produce their predictions based on a combination of the outputs of multiple classifiers.Ensemble learners include bagging methods (i.e.Random Forest) and boosting methods (i.e.Ada Boost, stochastic gradient boosting).

a) Random Forest
Random forests (RF) are an ensemble learning technique that can support classification and regression.It extends the basic idea of single classification tree by growing many classification trees in the training phase.To classify an instance, each tree in the forest generates its response (vote for a class), the model choses the class that has receive the most votes over all the trees in the forest.One major advantage of RF over traditional decision trees is the protection against overfitting which makes the model able to deliver a high performance [47]- [50].

b) Boostingbased techniques (Ada Boost and Stochastic Gradient Boosting)
Both AdaBoost (Adaptive Boost) and Stochastic Gradient Boosting algorithms are ensemble based algorithms that are based on the idea of boosting.They try to convert a set of weak learners into a stronger learner.The idea is that having a weak algorithm will perform better than random guessing.Thus, Weak learner is any algorithm that can perform at least a little better than random solutions.The two algorithms differ in the iterative process during which weak learners are created.Adaboost filters observations, by giving more weight to problematic ones or the ones that the weak learner couldn't handle and decrease the correctly predicted ones.The main focus is to develop new weak learns to handle those misclassified observations.After training, weak learners are added to the stronger learner based on their alpha weight (accuracy), the higher alpha weight, the more it contributes to the final learner.The weak learners in AdaBoost are decision trees with a single split and the label assigned to an instance is based on the combination of the output of all weak learners weighted by their accuracy [56].
On the other hand, gradient bosting gives importance to misclassified/difficult instances using the remaining errors (pseudo-residuals) of the strong learner.At each iteration, errors are computed and a weak learner is adjusted to them.Then, the contribution of the weak learner to the strong one is the minimization of the overall error of the strong learner [57].For churn prediction Adaboost [58]- [60] and Sochastic gradient [61], [62] have been used for churn prediction.

7) Artificial neural network:
Artificial Neural Networks (ANNs) are machine-learning techniques that are inspired by the biological neural network in human brain.ANNs are adaptive, can learn by example, and are fault tolerant.An ANN is composed of a set of connected nodes (neurons) organized in layers.The input layer communicates with one or more hidden layers, which in turn communicates with the output layer.Layers are connected by weighted links.Those links carry signals between neurons usually in the form of a real number.The output of each neuron is a function of the weighted sum of all its inputs.The weights on connection are adjusted during the learning phase to represent the strengths of connections between nodes.ANN can address complex problems, such as the churn prediction problem.Multilayer perceptron (MLP) is an ANN that consists of at least three layers.Neurons in each layer use supervised learning techniques [52], [53].In the case of customer churn problem, MLP has proven better performance over LR [21], [27], [28], [54], [55].
8) Linear Discriminant Analysis: Linear Discriminant Analysis (LDA) is a mathematical classification technique that searches for a combination of predictors that can differentiate two targets.LDA is related to regression analysis.They both attempt to express the relationship between one dependent variable and a set of independent variables.However, unlike regression analysis, LDA use continuous independent variables and a categorical dependent variable (target).The output label for an instance is estimated by the probability that inputs belong to each class and the instance is assigned the class with the highest probability.Probability in this model is calculated based on Bayes Theorem.LDA can be used for dimensionality reduction by determining the set of features that are the most informative.LDA has been used in for different classification tasks including customer churn [63]- [65].

III. METHODOLOGY
The first step before applying the selected analytical models on the dataset, explanatory data analysis for more insights into dataset was performed.Based on the observations data was preprocessed to be more suitable for analysis.www.ijacsa.thesai.org 1) Data: The used dataset for the experiments of this study is a database of customer data of a telecommunication company.The dataset contains customers' statistical data including 17 explanatory features related to customers' service usage during day, international calls, customer service calls.14% of the observations have the target variable "yes" and 86% observations have the value "No".The dataset variables of customer transactions and their descriptions are presented in Table I and Fig. 1 shows the distribution of each feature.Two of the explanatory variables (Int.l.plans and VMail Plan) were transformed from binominal form (yes/no) into binary form (1/0) to be more suitable for the selected models.

b) Data cleaning
This stage includes missing data handling/imputation: Some of the selected algorithms cannot handle missing data such as SVM.That's why missing value can be replaced by mean, median or zero.However, missing data replacement by statistically computed value (imputation) is a better option.The used dataset included missing values in some the numerical variables (Day Charge, Eve Mins, Intl Calls, Intl Charge and Night Charge) and two categorical variables (VMail Plan, Int'l Plan).Numerical data were replaced using random forest imputation technique [69].And binary values were imputed using the techniques in [70] c) Feature selection Before model training, feature selection is one of the most important factors that can affect the performance of models.In this study, the importance of the used variables was measured to identify and rank explanatory variables influence on the target/response.This allows dimensionality reduction by removing variables/predictors with low influence on the target.Random forest technique can be used for feature selection using mean decrease accuracy.Mean decrease measures the impact of each feature on model accuracy.The model permutes values of each feature and evaluates model accuracy change.Only features having higher impact on accuracy are considered important [71].Another well-known feature selection technique Boruta [72] was used.It is an improvement on RF.It considers all features that are relevant to the target variable whereas, most of techniques follow a minimal optimal method.Additionally, it can handle interactions between features [72].Both techniques were applied to rank predictors based on the mean importance from Boruta and the mean decrease error calculated by random forest.Results shown in Table II shows that both models agree on the top three variables with the same rank (custServ.Calls, Int.l.Plan,Day.Mins).Both models agree on the next six features with different ranks (Day.Charge, VMail.Message, Intl.Calls, Eve.Charge, Intl.Mins and Eve.Mins).Both models give very low rank to the same four variables (Day.Calls,Night.Calls, Eve.Calls and Account.Length).Results are shown in Table II and Fig. 2.
3) Simulation Setup: For this study, the selected models are used to generate predictions using the dataset containing 3333 samples with 13 predictors and one response variable.10-fold cross validations were used for models training and testing.Training and testing datasets are randomly chosen with cross validation 60% for training and 40% for testing.Each module requires initial parameters that are set as follows: www.ijacsa.thesai.orga) Decision Tree (CART) One parameter is used for decision tree, CP which is a complexity parameter used to control the optimal tree size.Accuracy is used to choose the optimal model.The final (cp) value used for the model was: 0.07867495 as shown in Table III.b) Support Vector Machine In order to train SVM, two main parameters are required: C and Sigma.The C parameter affects the prediction.It indicates the cost of penalty.Large value For C means high accuracy in training and low accuracy in testing.While small value for C indicates unsatisfactory accuracy.While sigma parameters has a more influence than C on classifications, as it affects hyperplane partitioning.A too large value of sigma leads to over-fitting, while small values lead to under-fitting [73].Cross-validation was performed to select and tune performance parameters.The values that gave the highest accuracy were sigma = 0.06295758 and C = 1 as shown in Table IV.

c) K-nearest Neighbor
In KNN, one parameter needs to be tuned.K is the number of instances/neighbors that are considered for labeling an instance to a certain class.Cross validations were performed using different k values.Results shown in Table V shows that the highest accuracy is obtained using k=7.

d) AdaBoost
For Ada boost mode, nIter -represents the number of weak learners to be used.Grid search was used to determine the best accuracy.Results show that highest accuracy is at nIter=100 as shown in Table VI.e) Random Forest A forest of 500 decision trees has been built using the Random Forest algorithm.Error rate results indicate that after 100 trees, there is no significant error reduction.Another parameter is mtry that indicates number of predictors sampled for spliting at each node.Results in Table VII show that the optimal performance is at mtry = 7.  f) Stochastic gradient boost The model was tuned to calculate the number of trees that achieves the best accuracy.The parameter was initially 5000 to 1000000.Results show that after 60000 ntrees, there's no significant change in accuracy as shown in Fig. 3.

g) MLP ANN
Multi-layer perceptron neural network was built using: 13 inputs, 2 outputs and one hidden layer with 5 neurons.The initial weight matrix was randomly generated.The learning function is "Std_Backpropagation" and the learning rate = 0.1.
The resulted weight matrix after epochs' network training is shown in Table VIII.

IV. RESULTS AND DISCUSSION
Accuracy is used to evaluate the model performance.Accuracy indicates the ability to differentiate the credible and non-credible cases correctly.It's the proportion of true positive (TP) and true negative (TN) in all evaluated news: (1) Where, TP: is the total number of customers correctly identified as churn.
FP: is the total number of customers incorrectly identified as churn.
TN: is the total number of customers correctly identified as no-churn.FN: is the total number of customers incorrectly identified as no-churn.
Results of applying the cross validation in all models are shown in Table IX and Fig. 4.   Minimum and maximum accuracies for all of the selected models are summarized in Table X and Fig. 5. Results of the study show that ensemble based learning techniques (RF and AdaBoost) achieved the highest performance with approximately 96%.Both MLP and SVM can be recommended as well with 94% accuracy.DT achieved 90%, NB 88% and finally LR and LDA with accuracy 867% as shown in Fig. 5.

V. CONCLUSION AND FUTURE WORK
This study tries to present a benchmark for the most widely used state of the arts for churn classification.The accuracy of the selected models was evaluated on a public dataset of customers in Telecom Company.Based on the findings of this study, ensemblebased learning techniques are recommended as both Random forest and Ad boost models gave the best accuracy.However, the study can be extended by including hybrid models and deep learning models.Other performance metrics can be used for performance evaluation.Timing measures of the models can also be a major indicator for performance.Models can also evaluate against different datasets from different domains.

TABLE III .
CART COMPLEXITY VARIABLE AND ACCURACY

TABLE IV .
ACCURACY USING DIFFERENT C VALUES Rejected www.ijacsa.thesai.org

TABLE X .
ACCURACY OF THE SELECTED MODELS