Customers Churn Prediction using Artificial Neural Networks (ANN) in Telecom Industry

s—To survive in the fierce competition of telecommunication industry and to retain the existing loyal customers, prediction of potential churn customer has become a crucial task for practitioners and academicians through predictive modeling techniques. The identification of loyal customers can be done through efficient predictive models. By allocation of dedicated resources to the retention of these customers would control the flow of dissatisfied consumers thinking to leave the company. This paper proposes artificial neural network approach for prediction of customers intending to switch over to other operators. This model works on multiple attributes like demographic data, billing information and usage patterns from telecom companies data set. In contrast with other prediction techniques, the results from Artificial Neural Networks (ANN) based approach can predict the telecom churn with accuracy of 79% in Pakistan. The results from artificial neural network are clearly indicating the churn factors, hence necessary steps can be taken to eliminate the reasons of churn. Keywords—Neural Network; ANN; prediction; churn management


I. INTRODUCTION
To survive in dynamic, high service demanding sector of telecommunication and to achieve operational excellence, it is indispensable to maintain up to date customer relationship management system (CRM). This system plays pivotal role in developing customer satisfaction, loyalty and main interface to interact with our clients [1]. The CRM system could revolutionize the entire system by providing real time provisioning of information, improvement in sale process and, enhancement of customer loyalty, advertisements and increase the effectiveness of marketing tools [2]. This term is basically emerge from relationship marketing defined asthe type of marketing where a lot of emphasis is laid to improve the qualitative, strategic and supportive relationship with already working customers and devise strategy to attract new customers. [3]. In order to survive in extreme competitive market of telecom, the firms must have to adapt with external changes, so CRM is proved to be an important monitoring tools to detect environmental changes on business operation and take tough business decision. CRM system is providing both tangible and non-tangible benefits hence former are providing increasing productivity, increase retention rates and profitability, decrease the marketing cost and faster turnaround time while later benefit is responsible for customer loyalty and satisfaction, take the benefits of segmentation, increase customer experience, positive effect from word of mouth effects and excellence customer experience [4].
The income can be increased by improving service quality and customer satisfaction, redress the customer problems and timely tackle of complaints registered by customer [5]. The process can be more effective by introduction of automation through CRM system. The business capabilities can be improved by better and integrated knowledge in real time from CRM system and help significantly through better decision-making. By extracting detailed tables and report without spending much time which can better improve working productivity [6]. Excellent and comprehensive analysis can be executed through provisioning of customer service so individual planning and decision making can improved through better learning opportunity. The customer needs, expectations and behavior can be predicted through information derive from CRM [7].
The prime factors behind customer retention can be extracted in order to develop the profitable, loyal and long lasting relationship with their clients. Without running retention compaign, the telecom operators are consistantly losing significant numbers of customers that is 20% to 40% each years [8]. By bringing improvement in retention phenomena the aount £128 million rang from 20% to 10 % can anually be saved by one of british well know telecom operator Orange [9]. Many statistical and datamining techniques are introduced to investigate the customer churn prediction. In contrast to market survey, data mining techniques are analyzing the information obtained from both historical & current data in order to predict the patterns on historical data and future customer attitudes [10].
The most common techniques used for prediction are Decision Tree (DT), Logitic Regression (LR), Support Vector Machine (SVM) and Neural Network (NN). Further more, the decision tree is used for resolution of classificaiton problems to divide the instances into two or more than two classes. Similarly, logistics regression gives the probability by providing input/oupt fields, set of equation and factors causing customer churn [11].
Most of the companies in today"s world are suffered badly due to switching over of dissatisfied customers famously know as customer churn and departure is mostly done to new competitor. The acquisition of new customer costs new www.ijacsa.thesai.org company 6 to 7 times more than retaining the existing customer hence cause lot of profit lose [12]. The most probable reason behind the departure of customer is achieving of cheaper offer from another company, expression of dissatisfaction from existing operator or successful marketing strategy of new company [13].
Predictive Analytics can divided into classifications and prediction which can further be sub divided into decision tree, logistic regression, neural network and support vector machine given in Fig. 1. To protect the profit, brand name and assets of exiting operator, it is deemed necessary to retain the customers. Hence collection of these customer data and necessary prediction would help identification of these dissatisfied customer and would help utilization of resources for these targeted customers. In the section-I brief introduction about customer churn is given which is followed by literature review in section-II and related work about churn and predictive analytics techniques is furnished. In the section-III deep learning is explained and illustrated followed by methodology in section-IV. In section-V, all the results of study are given in details.

II. LITERATURE REVIEW
The organization of recent times have significantly moved from product and service centric approach to customer centric behavior. To meet the needs of the customer is merely remain the core function of the organization but telecom operator are fast moving towards service improvement by enhancing customer loyalty and satisfaction. A lot of emphasis is laid to developed strategy to retain and categorize the customers according to the values they are providing to company [14].
Acquiring new customer is many times costlier than to retain the existing customer so strategic goal is set to work on dissatisfied customers who are intended to stop the service and leave the company and phenomena is termed as customer churn management. This involves prediction of those profitable customers in advance, are unable to continue with existing firm due to marketing strategy failure, company tariffs, poor customer care or frequent technical issues [15].
In this scenario data mining techniques are used to process and analyze the demographic details, customer care service data, billing and account receivables, customer credit scoring, customer usage behavior patterns , value added services data, customer relationship data. Hence model is developed and evaluated for potential churners and arrange to develop effective business solution to beat the competitor in the face of extreme competition. This activate the CRM department to convince churner customer by redressing their issues, offering extra service and discount in monthly bills so loss could be prevented. The six steps initiating from data collection to devising of churn policy [16] is depicted in Fig. 2. Churn prediction and analysis are performed through different techniques and covered mostly by data mining tools. Many different studies are conducted by researchers and telecom professional to construct churn prediction models with varying accuracy and precision on different data sets. Support Vector Machine (SVM), Logistic Regression (LR) and Multilayer Perceptron are being used for model creation from data set of Taiwan Logistic Company [17]. Similarly, decision tree algorithm (DT-ID3) is used for construction of prediction model for Malaysian Telecommunication Company [18]. Another study conducted on the basis of Decision Tree (DT) and logistic regression (LR) by taking the data from renowned Telecommunication Company by using survival analysis [19]. Research is conducted by using the data from Oracle Company"s data set and techniques are Naïve Bayesian Theorem for building of churn prediction model [20]. In China Telecommunication Company, data was analyzed for churn prediction by using Decision Tree (DT), C5.0 algorithm BPN and Neural network [21].

III. DEEP LEARNING MODEL
This is the phenomena involve the processing from features selection to ultimate process of classification or prediction. There are three main requirement for effective deep learning mode. These are availability of abundant data to train the model, adequate system computational power lead to introduction of GPU"s and innovative algorithm to make the process fast [22].

A. Artificial Neural Network
In human brain, all decision are taken through neural networks provided naturally in our body which are composed of basic building block "neuron".
The biological neuron as given in Fig. 3 is composed of dendrites responsible for receiving of data from other neuron, cell body sums all the inputs received inside and output the data through axon outside from neuron [23]. All the communications and processing is performed in electrical signals through synapses-a connection point between dendrites and axon from preceding neuron. www.ijacsa.thesai.org Similarly, in artificial neuron inputs X1, X2… Xn are taken by each neuron and inserted to for summation and activation/ transfer function for decision making. The output is taken outside on basis of joint decision taken by entire neural network given in Fig. 4.
The neuron is composed of three main layers (input, hidden & output layers) provided in Fig. 5.When input Xi is applied to neuron, the weight is added to and results: W i is weight for each input data X i with b j is the bias for each perceptron and O k is the output.
The model is comprised of networks of neuron interconnecting with each other through weights and output from a particular perceptron is obtained from experience after completion of training process of a model [24].
Multilayer perceptron are perceptron consists of a scores of layer including a number of hidden layer in between input and output layer. In the training process, the error is consistently reduced calculated during comparison of actual and desired output and back feed to input side and weight for each input value is regularly adjusted.
where E i is the Error between actual and predicted outputs A = Actual Output Â= Predicted/ Modeled Output

B. Activation Function
In artificial neural network all the inputs are multiplied against their weights, sum and activation function is applied. There are lots of expectation from ANN to perform non-linear, complicated, high dimensional mapping between inputs and response output being considered as universal function approximates.
Sigmoid activation function or S-shaped curve range from 0 to 1. The function has some shortcomings of gradient problem vanishing and output is not zero centered hence push the gradient in far too different directions that"s why having slow convergence.
On another hand, Hyperbolic Tangent function (tanh) is zero centered and best for optimization which make it superior over sigmoid function. The third and most popular function is Rectified Linear Units (ReLU) which has improved the performance six times Tanh and completely remove the problem of vanishing gradient. The function is frequently used and limited to only hidden layer of neural network.
In addition, the modified version of ReLU is Leaky ReLU which has covered the shortcoming of ReLU and gradient stopped working during training phase and can die.

C. Convolution Neural Network
The use of convolution neural network has largely being extended for understanding of image contents, image recognition, detection and segmentation.
The main reason behind this fact is capability of CoVNet depicted in Fig. 6, for scaling the network to millions of attributes and labeled data set help greatly in learning process. The technique has also caught the attention of classification of large scale videos classification where the network has excellent performance not only in static images but also complicated temporal evolution.

A. Collection of Data
The collection of data were carried out from all cellular companies of Pakistan telecom market in Year 2018-19. There are 20468 instances comprise of 26 attributes selected during period of October to December 2018.The demographics information of the survey is illustrated in Fig. 7 indicating gender, education, marital and age information. However, Table I is detailed with descriptive statistics while the correlation of the data is furnished with another illustration, Table II.

C. Design and Setup of Artificial Neural Network
The artificial neural network model was built in multilayer perceptron (MLP) in IBM SPSS statistics 21. The ANN is trained through back-propagation learning algorithm and synaptic weights were adjusted through gradient decent for lowering the error through transformation function. The required data were assigned to training and testing in the percentage of 80% and 20% respectively. Special care were made not to over-train the model in the training mode and reduce the error to minimum possible level. Before initialization of training, all the covariates were normalized with values of 0 and 1.
The transformation function used for this model is hyperbolic tangent function returned the values all the ways between -1 and +1 efficiently. The function can be termed as probability distribution and the output Oj is called as network"s pseudo-probability or estimated probability of input classification function. There are two mode for optimization through gradient descent algorithm and the most commonly used method is batch mode which updates all synaptic weights of all records in training data set. Another method is online or incremental mode which do the same work however with lower efficiency than preceding method. For solution of linear equations through iteration, the technique used for in the category of batch mode is scaled conjugate gradient has become more optimized. Hence by each repetitive iteration, the synaptic weights are continuously updated, the algorithm bring the error surface to more minimum values compare to previous iteration.
The scaled conjugate gradient algorithm has four parameters for development of model that is initial lambda, sigma, interval center and offset. The Hessian Matrix when acting as negative then controlling parameter is lambda. When size of weight change by sigma parameter, Hessian estimate change the derivative of first order of error function. www.ijacsa.thesai.org Similarly, the annealing algorithm boosts the interval center a0 and a 0 -a and a 0 +a. So the initial setting of all parameters are:

Initial lambda =0.0000005, Initial sigma=0.00005 & Interval center =0 and Interval offset =0.5
The SPSS prefer to use cross-entropy error function rather than squared error function whenever softmax activation function is selected and applied on output layer. The mathematical form of cross entropy error function is as below: where n= total number of output nodes T j =target value (output node j) O j = Actual output value (output node j) The main purpose of this research the generation of multilayer perceptron neural network (MLP-NN) model that can actually predict the telecommunication churn by processing of data obtained from telecom industry according to factors. Below mentioned Table III  Moreover, the number of nodes calculated in the hidden layer is 3 while in the output layer is 2, the activation function taken for hidden layer is selected as hyperbolic tangent (tanH) and softmax function in the output layer and error function used as cross-entropy of network information table.
In network diagram extracted from SPSS results, the telecom customer churn (Churn=No, Churn=Yes) from inputs tenure.
Total charge. Similarly, there are three nodes in the hidden layer and two output nodes determined as churn, No-churn depend on results.
The network diagram of artificial neural network is furnished in Fig. 8 which are comprised of input layer, hidden layer and output layer after putting the data in software. The parameter estimates table is given in Table V while model  summary in Table VI provide details information about training and testing data set. For both training and testing while the cross entropy error is brought down to minimum possible level during training processing. The power of the model to forecast the outcome of the model is calculated through small value of cross entropy error is 1623.861. The percentage of incorrect prediction is 22.7%. There are 10 consecutive steps completed was performed where there is no more reduction in error function during testing of sample is performed categorical dependent variable outcome.   For, classification (confusion matrix) is drawn in Table VII. When the predicted outcome is greater than 0.5 the outcome is determined as Yes (churn=Yes). The over percentage of training, 77.3% classification of training data was performed.
The pseudo-probability is calculated in box plot are drawn in Fig. 9. The dependent variable (Churn=Yes or No), the pseudo-probabilities obtained from whole dataset is displayed in box plots and each value greater than 0.5 display correct predictions. The first box plot extended from left to right are customer No churn in the category of No churn, the second plot indicates the classification of churn churn=No, although the values related are Yes category. Similarly, the third box plot actually relates to Yes category although predicted in Category churn=No and fourth plot indicated probability and actual are classified as customer churn=Yes.
To determine the possible cutoff point, the sensitivity versus specificity is classified through receiver operating characteristic curve (ROC) is illustrated in Fig. 10.
The combine sensitivity and specificity (1-false positive rate) is showing he random diagonal line is drawn from lower left side to upper right side at 45 degree and greater the accuracy can be achieved depend on the distance from 45 degree base line through classification process.
The mathematical determination of the area under the curve can determine in Table VIII. The probability predicted as 79% displays in the model having the customer churn= Yes and churn=No are randomly selected clearly reveal the pseudo-probability of churn prediction in the category churn=Yes.  The cumulative gain that displays the classification of telecom customer churn calculated through artificial neural network against classification of prediction through chance. The fifth point in the curve at category churn=No (50% and 40%), meaning if dataset is rated and churn cases are predicted through pseudo-probability of category churn=No ,then it is not difficult to determine that top 40% cases contain 50% of total cases actually take churn=No.
Simply, the gain given in Fig. 11 is the determination of effectiveness of classification model that the correct prediction out of total model against the prediction determine without using a model. The greater the distance of baseline curve main incline line the greater the gain a model have, which measure the excellent performance.
On other hand, lift curve drawn in Fig. 12 also evaluate the performance of the model according to the portion of population and give clear view of benefit of using the model to scenario where there is no predictive modeling. By comparing the gain curve with lift curve, it is determine that 93% value of gain curve , the lift factor is determine as 2.5 on lift curve.  This figure indicate the sensitivity of model according to change of each input variable. The greatest impact from independent variable that is tenure/subscription of customer with company classify the customers to either churn =Yes or Churn= No. On second position, the churn is mostly affected by total charges. Table IX is giving the importance of each variable used in main data set. The importance along with percentage is given in more detail for each variable where age has got the highest values equal to 100% followed by call duration (Total) equal to 63%, complaint lodge 33.3%, call duration(Avg) 25%, education 24%, income per month 23.30%, Drop call rate 21.9%, monthly income 20.1%, unpaid numbers 19.1% and occupation 11.4% at lowest level. All these percentage values are illustrated in bar chart provided in Fig. 13 for better understanding and arrange in descending orders for easy understanding.

VI. CONCLUSION
The prediction and management of customer churn has become more important task due to liberalization of cellular market. Timely prediction of loyal customers intended to leave the company can help identification and subject to the proactive action in order to retain them. Therefore, building of accurate and precise churn model is necessary not only for telecom companies" owner and practitioners as well. To determine a promising solution for maintaining strong customer baseline, telecom churn prediction has taken a shape of modern day research problem to issue an early warning system for switching over subscribers. The main theme of this paper is application of artificial neural network modeling in customer churn prediction in most volatile market of Pakistan telecommunication industry of the data obtained from all cellular and fixed operators of Pakistan. The research is already in line with literature review and has declared the Artificial Neural Network (ANN) performing excellent than other classification techniques. The already prevailing technique of back propagation algorithm is used to train the model for prediction of telecom customer churn. In the last part of the research the impact of each variable on telecom customer churn in term of normalized importance is provided to calculate the impact of each aspect.