Customer Segmentation and Profiling for Life Insurance using K-Modes Clustering and Decision Tree Classifier

Customer segmentation and profiling has become an important marketing strategy in most businesses as a preparation for better customer services as well as enhancing customer relationship management. This study presents the segmentation and classification technique for insurance industry via data mining approaches: K-Modes Clustering and Decision Tree Classifier. Data from an insurance company were gathered. Decision Tree Algorithm was applied for customer profile classification comparing two methods which are Entropy and Gini. K-Modes Clustering segmentized the customers into three prominent groups which are “Potential High-Value Customers”, “Low Value Customers” and “Disinterested Customers”. Decision Tree with Gini model with 10-fold cross validation was found as the best fit model with average accuracy of 81.30%. This segmentation would help marketing team of insurance company to strategize their marketing plans based on different group of customers by formulating different approaches to maximize customer values. Customers can receive customization of insurance plans which satisfy their necessity as well as better assistance or services from insurance companies. Keywords—Customer segmentation; customer profiling; decision tree; insurance domain; k-modes clustering


I. INTRODUCTION
Insurance industry has been in the global market for decades and it is a critical contributor to a country"s long term economic growth. Life insurers improve their policyholders" quality of life by pooling the risk of mortality, morbidity, and longevity among a wide number of people and returning the benefits of this pooling in the form of guaranteed payments [1]. In insurance industry, maintaining current customers is a challenge. Customer retention is more important than acquisition of new customers. It is said that 20% of the customers contribute more to the revenue of the company than the rest, as according to Pareto principle [2]. Despite the belief that clients are important for insurance organizations in gaining income and enhance their profitability, acquiring and retaining clients are serious issues faced by insurance firms [3]. It is not easy to obtain and influence new clients because when compared to the current clients, generally, new clients purchase 10% fewer than them, fewer involvement in the purchasing procedure as well as association with the seller [4]. Additionally, acquisition of new clients is more expensive compared to the maintenance of existing clients of the company [5]- [7]. Besides that, the likelihood of effectively selling a good or service to existing active clients is approximately 60-70 percent, while the likelihood is just 5-20 percent for potential clients, which made a greater likelihood of success in selling a good or service to existing clients compared to the potential ones [8]. It is also worthy to note that different clients contribute different amount of revenue to insurance companies, and so it is vital to handle clients based on their profitability due to uneven revenue generated by them [9].
Insurance companies are growing in numbers and the diversity of services offered, in which the clients have full control of their decisions [7]. It is thus important to have a good customer relationship management to retain the existing customers. To achieve that, insurance companies need to identify their target markets by segmenting the customers into groups. This allows them to choose whichever services that match their needs from any service providers. Customer segmentation helps business people to customize marketing plans, identify trends, plan product development, advertising campaigns and deliver relevant products, as well as personalizing messages of individuals for better communication with the intended groups [10]. Consumer sectioning is a great instrument in separating the consumers into various groups and perform analysis on their traits [3], and thus organizations are able to focus on clients in distinct features and determine the most valuable clients by sectioning the clients [9]. Clustering methods have been employed in many studies to segmentize customers [3], [9], [11]- [14], while classification via Decision Tree has also been widely used in past studies [15]- [17].
The following are the contributions of this paper:

 This research uses K-Modes Clustering and Decision
Tree Classifier for customer segmentation and profiling for insurance domain.
 Marketing team of insurance company will be able to strategize their marketing based on different group of customers by formulating different strategies to maximize customer values.
 Customers can receive customization of insurance plans which satisfy their necessity as well as better assistance or services from insurance companies. www.ijacsa.thesai.org The remaining of this paper is structured as follows: Section II discusses the related works on clustering and classification methods, while Section III describes the study's methodology. Section IV highlights the results and Section V provides the discussion and finally Section VI concludes the paper with future works.

A. Data Mining and Machine Learning
Investigation of unseen data and recognition of designs as well as affiliations that have valuable usages can be performed by information mining methods [18]. Organizations are able to pull out beneficial information from the data and obtain comprehension of their clients as well as their necessity through this information by implementing data-mining methods [7]. Data mining which is also part of knowledge discovery in database (KDD) involves the following process [19]: data selection, pre-processing, transformation, performing data mining algorithm, and data interpretation and evaluation. Data mining techniques like regression, classification, clustering, forecasting, association and visualization are also part of the classification framework in customer relationship management (CRM) [20].
While data mining extracting information from the vast amount of data, machine learning discovers algorithms that allows the machines to learn by itself without human intervention. Some examples of machine learning algorithms are Neural Networks, Decision Trees, Naïve Bayes, and Logistic Regression. K-Means, initiated by Mc. Queen in 1967 is the most popular and relevant clustering model [21]. Predictive classification models have been used to study customer purchasing behavior in past researches [13], [22], [23] in which classification models like K-Means and Decision Tree were commonly employed. This study explores these two models for segmenting and classifying customers.

B. Customer Segmentation via Clustering Methods
Past research shows that K-Means Clustering method has been widely used. K-Means Clustering was used to segment bank"s customers whereby customers were grouped into five categories: potential growth customers, general customers, intermediate customers, senior customers and VIP customers [24]. Meanwhile, K-Means Clustering algorithm was also applied to segmentize private banking customers and the results showed three clusters named "Core Value Customers", "Financial Products Oriented Customers" and "Deposit Oriented Customers" [25].
Khalili-Damghani et al. [9] employed K-Means Clustering for insurance customers segmentation. The results were three clusters labelled as "profitable customer", "potential profitable customers" and "disinterested customers. Fuzzy C-Means clustering was used to cluster life insurance customers [26]. The results explained that two was the optimal number of clusters for the study which denoted as "investment" and "life security". In [3], a comparison of k-prototypes was conducted which combined K-means (for numerical element) and Kmodes (for categorical element) algorithms, improved kprototypes and SBAC (Similarity-Based Agglomerative Clustering (SBAC) for customer segmentation in auto insurance case study. The results showed that SBAC algorithm is more effective in clustering auto insurance customers with higher silhouette index value.
In another study by Qadadeh & Abdallah [27], K-Means Clustering and Self-Organizing Map (SOM) techniques were used to cluster insurance customers. The comparison was made between K-Means Clustering and the combination of SOM with K-Means Clustering which resulted in a better overall performance of the combined method with six clusters of customers. Further studies on SOM had been applied on imbalanced dataset for clustering categorical data in which Kohonen SOM (KSOM) algorithm was improved by focusing on the distance calculation amongst objects [35] [36]. Another study on K-Means for clustering was done in [28] whereby K-Means algorithm was used to analyze the network traffic trend and type of traffic in campus network. The result showed that it was beneficial for managing or shaping the bandwidth usage and strengthens the security policy of the network.
K-Modes are generally the extended version of K-Means algorithm. The dissimilarity measure applied in K-Means algorithm is the reason that K-Means is unable to cluster categorical variables [29]. K-Modes clustering algorithm is introduced by Huang [30] by presenting a new measurement of dissimilarity to cluster categorical attributes [31]. While maintaining its proficiency, K-Modes clustering model eliminates the numeric data restriction. K-Modes removed the constraints imposed by K-Means through some adjustments including the usage of simple matching dissimilarity measure or hamming distance for categorical attributes and the replacement of means of cluster to the modes of cluster. The frequency-based approach is utilized by this model in updating the modes during clustering procedure to decrease the cost function which is estimated by calculating the standardized sum of within sum errors.

C. Rules Extration using Decision Tree Classifier
Clustering and classification techniques complement each other and are proved to perform well in segmenting customers. Clustering methods which are good at handling data without any labels have a setback of not being able to predict new and unknown data. On the other hand, classification methods are able to perform prediction to a set of unknown data but need to be trained by a set of labelled data. Decision Tree works in a way to guarantee the similarity of the sub-groups by splitting data points into two or more sub-categories [17]. A feature is represented by each node of the tree, and a value or a range of values for the feature that represents the node is portrayed by each edge aroused in a node [15]. The final output of the classification, known as class label, is stored in a leaf node. The comparatively straightforward process of Decision Tree makes it easy to understand and interpret, and the process that addresses a number of data intricacy that usually presents in the real data makes the method popular [32]. Hypotheses on each feature"s own influence in the classification procedure are produced with the help of the decision rules uncovered on the pathways [15].
There are several applications that implement Decision Tree as classifiers. Clustering analysis using K-Medoid Clustering was performed on family farmers in Brazil, and www.ijacsa.thesai.org used Decision Tree aside from Support Vector Machine, Neural Network (Multilayer Perceptron) to identifying character that distinguish between those identified clusters [15]. Classifications using Support Vector Machine, Random Forest, Decision Tree, K-Nearest Neighbour, and Naïve Bayes were performed to predict the churning of credit card holders [16]. In a study by Ganjali and & Teimourpour [33] on life insurance customers, K-Means Clustering was used to group the customers based on their lifetime value. The researchers also performed association rules to the most valuable customer group as well as classification to predict position of new customers in each cluster. In [34], Decision Tree was used in job profiling analytics to select the most significant skillsets for each job position intelligently. It produced accuracy of 63.5% when used together with Capacity Utilization Rate. Decision Tree approach was used for classification process and the results showed that the model achieved 61.3% accuracy, 38.97 % classification error, 0.012% Kappa and 0.024% Correlation criteria.
Based on the previous research, K-Means Clustering and Decision Tree Classifier have been proven as the most popular clustering technique to group and classify customers across industries. Nevertheless, there is very limited study that perform and model categorical data which is proposed in this study. K-Means is only suitable for numerical data, whereas K-Modes is the extension of K-Means algorithm which can handle categorical data.

III. METHODOLOGY
This section presents the methodology of the study. Fig. 1 illustrates the four main steps which are detailed in the next subsections: 1) Data; 2) Variable Selection; 3) Model Development, and 4) Model Evaluation.

A. Data Preparation
The data used in this study was obtained from one of the life insurances companies in Malaysia. The total number of data was 37,181 records and it consisted of daily new business customers information including their demographic details and their policy information ranging from January 2018 until December 2019. Prior to conducting analysis, the data underwent a pre-processing phase including handling missing values, imputing outliers as well as transforming the variables. Missing values were imputed accordingly with blanks, while detected outliers were transformed. Data transformations methods include discretizing numerical variables via quantilebased approach, re-grouping of data in certain variables as well as changing data types. Discretization results in either conversion of some variables into categorical data, re-labelled to avoid redundancy or merged accordingly. Table A1 in Appendix shows the pre-processed variable description.

B. Variable Selection
Variable selection involves selecting attributes that provides meaningful insights towards targeting the right customers. Based on the data used, several variables were removed as they did not have impact in the analytics including "Occupation Group", "Distribution Channel", "Insured (Self/Others)", "Occupation Group", "Payment Frequency", "Premium Status", "Race PO" and "Sum Assured". Additionally, business expert has suggested including some of the information regarding the policy purchased by the customers including duration of policy issuance, annual net premium, premium payment method, product type and policy status. Table I shows the final attributes selection.

C. Model Development
This study developed customer segmentation model using K-Modes and Decision Tree Classifier. Python language was used to perform the modelling for K-Modes Clustering and Decision Tree Classifiers. The first model, K-Modes was implemented with cost function in getting the minimize distance for the intra cluster distance. The number of clusters was set into k = 2, 3, 4 and 5. Then, the output for clustering is compared and evaluated in determining the best number of clusters. The optimal K value for K-Modes was determined by using Elbow Method. Fig. 2 shows that the elbow shape is detected when number of clusters suggested was 3 using cost function value and the precise value of the cost function is given in Table II. Therefore, K-Modes clustering with K=3 was chosen as the best number of cluster.   The second model is a Decision Tree Classifier that was implemented with a built-in function of Python"s Scikit Learn package. This function applies optimized CART (Classification and Regression Trees) in which it can perform well for binary classification as well as multi-class classification. The classifier was tuned by using the criterion function. This study experimented with two criteria of Decision Tree, "Gini" and "Entropy" with k-fold cross validation approach to achieve the best fit model. The number of labels was determined by the best number of clusters by K-modes; and in this study, 3 labels was defined for the classification task with rule extraction. The experiments were based on different number of clusters with evaluation of the cost function, thus the development time of clusters did not give any significant value, with 0.001 -0.005 differences.

D. Model Evaluation
The clustering validation was done by using Silhouette Index score. The expectation of a good clustering is the shorter distance between each point in a cluster, the farther distance between clusters and a balanced proportion of data points among clusters. The equation of Silhouette Index is as shown in (1) [9]: where a(i) is the non-similarity between one object and other objects in the same cluster, and b(i) is the non-similarity between one object and other objects in the closest cluster. Evaluation using Calinski-Harabasz Index, CH was also performed to justify the performance of the clusters based on the formula shown in [9]: where n is number of records, q is number of clusters, W q is intra-cluster scatter matrix, B q is inter-cluster scatter matrix. The highest score portrays the best number of clusters for the dataset.
Meanwhile for the classification, the evaluation was performed using K-fold cross validation whereby the datasets were randomly divided into K equally sized subsets. The models were trained, and tested K times and the results were determined during each phase. The final accuracy of the models was measured as the average of all accuracies obtained in every iteration made. The formula for accuracy is as shown in (3) [15]: where c is number of test samples classified correctly and n is total number of test samples.

A. Clustering Analysis
The results of distribution of each attribute in each cluster with K=3 is shown in Table A2 in Appendix. Based on the cluster analysis, it is shown that 51% (19,047) of the total observation falls under Category 0. This group has the highest percentage of young working customers with a relatively low annual income, and they aged in the range of 26 to 34 years old (37.7%) and earn MYR27,000.01 until MYR42,000.00 yearly (35.1%). The distribution of gender shows more than half of the customers are female and are married. Top residential location for customers in this group is Central Malaysia with 25.8, and slightly more than half of the customers in this group opt to pay low annual net premium which is in the range of MYR0 -MYR 1,800. Majority of the customers are new customers (88.9%) who purchased policies for the first time between of year 2018 and 2019. In addition, more than half of the customers belong to Class 1 of occupational risk which means that their occupations are having the least risk of exposure towards hazardous elements. For payment method, most of the customers use Auto Debit (80.5%) to pay their premium. The highest percentage of the policies" issuance days goes to "0 -186 days" category (34.4%). Lastly, more than half of the customers purchase Ordinary Life (Endowment) products (68.7%) and their policies remain active (72.0%) at the end of year 2019. Hence, Cluster 0 is named as Low-Value Customers.
Cluster 1 makes up 27% (10,081) of the whole dataset. Customers are in young group aged between 10 to 25 years old (62%) and almost half of the customers have a low annual income from MYR0.00 -MYR27,000.00. Males are higher customers (68.8%) compared to female. Majority of the customers in this cluster are single (81.0%) and most of them reside in Northern Malaysia. In terms of the occupational risk class, more than of the customers belong to Class 1. Almost all customer in this group are new customers (90.6%). The annual net premium paid by the customers in this group are mostly between MYR1,800.01 -MYR 2,400.00 with 40.1% and they also prefer Auto Debit (75.7%) as the method to pay their premium. The distribution has the highest percentage for "187 www.ijacsa.thesai.org -334 days" duration policy (33.1%) and more than half of the customers purchase Investment-Linked (Whole Life) products. Lastly, the distribution of policy status is almost balance for this cluster with slightly higher percentage of inactive policies (55.3%). This means that this group of customers has higher chances to turn their policies inactive. Cluster 1 is labelled as Disinterested Customers.
On the other hand, Cluster 2 makes up of 22% (8,053) of the total observation. Majority of the customers in this cluster are older customers with more stable earnings since they have a high annual income. 49.3% of them age in the range of 42 -76 years old and 61.8% of them have annual income in the range of MYR67,000.01 -MYR1,400,000.00. Moreover, more than half of the customers are male and 80% of the customers are married. Central Malaysia has the highest percentage for this cluster with 49.2%. This cluster also has the highest percentage of customers who belong to Class 1 hence their occupation is not very risky. Besides that, this cluster has a slightly higher percentage of existing customers (58.6%) compared to new customers.
Aligned with the range of annual income, this group of customers has the highest percentage of annual net premium in the range of MYR3,380.01 -MYR369,200.00 which is the highest category of annual net premium in this dataset with 51.5%. This cluster also has the highest percentage of those who issued their policy between 335 to 543 days with 35.5%. This group of customers prefers Credit Card the most with 58.4% as the medium to pay their premium to the insurer. For product type, the customers mainly purchase Investment-Linked (Whole Life) products with 79.5%. Finally, majority of the policies purchased are still active as of December 31st, 2019 with 84.3%. Cluster 2 is called as Potential High-Value Customers.
The cluster performance can be measured by evaluating intra-cluster performance and hence, we implemented Silhouette Index and Calinski-Harabasz Index. The results are shown in Table II. The cost function values are also included in  the table for analysis purpose. Based on Table II, for Silhouette and Calinski-Harabasz Indexes, the scores need to be the highest to have the best cluster performance. In this study, it is shown that both scores are the highest when the number of clusters used are 3, as shown in Fig. 3. For the cost function, the value is decreasing when we add a greater number of clusters. However, the largest difference of the cost value is when the number of clusters is changed from 2 to 3 compared to the change from 3 to 4 clusters and 4 to 5 clusters which results in the elbow shape. Therefore, it is justified that K-Modes with 3 clusters has the best performance for this study.

B. Classification Analysis
The purpose of performing classification is to predict the characteristics of each class label by extracting the rules developed by Decision Tree. There was a total of 43 attributes including the target variable. We implemented K-Fold Cross Validation where it divides the dataset into K-folds and they have roughly the same size of samples. In this study, we performed experiments on both "Gini" and "Entropy" criteria for Decision Tree Classifier and the number of folds selected are 2, 5 and 10. We also set the maximum number of leaf nodes to 50 to ease the validation of the decision rules as this parameter enable the model to grow a tree in best-first decisions. Table III shows the outputs selected at random for all the experiments.
The performance evaluation of Decision Tree classification is done by comparing the accuracy of the models in each experiment. Since the experiments are implemented based on the k-folds cross validation method, the average accuracy for each model is compared. Referring to Table IV and Fig. 4, it is shown that the accuracy of the models increases as the larger value of K is used. It can be concluded that, Decision Tree classifier with Gini criterion and 10-fold cross validation is the best fit model for this dataset as it has the highest average accuracy compared to other models with 81.30%.

V. DISCUSSION
Customer segmentation analysis is crucial for insurance companies to identify who are the profitable customers, how many percentages of them from the total population, to find more clients with similar profiles and how to manage less profitable clients [37]. Based on the results from this study, it can be concluded that customer segmentation can be achieved by using data mining techniques. Both clustering and classification methods are complementing the outcome of customer segmentation.
Once the customer segmentation is identified, there are some strategies that may be incorporated to cater each group of customers. For those customers who fall under "Potential High-Value Customers", insurance company may want to focus more on this group to make the customers stay loyal to the company for a long period of time by providing good services to them. Since this group of customers contribute a lot to the company, the insurer may want to find more opportunities to sell more products to them based on their needs [9]. Insurer also needs to keep in touch with the customers from time to time to update their condition and keep on track of their well-being so that they feel comfortable with the company [9] [38]. The insurance companies could build trusts relationship with the customers and this would increase cross-selling and up-selling [38]. Being www.ijacsa.thesai.org aware on customers" triggering events such as having a baby or buying a house could increase the product sales even more significantly, as according to the research done by [38]. Furthermore, insurer may provide a better customer experience by providing such a strategic and tactical focus based on the five key organizational process i.e., making strategic choices, creating value for customers, customer acquisition, customer retention, service quality and loyalty or rewards program, which can be achieved with a good CRM tool [39].
Nevertheless, the company must have strategies for those in group of "Low-Value Customers" and "Disinterested Customers". For "Low-Value Customers", although they are not the main contributors to the company, they are still the customers who are willing to take a chance in trusting the insurance company. Insurance companies may want to adopt customer centric approach to provide a superior customer experience [40]. This includes providing better customer services towards more customization and personalization by providing appropriate channels for communication to keep the customers informed, demanding and connected.
On the other hand, upon detecting the customers who fall into "Disinterested Customers" group, insurance companies may be able to discuss and advise them to keep their policies in the event of customers cancelling the policies. In a study done by [39], it is evident that customers demand more on service quality, interaction management, contact programs, retention management, service strategy, customer satisfaction and customer loyalty, and this also could be achieved with a systematic CRM tool in place.

VI. CONCLUSION AND FUTURE WORK
This study has presented the work on customer segmentation and profiling for insurance industry by using K-Modes clustering and Decision Tree Classifiers. The grouping of customers is made by analyzing the similarity of their characteristics and hence, able to determine the target customers. It is highly recommended for life insurance companies to segmentize their customers to enable them to offer suitable products or services in accordance with the needs of customers.
Future researchers may consider using a larger dataset with longer time periods to perform customer segmentation to have a more accurate result. If the data is too large, they may consider performing dimensional reduction technique such as Principle Component Analysis (PCA) to handle the data by transforming them into useful components. It is also suggested that future studies should use transactional details of the customers to monitor their behaviors and include more product categories such as Credit Life Insurance products as well as all rider products purchased by customers.
Further future work could also include result comparison with other classification models such as Random Forest, Naïve Bayes or even Artificial Neural Network (ANN). Also, computational complexity analysis could also be studied to analyze the learning efficiency and performance while implementing customer segmentation. The proposed approach in this study can also be applied in other industries like retail, hospitals, food chains, bookstores and so forth.