Visualization and Analysis in Bank Direct Marketing Prediction

—Gaining the most beneﬁts out of a certain data set is a difﬁcult task because it requires an in-depth investigation into its different features and their corresponding values. This task is usually achieved by presenting data in a visual format to reveal hidden patterns. In this study, several visualization techniques are applied to a bank’s direct marketing data set. The data set obtained from the UCI machine learning repository website is imbalanced. Thus, some oversampling methods are used to enhance the accuracy of the prediction of a client’s subscription to a term deposit. Visualization efﬁciency is tested with the oversampling techniques’ inﬂuence on multiple classiﬁer performance. Results show that the agglomerative hierarchical clustering technique outperforms other oversampling techniques and the Naive Bayes classiﬁer gave the best prediction results.


I. INTRODUCTION
Bank direct marketing is an interactive process of building beneficial relationships among stakeholders.Effective multichannel communication involves the study of customer characteristics and behavior.Apart from profit growth, which may raise customer loyalty and positive responses [1], the goal of bank direct marketing is to increase the response rates of direct promotion campaigns.
Available bank direct marketing analysis datasets have been actively investigated.The purpose of the analysis is to specify target groups of customers who are interested in specific products.A small direct marketing campaign of a Portuguese banking institution dataset [2], for example, was subjected to experiments in the literature.Handling imbalanced datasets requires the usage of resampling approaches.Undersampling and oversampling techniques reverse the negative effects of imbalance [3], these techniques also increase the prediction accuracy of some well-known machine learning classification algorithms.
Data visualization is involved in financial data analysis, data mining, and market analysis.It refers to the use of computer-supported and interactive visual representation to amplify cognition and convey complicated ideas underlying data.This approach is efficiently implemented through charts, graphs, and design elements.Executives and knowledge workers often use these tools to extract information hidden in voluminous data [4] and thereby derive the most appropriate decisions.The usage of data visualization by decision makers and their organizations offers many benefits [2], that includes absorbing information in new and constructive ways.Visualizing relationships and patterns between operational and business activities can help identify and act on emerging trends.Visualization also enables users to manipulate and interact with data directly and fosters a new business language to tell the most relevant story.
The choice of a proper visualization technique depends on many factors, such as the type of data (numerical or categorical), the nature of the domain of interest, and the final visualization purpose [5], which may involve plotting of the distribution of data points or comparing different attributes over the same data point.Many other factors play a remarkable role in determining the best visualization technique that can detect hidden correlations in text-based data and facilitate recognition by domain experts.
The current research is an attempt to demonstrate the capabilities of different visualization techniques while performing different classification tasks on a direct marketing campaign.The data set, which contains 4521 instances and 17 features that including an output class, originates from a Portuguese banking institution.The goal is to predict whether a client will subscribe to a term deposit.The data set is highly imbalanced.Some oversampling methods are applied as a preprocessing step to enhance prediction accuracy.Random forest, support vector machine (SVM), neural network (NN), Naive Bayes, and k-nearest neighbor (KNN) classifiers are then applied.A comparison is conducted to identify the best results under Gmean and accuracy evaluation metrics.
The rest of the paper is organized as follows.Second section presents a review of the literature and the contributions of the bank direct marketing dataset.The third section provides a brief description of the oversampling techniques used in this research.Fourth section introduces details regarding the data set.Finally, the fifth section discusses the methodology followed in this research and the results obtained from running five different classifiers and their implications on the final prediction.

II. RELATED WORK
From a broad perspective, the work in [6] surveyed the theoretical foundations of marketing analytics, which is a diverse field emerging from operations research, marketing, statistics, and computer science.They stated that predicting customer behavior is one of the challenges in direct marketing analysis.They also discussed big data visualization methods for the marketing industry, such as multidimensional scaling, correspondence analysis, latent Dirichlet allocation, and customer relationship management (CRM).They debated on geographic visualization as a relative aspect of retail location analysis and tackled the general trade-off between its common practices and art.Additionally, they elaborated on discriminant analysis as a technique for marketing prediction.Discriminant analysis includes methods such as ensemble learning, feature reduction, and extraction.These techniques solve problems such as purchase behavior, review ratings, customer loyalty, customer lifetime value, sales, profit, and brand visibility.
Authors of [7] analyzed customer behavior patterns through CRM.They applied the Naive Bayes, J48, and multi-layer perceptron NNs on the same data set used in the current work.They also assessed the performance of their model using sensitivity, accuracy, and specificity measures.Their methodology involves understanding the domain and the data, building the model for evaluation, and finally visualizing the outputs.The visualization of their results showed that the J48 classifier outperformed the others with an accuracy of 89.40.
Moreover, [8] employed the same data set for other customer profiling purposes.Naive Bayes, random forests, and decision trees were used on the extended version of the data set examined in the current work.Preprocessing and normalization were conducted before evaluating the classifiers.RapidMiner tool was used for conducting the experiments and evaluation processes.They illustrated the parameter's adjustments of each classifier using a normalization technique applied previously.Furthermore, they showed the impact of these parameter values on accuracy, precision, and recall.Their results showed that decision trees are the best classifier for customer profiling and behavior prediction.By contrast, [9] used the extended data set to create a logistic regression model for customer behavior prediction.This model is built on top of specific feature selection algorithms.Mutual information (MI) and data-based sensitivity analysis (DSA) are used to improve the performance over false-positive hits.They reduced the number of feature sets influencing the success of this marketing sector.They found that DSA is superior in the case of low false-positive ratio with nine selected features.MI is slightly better when false-positive values are marginally high with 13 selected features among a wide range of different features.
Additionally, a framework of three feature selection strategies was introduced by [10] to reveal novel features that directly affect data quality, which, in turn, exerts a significant impact on decision making.The strategies include identification of contextual features and evaluation of historical features.A problem is divided and conquered into sub-problems to reduce the complexity of the feature selection search space.Their framework tested the extended version of the data set used in the current work.Their goal was to target the best customers in marketing campaigns.The candidacy of the highest correlated hidden features was determined using DSA.The process involved designing new features of past occurrences aided by a domain expert.The last strategy split the original data upon the highest relevant set of features.The experiments confirmed the enrichment of data for better decision-making processes.
From visualization aspects, [11] explained several types of visualization techniques, such as radial, hierarchical, graph, and bar chart visualization, and presented the impact of human-computer interaction knowledge on opinion visualization systems.Prior domain knowledge yielded high understandability, user-friendliness, usefulness, and informativeness.Age factor affected the usability metrics of other systems, such as visual appeal, comprehensiveness, and intuitiveness.These findings were projected to the visualization of the direct marketing industry because it is mainly aided by end users and customers.

III. OVERSAMPLING TECHNIQUES
Oversampling is a concept that relates to the handling of imbalanced datasets.This method is performed by replicating or synthesizing minority class instances.Common approaches include randomly choosing instances (ROS) or choosing special instances on the basis of predefined conditions.Although oversampling methods are information-sensitive [12], they often lead to the overfitting problem, which may cause misleading classifications.This problem can be overcome by combining oversampling techniques with an ensemble of classifiers at an algorithmic level to attain the best performance.
An overview of the oversampling techniques used in the preprocessing phase of the current research, along with a brief description of the exploited classification algorithm, is introduced.

A. Synthetic Minority Oversampling Technique
Synthetic minority oversampling (SMOTE) generates synthetic instances on the basis of existing minority observations and calculates the k-nearest neighbors for each one [13].The amount of oversampling needed determines the number of synthetic k-nearest neighbors created randomly on the link line.

B. Adaptive Synthetic Sampling Technique
Adaptive synthetics minority (ADASYN) is an improved version of SMOTE.After creating the random samples along the link line, ADASYN adds up small values to produce scattered and realistic data points [14], which are of reasonable variance built upon a weighted distribution.This approach is implemented according to the level of difficulty in learning while emphasizing the minority classes that are difficult to learn.

C. Random Over Sampling Technique
ROS is a non-heuristic technique.It is less computational than other oversampling methods and is competitive relative to complex ones [15].A large number of positive minority instances are likely to produce meaningful results under this technique.

D. Adjusting the Direction of the Synthetic Minority Class Examples Technique
Adjusting the direction of the synthetic minority class (ADOMS) examples is another common oversampling technique [15] that relies on the principal component analysis of the local data distribution in a feature space using the Euclidean distance between each minority class example and a random number of its k-nearest neighbors aided by projection and scaling parameters.

E. Selective Preprocessing of Imbalanced Data Technique
The selective preprocessing of imbalanced data (SPIDER) technique introduces a new concept of oversampling [16] and comprises two phases.The first phase is identifying the type of each minority class example by flagging them as safe or noisy using the nearest neighbor rule.The second phase is processing each example on the basis of one of three strategies; weak amplification, weak amplification with relabeling, and strong amplification.

F. Agglomerative Hierarchical Clustering Technique
Agglomerative hierarchical clustering (AHC) [16] starts with clusters of every minority class example.In each iteration, AHC merges the closest pair of clusters by satisfying some similarity criteria to maintain the synthetic instances within the boundaries of a given class.The process is repeated until all the data are in one cluster.Clusters with different sizes in the tree can be valuable for discovery.

IV. CLASSIFICATION TECHNIQUES
In this section, a concise description of the classification algorithms used in this research is presented.

A. Random Forests
Random forest [17] is a supervised learning algorithm used for classification and regression tasks.It is distinguished from decision trees by the randomized process of finding root nodes to split features.Random forest is efficient in handling missing values.Unless a sufficient number of trees is generated to enhance prediction accuracy, the overfitting problem is a possible drawback of this algorithm.

B. Support Vector Machines
SVM is a learning algorithm used in regression tasks.However, SVM [18] is preferable in classification tasks.This algorithm is based on the following idea: if a classifier is effective in separating convergent non-linearly separable data points, then it should perform well on dispersed ones.SVM finds the best separating line that maximizes the distance between the hyperplanes of decision boundaries.

C. Artificial Neural Networks
ANN is an approximation of some unknown function and is performed by having layers of "neurons" work on one another's outputs.Neurons from a layer close to the output use the sum of the answers from those of the previous layers.Neurons are usually functions whose outputs do not linearly depend on their inputs.ANN [19] uses initial random weights to determine the attention provided to specific neurons.These weights are iteratively adjusted in the back-propagation algorithm to reach a good approximation of the desired output.

D. Naive Bayes
Naive Bayes [20] is a direct and powerful classifier that uses the Bayes theorem.It predicts the probability that a given record or data point belongs to a particular class.The class with the highest probability is considered to be the most likely class.This algorithm assumes that all features are independent and unrelated.The Naive Bayes model is simple and easy to build and particularly useful for large data sets.This model is known to outperform even highly sophisticated classification methods.

E. K-Nearest Neighbor
The KNN learning algorithm is a simple classification algorithm that works on the basis of the smallest distance from the query instance to the training sample [21] to determine the simple majority of KNN as a prediction of the query.KNN is used due to its predictive power and low calculation time, and it usually produces highly competitive results.

V. VISUALIZATION METHODS
Data visualization exerts considerable impact on user software experience.The decision-making process benefits from the details obtained from large data volumes [22], which are usually built in a coherent and compact manner.
The purpose of the established model is to emphasize the importance of data visualization methods.It helps in conducting a perceptual analysis of a given situation.Visualization methods are used to illustrate hidden patterns inside data sets.This section introduces the characteristics of the visualization methods used in this research.

A. Scatter Plot
Scatter plots are a graphical display of data dispersion in Cartesian coordinates [22] that shows the strength of relationships between variables and determines their outliers.Variations include scatter plots with trend line.It is used to reveal the patterns in the normal distribution of the data points.

B. Bar Charts
Bar charts are used to represent discrete single data series [23].The length usually represents corresponding values.Variations include multi-bar charts, floating bar charts, and candlestick charts.

C. Pie Charts
A synonym of a circle graph [23] is divided into a number of sectors to describe the size of a data wedge.Sectors are compared by using a labeled percentage.Variations include doughnut, exploding, and multi-level pie charts for hierarchical data.[23] is used to display the trend of data as connected points on a straight line over a specific interval of time.Variations include step and symbolic line charts, vertical-horizontal segments, and curve line charts.

A. DataSet
The data set is obtained from the UCI machine learning repository.This data set is related to the direct marketing campaigns of a Portuguese banking institution.The current research uses the small version of the raw data set, which contains 4521 instances and 17 features (16 features and output).The classification goal is to predict whether the client will subscribe (yes/no) to a term deposit (variable y).Table I shows the dataset description, and Fig. 1 illustrates the distribution of each feature.Fig. 2 shows the relationship between each attribute and the output.The attribute with unique values (distribution) of less than 6 is selected.According to the distribution of the target class shown in Fig. 3 and 4, the data set is imbalanced.That is, the percentage of class yes is 11.5 (500 records out of 4521), whereas that of class no is 88.5 (4021 records out of 4521).

B. Methodology
A preprocessing phase is first implemented to balance the data distribution by applying oversampling techniques, such as SMOTE with varying percentages.The next step is to determine which among ADASYN, ROS, SPIDER, ADOMS, and AHC is superior.The selection is aided by proper visualization methods.Then, random forest, SVM, ANN, Naive Bayes, and KNN classification algorithms are applied, and their assessment is conducted using Gmean as an evaluation metric.Other essential measurements, such as accuracy and recall, are also used.

VII. RESULTS AND DISCUSSION
In the preprocessing step, the SMOTE percentage is set to 100.Gmean and accuracy are calculated for different classifiers.Table II shows the results.
Accordingly, the Naive Bayes technique has the highest Gmean among all techniques.The accuracy and Gmeans of all five classifiers in the original dataset are shown in Table III.
Then, Naive Bayes with different SMOTE percentages is applied to select the most appropriate one that throws the best results from this data set.Table IV shows the results of Naive Bayes on the data set with different SMOTE percentages ranging from 100 to 800.The best results with Naive Bayes are those with a SMOTE value set to 400, as shown in Fig. 5.The figure also shows the Gmeans of different SMOTE percentages using the Naive Bayes classifier.The results in Table IV are shown as a chart to present a clear reading of the best Gmean.Another experiment is conducted to determine the second best classifier under SMOTE set to 400.Table V shows the results.As shown in Fig. 6, the best classifier is the one with the highest Gmean, that is, Naive Bayes and SVM with a Gmean value of 0.74.SVM and Naive Bayes are good candidates for further analysis.Naive Bayes is preferred over SVM because it is fast and easy to install.Fig. 6 shows the Gmean value before and after applying SMOTE.SMOTE enhances the Gmeans for all applied classification techniques.The final step is to compare oversampling techniques with

TABLE I .
ATTRIBUTE INFORMATION

TABLE II .
GMEAN AND ACCURACY OF DIFFERENT CLASSIFICATION TECHNIQUES AFTER USING SMOTE OVERSMPLING TECHNIQUE WITH PERCENTAGE 100

TABLE III .
RESULTS OF DIFFERENT CLASSIFICATION TECHNIQUES ON THE ORIGINAL DATASET (GMEAN,ACCURACY)

TABLE IV .
GMEAN AND ACCURACY OF NAIVE BAYES WITH DIFFERENT SMOTE PERCENTAGES RANGING FROM 100 TO 800

TABLE VII .
COMPARISON BETWEEN DIFFERENT OVERSAMPLING TECHNIQUES WITH BEST GMEAN PERCENTAGE