Clustering Methods for Credit Card using Bayesian rules based on K-means classification 1 S.Jessica Saritha, 2

K-means clustering algorithm is a method of cluster analysis which aims to partition n observations into clusters in which each observation belongs to the cluster with the nearest mean. It is one of the simplest unconfirmed learning algorithms that solve the well known clustering problem. It is similar to the hope maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. Bayesian rule is a theorem in probability theory named for Thomas Bayesian. It is used for updating probabilities by finding conditional probabilities given new data. In this paper, K-mean clustering algorithm and Bayesian classification are joint to analysis the credit card. The analysis result can be used to improve the accuracy.


I. INTRODUCTION
Since the data mining is the synthetic product of multisubjects, it has drawn the advantages from all of these subjects including database technique, artificial neural network nets, statistical methods, mode identification, information searches, database visualization and so on.A commonly acknowledged definition is a complex process of picking the cryptic, unknown, potentially great useful and valuable models, rules or some practical knowledge from the database.It is actually a kind of deep-layer data analysis method.In order to improve the conciseness of basic Bayesian classification, many literatures have made advancements in broadening the independence of the conditions.In fact the elements that affect the conciseness is not only the relativity among the attributes but also the completeness.In response to this reason, the introduction of the clustering arithmetic for the K average into the Bayesian classification method is aimed to improve this paper is supported by Leading Academic Discipline Program , 211 Project for Shanghai University of Finance and Economics (the 3rd phase) conciseness.K-means clustering has been used in many fields [1][2], Bayesian rule is a theorem in probability theory named for Thomas Bayesian.It is used for updating probabilities by finding conditional probabilities given new data.It has been used in many fields [3][4].Credit card is the fastest-growing banking industry of the financial business.In order to prevent risks, many methods are developed to mine or analysis the customers [5].In this paper, K-mean clustering algorithm and Bayesian classification are combined to analysis the credit card.The analysis result can be used to improve the accuracy and shows that the method is feasible.

II. K-MEANS RULES
The key point in the K-means clustering rule is to divide the data into different clusters through iterative method.The ultimate aim is to get the target function minimized, the cluster produced will be as close and independent as possible.
Input: expected number of clusters: k, the database of n objects.
Output: k clusters which make the square error criteria function to be the minimal one. Steps: (1) Selecting k as the of the original cluster centroid.
(2) Calculating the distances between the objects and every cluster centroid, then divide the objects to the closest cluster.
(3) Re-calculating the average of the every new Cluster.
(4) Keep doing this until the centroid tends to be Unchangeable.

Features:
(1) A pre-fixed k; (2)Creating a initial division ,then using the position relocation technique of the interactive (3) The distances and matrixes can be unsure (4)Less calculating than the hierarchical clustering method and it is suitable for dealing with huge sampledatabase.
(5) It is suitable for the discovery of the ball-like ones.
The K-Means algorithm has the advantages of fast clustering and easy realization.But there is a pre-fixed number k of the clusters.This condition has affected and the origin cluster is stochastic which may bring instability to the result.Hence it is of high value to improve the quality and stability in the cluster analysis.

III. BAYESIAN RULE
Bayesian rule is a method belonging to the statistics.They can forecast the rate of whether some target data belongs to some certain category.Bayesian rule makes a hypnosis that the of all attributes are independent from each other.This hypnosis is also be named as: independent, it helps to effectively reduce the calculation work when the Bayesian classification rule is found.The basic Bayesian classifier is described as the following.Suppose a variable quantity collection {A1,A2,...,An, C}.Among them, A1,A2,.,An is the variable attribute quantity in the practices.Bayesian (1) It doesn't assign an object to a certain category unconditionally.Instead, it works out the rate through calculation.The category which has the largest rate is the one that object belongs to.
(2) Commonly, all the attributes function invisibility.It means it not that several attributes that determine the classification but all the attributes.
(3) The attributes of the objects can be discrete, consecutive and mixed also.Compared to other classification methods, the Bayesian classifier has the lowest mistake rates.

A. Bayesian Theories
Suppose x to be a sample database whose belonging is unknown, Suppose H as a hypothesis, for example, sample database X belongs to a specific category C. As for the classification, our goal is to fix P(H|X)---fixing a observed sample database X and the rate when H is supposed to be right.P(H|X) is posterior probability which is the rate of the rightness of H under condition X.For example, Suppose the sample database is fruit, the attributes described are colors and shapes.Suppose X signifies red color and round shape, H is the hypothesis that X are apples.So P(H|X) presents the rate of the fact that X are apples when fruit X are known as red and round.To the contrary, P(H) is the priori probability, in the examples above, P(H) signifies the rates of the fact that the sample is apple no matter what color it is and what shape it is.P(H|X) is based on more information.While P(H) has no relation with X.Similarly, P(X|H) is the after-rate of the foundation of X under conditions H. Which means, if it is already known that H is apple, the rate of X being red and round can be signified as P(X|H).P(X) is the priori probability of X, which is also the rate of picking up a sample which is red and round from the collection.Bayesian rule describes how to work out the P(H|X) according to the P(X),P(H) and P(X|H).Among which, the rate of P(X),P(H) and P(X|H) can be get from the data collection.

B. The basic Bayesian classification procedure
(1) Every sample database uses n-dimension vectors to signify the specific number of its n attributes.
(2) Suppose there are m different categories, C1,C2....Cm .An unknown data sample X is given.The classifier, when X is known, predict the category which X most likely belongs to.Which is, when the basic Bayesian classifies the unknown sample X into category Ci, only when (1) is true.
Which is also that P(C | X ) i is largest.The category Ci is called the supposition of the largest after-rate.
Suppose there are m different categories, c1 , c2,… cm .A sample database X whose category is still unknown is given (3).Since P(X) is the same to all the categories, it will be ok with the largest P(X/Ci)P(Ci) .And because that the pre-rate of each category is unknown, the occurrence rates of each categories are supposed to be the same, that is P(C1 )=P(C2)=…=P(mC ).This way, formula (2) chose the maximum then it turns out to seek the largest P(X|Ci),otherwise the largest P(X|Ci) and P(Ci) must be largest.While the pre-rate of the pre-rate can be estimated through using formula P(Ci)=si/s, the si is the number of the Ci category in the sample collection.S is the size of the training sample collection.( 4) If according to the offered database which includes a few attributes, there will be quite a large amount of computation to work P(X|Ci) out directly.In order to estimate P(X|Ci) effectively, Bayesian classifier usually suppose that each category is independent from each other.Which means, the attribute values are independent.For a certain category, its attributes are independent of each other.There are : Values of P(x1|Ci), P(x2| Ci),…,P(xn| Ci) can be estimated according to the training samples.Detailed explanation of the method is as followed: If Ak is a symbol quantity, P(xk|Ci)=sik/si, sik is the sample's number when the category is Ci and Ak's value is vk.And it is also the sample's number which falls in the Ci category.If Ak is a consecutive quantity, and suppose the attributes are in line with Gaussian distribution property, hence there will be: In order to predict the category of an unknown sample X,we can estimate the corresponding value of P(X|Ci)P(Ci).Sample X will belong to category Ci, only when

IV. CLUSTERING METHODS
In this paper we used the naïve bayes concept in clustering.With the assumption of K clusters, the objects are grouped based on the maximum posteriori probability.Theprocess of clustering starts with K clusters each with one object as a http://ijacsa.thesai.org/member.Considering this as prior information, posteriori probability is computed for other objects, and the object is placed in the cluster with maximum posterior probability.The objects are read one by one and placed in the respective clusters.The proposed method is based on the concept of Kmodes.Number of clusters and the initial set of modes are given as input.K distinct records are selected as initial values for K clusters.

Algorithm:
Input: Data set T, K-number of clusters.i. Select K distinct records as initial objects for each cluster.
ii. Read the tuple X.
iii.Compute P(Ci / X ), 1i K.iv.Place the object in the cluster which results in maximum posteriori probability.v. Repeat (ii) to (iv) until all the objects in the dataset have been placed.

V. SIMULATION
Computation method of K average and the integration of Bayesian classification integrationAs we all know, banks as the pillar of a nation's finance industry, a large amount of the interest comes from the banks' loan business.And the credit risk of the loaners make many money can't be repaid, we call this kind of money "bad money".Thus great lost will be caused to the bank.In order to prevent this lost from happening and lower banks' potential risk of de collection, we can use the data digging to make analysis of the old debt collection cases thus help banks do the credit rating.As a result, we can prohibit the come-intobeing of the bad money ahead.The paper is conducted against the background of banks' credit problems.Some practical examples will be cited to show how to use the new computation method which has combined the K average computation with Bayesian classification to deal with those data.The picture beyond represent the tabular form of dealing with the data.Table 1.In order to improve the conciseness of Bayesian computation method, the data will be classified through Kmeans computation method.The conditions for the classification is shown as following : Data classification: Table 2.
Using the 500 data of more than 600 lines of data given as the modal, and the rest will only be used in authentication and modification.Using the former two lines of data as the central point, we will do the K average analysis one by one.The data will be classified into two categories after several times of repetition.For the sake of convenience, the ability conception is referred in this paper which equals the sum of income and assets, than minus debts.We can use the ability to pay as the criteria.We can conclude the table 3.
http://ijacsa.thesai.org/classifier The results of the study in comparing the classification methods show that the Bayesian classifier has the same function as neural net work.And in analyzing the large database, the Bayesian classifier has shown high conciseness and calculating function.The Bayesian classifier has the features as following: