Knowledge Discovery from Database Using an Integration of Clustering and Classification

— Clustering and classification are two important techniques of data mining. Classification is a supervised learning problem of assigning an object to one of several pre-defined categories based upon the attributes of the object. While, clustering is an unsupervised learning problem that group objects based upon distance or similarity. Each group is known as a cluster. In this paper we make use of a large database 'Fisher's Iris Dataset' containing 5 attributes and 150 instances to perform an integration of clustering and classification techniques of data mining. We compared results of simple classification technique (using J48 classifier) with the results of integration of clustering and classification technique, based upon various parameters using WEKA (Waikato Environment for Knowledge Analysis), a Data Mining tool. The results of the experiment show that integration of clustering and classification gives promising results with utmost accuracy rate and robustness even when the data set is containing missing values.


INTRODUCTION
Data mining is the process of automatic classification of cases based on data patterns obtained from a dataset.A number of algorithms have been developed and implemented to extract information and discover knowledge patterns that may be useful for decision support [2].Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases [1].Several data mining techniques are pattern recognition, clustering, association, classification and clustering [7].The proposed work will focus on challenges related to integration of clustering and classification techniques.Classification has been identified as an important problem in the emerging field of data mining [5].Given our goal of classifying large data sets, we focus mainly on decision tree classifiers [8] [9].Decision tree classifiers are relatively fast as compared to other classification methods.A decision tree can be converted into simple and easy to understand classification rules [10].
Finally, tree classifiers obtained similar and sometimes better accuracy when compared with other classification methods [11].Clustering is the unsupervised classification of patterns into clusters [6].The community of users has played lot emphasis on developing fast algorithms for clustering large datasets [14].It groups similar objects together in a cluster (or clusters) and dissimilar objects in other cluster (or clusters) [12].In this paper WEKA (Waikato Environment for knowledge analysis) machine learning tool [13][18] is used for performing clustering and classification algorithms.The dataset used in this paper is Fisher"s Iris dataset, consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor).Four features were measured from each sample; they are the length and the width of sepal and petal, in centimeters.Based on the combination of the four features, Fisher developed a linear discriminant model to distinguish the species from each other.

A. Organisation of the paper
The paper is organized as follows: Section 2 defines problem statement.Section 3 describes the proposed classification method to identify the class of Iris flower as Irissetosa, Iris-versicolor or Iris-virginica using data mining classification algorithm and an integration of clustering and classification technique of data mining.Experimental results and performance evaluation are presented in Section 4 and finally, Section 5 concludes the paper and points out some potential future work.

II. PROBLEM STATEMENT
The problem in particular is a comparative study of classification technique algorithm J48 with an integration of SimpleKMeans clusterer and J48 classifier on various parameters using Fisher"s Iris Dataset containing 5 attributes and 150 instances.

III. PROPOSED METHOD
Classification is the process of finding a set of models that describe and distinguish data classes and concepts, for the purpose of being able to use the model to predict the class whose label is unknown.Clustering is different from classification as it builds the classes (which are not known in advance) based upon similarity between object features.Fig. 1 shows a general framework of an integration of clustering and classification process.Integration of clustering and classification technique is useful even when the dataset contains missing values.Fig. 2 shows the block diagram of http://ijacsa.thesai.org/steps of evaluation and comparison.In this experiment, object corresponds to Iris flower, and object class label corresponds to species of Iris flower.Every Iris flower consists of length and width of petal and sepal, which are used to predict the species of Iris flower.Apply classification technique (J48 classifier) using WEKA tool.Classification is a two step process, first, it build classification model using training data.Every object of the dataset must be pre-classified i.e. its class label must be known, second the model generated in the preceding step is tested by assigning class labels to data objects in a test dataset.

A. Iris Dataset Preprocessing
We make use of a database "Fisher"s Iris dataset" containing 5 attributes and 150 instances to perform comparative study of data mining classification algorithm namely J48(C4.5) and an integration of Simple KMeans clustering algorithm and J48 classification algorithm.Prior to indexing and classification, a preprocessing step was performed.The Fisher"s Iris Database is available on UCI Machine Learning Repository website http://archive.ics.uci.edu:80/ml/datasets.html in Excel Format i.e. .xlsfile.In order to perform experiment using WEKA [20], the file format for Iris database has been changed to .arff or .csvfile.
The complete description of the of attribute value are presented in Table 1.A sample training data set is also given in Table 2 .During clustering technique we add an attribute i.e. "cluster" to the data set and use filtered clusterer with SimpleKMeans algorithms which removes the use of 5,6 attribute during clustering and add the resulting cluster to which each instance belongs to, along with classes to the dataset.2) KMeans clusterer: Simple KMeans is one of the simplest clustering algorithms [4].KMeans algorithm is a classical clustering method that group large datasets in to clusters [15] [16].The procedure follows a simple way to http://ijacsa.thesai.org/classify a given data set through a certain number of clusters.It select k points as initial centriods and find K clusters by assigning data instances to nearest centroids.Distance measure used to find centroids is Euclidean distance.

3) Measures for performance evaluation:
To measure the performance, two concepts sensitivity and specificity are often used; these concepts are readily usable for the evaluation of any binary classifier.TP is true positive, FP is false positive, TN is true negative and FN is false negative.TPR is true positive rate, it is equivalent to Recall.  3 shows the confusion matrix of three class problem .If we evaluate a set of objects, we can count the outcomes and prepare a confusion matrix (also known as a contingency table), a three-three (as Iris dataset contain three classes) table that shows the classifier's correct decisions on a major diagonal and the errors off this diagonal.The columns represent the predictions and the rows represent the actual class [3].An edge is denoted as true positive (TP), if it is a positive or negative link and predicted also as a positive or negative link, respectively.False positives (FP) are all predicted positive or negative links which are not correctly predicted, i.e., either they are non-existent or they have another sign in the reference network.As true negatives (TN) we denote correctly predicted non-existent edges and as false negatives (FN) falsely predicted non-existent edges are defined i.e., an edge is predicted to be non-existent but it is a positive or a negative link in the reference network.
b) Precision: In information retrieval positive predictive value is called precision.It is calculated as number of correctly classified instances belongs to X divided by number of instances classified as belonging to class X; that is, it is the proportion of true positives out of all positive results.It can be defined as: The results of the experiment show that integration of clustering and classification technique gives a promising result with utmost accuracy rate and robustness among the classification and clustering algorithms (Table 3).An experiment measuring the accuracy of binary classifier based on true positives, false positives, false negatives, and true negatives (as per Equation 4), decision trees and decision tree rules are shown in Table 3 and Fig. 4 &5.http://ijacsa.thesai.org/ In an ideal world we want precision value to be 1.Precision value is the proportion of true positives out of all positive results.Precision value of integration of classification and clustering technique is higher than that of simple classification with J48 classifier (Table 4, 5&6).
IRIS SETOSA CLASS AND CLUSTER 1 According to the experiments and result analysis presented in this paper, it is observed that an integration of classification and clustering technique is better to classify datasets with better accuracy.

V. CONCLUSION AND FUTURE WORK
A comparative study of data mining classification technique and an integration of clustering and classification technique helps in identifying large data sets.The presented experiments shows that integration of clustering and http://ijacsa.thesai.org/classification technique gives more accurate results than simple classification technique to classify data sets whose attributes and classes are given to us.It can also be useful in developing rules when the data set is containing missing values.As clustering is an unsupervised learning technique therefore, it build the classes by forming a number of clusters to which instances belongs to, and then by applying classification technique to these clusters we get decision rules which are very useful in classifying unknown datasets.We can then assigns some class names to the clusters to which instance belongs to.This integrated technique of clustering and classification gives a promising classification results with utmost accuracy rate and robustness.In future we will perform experiments with other binary classifiers and try to find the results from the integration of classification, clustering and association technique of data mining.

Figure 1 .
Figure 1.Proposed Classification Model (C4.5): J48 is an implementation of C4.5[17] that builds decision trees from a set of training data in the same way as ID3, using the concept of Information Entropy. .The training data is a set S = s1, s2... of already classified samples.Each sample si = x1, x2... is a vector where x1, x2… represent attributes or features of the sample.Decision tree are efficient to use and display good accuracy for large amount of data.At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other.

Figure
Figure 3. Confusion Matrix It is simply a ratio of ((no. of correctly classified instances) / (total no. of instances)) *100).Technically it can be defined as: Positive Rate : It is simply the ratio of false positives to false positives plus true negatives.In an ideal world we want the FPR to be zero.It can be defined as: TN FP FP FPR    e) F-Meaure: F-measure is a way of combining recall and precision scores into a single measure of performance.The formula for it is: RESULTS AND PERFORMANCE EVALUATION In this experiment we present a comparative study of classification technique of data mining with an integration of clustering and classification technique of data mining on various parameters using Fisher"s Iris dataset containing 150 instances and 5 attributes.During simple classification, the training dataset is given as input to WEKA tool and the classification algorithm namely C4.5 (implemented in WEKA as J48) was implemented.During an integration of clustering and classification techniques of data mining first, Simple KMeans clustering algorithm was implemented on the training data set by removing the class attribute from the data set as clustering technique is unsupervised learning and then J48 classification algorithm was implemented on the resulting dataset.

Figure 4 .
Figure 4. Decision tree and Rules during classification of Iris data

Figure 7 .
Figure 7. Decision tree and Rules of integration of clustering and classification technique PERFORMANCE EVALUATION TABLEIII.

TABLE III .
In an ideal world we want the FPR to be zero.Considering results presented in Table4, 5&6, FPR is lowest of integration of clustering and classification technique, in other words closet to the zero as compared with simple classification technique with J48 classifier. 