Weighted Clustering for Deep Learning Approach in Heart Disease Diagnosis

An approach for heart diagnosis based on weighted clustering is presented in this paper. The existing heart diagnosis approach develops a decision based on correlation of feature vector of a querying sample with available knowledge to the system. With increase in the learning data to the system the search overhead increases. This tends to delay in decision making. The linear mapping is improved by the clustering process of large database information. However, the issue of data clustering is observed to be limited with increase in training information and characteristic of learning feature. To overcome the issue of accurate clustering, a weighted clustering approach based on gain factor is proposed. This approach updates the cluster information based on dual factor monitoring of distance and gain parameter. The presented approach illustrates an improvement in the mining performance in terms of accuracy, sensitivity and recall rate. Keywords—Learning approach; weighted clustering; heart disease diagnosis; gain factor


I. INTRODUCTION
Heart diseases are rapidly increasing in recent past due to uneven living style and a highly variant environment conditions. The automaton of heart diagnosis for an early alarming is hence a primal need in today"s leaving. The advancement in recent technologies in data mining has brought out new possibilities of early diagnosis and alarming of heart disorderness based on vital parameter analysis. The existing system uses the past learning parameters from a large database in making decision to monitoring values. However the rapid increases in data set and the availability of new data exchange and storage facilities constraint the mining performance due to large search overhead. Hence, an enchantment to the existing approach of data mining is required in making decisions to Heart diagnosis. In developing approach for automation in heart diagnosis and early alarming various methods were presented in past. In context to the development of learning approach for heart diseases in [1] a new classification model based on feature fusion model is presented. The features of vital parameters in heart diagnosis are fused with medical monitored parameters for improvising the accuracy. In [2] a semantic co-ranking process for heart disease diagnosis is presented. This approach is however limited with the diversity in data base entry. In [3] a fast convergence learning approach based on distance parameter is presented. In [4] a decision system based of Bergman divergence condition in training and testing for heart disease diagnosis is presented. The distance parameters in feature vectors are used in developing a decision.
In [5,6] a matrix learning approach for co-ranking of data base features is presented. The approach outlines a random walk method in deriving the best possible decision for given test parameters. A modified divergence approach is proposed in [7]. This approach develops a hyper graph model for classification using learning of feature vector matrix. The diversity of feature vector in database however limits the proposed approach of classification. In [8] a relevance coding for heart disease diagnosis is presented. This approach develops a classification model based in the distribution of feature vectors in a wider domain. In [9, 10] a decision system based on personalized information in cardiovascular disorderness is presented. This approach addressed the variation parameter of learning feature in making decision. In [11] an approach of diagnosis for heart disease using semantic features is proposed. This approach defines a new relevance coding based on feature correlation and re-ranking operation in large dataset for heart diseases diagnosis. A fusion approach for process features based on the variation of feature vector is outlined in [12,13]. The presented approach defines a correlative function model based on the magnitude parameter of vitals given. The issue of scaling and varying diversity is not been addressed in the presented approach. In [14] an approach of random updation and mapping for feature vector in heart disease diagnosis is presented. The presented approach limits the misclassification performance by a robust online diagnosis of feature vector. In [15] a naive Bayes method for heart disease diagnosis is presented. This proposed approach develops a classification approach based on hidden knowledge based on continuous data in the database. In [16] content based represented of vital parameters for heart disease diagnosis is presented. This approach monitors the variation among the feature parameters in developing a rank value for decision making. The increase in data base feature is observed to be effective in learning performance, however, the constraint of search overhead and misclassification performance under semantic feature limits the current mining approaches in early diagnosis for heart disease. To improve the performance based on search overhead, delay and accuracy metric a new weighted clustering approach based on cluster gain parameter is proposed. In [19] new MAE approach is developed to estimate the heart disease estimation using machine learning algorithms. For prediction of health system and risk different machine learning approaches are outlined in [20]. The existing approach of machine learning based on feature values used for training. The training overhead of the classification system is based on the number of learning features used. The more details of www.ijacsa.thesai.org feature value results to higher accuracy however with a cost of processing overhead. Towards minimizing feature overhead, in a recent approach [18] a feature fusion approach is developed based on data and feature values. The fusion model is developed based on the magnitude of each parameter in observation. The fusion model minimizes the processing overhead by reducing the number of training features; however the proposed approach is developed based on discrete monitoring value of the vitals. Wherein fusion model minimizes the processing overhead, the fusion of feature based on magnitude is limited with external distortion. A deviation in magnitude value due to a spike or jitter results to a misclassification. The effect of distortion is hence needed to be minimized to improve estimation accuracy. With the objective of improving the fusion accuracy and noise suppression a new time variant feature value with medical significance characteristic is proposed. This proposed approach presented a new mean of feature fusion using a time line monitoring of feature variation and observing the classification improved due to the fusion. The rest of this paper is outlined in six sections. The existing feature clustering approach and heart diagnosis approach is presented in Section II. Section III outlines the approach of proposed heart diagnosis using weighted clustering approach. Section IV presents the results obtained for the developed system. A discussion to the observations obtained is presented in Section V with conclusion briefed in Section VI.

II. HEART DISEASE DIAGNOSIS
In an early detection of heart disease the learning features of data base has a vital role. In this presented work Cleveland dataset [17] is used for learning and analysis of the developed approaches.

A. Dataset Description
As observed, there are multiple observing parameters for diagnosis for heart diseases. In [17] multiple physiological parameters is presented for early diagnosis of heart disease. For the implementation of the proposed work Cleveland Dataset is used. Most of the researchers used this Cleveland Dataset as the benchmark dataset. The dataset consists of 76 attributes out of which majority of Computational Techniques have chosen only 14 attributes. The 14 attributes that we have considered along with their details are as follows: 1) Age: A slow varying parameter specifying patient age. 2) Sex: A non-varying parameter specifying male or female patient.

12)
Ca: Indicate numbers of major vessels defined by fluoroscopy.
The Cleveland heart disease dataset has five class attributes indicating either healthy or one of four sick types.
In the existing approach the parameters are developed for feature fusion to reduce the processing overhead, where the features are fused based on the data level and feature level of the observation. Wherein the fusion is based on the numerical and normalized values of the parameters, the fusion are magnitude based. This fusion limits the observation to a discrete time observation which has the probability of error in real time diagnosis.
In this work a new feature fusion approach is proposed based on the time line characteristic of measuring parameter in consideration with medical significance of the feature. This approach fuse the features based on the medical significance and time line variation of the monitoring parameter which results in more accurate decision at a lesser processing time compared to existing approach.
A single feature taken in isolation cannot figure out all individuals" risk of heart disease. Hence many features are required to diagnose it. Among the observed parameters features such as cholesterol, heart rate, hypertension (blood pressure), resting ECG, diabetes, blood sugar, stress, exercise induced angina and old age are significant in predicting heart disease. Out of these factors, eight of them are medically significant from the list of Cleveland heart disease dataset [18]. They are age, chest pain type, resting blood pressure, fasting blood sugar, cholesterol, maximum heart rate, resting heart rate and exercise induced angina. If medically significant features were neglected, then it has every chance to run into the risk of incorrect diagnosis. Cleveland heart disease data has got both continuous and discrete types of data. In the classification stage a deep learning approach is made to cluster the observing parameters and perform classification based on the modified feature set. For this cause, the testing accuracy of medical feature fusion with a weighted clustering is compared with the existing approach. The accuracy, sensitivity, specificity, recall, precision and F-measure parameters were used in evaluation of the classification performance.

B. Diagnosis Process
The process of heart disease diagnosis is developed based on the training feature and the classification method applied. In the presented diagnosis system a set of feature from Cleveland dataset is used for diagnosis. The process of decision making is developed based on the normalized distance metric as outlined in [1]. For a set of feature vector (Fv) given as, The classification is obtained using distance metric. The decision is derived using the minimization function of distance vector given as, (2) Where, defines the eclidian distance faotrgivne as, Where, is query sample feature, and is the trained database feature.
A normalized distance factor for the decision using Max-Min criterion is used given as, Here, "dist" represents Euclidian distance of feature of test sample and database trained features. The distance on testing feature ( ), selected feature set ( ) and database feature (F) is computed. d max , d min indicate maximum and minimum Euclidian distance of testing signal with dataset feature. The decision is develop as a normalized ratio of maximum and minimum distance parameter of testing and training feature vectors.
The equations are an exception to the prescribed specifications of this template. You will need to determine whether or not your equation should be typed using either the Times New Roman or the Symbol font (please no other font). To create multileveled equations, it may be necessary to treat the equation as a graphic and insert it into the text after your paper is styled.

III. WEIGHTED CLUSTERING WITH TIME LINE MONITORING
The distance based classification has a limitation of arbitrary variation in the magnitude of feature vector and a large search overhead due to linear search over the database. To overcome the addressed issue, a weighted clustering approach based on gain parameter is presented. In the representation of database information in a classification process, the raw database is mapped to a normalized feature set which has a normalized effect of magnitude variation. The process of feature monitoring is presented in " Fig. 1". The updation normalized feature is derived as a mean difference of feature vector.
For a set of randomly distributed feature vector the normalization is given by, Here, is the feature vector in database and ̅̅̅ defines the mean of dataset given by, ̅ ∑ In the retrieval process the updated features are correlated over a set of test sample feature in making distance following minimum distance criterion. The illustration of search process is shown in "Fig. 2".
The distance vector is computed as a correlation metric C, given by, √∑ Where, the classification is developed based on the correlation of the feature vector ( ) and selected feature cluster ( for i th test feature vector. However, the search overhead is considerably high for a large database. To minimize the effect a clustering approach is proposed. This presented approach cluster the available dataset into k-clusters based on the distance metric and the gain parameter. The cluster gain (CGn) attained due to updation in cluster is given by, | Here, R defines the redundancy of a feature. The redundancy of a feature in a class is defined by the redundancy factor (R) given by, where is the probability function, and defines the probability of i th information in a cluster .
The magnitude variation has a direct impact on the clustering performance hence a weighted updated of the feature vector is used given as, √∑ The weight parameter defines the tuning of the computed gain for an information under magnitude variation given by,  The weight parameter is updated by the distance parameter .
The convergence of the updation is given as a maximization function of cluster gain given as, ( ) (12) The cluster values are updated based on the convergence of the feature value monitored for a period of time. For a time period the aggregated gain is computed and the cluster is updated based on the maximum cluster gain of a class. The classification of the developed system is performed using a multi class support vector machine (SVM). The decision is made for a set of training feature passed to the SVM architecture.

IV. RESULT OBSERVATION
The testing of developed approach is made over a Cleveland dataset where 1/3 rd part of the database is used for training and remaining is used for testing. The approach presented is outlined in Matlab tool and validated for the retrieval Accuracy (Acc), sensitivity, specificity, Recall Rate, F-measure and computation time parameter. The parameters are computed as, (13) The sensitivity is given, (14) The specificity is given as, (15) Recall rate is given by, (16) The precision is given as, And F_Measure is computed as, The observing parameters are developed with the observing factors listed in Table I. The confusion matrix for the heart diagnosis is illustrated in Table II.  The observation for the developed system is computed for different test cases measuring the evaluating parameters. The observation for the developed approach under different test cases of healthy, type-1, type-2, type-3 and type-4 attributes is listed in Table III.
The developed approach of weighted cluster based classification is compared with the existing approach of random clustering and linear searching method. Observation plots for the healthy case samples are shown in " Fig. 3-10".
" Fig. 3" shows the accuracy comparison obtained for the developed approaches in comparisons with the existing classification approach of linear search and random clustering. The linear search method correlates the test feature vector to the database feature in a linear manner and develops decision based on the maximum correlation value. The average distance of the mapping is considered in making the decision. Random clustering is developed based on the manual selection of features for classification. The set of features are manually picked from the test sample to pass to the classifier model for decision. The observation of accuracy illustrated an improvement of 8% as compared to random clustering and 18% as compared from linear search method.
The Sensitivity of the developed method is shown in " Fig. 4". The proposed method shows a Sensitivity of 91% wherein the existing approach of linear search and random clustering shows a Sensitivity of 87% and 77% respectively.   The specificity of the developed method is shown in "Fig. 5". The proposed method shows a specificity of 51%. The existing approach of linear search and random clustering shows a specificity of 46% and 44%, respectively. The recall rate plot is shown in " Fig. 6". The proposed method obtain a recall rate of 91% and the existing approach of linear search and random clustering shows a recall rate of 89% and 77%, respectively.
The Precision of the developed method is shown in " Fig. 7". The proposed method shows a Precision of 93% wherein the existing approach of linear search and random clustering shows a Precision of 1% and 79% respectively. An analysis of the developed approach for different test types is summarized in Table IV. Processing time defined as the computation time for classification is shown in "Fig. 8". The proposed method takes a computation time of 0.26 sec in classification, wherein the existing approach of linear search and random clustering observe a computation time of 0.43 and 0.65 sec, respectively.    Observation for four types of effects refereed in Cleveland data set is compared for accuracy. The observation plot for accuracy for the three developed classification approach shown in " Fig. 9". The observation shows an accuracy of 89% for type-190% for type-2, 91% for type-3 and 92% for type-4 cases.
Observation plot for Sensitivity for the three developed classification approach shown in " Fig. 10". The observations show a Sensitivity of 93% for type-1, 94% for type-2, 95% for type-3 and 95% for type-4 cases.
For Specificity for the three developed classification approach is shown in " Fig. 11". The observations show a Specificity of 95% for type-1, 95% for type-2, 91% for type-3 and 92% for type-4 cases.
Recall for the three developed classification approach is shown in "Fig. 12". The observations show a Recall of 98% for type-1, 91% for type-2, 93% for type-3 and 91% for type-4 cases.
Precision for the three developed classification approach is shown in "Fig. 13". The observations show a precision of 74% for type-1, 85% for type-2, 89% for type-3 and 94% for type-4 cases.
Processing time for the three developed classification approach is shown in "Fig. 14". The observations show a Processing time of 0.27 sec for type-1, 0.20sec for type-2, 0.31sec for type-3 and 0.33 sec for type-4 cases.

V. DISCUSSION
Diagnosis of heart disease based on the vitals measured is a primary need for automated diagnosis and classification system. In the presented work a weighted clustering approach for feature fusion is proposed. The analysis of the proposed approach for classification accuracy, sensitivity, specificity, recall, precision and computation time resulted in more efficient observations compared to the existing approach of linear search and random clustering method. The process of time line monitoring and denoising results in faster processing resulting g in a minimization of 0.4sec in processing time compared to linear search method. The accuracy of classification is observed to improve by 20% compared to linear search method. For varying test type parameters from 1-4 it is observed that higher accuracy for type 4 case is observed. The test time is lower for type-1 case. Wherein the classification due to linear search has high search time, clustering based approach is suitable in minimizing the processing time with higher accuracy. The random clustering selects the features from the list of observation based on the magnitude values. The distortion impacts the retrieval accuracy. However, the proposed approach illustrates higher classification accuracy with lower time of processing. The analysis is focus on testing the proposed method onto real time vital parameters as future work.

VI. CONCLUSION
The diagnosis of heart disease using weighted clustering is presented in this paper. The approach of weight updation and clustering of large database feature based on cluster gain factor is proposed. This approach developed a new clustering approach of a distributed feature set in Cleveland data set into specified cluster to improve the classification performance. The search overhead in mining of features for classification is minimized by the formations of sub clusters. The decision approach of feature update into cluster is made more accurate with the dual factor of class attribute and cluster gain factor. The observations of the developed system illustrated an improvement in accuracy for the proposed system compared to existing approach with a reduced time of computation due to cluster search approach.