An Adaptive parameter free data mining approach for healthcare application

In today's world, healthcare is the most important factor affecting human life. Due to heavy work load it is not possible for personal healthcare. The proposed system acts as a preventive measure for determining whether a person is fit or unfit based on person's historical and real time data by applying clustering algorithms like K-means and D-stream. The Density-based clustering algorithm i.e. the D-stream algorithm overcomes drawbacks of K-Means algorithm. By calculating their performance measures we finally find out effectiveness and efficiency of both the algorithms. Both clustering algorithms are applied on patient's bio-medical historical database. To check the correctness of both the algorithms, we apply them on patient's current bio-medical data.


INTRODUCTION
Today the health care industry is one of the largest industries throughout the world.It includes thousands of hospitals, clinics and other types of facilities which provide primary, secondary & tertiary levels of care.The delivery of health care services is the most visible part of any health care system, both to users and the general public [2].A health care provider is an institution or person that provides preventive, curative, promotional or rehabilitative health care services in a systematic way to individuals, families or community.The physiological signals such as SpO2, ABPsys, ABPdias, HR affects person's health.In health care the data mining is more popular and essential for all the healthcare applications.In healthcare industry having the more amounts of data, but this data have not been used properly for the application.In this health care data is converted in to the useful purpose by using the data mining techniques [1].
The data mining is the process of extracting or mining the knowledge from the large amounts of data, database or any other data base repositories.The main purpose of the data mining is to find the hidden knowledge from the data base.In health care industry, the data having some unwanted data, missing values and noisy data.This unwanted data will be removed by using preprocessing techniques in data mining.Preprocessing is the process of removing noise, redundant data and irrelevant data.After the preprocessing the data will be used for some useful purpose.In recent years different approaches are proposed to overcome the challenges of storing and processing of fast and continuous streams of data.
Data stream can be conceived as a continuous and changing sequence of data that continuously arrive at a system to store or process.Traditional OLAP and data mining methods typically require multiple scans of the data and are therefore infeasible for stream data applications.Whereby data streams can be produced in many fields, it is crucial to modify mining techniques to fit data streams.Data stream mining has many applications and is a hot research area [3].Data stream mining is the extraction of structures of knowledge that are represented in the case of models and patterns of infinite streams of information.These data stream mining can be used to form the clusters of medical health data.This paper proposed two main clustering algorithms namely, K-means algorithm and density based clustering.
The K-means clustering algorithm is incompetent to find clusters of arbitrary shapes and cannot handle outliers.Further, they require the knowledge of k and user-specified time window.To address these issues, D Stream, a framework for clustering stream data using a density-based approach.The algorithm uses an online component which maps each input data record into a grid and an offline component which computes the grid density and clusters the grids based on the density.The algorithm adopts a density decaying technique to capture the dynamic changes of a data stream.Exploiting the intricate relationships between the decay factor, data density and cluster structure, our algorithm can efficiently and effectively generate and adjust the clusters in real time.Further, a theoretically sound technique is developed to detect and remove sporadic grids mapped to by outliers in order to dramatically improve the space and time efficiency of the system.The technique makes high-speed data stream clustering www.ijacsa.thesai.orgfeasible without degrading the clustering quality.The experimental results show that our algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams [4].

II.
RELATED WORK All Several health care projects are in full swing in different universities and institutions, with the objective of providing more and more assistance to the elderly.The application of data clustering technique for fast retrieval of relevant information from the medical databases lends itself into many different perspectives.
Health Gear: a real-time wearable system for monitoring and analyzing physiological signals [5] is a real-time wearable system for monitoring, visualizing and analyzing physiological signals.This system focused on an implementation of Health Gear using a blood oximeter to monitor the user's blood oxygen level and pulse while sleeping.The system also describes different algorithms for automatically detecting sleep apnea events, and illustrates the performance of the overall system in a sleep study with 20participants.A Guided clustering Technique for Knowledge Discovery -A Case Study of Liver Disorder Dataset, [6] presents an experiment based on clustering data mining technique to discover hidden patterns in the dataset of liver disorder patients.The system uses the SOM network's internal parameters and k-means algorithm for finding out patterns in the dataset.The research has shown that meaningful results can be discovered from clustering techniques by letting a domain expert specify the input constraints to the algorithm.
Intelligent Mobile Health Monitoring System (IMHMS), [7] Author proposed the system which can provide medical feedback to the patients through mobile devices based on the biomedical and environmental data collected by deployed sensors.The system uses the Wearable Wireless Body/Personal Area Network for collecting data from patients, mining the data, intelligently predicts patient's health status and provides feedback to patients through their mobile devices.
The patients will participate in the health care process by their mobile devices and thus can access their health information from anywhere any time.But actual implementation of data mining framework for decision support system is not done.
Real-Time analysis of physiological data to support medical applications [8], proposed a flexible framework to perform real-time analysis of physiological data and to evaluate people's health conditions.Patient or disease-specific models are built by means of data mining techniques.Models are exploited to perform real time classification of physiological signals and continuously assess a person's health conditions.The proposed framework allows both instantaneous evaluation and stream analysis over a sliding time window for physiological data.But dynamic behavior of the physiological signals is not analyzed also the framework is not suitable for ECG type of signals.

Performance of Clustering Algorithms in Healthcare
Database [1], proposed a framework where they used the heart attack prediction data for finding the performance of clustering algorithm.In final result shows the performance of classifier algorithm using prediction accuracy and the visualization of cluster assignments shows the relation between the error and the attributes.The comparison result shows that, the make density based clusters having the highest prediction Accuracy.

III. METHOD
We present a framework that will perform clustering of dataset available from medical database effective manner.The flow of the system is depicted in Fig. 1.The target is to cluster the patient's records into different groups with respect to the test report attributes which may help the clinicians to diagnose the patient's disease in efficient and The evaluation steps are the following-

1) Data set collection
The data set contains 7 attributes, SpO2, ABPsys, ABPdias, HR, heredity, obesity, cigarette smoking.These attributes are the risk factors that can help in predicting the patient's health status.Attributes such as SpO2, ABPsys, ABPdias, HR can be collected form MIMIC database [9] and the other attributes are influenced by the person's behavior.These all attributes values are discrete in nature .The dataset will be in preprocessed format.

2) Model Building
In model building phase features of the available data will be extracted and then clustering algorithm will be applied on extracted features.

A. Feature Extraction
For each physiological signal x among the X monitored vital signs, we extract the following features [8].

1) Offset
The offset feature measures the difference between the current value x(t) and the moving average (i.e., mean value over the time window).It aims at evaluating the difference between the current value and the average conditions in the recent past.www.ijacsa.thesai.org

2) Slope
The slope function evaluates the rate of the signal change.Hence, it assesses short-term trends, where abrupt variations may affect the patient's health.

3) Dist
The dist feature measures the drift of the current signal measurement from a given normality range.It is zero when the measurement is inside the normality range.

B. Risk Components
The signal features contribute to the computation of the following risk components.

1) Sharp changes
The z1 component aims at measuring the health risk deriving from sharp changes in the signal (e.g., quick changes in the blood pressure may cause fainting)

2) Long-term trends
The z2 component measures the risk deriving from the h weighted offset over the time window.While z1 focuses on quick changes, z2 evaluates long-term trends, as it is offsetbased.

3) Distance from normal behavior
The z3 component assesses the risk level given by the distance of the signal from the normality range.A patient with an instantaneous measurement outside the range may not be critical, but her/his persistence in such conditions contributes to the risk level From above risk components, risk functions and global risk components will be calculated.These values will be further used in clustering algorithms as an input for cluster formation.

C. Cluster formation
The proposed flow of the system uses two algorithms Kmeans and D-stream.The comparison between two clustering algorithms will be performed using the above described attributes.

K-MEANS ALGORITHM
1) The algorithm is composed of the following steps: [10] Place K points into the space represented by the objects that are being clustered.These points represent initial group centroids.Assign each object to the group that has the closest centroid.When all objects have been assigned, recalculate the positions of the K centroids.Repeat Steps 2 and 3 until the centroids no longer move.This produces a separation of the objects into groups from which the metric to be minimized can be calculated.For a data stream, at each time step, the online component of D-Stream continuously reads a new data record, place the multi-dimensional data into a corresponding discretized density grid in the multi-dimensional space, and update the characteristic vector of the density grid (Lines 5-8 of Figure 3).The density grid and characteristic vector are to be described in detail later.The offline component dynamically adjusts the clusters every gap time steps, where gap is an integer parameter.After the first gap, the algorithm generates the initial cluster (Lines 9-11).Then, the algorithm periodically removes sporadic grids and regulates the clusters (Lines 12-15).D-Stream partitions the multi-dimensional data space into many density grids and forms clusters of these grids.This concept is schematically illustrated in Figure 4. where xi 2 Si,ji .For each data record x, we assign it a density coefficient which decreases with as x ages.In fact, if x arrives at time tc, we define its time stamp T(x) = tc, and its density coefficient D(x, t) at time t is ----------(4) where λϵ (0, 1) is a constant called the decay factor.Definition (Grid Density) For a grid g, at a given time t, let E(g, t) be the set of data records that are map to g at or before time t, its density D(g, t) is defined as the sum of the density coefficients of all data records that mapped to g. Namely, the density of g at t is: 1) procedure initial clustering (grid list) 2) update the density of all grids in grid list; 3) assign each dense grid to a distinct cluster; 4) label all other grids as NO CLASS;   The calculated values of z1, z2, z3 components will be applied as an input for both the clustering algorithms to form the clusters based on their risk level.

3) Patient's Health status
Using clustering algorithm we form the clusters for attributes stated above.And then for patient's current input we predict patient's health status i.e. patient is fit or unfit.

IV.
EXPERIMENTAL RESULT The above described algorithms used for formation of clusters on medical database.The data will be collected from the Switzerland data set.The data set contains the 107 instances and the 14 attributes.The attributes are age, sex, Blood Pressure, Cholesterol, Chest Pain and etc.The performance of these algorithms will be computed by using correctly predicted instance.[1] Performance Accuracy= correctly predicted Instance/ Total Number of Instance From above table we observed that the performance of density based algorithm is better than simple K-means.Accuracy of D-stream algorithm is more than K-means.

V.
FUTURE SCOPE AND CONCLUSION K-means is unable to handle arbitrary cluster formation because prediction of the number of classes to be formed is not fixed.The D-stream algorithm has superior quality and efficiency, can find clusters of arbitrary shapes, and can accurately recognize the evolving behaviors of real-time data streams.Therefore, D-stream will perform better in biomedical applications.This system can be further developed for real time analysis of biomedical data to predict patient's current health status.
The proposed system can be used for monitoring elderly people, Intensive Care Unit (ICU) Patient.Also the system gives the health status of patient, it can be used be used by clinicians to keep the records of patients.
The proposed system is adaptive since it can handles more than one physiological signal.The proposed system uses historical biomedical data which is very useful for prediction of current health status of a patient by using clustering algorithms like K-means, D-stream, etc. Prediction of health status is very sensitive job, D-stream will perform better here, as it supports arbitrary cluster formation which is not supported by K-means.Also D-stream is particularly suitable for users with little domain knowledge on the application data that means it won't require the K-values.Hence D-stream is parameter free and proves to give more accurate results than K-means when used for cluster formation of historical biomedical data.

Figure 1 .
Figure 1.Flow of the system

Figure 3 :
Figure 3: The overall process of D-Stream.The overall architecture of D-Stream, which assumes a discrete time step model, where the time stamp is labeled by integers 0, 1, 2, • • • , n, • • • .D-Stream has an online component and an offline component.The overall algorithm is outlined in Figure 1.

Figure 4 .
Figure 4.Illustration of the use of density grid.The input data has d dimensions, and each input data record is defined within the space

Figure 3 :
Figure 3: The procedure for initial clustering.

Figure 4 :
Figure 4: The procedure for dynamically adjusting clusters.

TABLE I .
PERFORMANCE OF CLUSTERING ALGORITHM