Estimation of Water Quality Parameters Using the Regression Model with Fuzzy K-means Clustering

— the traditional methods in remote sensing used for monitoring and estimating pollutants are generally relied on the spectral response or scattering reflected from water. In this work, a new method has been proposed to find contaminants and determine the Water Quality Parameters (WQPs) based on theories of the texture analysis. Empirical statistical models have been developed to estimate and classify contaminants in the water. Gray Level Co-occurrence Matrix (GLCM) is used to estimate six texture parameters: contrast, correlation, energy, homogeneity, entropy and variance. These parameters are used to estimate the regression model with three WQPs. Finally, the fuzzy K-means clustering was used to generalize the water quality estimation on all segmented image. Using the in situ measurements and IKONOS data, the obtained results show that texture parameters and high resolution remote sensing able to monitor and predicate the distribution of WQPs in large rivers.


INTRODUCTION
The use of remote sensing techniques to monitor, manage and predict water quality parameters (WQPs) and contaminants from the important things in recent years [1].This as a result of the growing population, increasing industry, agriculture, and urbanization [2].Where these techniques have helped to find and estimate pollutants in water with lower costs and greater potential [3].It is known, that the contaminated water affects directly or indirectly on people's health, especially when it is a source or the only source of drinking water.Therefore, the water quality monitoring helps to assess quality of water bodies and identify contaminated areas [4].In situ water quality measurement requires sampling which is expensive and time consuming in laboratory analysis.
For this reason, the remote sensing techniques can overcome these limitations by achieving an alternative means of water quality monitoring for larger average of temporal and spatial scales [5].It should be noted that monitoring of water quality using remote sensing began in early 1970s depending on measure of spectral and thermal response in emitted radiation from water surfaces.In generally, empirical relationships between spectral properties and WQPs were established by the authors since in1974 and developed an empirical approach to estimate it [6].The general forms of these empirical equations are: Where, Y is the remote sensing measurement vector (which includes: radiance, reflectance, energy …) and X is the water quality parameter vector of interest (i.e., suspended sediment, chlorophyll …), A and B are empirically derived factors [6].
Traditional methods for monitoring and estimating pollutants or water quality parameters WQPs [7] by optical satellite data were relied on the spectral response [8][9], thermal or scattering reflected from water [10] [11], or by fusion spectral and microwave techniques [12].Therefore, many references were recommended to use certain bands to find number of variables in waters.Thus, classification represents the nature of the separated, regardless of the classification accuracy.Therefore, in this study, we proposed different method to estimate WQPs using regression models on texture parameters.Thus, the proposed method, use the GLCM to estimate six texture parameters: contrast, correlation, energy, homogeneity, entropy and variance.Extracted texture parameters were corresponding to the ground-truth locations.Multiple regression models have been used to generate predictive models between texture parameters and WQPs.The predictive model with best correlation will be used later in the classification and identification of WQPs.Our work aims to study and estimate three parameters in water as WQPs: PH (is a measure of the acidity or basicity of an aqueous solution), phosphate ( 4 PO ), and nitrate ( 3 NO ) using an empirical equation as a function of extracted texture parameters.The empirical model, which has the highest accuracy, will be taken.Finally, to estimate the WQPs on the entire segmented water region in image, we apply the Fuzzy K-means classifier (FKM).
In the next sections, we present the study and the data used in this work.In section II, a simplified overview about study area has been introduced.Section III includes all methods used to get the purposes of this work.Then we will move on the results and discussion in section IV.

A. Study Area
The Tigris River is the eastern member of the two great rivers that define Mesopotamia, the other being the Euphrates.The river flows south from the mountains of southeastern Turkey through Iraq.The river Tigris is 1850 km in length, rising in the Taunus Mountains of Eastern Turkey.The total length of the river in Iraq is 1418 km [13].It consider main source for human use, especially for drinking water [14].The study area represents the river Tigris within Baghdad city www.ijacsa.thesai.org(the capital of Iraq) and the length of river, extended 49 km from the Al-Muthana Bridge north Baghdad to the confluence with the Diyala river south Baghdad [15].Fig. 1 level 1 illustrates the map study area of our work.

B. In Situ Data
In situ data measurements were collected from eight stations represents the main station of Baghdad city and distributed on Tigris river.This samples were collected from these stations in October 2012 and analysed in laboratories of ministry of environmental in Baghdad city to extract the water quality parameters included: PH , ( 4PO ) ( 3 NO ) and other related variables.All parameters were done according to standard specifications presented by the American public health association [16].

C. IKONOS Data
Many types of satellites have the ability and potential appropriate for estimating WQPs.Higher resolution satellite is better in most cases, but signal-to-noise requirements of sensor technology impose limitations on the combined spectral, spatial and temporal resolutions for this reason no sensor can have a high spectral, high spatial and high temporal resolution.That mean if the pixel resolution of a sensor is small (high spatial resolution), the spectral bandwidth has to be large (low spectral resolution) to capture sufficient light energy for an acceptable signal-to-noise ratio.There is a trade off in spectral, spatial and temporal resolution and the best combination depends on the intended use of the sensor [17].The IKONOS satellite was launched in September 24th, 1999 to provide global, accurate, high resolution imagery arrive to 1m [18].In this study, one scene of IKONOS data was acquired on October 16th, 2012.The image was georeferenced to UTM, WGS48 and radiometrically corrected to minimize atmospheric effects.The image presented in Fig. 2 shows IKONOS image as input data in this work.

D. The Delineation and Extraction of River Water Image
The delineation and extraction of water bodies from remote sensing image is an important task useful for various applications such as, GIS database updating, flood prediction, and the evaluation of water resources [19].Several techniques for the extraction of linear features from remotely sensed data have been introduced for high spatial resolution imagery [20].The methodology and methods used to extract water area in satellite image can be summarized by three principal families of methods: Feature extraction method, supervised and unsupervised classification methods, feature based classifier and data fusion.Many researchers provided comprehensive overview on methods on water extraction (water segmentation) from high resolution satellite images.The authors in [21] provided comprehensive overview on methods on water extraction from high resolution satellite images.Fu June in [22] developed an automatic extraction of water body from TM image using decision tree algorithm which was adopted for the difference in spectral response from the water and terrestrial response.However, the difference in spectral response from the water and terrestrial response, the extraction and segmentation of water region in satellite images necessary for reasons: to ride of effect and to separate terrestrial area from the original image, easy to identifying, and advanced processing could be done easier and faster [23].By using ENvironment for Visualizing Images (ENVI) and Geographical Information System (GIS), river has been extracted from the image after the segmentation step is applied on satellite image to extract three classes (land, vegetation and water) as shown in

A. Texture Feature Extraction
There are many approaches used for texture analysis.We have chosen in this work, some parameters computed from the Gray-Level Co-occurrence Matrix (GLCM).The GLCM is the statistical approach for examining the textures that considers the spatial relationship of the pixels.The GLCM characterize the texture of an image by calculating how often pairs of pixel with specific values and in a specified spatial relationship occur in an image [24].It provides a second-order method for generating texture features to calculate the relationship between the conditional joint probabilities of all pairs of combinations of grey levels in the image parameters such as displacement d and orientation θ [25].It can be calculated as symmetric or non-symmetric matrix.The symmetric of the GLCM is often defined a pair of grey levels (i, j) oriented at θ=0° and also be considered as being oriented at θ=180° [26].www.ijacsa.thesai.orgVarious texture features can be generated by applying GLCM statistics as in reference [24].However, in our study six features (parameters) have been chosen and computed from the GLCM.These extracted parameters will be used to estimate the regression models to predict the QWPs.These are: Where P is the matrix element,   j i, intensities, G is the number of gray levels used, μ is the mean value of P .x  , y  , x  and y  are the means and standard deviations of the marginal-probability matrix [25].
To validate our model, we present in Table .1 the texture parameters value computed from eight water regions (stations).The station (2, 4, 6 and 8) refers to training and used to estimate the models, while stations (1, 3, 5 and 7) refers to testing and used to validating our model.For each of these regions, the water is analysed and the corresponding WQPS (PH, PO 4 and NO 3 ) are extracted as shown in Tab II.

B. Development of Multivariate Retrivel Algorithem
Most of remote sensing studies which are interesting in water quality parameters based on empirical models as we mentioned in the introduction.In this study multivariate algorithms using extracted texture parameters and satellite data have been done depend on equation (1) that is refer to multiregression model.The statistical analysis depends on the WQPs and their corresponding texture parameters that shall be using in our work.In the multiple regressions, the independent variables were six texture parameters while dependent variables are water quality parameter to be calculated.All formulated model by empirical model essentially based on the correlation coefficient between measured data of water quality and texture parameters extract, regardless of whether the correlation is direct or indirect.When the correlation is high among the independent variables and WQPs, the predictive regression model will be strong, and the tendencies of values are high, as in Table II.

C. Validation of Multivariate Preductive Algorithms 1) Validation by comparison between measured and calculated WQPs
Measured WQPs refers to the observation were taken from the stations and calculated WQPs refers to the parameters calculated via satellite data.In order to obtain a strong validation, the validation applied for four different stations in first stage and in second stage all station was taken into account to find the difference in measured and calculated values.

2) Validation by fitting and confidence bounds models
Data fitting is the process of fitting models to data and analyzing the accuracy of the fit.Engineers and scientists use data fitting techniques, including mathematical equations and nonparametric methods, to model acquired data.The polynomial model has been selected to apply to analyzing and finding the errors.A polynomial is a function that can be written in the form: then the polynomial is said to be of order n.A first order (linear) polynomial is just the equation of a straight line, while a second-order (quadratic) polynomial describes a parabola [26].Confidence and prediction bounds define the lower and upper values of the associated interval, and define the width of the interval.The width of the interval indicates how uncertain you are about the fitted coefficients, the predicted observation, or the predicted fit.The confidence bounds for fitted coefficients are given by: Where b are the coefficients produced by the fit, t depends on the confidence level, and is computed using the inverse of www.ijacsa.thesai.orgStudent's t cumulative distribution function, and S is a vector of the diagonal elements from the estimated covariance matrix The simultaneous prediction bounds for the function and for all predictor values are given by: Where f depends on the confidence level, and is computed using the inverse of the F cumulative distribution function.The Goodness of Fit (GOF) of a statistical model describes how well it fits into a set of observations.GOF indices summarize the discrepancy between the observed values and the values expected under a statistical model.To evaluate the goodness of fit, its required to calculate each of the Sum of Squares Error (SSE), R-square, adjusted R-square, and Root Mean Squared Error (RMSE).

D. Validation by Fuzzy K-Means Clustering
The Fuzzy K-means Clustering (FKM) algorithm performs iteratively the partition step and new cluster representative generation step until convergence.The applications of FKM can be founded in reference, which provided an excellent review of FKM.An iterative process with extensive computations is usually required to generate a set of cluster representatives [27].Clustering a data set N R X  implies that the data set is partitioned into k clusters such that each cluster is compact and far from other clusters.One way to achieve this goal is through the minimization of the distances between the cluster center and the patterns that belong to the cluster.Using this principle, the hard k-means algorithm minimizes the following objective function [28]: Where   is a distance measure between the center k m of the cluster k F and the pattern X x i  Eq. ( 2) can be rewritten as Where .When the clusters are overlapping, each pattern may belong to more than one cluster, i.e.,      should be interpreted as a membership function rather than the characteristic function.Therefore, the objective function (3) can be modified to the following: Where  a fuzzy membership function and q is now is a constant known as the index of fuzziness that controls the amount of fuzziness.

A. Analysis of Correlation
Scattering pattern has been studied in the early stages of the work, for two main reasons: to find the correlation between WQPs and extracted texture parameters.The forms of scattering indicate the relationship between the parameters in direct and indirect.It is not important what kind of relationship and behavior was done, because the correlation takes the absolute value to determine the strength between parameters.Examination of the correlations between the parameters that extracted by method of texture analysis and measured from the station shows in Table II.Where, there is a strong direct correlation between PH and two texture parameters: contrast, correlation (0,845, 0,876) respectively.That means, these two parameters influenced more directly with PH .If the PH increase the two texture parameters will increase and vice versa.In the same time, It was indirect correlation between PH and energy and homogeneity (-0,838, -0,846) respectively.This interprets inverse relationship will increase with decrease.En general in both cases there is strong correlation.A weak correlation was found between PH , entropy and variance.The high correlation between extracted parameters probably means that these texture parameters are measuring similar aquatic properties.As for the second parameters ( 4PO ), which is one of the important pollutants in the water.

4
PO was found a high correlation with variance.The correlation does not appear with other studied texture parameters.The 3 NO , which represents the purity in the water, showed a high correlation with entropy.

B. Generation of Multivariate Predictive Algorithms
Using multiple regression model making possible to predict eight equations to measure PH according to the type of texture used with average of accuracy (95%).

C
Where contrast (C), correlation (Co), energy (E), homogeneity (H).Each predicted equation has an accuracy corresponding to 2 R and probability value in regression analysis model.Hence, the equations from 1-8 have (0.8450, 0.8760,0.895,0.9830, 0.9998, 0.9970, 0.9990, 0.9997) respectively.Equation ( 17) and (19) have been excluded accuse of probability values were higher than 0.05.Equation (20) have been chosen to represent the classification because of high accuracy.www.ijacsa.thesai.orgUsing equation ( 20) does not prevent using other equations according to the accuracy and type of the texture available.For

C. Validating of Predictive Algorithems
As it mentioned above, three type of validation has been done to measure the strength of equations.Analysis confidence bounds models have been done to measure the accuracy of all productive algorithms as shown in Figs.

D. Image Segmentation and Validation
Image segmentation using fuzzy K-means result shows in Figs. 6 and 7.The fuzzy logic is classified the mixed pixel to specific category based on the descriptions of the input and output variables.Fuzzy logic rules applied to incorporate expert knowledge.Fixing a set of rules has been done to classify PH image.Three classes have been selected to represent PH .

V. CONCLUSION
In this study, potential applications for assessing and monitoring water quality, using texture parameters have been demonstrated using GLCM.Texture parameters were extracted for each sample corresponding to the ground-truth locations.Homogeneity, entropy and variance were found to be the most suitable texture parameters for predicting PH , 4 PO , and 3 NO concentrations using empirical models with high correlation.This method helped to calculate PH from many equations according to texture parameters and with different accuracies.Confidence bounds models have indicated the substantial convergence between measured and calculated variables.Some of the points possessed very high values of pollution which caused a large gap in the homogeneity values of the points.This gap or disparity in the measured values did not affect the accuracy of the model.Analysis between remotely sensed data and ground data have indicated the possibility to mapping two of WQPs, expect the third water parameter 4 PO which had zero in image texture.For this reason, it is ignored from results.Using fuzzy K-mean method was helped the rules about the texture input and description of classes to get good classification for studied parameters.
As a perspective work, the future research should contemplate this issue by selecting more number of sampling stations in proper locations so that more accurate results can be obtained.As well as this method could be applying with different type of satellite images and compared it with other methods especially which are concern to study the roughness of the surfaces and backscattering models.

4 PO and 3 NO
there was strong relationship with variance V and entropy ( En ) respectively expressed by:

2 and 3 . 4 PO and 3 NO
Polynomial quadric model has been demonstrated to fit measured and calculated PH , .The result shows that all points fall into boundary of confidence equal to (95%).The indications of goodness of fit (GOF) has been also calculated as; SSE= 0.02149, R-square= 0.9853, adjusted R-square=0.9486 and RMSE= 0.1037and all of these indicators give high quality.

Figs.4and 5 ,
Figs.4and 5, show the result of segmentation and distribution of PH and 3 NO .All of these parameters were full into under safety factors and corresponding to ground measurements.