Recognizing Safe Drinking Water and Predicting Water Quality Index using Machine Learning Framework

.


I. INTRODUCTION
In the new green economy, monitoring and evaluating water quality is a central issue for the life of all organisms. Using the classical monitoring ways that depend on chemical monitoring is not enough to evaluate the consequences of some influences and stresses, as predicting the interactive effects of different chemical variables on water microorganisms is very difficult [1]. Rapid industrial development has deteriorated water quality at an alarming rate. In addition, the infrastructure, with the absence of public awareness, and the low quality of hygiene, greatly affects the quality of drinking water [2].
Polluted drinking water is very serious and can adversely affect organisms' health, as well as many environmental, and infrastructural impacts. According to a United Nations (UN) report, roughly, more than 1.5 million people die every year due to water-polluted diseases. In third-world countries, it has been declared that 80% of health issues are due to polluted water. Moreover, 2.5 billion illnesses and five million deaths are reported annually [3], and these are truly terrifying numbers.
Due to the lack of robust water monitoring techniques, many countries are unable to enhance their water systems and there are shortcomings to produce effective water recovery systems. These shortcomings may lead to a greater level of uncertainty when developing water resource management policies [4].
Recently, there has been a marked increase in the development of rapidly developing biological monitoring and biological assessment tools for water resources that are reliable enough to manage many degraded water bodies in the USA, Europe, South Africa, and Australia [5]. However, with the huge increase in data generated by monitoring devices and the futility of manual coding, the shortcomings began to appear in those systems due to the lack of an effective mechanism for processing that huge data. However, with the growth of artificial intelligence based on machine learning and deep learning techniques, it can introduce a perfect solution to that problem, such as artificial intelligence is characterized by many predictions, clustering, and classification techniques to produce effective solutions to water quality problems [6]. Research of the past decades has focused largely on analyzing the water quality of rivers based on artificial intelligence (AI) techniques [7]. Using AI models, water quality forecasting, classification, and risk assessment can be achieved easily. Moreover, advanced early warning systems and effective management policies can be designed to add more control and monitoring services to rivers and water bodies [8,9].
In this paper, a proposed machine learning framework has been introduced for analyzing water quality. It consists of two subsystems; the first subsystem is responsible for classifying water quality based on nine AI models that have been applied, tested, and compared to classify various samples of drinking water as safe to drink or unsafe to drink. The applied nine AI www.ijacsa.thesai.org models are: Extreme Gradient Boosting (XGBoost) [10], Light Gradient Boosting Machine (Light GBM) [11], Decision Tree (DT) [12], Extra Tree (ET) [13], Multi-layer Perceptron (MLP) [14], Gradient Boosting (GB) [15], Support Vector Machine (SVM) [16], Artificial Neural Network (ANN) classification [17], and Random Forest (RF) Classifier [18]. The second subsystem is responsible for predicting water quality index (WQI) based on six regression models, LGBM regression, XGB regression, ExtraTrees regression, DT Regression, RF regression, and linear regression. These models have been applied to a dataset called Water quality, which was downloaded from [19]. The experimental results proved the superiority of the LightGBM model compared with the other eight AI models with an accuracy of 97% in classifying water samples to recognize the safe drinking water samples. Moreover, the predictive analysis of the used regression models clarified outperforms of LGBM regression, and Extra Trees Regression models in predicting water quality index according to training accuracy, testing accuracy, and mean absolute error (MAE) compared to the other four regression models.
The rest of this article is designed as follows: Section II reviews the related work. Section III explains the proposed machine-learning framework for analyzing water quality. Section IV presents and discusses the implementation results. Section V presents the conclusion of this work.

II. LITERATURE REVIEW
A growing body of literature has investigated the efficiency of using machine and deep learning models for monitoring, analyzing, and predicting water quality index. The literature introduced some reviews that discuss various AI models for solving water quality prediction problems [9,20,21]. There are several large cross-sectional studies, which introduces multiple machine and deep learning to predict water quality index.
Ali Najah et al. [22] applied four machine learning models, an enhanced Wavelet De-noising Techniques (WDT)-based Neuro-Fuzzy Inference System (WDT-ANFIS), Adaptive Radial Basis Function Neural Networks (RBF-ANN), Neuro-Fuzzy Inference System (ANFIS), Multi-Layer Perceptron Neural Networks (MLP-ANN), and to predict water quality parameters (i.e. pH, ammonia nitrogen (AN), and suspended solids (SS)) of Johor River in Malaysia. The experimental results clarified outperform of the WDT-ANFIS model in prediction accuracy for all the water quality parameters compared to the other three used models.
Amir Hamzeh et al. [23] used the support vector machine (SVM) algorithm, Artificial Neural Network (ANN), and group method of data handling (GMDH) models for analyzing the water quality prediction of Tireh River in Iran. Different types of the kernel and transfer functions were validated and tested, and the practical results clarified that both ANN and SVM are better models than GMDH in predicting the water quality of Tireh River.
Umair Ahmed et al [24] introduced supervised learning models for evaluating WQI prediction based on four features of water elements, namely, turbidity, temperature, pH, and total dissolved solids. The proposed models achieved acceptable accuracy and fewer error rates using a minimal number of features in predicting the WQI in real-time.
Abubakr Saeed et al. [25] proposed an efficient machine learning algorithm based on the SVM model to forecast the WQI of Langat River Basin based on the investigation of six variables (Dissolved Oxygen (DO), pH, Chemical Oxygen Demand (COD), Suspended Solids (SS), Ammonia Nitrogen (AN), and Biochemical Oxygen Demand (BOD)) of dual reservoirs that are located in the catchment. The experimental results showed that this model could accurately predict WQI value with small mean absolute error.
Mourad Azrour et al. [26] investigated the efficiency of machine learning algorithms for evaluating WQI prediction value based on four water features: pH, temperature, turbidity, and coliforms. The experimental results have proven the efficiency of used regression algorithms in predicting WQI. Moreover, the artificial neural network proved that it is the most highly efficient model in classifying water quality compared to other models in the literature.
They H et al. [27] utilized advanced AI models to evaluate WQI prediction value and classifying water goodness. The authors applied nonlinear autoregressive neural networks (NARNET) and long short-term memory (LSTM) as deep learning algorithms for predicting WQI. Moreover, three learning techniques, namely, K-nearest neighbor (K-NN), Naive Bayes, and SVM have been applied for the water quality classification task. The Prediction results showed that the NARNET algorithm performed slightly better than the LSTM for predicting WQI values. On the other hand, the SVM model has achieved the greatest accuracy (97.01%) for water goodness classification compared to the other classification models.
Siti Nur Mahfuzah et al. [28] investigated the efficiency of two machine learning algorithms, the Random Forest algorithm and the Random Tree algorithm for Classifying River Water Quality. The practical results have proven that Random Forest gives a higher classification accuracy compared to the Random Tree algorithm.
Junhao Wu et al. [29] proposed a hybrid model based on discrete wavelet transform (DWT), an ANN model, and LSTM model to predict the water goodness of the Jinjiang River. The prediction results clarified the efficiency of the proposed hybrid model in predicting water quality index compared to other models such as the ARIMA model, the LSTM model, nonlinear autoregression (NAR) model, the ANN-LSTM model, multi-layer perceptron model, and the CNN-LSTM model.
NguyenHien Than et al. [30] investigated water quality monitoring for the Dong Nai River at different times based on a novel architecture of the neural network model FFNN, and LSTM-MA hybrid model at different time series. The validation results proved that The LSTM-MA model provided more reliable prediction and achieved faster training time than the NAR, NAR-MA, ARIMA, and LSTM models. Moreover, the proposed hybrid model produced classification results for water quality in close agreement with the actual monitoring data. www.ijacsa.thesai.org Other hybrid machines and deep learning models have been developed for investigating water quality index, for example, one-dimensional residual CNN (1-DRCNN) and bi-directional gated recurrent units (BiGRU) have been utilized for predicting Water Quality in the Luan River [31]. Moreover, a hybrid deep learning model based on the CNN and LSTM model has been applied, tested, and compared for predicting water goodness based on real-time monitoring of water quality variables [32].

III. WATER QUALITY ANALYSIS FRAMEWORK
Automatic analyzing drinking water quality from a given dataset, a framework consisting of two phases is proposed. The first phase is responsible for classifying water samples from a given dataset into two classes, safe or unsafe for drinking based on nine classification algorithms, whereas, the second phase is responsible for predicting the water quality index (WQI) based on six regression algorithms. In the following, the two phases are discussed in more detail:

A. Phase 1: Water Samples Classifications
To classify water samples to recognize safe drinking water samples, nine-machine learning techniques have been used, tested, and compared. Fig. 1 depicts how these models can be used for classifying water samples from a given dataset. The classification phase starts by doing a preprocessing step for cleaning, splitting, and resampling the used dataset. In the second step, the given dataset is divided into training (70%) and testing (30%) data parts. The third step focuses on extracting water features that may impact water quality through a feature selection step. The final step, the classification step sequentially calls nine classification algorithms (i.e. learning model) one after one for performing the classification task. The used classification models can be briefly described as follows:

1) Extreme Gradient Boosting (XGBoost):
It is depending on supervised machine learning, decision trees, ensemble learning, and gradient boosting. It is one of the most powerful techniques for building stochastic models for regression, classification, and ranking problems [33]. It provides a parallel tree boosting approach to fix errors made by prior boosted tree models [34].
2) Light Gradient Boosting Machine (Light GBM): It has been developed by Microsoft, which is a popular algorithm used for ranking and classification problems. Its structure is also based on decision tree models. LightGBM is being distinguished by training speed and accurate prediction results. This is because of adding an automatic feature selection procedure as well as focusing on boosting instances with greater gradients [35].

3) Decision Tree (DT):
It is a common supervised learning algorithm used for regression and classification problems [12]. The idea is to use learning decision rules deduced from the data features to perform classification or prediction tasks. What makes DT an effective classification model is: 1) the DT model can be prepared with little data. 2) Training a DT model is logarithmic in the number of data points. 3) A DT model can be validated by statistical tests. 4) Its performance doesn't affect any violation in predefined assumptions with the original model from which the data were created. 5) DT models can be visualized easily and can be understood without mysterious [36].

4) Extra Tree Classifier (ETC):
It is a class of ensemble learning approaches. The classification results are collected from a forest of several de-correlated DT models [37]. It differs from Random Forest Classifier in DT constructions way, where DT models are constructed in a "forest". The forest construction and creation of multiple de-correlated DT models of this classifier are based on extracting a random sample of features that leads to the best classification results based on some mathematical conditions.

5) Multi-layer Perceptron Classifier (MLP Classifier):
It is a class of feed-forward neural network models [38]. There may be multiple nonlinear hidden layers between the input and the output layers for mapping input data to output data. This classifier is based on the functionality of the sigmoid activation function for doing the classification task.

6) Gradient Boosting Classifier (GBC):
It is a common boosting classifier algorithm [39]. The functionality of gradient boosting works based on training N Trees based on the repeated fixing errors resulting from the predecessors of predictors to form the ensemble of data. The training step of the GBC model is done by training the predictors with the error labels produced by the predecessor of those predictors. The prediction results of each tree model are based on "a shrinking routine". 7) Support Vector Machine (SVM) Classifier: it is a supervised learning model used for both regression and classification problems [40]. The main goal of the SVM model is to identify a hyperplane in an N-dimensional space for classifying data items. The kernel of SVM is a procedure that depends on low-dimensional input space and converts it into higher-dimensional space. Therefore, SVM is suitable for nonlinear classification problems. SVM has some advantages that make it an efficient classifier such as memory efficiency, effectiveness in high dimensional cases, and possible to customize kernel functions.

8) Artificial Neural Network Classification (ANN):
This class of ANN is one of the simplest types of neural networks www.ijacsa.thesai.org [17]. It is also a fed forward algorithm as it passes information in one direction from input neurons through one or more hidden layers to output neurons. The main advantages of using an ANN classifier are the ability to work with incomplete knowledge, storing information on the entire network, having a distributed memory, and having fault tolerance.

9) Random Forest Classifier (RF):
It is a non-linear classification technique, which consists of a group of decision trees. [18]. It integrates multiple decision trees to get more accurate predictions. Each decision tree model is used when employed on its own. This algorithm is called random because they choose predictors randomly at a time of training. In addition, it is called a forest since it takes the result of multiple trees to make a decision. The main advantage of Random forests compared to decision trees is the large number of uncorrelated tree models that work as a single unit will always outperform the individual tree models.

B. Phase 2: Water Quality Index Prediction
The second phase of the proposed framework is responsible for the predictive analysis of the water quality index. In this phase, we examined the impact of the water quality index (WQI) in predicting water quality using six regression models. This analysis started by calculating WQI for the dataset using a mathematical model specified in equations 1, 2, 3, and 4 [41]. After that, six regression models have been applied for predicting water quality. These models are LGBM regression, XGB regression, Extra Trees regression, Decision Tree Regression, Random Forest regression, and linear regression [42]. Fig. 2 explains how the six regression models are applied to predict the water quality index.
Where, is the standard value for each variable of water elements, and is a constant.
Then, the weight value of each element can be calculated as in equation 2. (2) The Quality Impact value for each element in the water dataset can be calculated as in equation 3. In this section, we present two types of analysis for investigating the efficiency of the proposed machine learning approach in predicting water quality. Subsection A discusses the classification analysis of water samples using nine classifiers, while subsection B discusses the predictive analysis using five regression models:

A. Classification Analysis
The first set of analyses examined the efficiency and accuracy of nine machine learning models used in the proposed framework (as explained in section 3.1) for classifying water samples to recognize that good samples are suitable for human drinking. These performances of these models have been applied to a dataset called Water quality, which was downloaded from [19]. The used dataset consists of 7996 samples of water and 19 features (i.e. variables) that impact water quality. The data has been segmented into training data (6396 samples, 19 features), and testing data (1600 samples, 19 features). The main objective was to classify water samples as suitable for human drinking or not suitable for human drinking. The performance of the nine machine learning models used in the proposed framework has been tested and evaluated using twelve measures as detailed in Table I. The best performance among the nine machine learning models according to each measure is being highlighted. The obtained results clarify that although the random forest algorithm achieved the best training accuracy, the Light GBM outperformed the other classifiers in recognizing good water samples regarding testing accuracy, sensitivity, AUC, F1-score, recall, precision, and mean square error. Fig. 3 and 4 present the comparison results of classification analysis metrics and mean square error (MSE) to nine classifiers, respectively. In addition, Fig. 5 to 13 depicts the performance matrices (or confusion matrices) and the corresponding receiver operating characteristic (ROC) curves of nine machine-learning models, respectively.

B. Predictive Analysis
The second set of analyses examined the efficiency and accuracy of six regression machine learning models used in the proposed framework (as explained in section 3.2) for predicting WQI. Table II summarizes the predictive analysis results of the six regression models after applying the mathematical model of WQI in the dataset. The obtained results have been evaluated based on the common regression metrics, training accuracy, testing accuracy, R2, Adjusted R2, and Mean absolute error (MAE).
The regression analysis results show the superiority of LGBM regression, and Extra Trees Regression models in predicting water quality index according to training and testing accuracy as well as the mean absolute error (MAE) compared to the other regression models. Fig. 14 to 16 visualizes the prediction results of the six regression models, respectively. Fig. 17 presents the comparison results of regression analysis of the used six regression models.

V. CONCLUSION
The present article was designed to investigate the efficiency of using a proposed machine-learning framework to classify drinking water samples and predict water quality index. The classification tier of the proposed framework consists of nine classification models, Extreme Gradient Boosting (XGBoost), Light Gradient Boosting (LightGB), Decision Tree (DT), Extra Tree (ET) classifier, Multi-layer Perceptron (MLP) classifier, the Gradient Boosting (GB) classifier, Support Vector Machine (SVM), Artificial Neural Network (ANN), and Random Forest (RF) classifier. The performance of those models has been validated on a benchmark dataset consisting of 7996 water samples, and 19 features. The obtained results clarified good classification results to the nine models with average accuracy. 94.7%. However, the obtained results clarified that, although the Random Forest (RF) algorithm achieved the best training accuracy, 100%, the Light GBM outperformed the other classifiers in recognizing good water samples regarding testing accuracy, 0.97%. The second goal of this study was to investigate the efficiency of the regression tier through applying six regression models for predicting water quality index. The regression analysis clarified the superiority of LGB regression, and Extra Trees Regression models in predicting water quality index according to training and testing accuracy as well as the mean absolute error (MAE) compared to the other regression models. Taken together, these findings suggest a role for using machine learning models in promoting the analysis and prediction of water quality. Moreover, these results have significant implications for the understanding of how novel deep learning models can be developed for predicting water quality, which is suitable for human drinking, irrigation of plants and crops, and other industrial or environmental purposes.