Rainfall Prediction in Lahore City using Data Mining Techniques

Rainfall prediction has extreme significance in countless aspects and scopes. It can be very helpful to reduce the effects of sudden and extreme rainfall by taking effective security measures in advance. Due to climate variations, an accurate rainfall prediction has become more complex than before. Data mining techniques can predict the rainfall through extracting the hidden patterns among weather attributes of past data. This research contributes by exploring the use of various data mining techniques for rainfall prediction in Lahore city. Techniques include: Support Vector Machine (SVM), Naïve Bayes (NB), k Nearest Neighbor (kNN), Decision Tree (J48) and Multilayer Perceptron (MLP). The dataset is obtained from a weather forecasting website and consists of several atmospheric attributes. For effective prediction, pre-processing technique is used which consists of cleaning and normalization processes. Performance of used data mining techniques is analyzed in terms of precision, recall and f-measure with various ratios of training and test data. Keywords—Rainfall prediction; data mining; classification techniques


INTRODUCTION
Time series data mining is one of the hot research topics in the domain of knowledge discovery [19].The data with time series approach is collected over a specific period of time such as daily, weekly, monthly, quarterly or yearly [13].This data can be used for predictions in different domains such as finance, stock market and climate change etc.Data mining techniques are used to extract the hidden knowledge from time series data for future use [13], [17], [25], [28].Weather prediction with time series data is beneficial but quite challenging task [16], [27], [29].It comes with an array of complexities which needs to be tackled for optimal results [18].The statistical weather data has a wide variety of fields which are called features such as humidity, pressure, wind speed, pollutants, concentrations etc.Data mining techniques can predict the weather on the basis of hidden patterns among these features [27], [29].Rainfall prediction is an important aspect of climate forecasting.Accurate and timely rainfall prediction is crucial for the planning and management of water resources, flood warnings, construction activities and flight operations etc. [14], [15].This study used 5 data mining techniques for rainfall prediction in Lahore, capital of Punjab province, Pakistan.In Lahore, development and construction activities are increasing exponently, so timely rainfall prediction is crucial for better assessment of future requirements and planning.The used data mining techniques include: Support Vector Machine, Naïve Bayes, k Nearest Neighbor, Decision Tree and Multilayer Perceptron.These algorithms belong to supervised data mining class where pre-classified data is required first for training purpose.During training, these algorithms make rules of classification for input dataset (test data) [20]- [25], [30].In this research, dataset is obtained from weather forecasting website [10] from December 1, 2005 to November 31, 2017 (12 years), which contains several weather related attributes such as Temperature, Atmospheric pressure, Relative humidity etc.For rainfall prediction, a classification framework is used in which the dataset gone through cleaning and normalization process before classification.Cleaning is performed to deal with the missing values and the purpose of normalization is to keep the attribute values in a certain limits.These pre-processing activities are crucial for the smooth classification process as well as for good results [9], [12].Prediction performance of used data mining techniques is evaluated in terms of precision, recall and f measure, which are the important metrics of information retrieval.Finally the results are shown in tables and graphs.
Further organization of this paper is as follows.Section II describes the related work.Section III discusses the materials and methods used in this research.Section IV presents results and discussion.Section V finally concludes the paper.

II. RELATED WORK
Many researchers have been working to achieve high accuracy in rainfall prediction using data mining techniques; some of the selected studies are discussed here.Researchers in [1] performed a comparative analysis of multiple classifiers for rainfall prediction in Malaysia.Classifiers include Naïve Bayes, Support Vector Machine, Decision Tree, Neural Network and Random Forest.Dataset was obtained from multiple stations of Selangor, Malaysia.Pre-processing tasks were applied before classification to deal with the noise and missing values.According to results, Random Forest performed better as with small training data it correctly classified large amount of instances.In [2] Researchers in [5] proposed an algorithm which combined data mining and statistical techniques.The likely predictors with highest confidence level, based on association rules were selected.Those predictors were derived from local and global conditions.From local conditions: sea level pressure, wind speed, and maximum &minimum temperatures were recorded.
On the other hand from global condition, southern oscillation and Indian Ocean dipole conditions were taken.The algorithm predicted the rainfall in five categories: Flood, Excess, Normal, Deficit and Drought.Researchers in [6] presented Wavelet Neural Network model (WNN) for rainfall prediction which is the combination of wavelet technique and Artificial Neural Network (ANN).Proposed WNN and ANN, both models were applied on monthly rainfall data of Darjeelin grain gauge station, west Bengal, India.Statistical methods were used to analyze the performance of both techniques and it was observed that WNN performed much better than ANN model.In [7], researchers implemented a rainfall forecasting model using Focused Time-Delay Neural Networks (FTDNN).The parameters for neural networks were taken from several experiments to perform prediction with one step ahead.For prediction, the daily rainfall data was obtained from Malaysia Meteorological Department (MMD) and then converted to monthly, biannually, quarterly and yearly basis.Models were trained and tested on each dataset and corresponding accuracies were evaluated using Mean Absolute Percent Error.
According to results, most accurate forecasts were made with yearly rainfall dataset.The authors have pointed out that more parameters such as temperature, humidity and sunshine data should be incorporated into the neural networks to make the performance more accurate.Researchers in [8] presented a methodology to predict maximum temperature in the day, which followed the Support Vector Regression approach.Proposed technique performed prediction on the basis of several features, obtained from different measuring stations in Europe.Weather related features included temperature, precipitation, relative humidity, air pressure, specifically synoptic situation of the day and monthly cycle.The proposed technique performed well when compared with other neural networks, multi-layer perceptron and an extreme learning machine.

III. MATERIALS AND METHODS
This research aims to analyze the performance of data mining techniques on rainfall prediction in Lahore city using a classification framework (Fig. 1).Dataset used in this research consists of several attributes along with the known output class.Output class is one which is going to be predicted on the basis of other available attributes.The reason of including the output class in dataset among other features is to analyze the performance and accuracy of data mining techniques [20], [24].The output result after processing is compared with the known class and performance is measured in terms of precision, recall and f measure [1], [20], [21], [24], [26].Weka [22], [23] is used in this study for classification and performance analysis.It is one of the extensively used data mining softwares.Weka is developed in Java language at the University of Waikato, New Zealand.It is famous and widely accepted tool among students and researchers due to its easy to use GUI interface, portability and General Public License.
The classification framework used in this research consists of four stages: Selection of appropriate dataset, Preprocessing, Prediction and Simulation of results.The input dataset for www.ijacsa.thesai.orgrainfall prediction is obtained from weather forecasting website [10] and consists of several atmospheric attributes.Name, type and measurement unit of selected attributes are given in Table I.Dataset contained missing values as shown in Table II.The incomplete data can affect the accuracy of results as the attribute which has the missing value cannot fully participate in prediction process.Beside the missing values, dataset also contained noise where value resides below or exceeds from a certain limits.For effective data mining results it is recommended to keep the values in a certain limits [1], [11].Pre-processing of input data is a crucial stage in classification framework which ensures the high accuracy of mining results.This stage consists of two activities: cleaning and normalization.Cleaning process deals with the missing values by using average mechanism.In this mechanism sum of all the instances of selected attribute is divided by the number of samples.On the other hand normalization process deals with the noise by limiting the values within a specific interval.Such interval can effectively facilitate the prediction process where the values will be mapped onto a particular range.In this research the normalization process is performed in Weka.Prediction is the final stage of classification framework where data mining algorithms perform classification by exploring the hidden patterns.Performance of any supervised machine learning technique can be analyzed by comparing the output result with known class (pre-classified data).Performance evaluation of used data mining techniques is performed with 10 proportions (10:90-90:10) of training data and test data.For comparative analysis, three evaluation parameters of information retrieval are used: Precision, Recall and F Measure.
The aim of Precision is to evaluate the True Positive (TP) entities with respect to False Positive (FP) entities.It can be calculated as follows: Precision  TP is used for the entities, which are correctly classified, and FP is for those entities, which are wrongly classified.
The aim of recall is to evaluate the True Positive entities with respect to the (FN) False Negative entities, which are not classified at all.It can be calculated as follows:

Recall  
There may be a point where performance evaluation will not be possible with precision and recall, for example if one mining algorithm has higher precision but lower recall than another algorithm so the question arises that which algorithm is better.Solution to this issue is to use F-measure, which provides the average of precision and recall.F-measure can be computed as bellow: F-measure 

IV. RESULTS AND DISCUSSION
With SVM the results are almost same (Table III, Fig. 2) in all three accuracy parameters (Precision, recall and f-measure).Results for no-rain class with first seven proportions from 10:90 to 70:30 in precision, recall and f-measure are 0.941, 1, and 0.955 respectively however minor improvement were seen when proportions 80:20 and 90:10 were used.The notable point is that the result for rain class with all proportions in all accuracy parameters is 0, which means that this technique could not classify a single instance correctly for rain class even with 90:20 ratios.The results with KNN are shown in Table V and Fig. 4. With no-rain class the 80:20 and 90:10 performed better in precision, 10:90 in recall and 90:10 in f-measure.With rain class 10:90 performed better in precision, 80:20 in recall and 90:10 in recall.

A. Critical Analysis
Data mining techniques used in this study showed good results for no-rain class in all accuracy measures (Precision, recall and f-measure) however for rain class these techniques did not perform well and results are not accurate enough.Fmeasure is a high-quality accuracy measure as it provides the average of precision and recall.Table VIII is arranged according to highest f-measure score in each mining technique along with its class and proportion.There could be several reasons for the lower results with rain class such as, missing values as mean value cannot reflect the actual one, absence of one or more important climatic attributes and the most important is the lower rainfall rate in the city.Due to climate www.ijacsa.thesai.orgvariations, rainfall rate in most of the locations is not, what it used to be.Moreover the dataset does not include the rainfall quantity/measure, instead it only includes the rainfall polarity (yes/no).So the data is reflecting the number of times it rained but not how much.There might be only one rainy day in a week but that might have been catastrophic with extreme rainfall.With overall lower rainfall rate (number of times it rained), less patterns were provided to classification algorithms which resulted in poor performance with rain class whereas on the other hand in no-rain class, more patterns were available for training of classification techniques, resulted in high accuracy.

TABLE III .
SVM RESULTSThe results with Naive Bayes are shown in TableIVand Fig.3.It can be seen that with no-rain class the 10:90 www.ijacsa.thesai.orgperformed better in precision, 30:70 and 40:60 in recall and 90:10 in f-measure.With rain class 40:60 performed better in precision, 10:90 in recall and 50:50 in f-measure.

TABLE IV .
NB RESULTS

TABLE VI .
DECISION TREE RESULTSThe results with MLP are shown in TableVIIand Fig.6.It can be seen that with no-rain class 80:20 performed better in precision, 90:10 in recall and 90:10 in f-measure.With rain class 90:10 performed better in precision, 60:40 in recall and 80:20 in f-measure.

TABLE VII .
MLP RESULTS

TABLE VIII .
DM TECHNIQUES WITH HIGHEST F-MEASURE Measure www.ijacsa.thesai.orgV. CONCLUSION AND FUTURE WORK This research performed rainfall prediction in Lahore city using five data mining techniques: Support Vector Machine, Naïve Bayes, k Nearest Neighbor, Decision Tree and Multilayer Perceptron.12 years of past weather data from December 1, 2005 to November 31, 2017, is used for prediction in this research.Performance analysis of used data mining techniques is performed using three accuracy measures: precision, recall and f-measure and results are presented in tables and graphs.For effective prediction, a classification framework is used in which the input data went through a preprocessing stage and got cleaned and normalized before classification process.To analyze the performance dependency of classification techniques on training data, ten ratios of training and test data (training data: test data) are used from 10:90 to 90:10.According to results, used classification techniques performed well for no-rain class however for rain class, the techniques did not perform well.The reasons behind the lower accuracy in rain class may include: missing values, absence of important climatic attributes in dataset and overall lower rate of rainfall in the city.It is suggested for future work that further predictions should be performed by exploring more classification techniques and climatic attributes on different weather data.