Performance of Data Reduction Algorithms for Wireless Sensor Network (WSN) using Different Real-Time Datasets: Analysis Study

This paper investigates the effect of data reduction methods in the performance of Wireless Sensor Network (WSN) using a variety of real-time datasets. The simulation tests are carried out in MATLAB for several methods of reducing the quantity of sent data. These approaches are Data Reduction based Neural Network Fitting (NNF), Neural Network Time Series (NNTS), Linear Regression with Multiple Variables (LRMV), Data Reduction based – “An Efficient Data Collection and Dissemination (EDCD2)” and Data Reduction based – Fast Independent Component Analysis (FICA). The selected algorithms NNF, NNST, EDCD2, LRMV, and FICA are evaluated using real-time datasets. The performance indicators included are energy consumption, data accuracy, and data reduction percentage. The research results show that the selected algorithm helps to reduce the amount of data transferred and consumed energy, but each algorithm performs differently depending on the dataset used. Keywords—Data reduction algorithms; WSN; energy consumption; accuracy; neural network; independent component analysis


I. INTRODUCTION
In this paper, Wireless Sensor Network (WSN) is a network that collects data from spatially isolated sensors. Sensor nodes are used to monitor and record environmental variables, such as sound, pollution level, humidity, temperature, and wind, and then send the sensed data to the base station [1] [2]. The sensor node in the WSNs application is powered by a battery with limited-service life [3]. Furthermore, sensor nodes with multivariable sensors can have an impact on battery life because the node must support additional data transmission, causing the battery to drain faster than a sensor node with a single sensor [4]. Therefore, many researchers have been proposed various approaches to reduce the amount of transmitted data at the sensor node level, which will help in prolonging the battery lifetime. For example, in WSN, the spatial and temporal correlation between the generated traffic can be used to reduce the energy consumption of continuous sensor data acquisition. Spatial-temporal correlation is used in dual prediction (DP) and data compression (DC) techniques to reduce the number of transmissions to save energy and bandwidth. In [5], the author has used these two technologies as part of a two-stage data reduction scheme. The DP technology reduces traffic between cluster nodes and cluster heads, while the DC scheme reduces traffic between cluster heads and sink nodes.
In [6], the author proposed a data-aware energy-saving technology. The essential correlation between continuous measurements of sensor nodes and the similarity of data trends between adjacent sensor nodes are used to reduce data transmissions. The forecast-based data collection framework reduces time data redundancy. "Autoregressive Integrated Moving Average (ARIMA)" model was used to predict data. The proposed model was implemented in the Cluster head (CH) node.
In [7], the author proposed a novel technique for secure data prediction in WSN by using a Time Series Trust Model (TSTM) based on the Toeplitz matrix and a trust-based autoregressive process (TAR). The author proposed an adaptive data reduction method (AM-DR) in [8]. AM-DR is based on a convex combination of two decoupled Least-Mean-Square (LMS) window filters of different widths for predicting the next readings at both the source and sink nodes.
In [9], the authors have evaluated the performance of several methods based on computational intelligence to decrease the amount of the payload of every packets sent from the sensor node to the base station. These approaches are data reduction based on "artificial neural networks (DR-ANN)"; independent component analysis (DR-ICA) and deep learning regression methods called DR-GDMLR".
In [10], two multivariate data reduction methods for adaptive thresholds were proposed a Principal Component Analysis Based (PCA-B) -and multiple linear regression Based (MLR-B). PCA-B is a multivariate data reduction method. It uses "Candid Covariance-free Incremental PCA (CCIPCA)" with an adaptive threshold and to reach a high reduction ratio the number of Principal Components (PCs) assigned to "1". Another method to decrease the amount of payload sensed data is named MLR-B, which it using multiple linear regression (MLR) model. The authors used an adaptive threshold to retrain the model. According to [10], after updating the reference parameters of the model, the size of the transmitted data is greater than or equal to the size of the payload data without being reduced. This means that the sensor node needs 649 | P a g e www.ijacsa.thesai.org more power during the update phase, than the required power during the phase of reduction. The article recommends a new indicator for evaluating the performance of data reduction models, which it considering the number of repeating of updating the parameters reference of the model. A novel simple scheme called "Adaptive Real-Time Payload Data Reduction Scheme (APRS)" is proposed in [4]. APRS purposes is to decrease the size of the transferred payload of the sensor nodes. Further details on approaches of data reduction for sensor nodes are defined in [11]. In this study, effect of data reduction approaches on WSN performance is investigated using a set of real-time datasets. Simulation tests are performed in MATLAB for different approaches to decrease the amount of transferred payload data. The selected algorithms NNF, NNST, EDCD2, LRMV, and FICA are evaluated using real-time data sets. The performance indicators included are energy consumption, data accuracy, and data reduction percentage.
The organization of the article is as follows: Section I presents the introduction and related work of this study and the main contributions. The selected data reduction algorithms are described in Section II. Section III explores the real-time datasets used in this study. Section IV describes the performance metrics used in this study to evaluate the algorithms. Section V presents the study simulation and results. Lastly, Section VI concludes the outcome of the study.

A. Data Reduction based -Neural Network Fitting (NNF) Algorithm
The NNF model provided by MathWorks [12], helps in solving a data fitting problem using a two-layer feed-forward network. It helps in selecting the data, partitioning it into training, validation, and testing sets, defining the network architecture, and training the network.
In this section, the application of the NNF model to reduce the size of data transferred is described in detail. Fig. 1 represents the block diagram of WSN data reduction based on the NNF algorithm with a general structure. In the training phase, first select the sensor S1(t) with the highest correlation attribute as the input data of the NNF model and the other sensor features S2(t) and S3(t) as the output target of the NNF. The main objective in training NNF is to predict the values PS2(t) and PS3(t) from a single input sensor S1(t) during the reduction phase. As mentioned earlier, NNF is used to decrease the size of the transmitted data by the sensor node. The detailed description of the data reduction based NNF algorithm is stated by means of the following pseudocode.  Input: S1(t), S2(t), S3(t) // Sensor value for S1 // real-time data 2

B. Data Reduction based -Neural Network Time Series (NNTS) Algorithm
The prediction model NNTS provides by MathWorks [12]. NNTS is a type of dynamic filtering, that uses past-values of one or more-time series to predict future values. Dynamic neural networks containing tapped delay lines are used for nonlinear filtering and prediction.
This section describes in detail the NNTS algorithm used to reduce the amount of data transmitted. Fig. 2 represents the block diagram of WSN data reduction based on the NNTS algorithm with a general structure. In the training phase, first select the sensor S1(t) with a high correlation attribute as the input data of the NNTS model and the other sensor features S2(t) and S3(t) as the output target of NNTS. The main objective in training NNTS is to predict the values PS2(t) and PS3(t) from a single input sensor S1(t) during the reduction phase, where S1(t-1) and S1(t-2) are the last two received values of sensor S1(t). As mentioned earlier, NNTS is used to decrease the size of the transmitted data by the sensor node. 650 | P a g e www.ijacsa.thesai.org The detailed description of the data reduction based NNTS algorithm is given in the following pseudocode.  Input: S1(t), S2(t), S3(t) // Sensor values // real-time data 2

C. Data Reduction based -Linear Regression with Multiple Variables (LRMV) Algorithm
In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables (also referred to as dependent and independent variables). The theoretical concept of using linear regression with multiple variables was explained in detail by Ng, Andrew in [13].
In this section, the application of the LRMV algorithm to reduce the amount of data transferred is described in detail. Fig. 3 represents the block diagram of WSN data reduction based on the LRMV algorithm with a general structure, where the sensor S1(t) is assigned as the dependent variable of the LRMV model, and the other sensor features S2(t) and S3(t) are assigned as the predictor/independent variables of LRMV during the training phase. The aim of training LRMV is to predict the PS1(t) value from multiple sensors S2(t) and S3(t) during the reduction phase. The LRMV parameters are theta (θ), mean (mu), and standard deviation (SSDV). As mentioned earlier, LRMV the size of the transmitted data by the sensor node. The detailed description of the data reduction based LRMV algorithm is stated in the following pseudocode.

D. Data Reduction based -EDCD2 Algorithm
EDCD2 is a scheme to bring up-to-date measured data to the BS [14]. EDCD2 was used to decrease the number of transferred packets from nodes (multiple sensors). It should be noted that there are two versions of EDCD, EDCD1, and EDCD2 for sensor nodes with one and multiple sensors, respectively. In this section, the application of the EDCD2 to reduce the size of data transferred is described in detail. Fig. 4 shows the block diagram of WSN data reduction based on the EDCD2 algorithm with a general structure. The basic idea of EDCD2 is to avoid transmitting the sensed data if the value of the relative difference between the currently sensed data S(t) and the last transmitted data S(t-1) is smaller than the threshold value β for all sensors of the same node, otherwise, the sensed data S(t) will be transmitted to the BS. The detailed description of the EDCD2 algorithm based on data reduction is given in the following pseudocode.
for each sensor and . 2 Output: 3 Begin: 4 Set ( − 1)← last measuring value transmitted by the sensor 6 Read: the sensor value ( ) at time 7 Set ( )  [16]. Like most FICA algorithms, FICA searches for an orthogonal rotation of the previously whitened data through a fixed-point iteration scheme that make the most of a measure of the non-Gaussian distribution of the rotated components. This section provides a detailed description of the FICA algorithm to reduce the amount of data transferred. Fig. 5 shows the block diagram of WSN data reduction based on the FICA algorithm with a general structure consisting of two phases, namely the reduction phase at the sensor node level and the approximation phase at the BS level. The main objective of training the FICA model is to determine the reference parameters R_(1×r), which are then stored on the sensor node and the same copy is stored on BS. At the node level, the new data S1(t), S2(t), and S3(t) acquired in real-time are reduced by applying the FICA algorithm before transmission and then the reduced data D(t) is sent to BS. After that, the originally acquired data PS1(t), PS2(t), and PS3(t) are estimated by the approximation phase at BS by applying FICA with the same reference parameters R(1×r) used to reduce the node-level data. As mentioned earlier, FICA is used to reduce the packet size of the sensor node. The detailed description of the data reductionbased FICA algorithm is given in the following pseudocode

III. REAL-TIME DATASETS
The considered algorithms are evaluated on different benchmark real-time datasets, as described in the following subsections. It's important to note that, usually only part of the data from specific nodes of these datasets are used to assess the performance of current data reduction methods in WSN [17] [23]. The reason is that most data reduction methods focus on reducing the amount of transferred data without considering how this data is forwarded to the CH /BS. In other words, they assume that the sensor nodes can directly transmit the sensed data to the CH /BS. The selected algorithms NNF, NNST, EDCD2, LRMV, and FICA are evaluated on real-time datasets as shown below:

C. Data 3-GSB
Data3-"Grand St. Bernard (GSB)" is WSN data set, which it was gathered by deployed 23 sensors to observer the measurement characteristics of the environmental in the "Grand Saint Bernard Pass" between Switzerland and Italy. The sensors are relative humidity, surface temperature, and ambient temperature [26].

D. Data 4-Intel
Data4-"Intel Berkeley Research Lab (IBRL)" is a WSN data-set, which it was gathered by deployed 54 Mica2Dot sensor nodes at "Intel's research Lab", University of Berkeley. The wireless network consisted of. The WSN includes various sensors: voltage, temperature, light, and humidity [27]. Fig. 9 shows the structure of Data 4-Intel, also some examples of sensor values for all the nodes used in Tables VI to XX.

A. Accuracy
Accuracy is the overall average of absolute error for all selected nodes from the same dataset as defined below: (1) Where K= {1, 2,..L}, L is the number of nodes., AEPN is the average of absolute error for all samples transmitted by the node (k) , SV is the sensor value at the sensor node, RV is the received value at BS, and AEPS(i) is the mean Absolute error for sensor (i), i = {1, 2...N}, N is the number of sensors, M is the number of samples transmitted by the node (k).

B. Data Reduction Ratio
where DR is the ratio of the reduced data, is the size of transferred data after reduction and ℎ is the size of the unreduced transmission samples.

C. Total Energy Consumption
Where: TE Directly is the Total Energy consumed in case of the Direct transmission, D s is the mean Data size, N of S is the mean Number of Samples, C E P Byte is the mean Cost Energy Per Byte, is the mean Data Reduction.

V. SIMULATION AND RESULTS
Fig . 10 shows the accuracy of the applied algorithms EDCD2, FICA, NNF, NNTS, and LRMV for all selected nodes N1, N2, N5 from the DATA1-AIRQ dataset. From the results, the EDCD2 algorithm has the best accuracy compared to the other algorithms FICA, NNF, NNTS, and LRMV. The reason for this is the average total absolute error which has the lowest value of 5.48 when EDCD2 is used for all nodes. Moreover, the algorithms FICA and LRMV have the worst performance in terms of accuracy, and the average absolute errors are 62.30 and 20.13, respectively. Table A1, Table A2, Table A3, Table  A5 (see Appendix) showed the average of absolute error for all samples transmitted by the nodes (N1-N5) of the applied algorithms EDCD2, FICA, NNF, NNTS, and LRMV, respectively. 656 | P a g e www.ijacsa.thesai.org Fig. 10. Accuracy of Applying Various Algorithms for all Selected Nodes from DATA1-AIRQ. Fig. 11 shows the accuracy of the applied algorithms EDCD2, FICA, NNF, NNTS, and LRMV for all selected nodes N1, N2, N5 from the DATA2-ARHO dataset. From the results, the EDCD2 algorithm has the best accuracy compared to the other algorithms FICA, NNF, NNTS, and LRMV. The reason for this is the average total absolute error which has the lowest value of 0.199 when EDCD2 is used for all nodes. Moreover, NNTS and NNF algorithms have the worst performance in terms of accuracy, and the average absolute errors are 5.38 and 5.62, respectively. In summary, EDCD2 is a threshold-based data reduction algorithm. EDCD2 transmits measurement data only when the relative difference between the current measurement data and the last transmitted data is larger than the threshold value. Fig. 12 shows the accuracy of the applied EDCD2, FICA, NNF, NNTS, and LRMV algorithms for all selected nodes N1, N2, N5 from the DATA3-GSB dataset. From the results, the EDCD2 algorithm has been shown to have the best accuracy compared with the other algorithms, FICA, NNF, NNTS, and LRMV. The reason is related to the overall average absolute error, which is the lowest value of 0.30 for applied EDCD2 for all nodes. Furthermore, the FICA and NNF algorithms have the worst performance in terms of accuracy, and the average absolute errors are 3.84 and 1.34, respectively.   Fig. 14 shows the average of data reduction ratio percentage for applying various algorithms for different datasets. The studied algorithms are EDCD2, FICA, NNF, NNTS, and LRMV. The selected datasets are Data1-AirQ, Data2-ARHO, Data3-GSB, and Data4_Intel. From these results, the average data reduction percentage for applied EDCD2, FICA, NNF, NNTS, and LRMV algorithms through a real-time dataset named Data1-AirQ is 33%, 33%, 67%, 67%, and 33%, respectively. It is noted that the NNF and NNTS algorithms have the highest data reduction. By referring to Fig.  10 both algorithms, NNF and NNTS, have acceptable accuracy and the lowest error has been shown by applying EDCD2. In the same way, the average data reduction percentage for applied NNF, NNTS, EDCD2, LRMV, and FICA algorithms through a real-time dataset named Data2-ARHO is 67%, 67%, 56%, 33%, and 67%, respectively. Although NNF and NNTS algorithms achieved the highest data reduction ratio, both NNF and NNTS have the highest error and worst performance in terms of accuracy as shown in Fig. 11 The average data [ 657 | P a g e www.ijacsa.thesai.org reduction percentage for applied NNF, NNTS, EDCD2, LRMV, and FICA algorithms through a real-time dataset named Data3-GSB is 67%, 67%, 67%, 33%, and 33%, respectively. It is noted that the NNF, NNTS and EDCD2 algorithms have the highest data reduction, by referring to Fig. 12 the lowest error has been shown by applying EDCD2. FICA showed the worst performance in terms of accuracy, with the highest errors. The average data reduction percentage for applied NNF, NNTS, EDCD2, LRMV, and FICA algorithms through a real-time dataset named Data4_Intel is 50%, 50%, 83%, 25%, and 75%, respectively. It is worth noting that the EDCD2 algorithm achieves the highest data reduction. By referring to Fig. 14 , Tables XXI, XXII both  NNF, NNTS, EDCD2, and LRMV algorithms have acceptable accuracy, and the highest error has been shown by applying FICA.

VI. CONCLUSION
The impact of data reduction methods on WSN performance is investigated in this paper, using a set of realtime datasets. Simulation tests are performed in MATLAB for different methods to reduce the amount of data sent. The selected algorithms NNF, NNST, EDCD2, LRMV, and FICA are evaluated using real-time data sets. The performance metrics measured are energy consumption, data accuracy, and percentage of data reduction. The results of the study show that the selected algorithm helps to reduce the amount of transmitted data and energy consumption, and each algorithm performs differently depending on the dataset used.