Dynamic Data Aggregation Approach for Sensor-Based Big Data

Sensors are being used in thousands of applications such as agriculture, health monitoring, air and water pollution monitoring, traffic monitoring and control. As these applications collect zettabytes of data everyday sensors play an integral role into big data. However, most of these data are redundant, and useless. Thus, efficient data aggregation and processing are significantly important in reducing redundant and useless data in sensor-based big data frameworks. Current studies on big data analytics do not focus on aggregating and filtering data at multiple layers of big data frameworks especially at the lower level at data collecting nodes (sensors) that reduce the processing overhead at the upper layer, i.e., big data server. Thus, this paper introduces a multi-tier data aggregation technique for sensorbased big data frameworks. While this work focuses more on data aggregation at sensor networks. To achieve energy efficiency it also demonstrates that efficient data processing at lower layers (sensor) significantly reduces overall energy consumption of the network and data transmission latency. Keywords—Data aggregation; big data; sensor networks; energy efficiency; clustering


I. INTRODUCTION
The time of spreadsheet is over.A Google search, a barcode scan, a voice message, a picture of a car, a tweet among others all contains data that can be collected, analyzed and monetized.Indeed in today's time, we manage and store our life online.Data are gathered from smart phones, laptops and tablets that collect and transfer information on what people do.However, this is just the beginning.Most devices including our TVs, watches and even washing machines will collect and transmit messages.With the growing amount of information that exceed quintillion of bytes, new machines and techniques more powerful than the normal computer had to be created to allow us to make sense of the zeros and ones.Super computers and various algorithms have helped one so far in the real time analysis of those increasingly larger amounts of information.Nevertheless, for more efficient data mining, one always has to be on the chase for new methods.
The term Big Data refers to large volume of data sets.In the last few years, with the increase in the amount of digital information around us, the term has gained in popularity.As we speak, many professional in the field are working on finding better data mining ways to cope for the future.Sensors, mobile phones and other devices all generate big data.One can simply question what is the advantage of collecting so much information and how can it be useful for any company?The simplest example to answer such a question is the grocery stores/supermarkets.These stores offer various promotions and discounts upon using their cards such as Air Miles, Optimum card etc.These cards generate big data in the form of collected information in regards to demand and supply among various parameters stated in the contract signed by the customer.All the information are gathered and once processed, they help companies improve their businesses in various ways.Indeed, the primary goal of collecting these huge datasets is to look for meaningful patterns by using optimal processing.Emergence of sensor networks also play a major role in the rise of big data as thousands of sensor network applications collect huge amount of data that require processing.Hence, sensors data processing can be considered as a part of big processing.As sensors produce redundant data we can aggregate data to reduce and represent them in a meaningful way in big data framework.However, works on big data presented in [9]- [13] do not talk on sensor-based big data aggregation, they mostly talk about architecture and network theory of big data, data mining, and application of big data.
As sensors-based big data aggregation is an important area of research to reduce computational cost as well as energy consumption this paper introduces a sensor data aggregation approach for a multi-tier big data framework.The proposed aggregation approach is designed in three layers to ensure that sensors data aggregation is facilitated at the lowest layer.As the proposed communication framework only consists three layers of communication and processing devices (i.e., sensors, gateway node that connects to Internet, and big data server) this data aggregation approach has three layers.
The proposed data aggregation allows both cluster-based and tree-based network topologies and thus, considered as a hybrid data aggregation approach.Clustering is used in most sensor network applications especially, they are greatly required for emergency or real-time applications such as rescue operations, health, and traffic monitoring to reduce data transmission latency (results in reduced data processing delay and overhead at big data server).On the other hand, tree-based approach achieves efficiency in non-real time applications where achieving energy efficiency is more important than data transmission delay.The proposed approach works by selecting a few nodes that work as active nodes [19] to collect and aggregate data for a certain period of time unless the residual energy of these nodes become critical.While most clustering algorithms [1], [4]- [8], [18]- [20] allow all member nodes of a cluster to actively work at any time instant the proposed www.ijacsa.thesai.orgapproach selects only a few nodes as active to work at any time instant that cover the whole network area.The proposed approach allows other nodes to work as alternative nodes that take the responsibility of active nodes only when any active node fails.This results in fault tolerance and energy efficiency.The rest of the paper is organized as follows.
Section II briefly presents literature on sensor data aggregation approaches.Section III briefly presents the working principle of the proposed data aggregation approach.Section IV analyzes the performance of the proposed data aggregation approach and compares it with tree and clusterbased approaches in terms of energy consumption and data transmission latency.Experimental (simulation) setup and results are presented in Section V. Finally, the summary of the paper and future works are presented in Section VI.

II. RELATED WORK
Current research on big data analytics include distributed algorithms to process big data, network architecture and application of big data, MapReduce paradigm that works on big data [9]- [15].The existing distributing algorithms to process and aggregation big data are mostly done at high performance big data server.These studies [9]- [15] do not consider data aggregation at multiple layers especially sensor data aggregation at the data collecting side as a way to reduce computational cost.Hence, we studied and presented a few literatures on sensor data aggregation as follows as a plan to integrate an improved sensor data aggregation approach in our proposed sensor-based big data framework.
Directed diffusion (DD) is a flat data aggregation approach where a node A broadcasts its interest and the node B that senses data related to the interest message transmits to A though multiple paths.Later, the node A selects the shortest path for further data transmission through a reinforcement packet.However, DD requires a large number of data transmissions.Hence, Cluster diffusion with Dynamic Data Aggregation Approach (CLUDDA) [3], [16] is introduced to only propagate event of interest and interest event between cluster head and cluster members.In case, the cluster head resides far from the cluster members, it consumes huge energy.
Tree-based approaches are good for small networks with fewer nodes.However, these algorithms suffer from a single point of failure where the failure of a single node disconnects the data transmission path from leaf node to the root.Among many tree-based approaches, energy aware distributed heuristic (EADAT) [17], Power efficient data gathering and aggregation protocol (PEDAP) [18] based on a spanning tree to maximize the lifetime of the network and Power-Aware PEDAP (PEDAP-PA) [18] are more popular.Chain-based data aggregation techniques, such as power efficient data gathering protocol for sensor information systems (PEGASIS), have been proposed [20] where each sensor transmits only to its closest neighbor.As this approach does not guarantee the shortest data transmission path from the furthest nodes of the chain to the sink a multiple-chain scheme is introduced in [20].Again, this approach does not provide the shortest data transmission distance.Hence, the greedy chain construction algorithm, which constructs the chain by starting at the furthest node from the sink and considers it as a chain head, was proposed in [5].Every time a non-chain node is added to the chain, this new node is considered as a new chain head until all nodes are added to the chain.
A multiple chain scheme has also been proposed in [22].In this approach, the network is divided into four zones and each zone is centered at the node that is closest to the center of the sensing region.A linear that ends at the centre node is created for each zone.The multiple chain schemes aim to decrease the total distance of transmitting data as nodes broadcasts.In the greedy chain construction scheme proposed in [12], the process starts by selecting the chain head.The farthest node from the sink is selected as the chain head.At each step, a non-chain node, A is added to the chain head if A is closest to the chain head.The procedure stops whenever all nodes are added to the chain.This approach is further improved by including the non-chain node to the chain as a chain leader that provides the shorted distance as compared to other nodes if included into the chain as a leader.
In the grid-based data aggregation method [18], each grid has a data aggregator and all sensors in a grid transmit data to the grid aggregator while in the in-network data aggregation, data are aggregated at parent nodes as they are being transmitted towards sink at the root of the tree.The work in [5] presents a hybrid data aggregation scheme that combines the best features of grid-based and In-network aggregation schemes.The network topology is initially constructed based on in-network data aggregation approach.Once an event is detected by a sensor, the sensor follows in-network data aggregation scheme if the data is received from a static sensor application.If data is from a mobile sensor application, gridbased approach is used for data aggregation.Among other approaches, the work done in [26] introduces a cluster-based data aggregation approach where cluster head uses three different approaches to reduce redundant data collected from neighboring nodes (i.e., huge processing burden on cluster head), [27] introduces identity-based aggregate signature (IBAS) scheme for sensor-based secure data aggregation that provides data integrity as well as reduce bandwidth usage.
In sensor network, nodes receive data only when they are in active state that introduces the idea of properly utilizing the limited number of active time slots of sensor nodes with the goal of reducing data aggregation latency.The minimum latency aggregation schedule (MLAS) in most duty cycle WSN allows low latency and collision free aggregation schedule.However, this approach uses fixed structure aggregation methods and requires all sensor nodes are always awake.The work done in [28] introduces a distributed aggregation algorithm for duty-cycle WSNs, in which the aggregation tree and a conflict free schedule are generated simultaneously without using any fixed aggregation structure.The work done in [29] introduces an approximation algorithm to construct a maximum lifetime data aggregation tree that uses an adjustable transmission power level to achieve higher network lifetime while most work consider fixed transmission power.In [30], authors introduce a cluster-based approach for in-network aggregation.This approach uses an energy efficient routing strategy that uses multi-path routing tree and performs data fusion and data aggregation at intermediate www.ijacsa.thesai.orgnodes.While most data aggregation approached do not consider data security and privacy issues, Vakilinia et al. [31] presents data privacy preserving data aggregation/fusion approach for crowdsensing that uses linear transformation and homomorphic encryption scheme to obtain secured aggregated data.However, these approaches are complex and computationally expensive.
The work done in [32] presents several data fusion techniques such as approaches based on neural network, genetic algorithm, fuzzy logic, particle swarm optimization, steiner tree-based approach and data selection-based summation fusion.In [33] Yan M. introduces Forecast Algorithm of Data Aggregation (FTDA) data fusion algorithm based on the time prediction model, which predicts a time when data may differentiate from the data at current time.This model has the ability to proactively identify data redundancy and reduce energy consumption.However, approaches presented in [32], [33] work for small scale sensor networks, require more computational power and hence, have space to make them more energy efficient.
Most approaches that we have presented in this section do not consider selecting a fewer number of nodes as active nodes and allowing all other nodes to remain in sleep state (or idle) that reduce the network energy consumptions.Also they do not consider the type and priority of data packets for data aggregation.Hence, we introduce a multi-tier data aggregation approach that (1) uses both cluster and tree-based approaches, (2) selects only a few nodes as active node while keep all other nodes in sleep state, (3) assigns type and priority to each data packet.

III. PROPOSED ARCHITECTURE AND APPROACH
This section presents the high level architecture of the proposed data aggregation framework of big data along with the low level data aggregation and filtering scheme at sensor networks.

A. High Level Architecture
The proposed big data aggregation and filtering framework works in three layers, (1) Lower Layer: aggregates data at sensors (2) Middle Layer: aggregates data at base station (3) Upper Layer: aggregates data aggregation at big data server in distributed manner.
Fig. 1 illustrates such as a big data framework that only has three data communication layers.For instance, sensors at lower layers sense data and transmit those data to sink node or base station (BS).Then, the BS processes or aggregates data and transmit the aggregated data to the central big data sever through Internet.Finally, the big data server aggregates data by distributing it to commodity computers.Hence, the proposed hybrid data aggregation scheme has three data aggregation layers.The computational efficiency of big data sever at upper layer depends on data aggregation at data at middle and lower layers as low power nodes at these layers can aggregate and filter data to some extent even though nodes at upper the layer have higher computational power.However, existing big data aggregation approaches in literature are mostly only designed for upper layer at big data server.Hence, the computational cost or time at the server is not reduced as these approached do not consider any lower layers preprocessing of data (such as preprocessed at lower layers at sensors).
By designing efficient data aggregation approach at the lower level sensor nodes the overall computational costs at the upper layer big data server can be reduced, which is the objective of this paper as the data aggregation scheme reduces the volume of sensor's data that will be transmitted to the upper layer.Thus, this approach reduces data aggregation and processing overhead at the upper layer in NoSQL or other non-relational database systems for big data.The upper layer also consists of emergency response centre.The sink or base station at middle layer transmits emergency or time critical data to the emergency response centre before sending it to NoSQL database servers for processing/filtering and future storage.
Sensor networks are being used for many applications.These applications can be classified as (1) real-time and (2) non-real-time.Real-time applications such as health monitoring have more priority than non-real-time applications (i.e., real-time emergency data should have more priority than non-real-time data).Hence, data aggregation approaches should be designed considering the priority of sensor applications or data types.Most existing approaches [1], [4]- [8], [18], [23] do not consider this criteria to design a data aggregation approach.
Moreover, data processing at upper layer (i.e., at big data servers) should also consider the type of data so that data can be stored based on their categories for future use.Data aggregations at the lower and middle layers based on data types and sensor applications will ease the data processing at the upper layer.Thus, this paper introduces an energy efficient application dependent data aggregation approach for sensorbased big data frameworks.Sensors are programmed to have a data type field in their packets so that other sensors or devices that receive the data packet can identify the type of applications and perform data aggregation based on the data type [21].This field also helps to store data at the appropriate locations in big data server for further processing and use.www.ijacsa.thesai.orgRouting protocols can be proactive (periodic) and reactive (event-based).For periodic routing protocols, data are sensed and transmitted periodicallyat a certain time interval.In reactive routing protocols, data are transmitted only when a certain event is triggered.Sensors will also be programmed to contain a field (i.e., routing type) in their data packet that data transmission mode.For instance, if the routing field is set to 1 it will represent the periodic data transmission of emergency/real-time applications.
Otherwise, data transmission will be event-based.Data aggregation at sensors also depends on this field.In the proposed approach, emergency real-time data will be only aggregated or filtered at sensors to avoid transmitting redundant data (i.e., data with the same information that has already been transmitted) that will reduce network energy consumption and also allow the sink to transmit data faster to the emergency response centre.Moreover, more data aggregation and processing takes place at the middle layer (at base station or sink node) compared to that at the lower layer (i.e., at the sensor) since sensors have limited power and processing capabilities.Thus, big data servers at the upper layer are expected to receive partially structured data to reduce the overall processing overhead of big data framework.

B. Proposed Hybrid Sensor Data Aggregation Scheme
The proposed hybrid data aggregation scheme classifies sensor-based applications into the following categories.
1) Real-time, emergency, time critical applicationssuch as traffic monitoring, battlefield surveillance and health monitoring.
The lower layer sensors transmit data to the upper layers through gateway nodes.Fig. 2 illustrates such a scenario.However, data aggregation approaches may achieve energy and computation efficiency using dynamic network topologies based on the requirement of sensors applications.For example, sensors are programmed to form cluster-based topology for emergency real-time applications and tree topology for non-real-time applications (details of cluster formation, tree formation and CH selections are presented in [24]).In cluster-based topology, sensors collect and transmit data at their allocated timeslot to the cluster head (CH).Then the CH transmits to the gateway and end station.As this type of topology ensures the minimum number of hops to transmit to the end node data aggregation using cluster-based approach is expected to achieve computational, data latency as well as energy efficiency.In cluster-based data aggregation, once a cluster is formed and CH is selected the CH selects a minimum number of nodes as active node for any time instant while other nodes remain in sleep state (or idle).We use the work done in [19] to select active nodes.Active nodes of a cluster sense and transmit data to CHs while aggregates and filters data to discard redundant data.On the other hand, idle nodes (in sleep state) do not perform data sensing, transmission and aggregation.By discarding a large number of redundant and useless data in emergency applications this approach ensures faster transmission of data to the central server [1].On the other hand, achieving energy efficiency is more important than achieving reduced end-to-end delay in nonreal-time applications, such as agriculture, farming, pollution monitoring.Tree-based hierarchical topology may create the shorter path that uses more hops as compared to cluster-based topology.As distance is less the energy consumption will be less (energy consumption is directly proportional to the distance between two nodes [2], [3]).Thus, in tree-based data aggregation approaches, a sensor transmits data through the shortest path from itself to the sensor gateway.
In tree-based approach, nodes are identified to locate at different levels of the hierarchy considering the gateway node is the root of the hierarchy.Nodes residing one-hop away from the gateway can be considered to locate at the level 1 and so on.Then, the shortest path from the sensor gateway node to the active leaf nodes will be created using the method presented in [19].Data transmission starts at sensors of the lowest level.For instance, active sensor nodes (or leaf nodes) sense and transmit the event of interest to the active nodes at the upper level.Parent nodes in this tree structure always perform data aggregation using different aggregation functions such as MAX, MIN, MEAN, MEDIAN and SUM and transmit again to the active nodes at the upper level until data reaches at the sensor gateway at root.Thus, this energy consumption of the active nodes in this approach is well distributed and the total network energy consumption is expected to be lower even though the number of hops from the sensor to the gateway is more as compared to the cluster-based counterpart of this proposed approach.However, the treebased data aggregation may result in increased end-to-end data transmission delay as data from a node passes through several number of hops and is processed at each node for a certain time period.Thus, the proposed hybrid, dynamic and application-based data aggregation scheme offers a trade-off between energy efficiency and data transmission delay.www.ijacsa.thesai.org

Al
Gateway Node G Normally, sensor networks are used for a specific application by forming a specific network topology.Using the proposed data aggregation scheme, the sensors in a network can be reused to other applications and are able to change their topology if the application changes.Data packets have a number of fields and one field is used to set the application type.Once sensors receive a data packet from the gateway with the changed application field, it reconstructs the topology.Fig. 3 illustrates a sensor data packet that contains fields to identify data type and application type for the proposed data aggregation framework.
Sensor networks are mostly designed for a specific application and hence, a data aggregation scheme (cluster or tree-based) can be pre-established.However, the data aggregation scheme can also be constructed on-demand based on the types of packets that sensors transmit.This dynamism allows sensor networks to be used or re-used in multiple applications.Algorithm 1 presents the pseudo-code for the proposed sensor data aggregation approach.

IV. PERFORMANCE ANALYSIS
In this section, the performance of the proposed data aggregation scheme will be analyzed in terms of networks energy consumption and data transmission delay.Then we will set up the network simulator based on some assumptions and measure the performance of the proposed hybrid data aggregation scheme as compared to the tree and cluster-based approaches.

A. Energy Model
The energy model in [2], [3] is used to evaluate the performance of the proposed data aggregation approach as we only consider data transmission and reception energy consumption in this evaluation.This model considers that energy consumption is proportional to data transmission distance.The energy consumption of a node for transmitting www.ijacsa.thesai.orgdata of data n bytes to another node, which are at distance d apart is However, the energy consumption of a node for receiving a data packet is independent of distance and is denoted as follow.
Where data  is the energy consumption of a sensor node in its electronic circuitry and air  represent the energy consumptions in RF amplifiers for propagation loss.

B. Estimation of Energy Dissipation
Let us assume that the number of sensor applications =

1) Existing cluster-based method
Let us assume that the number of clusters in each network is cl n .
Therefore, the number of nodes in each cluster, Let us assume that each network has 2 level or hierarchy.We denote these levels as 1 L and 2 L .Also, we consider that the level that is closer to the gateway is 2 L .So, the number of clusters in each level is

Let us assume that the distance between an active member node and CH = avg d
The average size of a data packet that is transmitted from a member node to CH is data n .Therefore, the total network energy consumption for transmitting a data packet to a cluster is The energy consumption of a CH for receiving data from an active member node is Similarly, the energy consumption of a CH to transmit data packet to the sensor gateway is given as Where the aggregated data size at CH is agdatacl n and the average distance between CH and sensor gateway is CH d Thus, the total transmission energy consumption in a cluster-based data aggregation scheme is 2) Proposed hybrid approach This section presents the proposed data aggregation scheme both for when (1) modifications are done based on cluster-based topology for real-time applications, and (2) modifications are done based on tree-based topology for non-real-time applications.

Proposed approach is based on cluster-based topology for real-time applications
Let us assume that the number of nodes that reside in sleep mode = idle n .
Therefore, the number of active nodes in a cluster including CH is If we substitute ( 9) into ( 5) we find the energy consumption of active nodes in a cluster for transmitting data to CH, which is given as follows: Similarly, if we substitute ( 9) into ( 6) we obtain the energy consumption of a CH for receiving data from an active member node of a cluster, which is given as follows: The energy consumption of a CH for transmitting a data packet to the sensor gateway is given as In (12)

Proposed approach is based on tree-based topology for non-real-time applications
Again let us assume that the number of levels from leaf nodes to the sensor gateway is 2.
The number of active nodes in each level is activeprtr n .The proposed data aggregation approach that uses tree topology creates the shortest path from a leaf node to the sensor gateway.We assume that the size of a data packet that is sensed at a leaf node is dataprtr Thus, the energy consumption of active nodes at L1 for transmitting data to the nodes at L2 is given as The energy consumption of active nodes at L2 for receiving data from nodes at L1 is given as follows ) ( Similarly, the energy consumption of all active nodes at L2 for transmitting data packets to the sensor-gateway is given as Thus, the total energy consumption for transmitting data in the tree-based proposed data aggregation approach is given as: )

3) Existing tree-based method
Let us assume that the number of nodes at level of the tree = tr n and the number of hops to transmit data from a leaf node to the sensor-gateway = 2 Let us assume that the average distance from L1 nodes to L2 nodes = In this approach, all nodes are kept in the inactive mode.Transmission energy consumption of L1 nodes as given in (20).
Similarly, reception energy consumption of nodes at L2 is given as: And energy consumption for transmitting data from nodes at L2 to the sensor-gateways deduced using (22) www.ijacsa.thesai.org

4) Comparison of energy consumption among clusterbased, tree-based and hybrid approach
Case 1: Non-real-time sensor applications using tree-based topology.
Since it is known that tr activeprtr n n  we can conclude from ( 18) and ( 23) that Similarly, we can conclude from equations ( 16) and ( 21) that Case 2: Real-time sensor applications that use clusterbased topology.

It has been shown that
, so, we can conclude from ( 8) and ( 12) that Similarly, Case 3: Comprises of both real-time and non-real-time applications.Let us assume that the number of non-real-time and real-time applications are n 1 and n 2, respectively.Then, the transmission energy consumption for the proposed data aggregation approach will be given as Where the transmission energy consumption for the cluster-based approach will be denoted as Similarly, the transmission energy consumption for the tree-based approaches will be given as 28) and ( 29) we find that transmission energy consumption of the proposed approach will be less than the transmission energy consumption of the cluster-based approach.Similarly, as 28) and ( 30), we find that transmission energy consumption will be less than that of treebased approach.
We will find the similar result for data reception energy consumption (i.e., reception energy consumption of the proposed approach will be less than that of the cluster and tree-based approaches)

C. Analysis on Data Transmission Latency
In the cluster-based method, the active member nodes of a cluster transmit data packets to the CH.Then the CH aggregates and transmits the processed data to the sensorgateway.If the time allocated to the active member node and CH are c T and ch T , c ch T T  as the CH performs data sensing, data transmission, reception and aggregation.
The data transmission latency for the cluster-based method will be as presented in (31).

1) Proposed hybrid approach
The data transmission latency for the proposed clusterbased approach The data transmission latency for the proposed tree-based method is presented in (33).
The number of active nodes in each level of the proposed tree-based method is presented in (34).

2) Existing tree-based method
The number of nodes in each level is assumed to be same = r no n det and duration of timeslot allocated to each node at the lowest level is tr L T 1 .
The duration of timeslot allocated to each node at the upper level is This is because the upper level nodes perform data aggregation and transmit aggregated data to the sensorgateway.
Thus, the data transmission latency for tree-based approach will be From the above analysis, we conclude that the energy consumption and data transmission delay of the proposed sensor-based data aggregation approach at layer 1 is less than that of traditional cluster and tree-based schemes.

D. Computational Complexity
If the number of active nodes at each level l in the proposed tree-based approach is If we define the complexity of the algorithm based on the number of message transmission, which is a function of the number of nodes from each level at the predefined timeslot then the processing complexity of the proposed approach based on tree topology is O (n) where n is the number of nodes transmitting data packets.
Similarly, we can show that the processing complexity of proposed approach based on clustering will O (n).

V. VALIDATION OF THE PROPOSED APPROACH
To validate our proposed hybrid data aggregation and filtering technique for sensor-based big data frameworks we considered the scenarios presented in the section.

A. Simulation Setup
We designed and implemented a simulator to implement the proposed data aggregation approach using C programming language rather than using the existing simulators, NS-2, OPNET, NS-3 many sensor network and big data functionalities are not available in these simulators.Moreover, we have more control on implementing the new concept of sensor-based big data.
Real experiments or testbed always give accurate result as compared to simulation.However, real experiments are not always possible due to the unavailability of sensors and other components.Hence, simulation is being used to replace experimental work in sensor networks and other fields to a great extent.Hence, we decided to perform simulation to evaluate the performance of the proposed data aggregation scheme that works at layer 1 of the big data architecture and compared with the traditional cluster and tree-based approach as presented before.We use network energy consumption, network lifetime and data transmission latency as the performance metrics.Each time the simulator was run for a certain number of rounds and we run the simulator a certain number of times.The outputs are calculated as an average of these results.We define the performance metrics and related terms as follows: Roundis a period of time comprises a number of network setup and operation phases.
Data transmission latencyis considered as the end-toend data transmission delay, i.e., the time required to transmit data from an active node to the sensor gateway or base station.
Energy consumptionis the total energy consumed by a sensor to transmit, receive and aggregate data.
We simulated an area of size 100 meters x 100 meters as the network size.As this network area is considered as small, the network is divided into only 4 clusters and 20-30 nodes are randomly deployed on an average into each cluster (100 nodes in total into the network).For this small network area deploying 100 sensors can be considered as a large number of sensors that collect huge amount of data, i.e., big data.The proposed data aggregation approach still works even if we increase the size of the network and the number of sensors in this ratio (large scale).Simulation parameter and their respective values of our paper [25] are also used in this paper.
The simulator was run for rounds between 5000 and 30000 for different experiments to compare energy consumption between low (5000 rounds) and high (30,000 rounds) number of network setup phases.The sensor gateway is placed at the outside of the network area which is located at the co-ordinate (55, 101).During the network operation phase, cluster head allocates a number of timeslots to each node.However, each nodes receives different number of timeslots based on their distance or level from the sensor gateway.For instance, nodes which are closer to the sensor gateway require more time to sense data, receive data from lower levels of nodes, aggregate and transmit data.Hence, these nodes require more time (i.e., timeslots) as compared to the nodes that reside far from the sensor gateway at the lower levels.Table I lists the parameters and their values that are used in the simulation.

B. Simulation Results
Fig. 4 shows that the energy consumption of the proposed data aggregation approach is much lower than that in traditional tree and cluster-based approaches because the proposed approach selects only a few active nodes and most other nodes remain in idle state whereas the traditional approaches consider all nodes as active.Moreover, the proposed approach uses both cluster and tree-based approaches based on the type of data it senses and balances the energy consumption.Fig. 5 demonstrates that the data transmission latency of the proposed data aggregation scheme are less than that of the tree and cluster-based data aggregation approach because the CH receives data from a few active cluster member nodes in cluster-based approach and the parent node receives data from a few active child node, which require less time for the CH and a parent node to process and further transmit data to the next level.
From the result presented in Fig. 4 about the network energy consumption we can deduce that the network lifetime of the proposed scheme is expected to be more than those of cluster and tree-based approaches.Figure 6 demonstrates our claim that the network lifetime of the proposed hybrid data aggregation approach is much more than that in the traditional tree-based and cluster-based data aggregation approaches.We can further justify the presented results as follows:   In tree-based data aggregation schemes, upper level nodes wait until nodes at the lower levels transmit data to the upper levels.This results in higher data transmission latency.Moreover, a large number of active nodes at each level results in data redundancy, and data processing overhead.Clusterbased approach allows all cluster members to transmit data to the cluster head (CH).Thus, the CH requires much energy to process the received data.As some of the CHs might be far away from the sensor gateway, it consumes much energy of the CH to transmit the large aggregated data.In its own case, the proposed data aggregation scheme selects only a few active nodes that cover the whole network, this provides lower processing overhead and reduce the total network energy consumption (i.e., higher network lifetime).Processing and transmitting data from a fewer active nodes will also result in less data transmission latency.In summary, Table II compares the existing tree and cluster-based data aggregation approaches with the proposed hybrid approach based on different features.

VI. CONCLUSION AND FUTURE WORKS
We introduced a sensor-based big data aggregation approach in this paper.This approach works in multiple layers.However, we focus on aggregating redundant and unstructured sensors data at the lowest level of this framework at sensor nodes.The proposed hybrid data aggregation scheme uses either an efficient cluster-based data aggregation when data are transmitted from real-time or emergency sensor applications or a tree-based approach for non-real-time sensor applications.Experimental results demonstrate that the proposed hybrid and dynamic data aggregation scheme is better than traditional cluster and tree-based schemes in terms of network energy consumption, network lifetime and data transmission latency.This results in less amount of (unprocessed) data by big data server at upper layers to further faster data aggregation and filtering.In future, we plan to design and implement and efficient (computational) data aggregation scheme for upper layers at big data server.Also, we plan to implement the proposed approach in testbed (real experiments) and compare with more existing approaches to justify its effectiveness.Securing sensor data aggregation approaches against attacks, i.e., Sybil, wormhole, blackhole, bogus information, modification of sequence number through the use of public and private key cryptography and encryption mechanisms is significantly important even though those approaches require more computations.Hence, we plan to implement computation-efficient secure data aggregation approaches as part of our future research in this direction.
of non-real-time applications that use tree topology = nr n The number of real-time applications that use clusterbased topology = r n .

n
and the size of aggregated data packets at the upper level nodes is agdataprtr n .The average distance between the nodes at level 1 and level 2 is prtr L d 1 and between the nodes at level 2 and the sensor- average distance (shortest) between the leaf node and the sensor-gateway node is given as 12 L prtr L prtr dd 

2
The size of data sensed at the lowest level leaf nodes= datatr n .Then, the size of aggregated data at L2 nodes = agdataprtr agdatatr n n  number of packets transmitted by each active node of the network at their predefined timeslot is

TABLE I .
REPRESENTING TERMINOLOGIES BY SYMBOLS Table I lists the symbol used for different terms in Algorithm 1.
the aggregated data size at a CH is .

TABLE II .
COMPARISON OF DIFFERENT DATA AGGREGATION METHODS All nodes in the network are active (i.e., they sense, send and transmit data) √ √ XA few active nodes cover the whole network area X X √ Form clusters and send event of X √ √