Influence of Nitrogen-diOxide , Temperature and Relative Humidity on Surface Ozone Modeling Process Using Multigene Symbolic Regression Genetic Programming

Automatic monitoring, data collection, analysis and prediction of environmental changes is essential for all living things. Understanding future climate changes does not only helps in measuring the influence on people life, habits, agricultural and health but also helps in avoiding disasters. Giving the high emission of chemicals on air, scientist discovered the growing depletion in ozone layer. This causes a serious environmental problem. Modeling and observing changes in the Ozone layer have been studied in the past. Understanding the dynamics of the pollutants features that influence Ozone is explored in this article. A short term prediction model for surface Ozone is offered using Multigene Symbolic Regression Genetic Programming (GP). The proposed model customs Nitrogen-diOxide, Temperature and Relative Humidity as the main features to predict the Ozone level. Moreover, a comparison between GP and Artificial Neural Network (ANN) in modeling Ozone is presented. The developed results show that GP outperform the ANN.


I. INTRODUCTION
Tropospheric ozone is an air pollution which causes serious human health problems.The insufficient adherence to the international standard air quality trends, growth of industrialized activities and the emitting of various types of gasses such as carbon monoxide (CO), nitrogen oxides (N O x ), Sulphur dioxide (SO 2 ), and Particle Pollution (P M 10 ) and (P M 2.5 ) in the air without any concern about the impact on human health became a common problem worldwide.These behaviors cause a rise to the earth temperature and affect many meteorological variables [1], [2].
The role of stratospheric ozone in the air is to filter out the greatest portion of the sun possibly harmful shortwave the ultraviolet (UV) radiation.This means that the depletion of ozone allows more UV emissions to touch the earths surface.Many studies proved that these UV emissions could have severe impacts on human beings, animals and plants [3].In [4] authors explored the dramatic effects of UV radiation on the eye and the skin.Higher temperatures associated climate change possibly will lead, among numerous other effects, to increasing rate of skin cancer.The influence of ambient ozone on human health was studied for fifty US cities for five summers was presented in [5].Countries such as New Zealand developed many studies on air quality to estimate the likely health problem which may be encountered and decide where emissions should be condensed to improve air quality.In [6], a published report studied the influence of CO, nitrogen dioxide (N O 2 ), SO 2 , O 3 , and benzene and benzo(a)pyrene (BaP) in air.
In the past, researchers proposed different types of models to forecast the concentrations of pollutants.Some of these models are statistical based like Autoregressive-movingaverage (ARMA) models and linear regression models [7]- [10].Recently, a more attention was given to machine learning techniques based models such as Artificial Neural Networks (ANN) and Support Vector Machines (SVM) [2], [11]- [15] for developing forecasting models.
In this work, Multigene Symbolic Regression GP is used to develop short term prediction model of surface Ozone.The proposed model can predict the mean surface Ozone based on limited number of attributes.They are the Nitrogen-dioxide, temperature and relative humidity.The Multigene GP has some advantages over other techniques like ANNs such as; producing compact mathematical models that have explanation power and easy to evaluate.A complete comparison between both techniques on solving the modeling problem is presented.
This paper is organized as follows.An overview of the ANN technique is presented in Section II.GP as an evolutionary computation technique is presented in Section III.The evaluation criterion adopted to check the performance of the developed models are presented in Section IV.The area of study considered with detailed information about data collection is discussed in V. Section VI provides the experimental setup and results of the two developed models of the Ozone based ANN and Multigene Symbolic Regression GP. www.ijacsa.thesai.org

II. MULTILAYER PERCEPTRON ANN
ANN was first defined as an information-processing system.This system has large number of simple processing units called "neurons".These neurons interconnect by sending and receiving signals which activate the neurons connected to it.A huge number of these neurons constitute a neural network.ANN is distinguished by certain performance characteristics such as its architecture, its training algorithm and the activation function.In this work, we investigate the application of multilayer feedforward neural network which is one of the most common types of neural networks applied for function approximation and prediction [13], [16], [17].In MLP-ANN, neurons are arranged in layers (input, hidden and output layers).The information in feedfoward MLP-ANN flows in only one forward direction, from the input layer, through the hidden layers to the output layer [18].Figure 1 depicts an example of a feedfoward MLP designed for Ozone prediction.In this example, the MLP has four neurons in a single hidden layer.

A. Learning algorithm
To adjust ANN weights such that the learning process achieved its goal by modeling the relationship between the inputs and output we need a learning algorithm.One of the very famous learning algorithms is the backpropagation (BP) learning algorithms.BP works by adjusting a cost function to minimize the error difference between the actual output and the ANN output.This function could be simply the sum of the error square.The learning process can be split into number of phases as below: 1) Hidden layer: Assume we have a set of input-output measurements in the form of x i , y i .The inputs x i are always presented to the input layer, then pass to the hidden layer weighted by the weights w ij .The hidden layer always have a nonlinear function known as sigmoid function (see Equation 1).The output of each neuron in the hidden layer is the summation function presented in Equation 2.
where i = 1, 2 . . ., n and j = 1, 2, . . ., m. ψ and y j are the activation function and output of the j th node in the hidden layer, respectively.

2) Output layer:
After the computation of each output from the neurons in the hidden layer, the information is processed to the output layer.The output layer also has number of neurons which most likely less than the number of neurons in the hidden layer.Neurons in this layer most likely to have linear sigmoid function.The computed output for neurons in the output layer is presented in Equation 3.
k is the number of neurons in the output layers.ϕ is the linear activation function.Y is the neural network output from the single neuron in the output layer as in our case study.The learning process continues till we minimize a cost function.In our case, the cost function minimizes the difference between the actual and the result of the network as described in Equation 4. It is defined as the Root Mean Square (RMSE).RMSE can be described by Equation 4.

III. GENETIC PROGRAMMING
Genetic Programming is an evolutionary process which was successfully used to solve diversity of problem in system identification and control [19], [20].GP was inspired from idea of nature selection and evolution introduced by Darwain.GP uses the concept of survival of the fitness to develop solutions that more likely fits to a problem.It is a population based approach.In GP, the population comes in a form of tree structure not a chromosome such as in the case of Genetic Algorithms (GAS) [21]- [24].A block diagram which shows the GP evolutionary process is presented in Figure 2.

A. Population Initialization and Tree Representation
The initial population for any evolutionary process is produced most likely randomly.In GP a random population P 0 of tress is generated.Each tree represents a solution of a given problem.GP evolves tree structures which is composed of a set of functions and terminals sets provided by the user.
(IJACSA) International Journal of Advanced Computer Science and Applications, www.ijacsa.thesai.org

Fig. 2. Flow chart of the GP technique
The fitness of the initial population is computed according to a given fitness function.

B. Function and Terminal Sets
To develop a mathematical model which represents a relationship between input and output variables, we have to define both function and terminal sets.For a set of inputs x 1 , x 2 , x3 and x 4 to produce an output y, we may have a tree structure which produce the Equation 5.The function ϑ and terminal χ sets are given in Equation 6.

C. Selection Mechanism
While population evolves, selecting individuals for both crossover and mutation depends on what is called the selection mechanism.This is essential process in the generation of new population.Many selection mechanism were presented [25].They include roulette wheel technique, stochastic universal sampling, tournament selection and many others [26].Number of selection mechanism used in GP were presented in [27].

D. Multigene Symbolic Regression GP
Symbolic regression method was presented by J. Koza [19].The objective of this method is to search the space of possible mathematical expressions (i.e.equations) while minimizing some error criteria.Developing mathematical function between input variables x i and an output y is a challenge.It is important to find the function ζ which relates the inputs and output.Symbolic regression explores both the space of models along with the space of all possible parameters simultaneously such that it can find the best model which minimize the error criterion.

E. Crossover
Crossover is the main operator in any evolutionary process.Crossover is performed between two individuals (i.e.Tree) [28].A study of crossover operators in GP was presented in [29].Assuming we have two parents of genes T 1 , . . ., T 5 and R 1 , . . ., R 3 .In Table I, we show the crossover operation in multigene GP.

TABLE I CROSSOVER IN MULTIGENE GP
In Multigene symbolic regression, the model output ŷ is formed by a weighted output of each of the trees/genes in the multigene individual plus a bias term.Each tree is a function of zero or more of the n inputs variables x 1 , . . ., x n .Mathematically, a Multigene regression model can be written as: δ 0 represents the bias or offset term while δ 1 , . . ., δ m are the gene weights and m is the number of genes (i.e.trees) which constitute the available individual.The values of δ coefficients can be estimated using least square estimation technique.A simple example of two multigene model is presented in Figure 3 and Equation 8.

F. Mutation
Mutation is a relatively important operator it helps in keeping diversity in the population especially when most individual has the same fitness.Mutation helps keeping the exploration in the population.Mutation in multigene GP operates almost the same way as in standard GP.www.ijacsa.thesai.org

IV. PERFORMANCE CRITERION
Number of performance criterion were used to evaluate the performance of the developed ANN and Multigene GP models.These evaluation criterion are presented in the following equations.
1) Euclidian distance (ED): 2) Mean Absolute Error (MAE): 3) Mean Magnitude of Relative Error (MMRE): where y and ŷ are the actual measured Ozone level and the predicted Ozone level developed by the ANN and GP models given n measurements.

V. SITE CHARACTERIZATION AND DATA
The study area under study is Chenbagaramanputhur.It is a rural place in Kanyakumari district and is about 12 km from Nagercoil town.In the North and North East of the city, you can find the Tirunelveli district.Kerala State is located in the North West and sea in the west and south of Chenbagaramanputhur (See Figure 4).The data used in this study were reported in [13].Authors in [13] mentioned that the measurements were collected using a portable Aeroqual series S200.The Aeroqual series 200 can measure various ozone levels.Measurements were taken every 3 hours intervals for a period of 3 months during May 2009 to July 2009.Figure 5 shows the inputs and output of the proposed models.The variables used as inputs and output are presented in Table II.

A. Developed MLP-ANN Model
We developed a MLP-ANN model using the input-output data presented in Table II to model the surface Ozone using the parameters given in Table III.Various number of neurons in the hidden layer were explored during the learning process.The best number of neurons found was four. Figure 6, shows that the MLP training process had fast convergence to the minimum training error after only nine cycles (epchs).Figure 6 shows the actual and estimated Ozone surface values based the final developed MLP model.

B. Developed Multigene GP Model
To develop the genetic programming model, GPTIPS MATLAB Toolbox developed in [28] is used.GPTIPS is a powerful genetic programming software tool which can be used of modeling of dynamical nonlinear systems.The tool can be configured to evolve multigene tree structure.The Multigene approach often develops simpler models than evolving models consisting of one monolithic GP tree.
The data set described earlier was loaded then the Multigene GP was applied using GPTIPS Tool.The parameters of   IV.In Figure 8, the convergence of GP over 100 generations is shown.The best generated Surface Ozone Multigene GP model is given in Equation 12.It can be clearly seen that the final model is a simple and compact mathematical model which is easy to evaluate.Figure 9 shows the actual and estimated surface Ozone values based the developed GP model.

C. Comments on Results
In order to compare the performance of GP and MLP for predicting Ozone concentrations, the evaluations criteria discussed in Section IV are used to assess both developed model.The criteria measurements for the models are computed and summarized in Table V.It can be noticed that the Multigene GP model has shown better prediction results over the MLP model for training, testing and validation partitions by means of all evaluation criteria.Moreover, the final developed GP model shown in Equation 12 is considered much simpler than the complex model of the ANN approach.

VII. CONCLUSIONS AND FUTURE WORK
A comparison between genetic programming and multilayer perceptron neural networks were presented for short term prediction of surface Ozone based on limited number of measured pollutant and meteorological variables.The GP approach adopted is based on Multigene symbolic regression which generates mathematical models of linear combinations of low order non-linear transformations of the input variables.Based on this comparison, it can be concluded that the evolutionary models of the Multigene GP have promising potential for predicting surface ozone concentrations when

θ
is defined as a random floating point number such that θ ∈ [−1, 1].Thus, a and b are also defined in the domain a, b ∈ [−1, 1]

Fig. 4 .
Fig. 4. Location of the area of study at Chenbagaramanputhur