A Genetic Programming based Algorithm for Predicting Exchanges in Electronic Trade using Social Networks ’ Data

Purpose of this paper is to use Facebook dataset for predicting Exchanges in Electronic business. For this purpose, first a dataset is collected from Facebook users and this dataset is divided into two training and test datasets. First, an advertisement post is sent for training data users and feedback from each user is recorded. Then, a learning machine is designed and trained based on these feedbacks and users' profiles. In order to design this learning machine, genetic programming is used. Next, test dataset is used to test the learning machine. The efficiency of the proposed method is evaluated in terms of Precision, Accuracy, Recall and F-Measure. Experiment results showed that the proposed method outperforms basic algorithm (based on J48) and random selection method in selecting objective users for sending advertisements. The proposed method has obtained Accuracy=74% and 73% earning ration in classifying users. Keywords—Electronic business; Social networks; prediction; machine learning; genetic programming; Facebook network


INTRODUCTION
Electronic business [1] means production, marketing, sale and delivery of goods using electronic tools.Although electronic business is still in its infancy, it has played an important role in our daily life, such that it cannot be easily avoided.Since succeeding in electronic business requires having information and analysing marketing environment correctly, it is clear that a major part of this information should be obtained in cyberspace.Additionally, correct information analysis requires knowledge of cyberspace.Since a major part of communications in electronic business is established through available tools including social networks, analysing information obtained from these tools can help the electronic business succeed significantly.The social network in social sciences investigates relations among humans, human groups and organizations.These networks consist of organizational groups which are connected through one or multiple dependencies [2][3][4][5].
One of the novel methods for predicting the behaviour of statistical society is to use social networks [6,7].The main problem in this paper is that how can information of social networks be used for predictions and improving electronic businesses?In order to answer this question, a learning machine based on genetic programming is proposed to be used in social networks for predicting exchanges in electronic trade.
The approach proposed in this paper might be an important step towards improving electronic business trade.This template, modified in MS Word 2007 and saved as a "Word 97-2003 Document" for the PC, provides authors with most of the formatting specifications needed for preparing electronic versions of their papers.All standard paper components have been specified for three reasons: (1) ease of use when formatting individual papers, (2) automatic compliance to electronic requirements that facilitates the concurrent or later production of electronic products, and (3) conformity of style throughout a conference proceeding.Margins, column widths, line spacing, and type styles are builtin; examples of the type styles are provided throughout this document and are identified in italic type, within parentheses, following the example.Some components, such as multilevelled equations, graphics, and tables are not prescribed, although the various table text styles are provided.The formatter will need to create these components, incorporating the applicable criteria that follow.

II. RELATED WORK
Studies conducted in the context of electronic business can be categorized into four general classes:

A. Approaches based on Brand
In general, these approaches focus on the contribution of consumer, sale objectives, Brand loyalty.Pentina et al. [8] have studied the impact of relations among consumers with Facebook and Twitter brands.De Vries et al. [9] have investigated impacts of transmitted messages on brands' pages including clarity, interaction, information content, amusing issues and location of the message.Labrecque [10] has studied whether interactions and social relations of a brand, makes the user offer information and be loyal to that brand?

B. Approaches based on Modeling Information Broadcasting
Researches done in this group is mostly model information broadcasting (including financial information) in social network level.Tsur and Rappoport [11] have predicted activities and performances in social networks using contents and topologies.Bonchi et al. [12] have extracted and searched business plans through learning population structure and network dynamic.Saito et al. [13] have proposed a www.ijacsa.thesai.orgprobabilistic model based on information broadcasting for prediction.

C. Generic Approaches
Studied in this groups are based on not identifying network structure which is a more difficult level of finding behavioural patterns of users.Rodriguez et al. [14] have designed a generic model for tracking users' path.These researchers have improved their model through concave optimization [15].Duong et al. [16] have resolved this problem using two approaches: The first learns graphical model potentials for a given network structure, compensating for missing edges through induced correlations among node states.The second learns the missing connections directly.

D. Approaches based on Users' Behavior
In an analysis of social networks, studying users' behaviour based on various hypotheses about the user, has attributed a lot of information.Zhang et al. [17] have identified strong users using their friends' comments.Anagnostopoulos et al. [18] have also identified users with high influence among information broadcast by users through social networks.
III. GENETIC PROGRAMMING Genetic programming (GP) is an evolutionary computation (EC) technique that automatically solves problems without having to tell the computer explicitly how to do it.At the most abstract level GP is a systematic, domain-independent method for getting computers to automatically solve problems starting from a high-level statement of what needs to be done.Algorithmically, GP comprises the steps shown in Algorithm 1.The main genetic operations involved in GP (line 5 of Algorithm 1) are the following [19,20]:  Crossover: the creation of one or two offspring programs by recombining randomly chosen parts from two selected programs.
 Mutation: the creation of one new offspring program by randomly altering a randomly chosen part of one selected program.6: until an acceptable solution is found or some other stopping condition is met (for example, reaching a maximum number of generations).7: return the best-so-far individual.

IV. THE PROPOSED METHOD
The main purpose of the proposed method is to design a learning machine with prediction ability to find potential users in social networks for business objectives.For this purpose, first, a dataset including profile information and its links in social networks is collected.Then this dataset is used to send advertisement links to users and their feedbacks are investigated.Users are marked based on opening the link or not opening the link.Thus, a general dataset is employed to train a predictor learning machine based on genetic programming.Objective variable in genetic programming training is output vector label of "yes" or "no" which indicated whether the link is opened or not.Additionally, data in this dataset is mapped to a numerical space to establish feature vector.After training the learning machine, this machine is used to select target users for commercial operations in social networks.
Flowchart of the proposed method is shown in Figure 1.The general framework of the proposed method is comprised of two main steps: 1) Designing Learning Machine for Prediction Process 2) Evaluating performance of the designed machine for selecting potential target users in electronic business In the first step, a learning machine is trained whose input data is the dataset collected from social networks and its output is "yes" or "no" label.The aim of this machine is to create a regression function which maps input data to output labels well.
In the second step, test data is used to test the designed machine.In order to test and validate the performance of this machine, target users for advertisement are selected randomly and using the designed machine.Finally, quality of selected users in these two methods is compared.It should be noted that opening or not opening an advertisement link shows performance quality.In the following, steps of the proposed method are described.

A. Collecting Dataset
Dataset of this research is adopted from Facebook which is collected by Stanford University [21].The focus of this network is on users' data.This social network creates an API programming through which users' information can be received as a web service.Main indices extracted from this dataset are shown in Table 1.
In this network, main elements are users.Except for features like name, age and etc., users are specified through their operations in groups and different events of the social network.Figure 2 shows a general scheme of users' activities on this network.In a dataset of Stanford University for Facebook, not only users' profile information is available, but also network level information is available.For instance, family relations of a person with other people are also shown.Figure 3 shows friendship circle and relation of a specific user (node v i ) with another user of the network (node u i ).
As can be seen in Figure 3, a user is in many different circles where some of these loops may overlap.This important data in this dataset is a clear characteristic which helps the learning machine to find users with common relations using this friendship circles with higher accuracy.Each user's data can be considered as a unit record which can be represented by an adjective vector and statistically independent.This dataset has various variables which are both numerical and classified.Each record is in fact a node in the graph of the social network, which specifies a specific user with a unit index.This dataset has four files which are described as follows:  Edges file: edges of each node in the network.In Facebook, edges are non-directional.
 Circles file: including circles of a series of nodes.
 Feat file: including features of each node.
 Feat Names file: name of each feature is in this file.
Features of users are initialized as 1 and the features which are not initialized for each user are specified with 0.

C. Recording Users' Feedback
In this step, a message containing an advertisement is sent to evaluate feedback from users about the advertisement link for users of the training dataset.At this step, 990 users have offered a positive answer which indicated that they had read the advertisement link and others have not responded.Thus, in the next step, training dataset for building the learning machine is a training dataset including 990 positive answers and 2100 negative answers.

D. Designing The Learning Machine
In the proposed method, genetic programming is employed to create the learning machine for selecting potential users.Inspired from the standard algorithm proposed for genetic programming in section 3, a modified version of the algorithm is employed in the proposed method.Since the output of the learning machine in the proposed method is either 0 or 1, a binary version of genetic programming is used.Features considered for each user in the social network is an ndimensional feature vector as in Equation (1). www.ijacsa.thesai.org Each vector belongs to a class 0 or 1 which determines whether the user has opened the link or not.Therefore, there would be two different output classes c1 and c2: And dataset would be: where, userno is the number of people who have participated in the test.The purpose of distance learning algorithm is to find a set of metric functions F as below: Such that: According to the definition of characteristic functions of each class, f functions are binary, this metric function f, maps each n-member input vector X to a unit vector in 2D space.Thus, cosine distance between these unit vectors is zero.In other words, desired metric function should have the following characteristic: Cosine function obtained from metric function being zero for two characteristic vectors ( ( ) ( ) ) means that characteristic functions f can be good classifiers for the problem.However, finding these functions is the main challenge of the problem.
Since function f is binary, the general framework of the functions in Equation ( 5) is defined as below: In fact, function f is the association of polymers in which each polymer is obtained through connecting monomers operand1 and operand2 with the logic operator and.In other words, (8) ( ) In which indicated and operation and indicates or operation.Thus, function f can be changed into a tree recursively.If there is no repetitive monomer in the function of monomer l is common among several polymers, these polymers combine and create a tree as in Figure 4.

E. Evaluating Performance of the Learning Machine
Finally, output function obtained from learning machine is used to identify potential target users who respond to advertisement links positively.The random selection algorithm is also used to evaluate the performance of the learning machine.In this step, total test dataset, 3037 users, are classified by the learning machine.In the next section, detailed results of the test are described.

V. EVALUATION RESULTS
In this section, the proposed method is evaluated and the results obtained from random selection and algorithm based on J48 [22] are compared.Moreover, all implementations and evaluations are performed in MATLAB on a PC with the Intel-i5 processor and 4GB RAM.
In addition, to train and test the proposed method, K-fold (K=10) method is employed.In this type of test, data are classified into K subsets.From these K subsets, a subset is used for test and K-1 subsets are used for training.This procedure is repeated K-times and all data are once used for test and once for training.Finally, an average of these K times test is selected as the final estimation.In the K-fold method, the ratio of each class in each subset and in the main set is the same.

A. Evaluation Measures
One of the common tools used for evaluating classification algorithms is to employ disturbance matrix.As can be seen in  Considering the confusion matrix, following measures can be defined and evaluated:  True Positive are those who were properly identified by the algorithm as interested in our product.The cost of advertising is unreservedly covered by the income.
 True Negative are those who were properly classified as not interested in our product.There is neither cost, nor income.
 False Positive is the group of recipients who were identified as interested in our product while, in reality, they were not.This group creates a cost because we have lost money invested in sending an advertisement.However, as we will see, it is not the worst classification.
 False Negative is the most expensive misclassification as we have lost those who would buy our product.Although, we have saved the money not invested into the campaign, those economies are incomparably small to our loss.
 Precision is the fraction of retrieved instances that are relevant:  Accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined:  Recall is the fraction of relevant instances that are retrieved: (  F-Measure combines precision and recall (harmonic mean): ( We will take an assumption that each positive response gives us z units of revenue while a cost of sending one advertisement is estimated as 0,01z.With these assumptions and referring to classification categories, we can evaluate revenues, costs and profits of classification for each of four previously defined groups of customers.A summary is presented in Table 3.

B. First Experiment: Effect of features selection on convergence speed of the proposed method
First, the learning machine was trained using all available features.Then, features which have more important role in training this learning machine were extracted through several tests and were used to train the machine.Choosing less and more important features increases training speed of the machine using genetic programming.Among features set, 30 important features were selected among which some features are listed in Table 4.In addition, Figure 5 shows convergence diagram of genetic programming in different iterations for cases of not selecting important features and selecting important features.As can be seen in the results, when features are reduced, genetic programming algorithm converges faster in reducing classification error compared to the case where all features are included.5.
For clarity of evaluations, cost and revenue measures are referred (Table 3).Profit gained from test and training of the learning machine is given in Table 6.As mentioned before, maximum loss which occurs in classifying the learning machine is related to the false negative class that is the learning machine has not selected a user as a potential man in the classification process.Thus false negative should be decreased and true positive should be increased.www.ijacsa.thesai.org

D. Third Experiment: Evaluating Performance of positive feedbacks using Lift diagram
After extracting important features and creating the proposed learning machine, Lift diagram, Figure 6 is used to evaluate the output of the learning machine in terms of positive feedbacks from test dataset users.In this diagram, the rate of positive answers received from users is compared to two cases of a random selection of users and selecting the users by the proposed learning machine.In this diagram, the rate of positive answers for random selection of users is considered as the base-line.Experiment results show that as a number of selected users in the test dataset increase, Lift rate decreases.Maximum Lift rate is obtained when 30% of users are selected to send them the advertisement.In this case, the proposed learning machine has performed 2 times better.That is, the proposed learning machine could have selected appropriate users to send them the advertisement.

E. Fourth Experiment: Evaluating Common Classification Measures
In this experiment, the performance of the proposed method in terms of Precision, Accuracy, F-measure and Recall in test stage is evaluated.Figure 7 shows results obtained from this test.Evaluation results showed that Precision, accuracy, recall and F-measure are 0.66, 0.72, 0.80 and 0.74, respectively.
Moreover, the performance of the proposed in test and training stage is compared in terms of accuracy and profit, where the results are given in Table 7 and Table 8, respectively and the results are compared with [22].Comparison results show that the proposed method with 80% accuracy in training stage and 74% accuracy in the test stage, perform better than the method proposed in [22].This shows the acceptable performance of the proposed method in selecting potential users to send them the advertisement.Table 8 has also compared profit to revenue ratio of the proposed method with that of [22] Comparison results show that the proposed method outperforms the other method.

VI. CONCLUSION
In this paper, a learning method based on genetic programming is proposed for business predictions in social networks.The main purpose of this method is to select users of the social network who give appropriate feedbacks to the advertisements, they receive.For this purpose, a dataset of users' information from Facebook was collected and studied.This dataset was divided into two test and training datasets.First, a learning machine was trained through sending an advertisement to existing users and receiving their feedbacks.The main purpose of designing the proposed learning machine is to train it to learn how to select users who may give positive answers with higher probability.
After designing and training the proposed learning machine, its performance was evaluated using the test dataset.Experiment results showed that proposed method classifies users with 74% accuracy.Additionally, LIFT test was used to compare the performance of the proposed learning machine with random selection method, and the results showed that for selection of users, the proposed learning machine outperforms random selection method in selecting potential target users.Moreover, for selecting , both methods perform the same.In the future work, we try to improve the performance of the proposed method by combining meta learning algorithms such as decorate, bagging, and boosting with genetic algorithm.

Fig. 4 .
Fig. 4. Converting an algebraic term into a binary tree in genetic programming

Fig. 5 .
Fig. 5. Comparing convergence speed of genetic programming for using/not using features selection C. Second Experiment: Calculating Profit After training and testing the proposed learning machine, obtained results in terms of TP, FP, TN and FN are presented for test and training datasets in Table5.

Fig. 6 .Fig. 7 .
Fig. 6.LIFT diagram of the proposed learning machine compared to random selection method

TABLE I .
THE MAIN INDICES IN THE NETWORK GRAPH

Table 2 ,
disturbance matrix includes results of predictions of classifier algorithm in 4 different classes including True Positive, False Negative, False Positive and True Negative.

TABLE IV .
SOME IMPORTANT FEATURES OF SOCIAL NETWORKS USERS

TABLE VI .
PROFIT OBTAINED FROM TEST AND TRAINING DATASETS BY USING THE PROPOSED METHOD

TABLE VII .
COMPARING PERFORMANCE OF THE PROPOSED METHOD WITH J48 [22] IN TERMS OF ACCURACY

TABLE VIII .
[22]ARING PROFIT TO REVENUE RATIO OF THE PROPOSED METHOD WITH J48[22]