MFCLMin : A New Algorithm for Extracting Frequent Conceptual Links from Social Networks

Massive amounts of data in social networks have made researchers look for ways to display a summary of the information provided and extract knowledge from them. One of the new approaches to describe knowledge of the social network is through a concise structure called conceptual view. In order to build this view, it is first needed to extract conceptual links from the intended network. However, extracting these links for large scale networks is very time consuming. In this paper, a new algorithm for extracting frequent conceptual link from social networks is provided where by introducing the concept of dependency, it is tried to accelerate the process of extracting conceptual links. Although the proposed algorithm will be able to accelerate this process if there are dependencies between data, but the tests carried out on Pokec social network, which lacks dependency between its data, revealed that absence of dependency, increases execution time of extracting conceptual links only up to 15 percent. Keywords—Social network analysis; frequent conceptual link; data mining; graph mining


I. INTRODUCTION
Social network is a social structure that is composed of some agents (generally individuals or organizations) that are connected by one or more kind of dependencies, such as ideas and financial transactions, friends, relatives, web links, spread of diseases (epidemiology).Social networks exist in different categories some of which could be found in [1].The results of various studies indicate that the capacity of social networks can be used in many individual and social levels in order to identify problems and determine solutions, establishing social relationships, organizational governance, policy making and advising people on track to achieve the objectives.
Social network analysis is a powerful tool for analyzing the nature and pattern of communication among members of a particular group.Social network analysis helps imagine and analyze complex set of relationships between relevant factors as the maps (graphs or photographs) of connected symbols, and patterns within these categories, and it also helps calculate and review the exact size, shape and density of the network as a whole and calculate the position of each element within it.For example, in the science of epidemiology, social network analysis is used to help understand how patterns of human contact helps or prevents the spread of diseases such as HIV in a population.
From a variety of social networks, online social network has received attention among researchers.A key aspect of many online social networks is being data-rich, and therefore providing unprecedented challenges and opportunities in terms of knowledge discovery and data mining.One of the most important fields of study of traditional data mining is exploring the frequent pattern.In the field of complex data structures such as networks, the issue of exploring frequent items is discussed in form of finding a subset of nodes (sub-graphs) that occur frequently arises in a network known as graph mining.Although primitive methods in this field have been using measures deriving from graph theory [2], new approaches known as social networks mining or simply link mining try to examine features of node in addition to the network structure to extract a new set of patterns [3]- [5].
Authors in [6] described a new approach named conceptual link to describe social networks.Conceptual link provides the knowledge about groups of nodes that densely connected to each other in a social network, and through a reduced structure, which is called as conceptual view, leads to a semantic view of social network.However, the problem of extracting conceptual link is like extracting of frequent itemsets [7] with NP-hard complexity [8].In this paper, D-MFCLMin algorithm is presented which using the concept of dependence, and by pruning the search space, tries to reduce the time required to extract frequent conceptual links.The paper will be structured as follows.In Section 2, the concept of conceptual links is presented, and then in Section 3, proposed algorithms for the extraction of frequent conceptual links are introduced.Our proposed algorithm is presented in Section 4. Finally in Section 5, test results are presented.

II. PROBLEM DESCRIPTION AND DEFINITIONS
In the field of search for frequent conceptual links (FCL), a model is defined as "a set of links between the two groups of nodes, where the nodes in each group share common characteristics".When these patterns are found on the network with enough repetition, they are seen as frequent patterns and called FCL [6].More formally, assume that G = (V, E) is a network where V is the set of nodes and E is the set of edges with . is defined as the relation where each is a attribute.Thus, every node is defined by the tuple where , is the attribute value in v.An item is a logical expression as A = x where A is an attribute and x is a value.Empty items are shown as .An itemset is a combination of items for example A1 = x and A2 = y and A3 = z.An itemset, m, which is a combination of k non-empty item is called a k-itemset and noted ).www.ijacsa.thesai.orgSuppose that m and sm are two itemset.If , we say that sm is a sub-itemset and m is a super-itemset of sm.For example, sm = xy is a sub-itemset from m = xyz [6].Set of all t-itemset made of V are shown with .Moreover, is defined as follows (set of all itemset of maximum size t): Suppose that G is a directed graph.Thus, for any itemset m on , is shown as a series of nodes in V that is match the pattern m and defined as follows [6]: -Set of links on the left m ( ): the set of links from E that starts from the nodes that satisfy m.
-The set of links on the right ( ) m: the set of links from E that enter the nodes that satisfy m.Definition 1. Conceptual links [6]: Suppose that m1 and m2 are two itemset and and are the set of nodes in V that satisfy m1 and m2 respectively.
is the set of links connecting the nodes in to the nodes in ,

Definition 2. [6]:
We call support as ratio of links in E that belongs to .: If a conceptual link l is frequent all its sub-links are frequent too.Thus, if a link is not frequent, none of its super-links is frequent.Definition 6. Maximum FCL [6]: Assume that β has a given support threshold value, we say that the maximum frequent conceptual link (MFCL), any FCL is so that no superlink of ́ from l that is frequent exists.More formally: III. RELATED WORK Popular approaches of mining social networks have been proposed to extract different forms of knowledge from these networks.Similar to the traditional field of data mining, social network mining addresses wide range of tasks such as classification, clustering, search for frequent patterns or link prediction.These methods can be divided into two groups [8]:  Approaches based on predictive modeling that includes techniques that analyze current and past facts to make predictive assumptions about future or unknown events.
 Approaches based on descriptive modeling that cover a set of techniques whose aim is to summarize data by identifying some related features to describe how things organize and actually work [8].
In this study, the focus is on descriptive approach of the social network.These approaches can be divided into following four categories [5].

1) Link Based Clustering (also known as Community
Detection) that searches a dense groups of nodes and its aim is to analyze network to several linked components (communities) in such a way that nodes in each component have high-density connections, while nodes in different components have the lowest density.Of the proposed methods in this category algorithm SLPA [9], TopGC [10], SVINET [11], MCD [12], CGGC [13], CONCLUDE [14], DSE [15] and SPICi [16] can be cited.
2) Hybrid clustering that simultaneously considers attributes and the structure of the nodes to identify clusters.The aim of this new type of approaches is partitioning of the network by balancing structural similarities and attributes so that nodes with common attributes are grouped in one partition and the nodes inside partition are densely linked.These approaches provide a more conceptual partition of the network that is not necessarily proportional to context.Of clustering methods SA-Cluster [4] and CESNA [17] can be cited.
3) Frequent Sub-graph mining.The most widely used definition of a pattern is as a connected sub graph [18].Therefore, techniques that focus on the search for frequent patterns in social networks aim to identify sub graphs that occur frequently in a database or a very large network of networks, based on a minimum threshold value.Among the prominent methods in this category, Apriori-based algorithms [19] and pattern growth [20] can be cited.
4) FCL combines network structure information and node attributes for providing knowledge about groups of nodes, which have more connections in a social network.Extracting MFCL creates a complexity similar to frequent item set, since it is proven that this complexity is NP-hard.Extracting all MFCLs from a social network may be a challenging problem and computationally intensive.According to the definitions of the concept of conceptual links, we deal with the methods provided for extracting these links.www.ijacsa.thesai.orgIf search space is very large, discovering all the frequent links in a network is very costly.In a simple approach, it is necessary to produce all set of possible items and then examine the frequency of each pair of them.To reduce this time, at the beginning, FLMIN algorithm [21] was proposed.This algorithm used a bottom-up approach by applying property 2 to gradually reduce the search space to include a superset of items that will potentially exist in FCLs.In Fig. 1, a sample of conceptual links extracted by FLMIN algorithm is shown.In [22] MAX-FLMin algorithm was presented.In this algorithm, the aim is finding MFCLs.Compared with previous algorithm, this algorithm only uses itemsets that satisfy property 1 to create links, and then they are checked for being frequent.In addition, in the process of examining the created link in order to add it to frequent links, this algorithm checks lack of existence of maximum frequent link compared to the current link.
In H-MFCLMin algorithm [5] in order to accelerate extraction of MFCLs, some of the itemsets are filtered.The filtered itemsets includes itemsets that the number of their matched nodes in the network are less than the threshold α. α is an input parameter for the algorithm.This filtering is done with the argument that there is little likelihood that an itemset with low frequency can attract a high proportion of links in the network and therefore by filtering these kinds of itemsets, despite the reduction in the search space, certain information will not be lost from the final conceptual network.

IV. THE PROPOSED ALGORITHM
In this paper, D-MFCLMin algorithm is proposed to extract conceptual links.By pruning the search space by applying the concept of dependency, this algorithm accelerates the extraction of conceptual links.In the following, first we introduce the concept of dependency, and then we will go on to present pseudo code of the proposed algorithm.

A. Definitions
Definition 7. Dependency: Suppose and are two itemset.We say is dependent on and show it as  if , we have .We show all dependencies of an itemset such as in the form of .
Definition 8. Set of selected itemset: Assume that is the set of extracted FCL from itemsets with maximum t items.
( ) is a set of itemsets used to create these links.

{ } (8)
Property 3: If itemset is not in any of the extracted FCLs in ( ), then none of the itemsets that depend on it will be at FL t .
Proof: Assume that is the itemset that depend on the itemset , and suppose that is not involved in any FCL ( , so according to definition of FCL, for all itemset such as β (or β ) Moreover, according to the definition 6, we know that So we have: As a result: And therefore the property is proven.

Definition 9. Parents of an itemset:
For each itemset ، two parents are shown as and ) so that: Definition 10.Dependency Level: For each itemset m, the dependence level is shown with and defined as follows:

B. Pseudo Code of D-MFCLMin Algorithm
The pseudo code for proposed algorithm is given below.Similar to H-MFCLMin, input parameters are α and β that are threshold value related to itemset and link support respectively.
Similar to H-MFCLMin [5], in the first iteration (t = 1), 1itemset ( ) are created according to the properties 1, 2 (lines 6 and 7).After creating these lists, the set of their itemsets are ordered in terms of the amount of their support in ascending order.Unlike H-MFCLMin, before the search for FCLs, in iteration t, the dependencies between itemsets in are obtained.
For this purpose, set of t-itemsets of are mutually joined and then, based on the amount of support of resulted itemsets, the existence of dependency between two www.ijacsa.thesai.orgjoined itemset is checked.In the absence of dependency, resulted itemset is inserted to the list for the next iteration as one of the candidate itemsets (lines 25-11).This insertion is done in a way that the order of the list of items remains in ascending order in terms of the amount of support.
After determining the dependencies among the itemsets of iteration t, their dependence level calculated and then is sorted by increasing order of the level of dependence (line 26).After sorting, the search for FCLs is done.Founded FCLs are added to list and then by removing sub FCLs links located in , are added to as MFCL (lines 44-27).
More exactly, this search is done so that for every itemset and , with the condition that or is checked whether the link is frequent or not.Before this check, set of the dependent itemset and are checked.If none of the sets of dependent itemsets are added in , checking the frequency of this pair is ignored (line 33).Recall that the itemsets in and are arranged in ascending order of dependency, so when check an itemset, all of its dependant itemsets, has already been investigated at this iteration.After this step, similar to H-MFCLMin algorithm, checking the frequency of the link is done (line 34

C. Analysis of the Proposed Algorithm
First, the cost of H-MFCLMin algorithm is discussed.Suppose that we want check the existence of conceptual link between the two itemset and (i = t or j = t) at iteration t ).To this end, the edges of the network whose source node belong to and their destination node belongs to will be counted, the cost of this study can be obtained as follows: In the above equation, N is the number of features of each itemset.To search for a node belonging to an itemset, it is enough to compare attribute values of nodes with the itemset, which will have cost of N, and because this action should be done for source and destination of each of the edges, double of these costs will be imposed.
In D-MFCLMin algorithm, by taking into account the dependencies, the above costs will change as follows: In the above relation, is the cost of calculating the dependencies of two itemset and , and p is the probability that dependencies on these two itemsets would stop counting the edges of social network to check for conceptual link between them.Value of depends on the number of dependencies of the itemsets being checked and the number of conceptual links found in the intended iteration.In the algorithm D-MFCLMin, for every pair of items being checked, their dependency of participation in the conceptual links that have been found so far in the current phase is evaluated, so this cost is as follows: Therefore, in the following the number of two factors of the itemset dependencies and conceptual links are examined.

1) The number of dependencies of an itemset:
There is no possibility to determine the exact number of dependencies of an itemset, so we will consider their maximum number.For simplicity, we assume that the number of itemsets in iteration t, in are equal.According to this assumption, in rest of paper we assume no difference between the two sets and therefore to be concise we will use the abbreviation .As already mentioned, the set of itemsets in each iteration are ordered based on the support arranged in ascending.Based on the assumption of the existence of maximum possible dependencies in the set , the first itemset will not be dependent on any itemset, the second itemset only may be dependent on the first itemset, the third itemset at most will be dependent on two previous itemset, and so on, so the maximum number of dependencies between all itemsets in the set is equal to: By considering the uniform distribution of this dependency between itemsets of this set, the maximum number of dependencies for each itemset is obtained as: It should be noted that the maximum number of itemset in iteration can be obtained from the following recursive relation: In the above relation shows the number of possible values for i-th feature.For example, about the characteristics of gender, the number of possible values is equal to 2.
2) The number of conceptual links: The second factor affecting the cost of checking dependencies is the number of conceptual links found in a step ( ).Given the steady growth of the number of conceptual link, the maximum number of conceptual links assessed per pair itemset is equal to: According to the above values, the number of conceptual links that are checked for every pair itemset on average is equal to: 17) www.ijacsa.thesai.org According to relations ( 14) and ( 17), the overall amount of is obtained as follows: Now, with regard to determining the amount of the dependencies cost, we will analyze the behavior of the proposed algorithm.
The worst situation in the proposed algorithm occurs when despite the large amount of dependencies, there is no pruning.The amount of pruning depends on the number of conceptual links found, as the number of conceptual link found is low, an increase in dependency, will be more likely in pruning the itemsets.On the other hand, the number of FCLs depends on the amount of β, as the value of this parameter is less, more FCLs will be found.Therefore, we expect that the proposed algorithm shows a weaker performance when β is a small amount.

V. EXPERIMENTS AND RESULTS
In this section, the results of the assessment of the proposed method (D-MFCLMin) are provided.H-MFCLMin method is considered as the method used for comparison.First, in the next section, data set used is introduced, and then we will examine the results.

A. Dataset
In this study, dataset of a social network called Pokec was used [23].Pokec is the most popular online social network in Slovakia.This dataset includes altered profiles of the users of this social network with links of friendship between them.It should be noted that in Pokec friendship relationship are directed.User's profile includes 59 fields that only eight fields are mandatory.In Table I, the features of these eight fields are shown.

B. Results
As mentioned earlier, in order to evaluate the performance of the proposed algorithm, its results were compared with the results of H-MFCLMin algorithm.It should be noted that the output of both methods is similar in the sense that, there are no differences in the extracted FCL in the two algorithms.In Fig. 2, the conceptual view derived from Pokec is shown by taking value 0.3 for β.An interesting feature shown in this figure is the two-way communications between itemsets.In fact, if there are conceptual links between the itemset A to B, there is a conceptual link between B to A itemset too.As already mentioned, the mentioned social network is directional, which means that friendship is one-sided.However, with the resulting outputs, it is revealed that the users of this social network have bilateral friendship relations.Although the proposed algorithm (D-MFCLMin) and H-MFCLMin algorithm extract similar conceptual views from the social network, the time taken to do this, is slightly different in two algorithms.In Fig. 3, the run time of each of these two algorithms to extract MFCL from Pokec social network is shown at different values of parameter β.It should be noted, parameter α value is considered as equal to zero.Both algorithms have been run 10 times and the achieved average execution time is considered as their run time.www.ijacsa.thesai.orgAs can be seen, at high levels of β, both algorithms have almost the same performance but with a lower value of this parameter, the difference in the time of two algorithms becomes greater.This difference is the time that takes to proposed algorithm to determine the dependencies between itemsets.It is noteworthy that, unfortunately, the dependency between itemset of used dataset is zero, so in fact no pruning is done due to the dependency in this experiment.However, as in the figure above is shown, despite the lack of existence of dependency in the Pokec, in the worst case (small values of β) run time of the proposed algorithm is ultimately up to 15 percent more than H-MFCLMin.However, if there is dependency between itemsets, the possibility of pruning the search space and thus accelerating the extraction of FCLs will be possible, and thus the difference in performance of the two algorithms will be a greater increase.

VI. CONCLUSION
Widespread use of social networks has caused very high volume of information so knowledge extraction has become one of the areas of interest for researchers.FCLs are one of the approaches to extract knowledge from these networks that in addition to the data related to communications emphasizes the data related to the existence of these networks.In this paper, by introducing and using the concept of dependency, a new algorithm is presented to accelerate the extraction of FCLs.The existence of dependencies between data causes a pruning of portion of the search space and thus accelerates the process of extracting conceptual links.Due to the lack of dependency in the used dataset, this acceleration was not observed, but the test results showed that despite the lack of dependencies, the proposed algorithm compared with H-MFCLMin algorithm has almost the same performance.

Fig. 3 .
Fig. 3.The run time of the two algorithms, D-MFCLMin and H-MFCLMin in different values of β.

TABLE I .
MANDATORY FIELDS FEATURES IN SOCIAL NETWORK * Frequently areas in Slovakia but some areas included in the Czech and Germany as well