Use of Non-topological Node Attribute Values for Probabilistic Determination of Link Formation

—Here we propose a probabilistic model for determining link formation, using Naïve Bayes Classifier on non-topological attribute values of nodes, in a social network. The proposed model gives a score which helps to determine the relationship strength in a non-formed link. In addition to Naïve Bayes Classifier, weighted Average of the Attribute value match helps to determine the friendship score of a non-formed link. With the increase in online social networks and its influence on people, more and more individuals are getting wider and enhanced social connect. Everyone tries to connect more to explore more. In this race of more, an individual needs better and definitive tools to help them grow their network. Wider is the network more is the possibility to explore. Here we present a novel approach for predicting a link (friendship) between two individuals (nodes) in a social network. The proposed approach uses non-topological attribute data values of both the nodes and predicts linkage possibility by applying Naïve Bayes Classifier on non-topological attribute data values of nodes in existing linkages. A linkage possibility is expressed using one quantitative measure FSCORE. We call it friendship score (FSCORE) between two unconnected individuals. FSCORE is used to predict linkage between two nodes. Higher FSCORE means a higher possibility of linkage between two nodes.


INTRODUCTION
Online Social Networks (OSNs) have become an integral part of today's life.OSN is where everyone keeps her/his social connect.Social networks are still doing the same work like information exchange, furthering a cause, keeping up communication, guiding and developing a society, to name a few.More connected is an individual; more he can achieve out of his social connects.
In the early days greeting and meeting people in social gathering was the only way to increase your social network and influence.Today, with the acceptance and spread of online social networks, ways to connect individuals have significantly improved.Different Online Social Networks addresses different interests of an individual.Facebook and Google+ mainly exist to share information, initiate a conversation and discuss on a certain topic.Twitter a micro-blogging site helps commenting on any issue at hand/in mind and letting the world know about it.Flickr is a photo sharing social network.LinkedIn is an online network of professionals.
With every online Social Network site, there is a new and different social network, of people in a context, created by an individual.With the increasing options and different focus areas of different networks, the data created out of these networks is diverse and huge.This data provides a great opportunity for analysts to dig the created data and interpret the future course of the network.
While there are inherent risks in use and distribution of OSNs data, there are also many potential benefits of this data.Interpretation of Social networking data along with related tools created to interpret the data can help to strengthen existing relationships and provide opportunities for creating new relationships.With better means and tools, a stronger and more connected network can be intended and created.Today networks are deploying different techniques to help a user to grow their social circle, connect with new individuals and find superior content of interest.
Every individual is on the lookout to increase his network with people of interest.As faced by every individual, there are two impediments in connecting with individuals of interest.

1) Who is the one of interest? 2) How high is the possibility of connecting with the one of interest?
Different Social Networks are engaging different ways to augment a user's arsenal to help them grow their own network.Most common ways of predicting a higher probability of connecting in a network are: a) Individuals with maximum mutual friends are suggested a connect b) Individuals are asked to suggest a connect between their unconnected friends c) Unconnected individuals having multiple short length paths in the graph are suggested a connect d) Unconnected individuals commenting on the same conversation, multiple times are suggested a connect New and better tools are evolving at day end to provide users with better services to enhance their experience of social connect.There is a wide range of research going on in the area of suggesting connects.In research terminology, it is called as Link Prediction in Graphs.Link prediction can be used to www.ijacsa.thesai.orgidentify hidden links, not yet formed in an Online Social Network, in a friend suggestion mechanism.
Link prediction outside the social network domain can have multiple uses like: a) Recommendation and relevance prediction in ecommerce [3] b) Protein Interaction prediction in Life Sciences [2] c) Identifying hidden groups of terrorist or criminals using link prediction in the security domain [4] The link prediction problem is relevant to different scenarios; several algorithms have been proposed in recent years to solve it.One common approach for solving Link prediction problem is using supervised learning algorithm.This approach was introduced by Liben-Nowell and Kleinberg in 2003 [6], who studied the usefulness of graph topological features by testing them on bibliographic data sets.In 2006, the work was extended to identifying hidden group of terrorist by Hasan et al [4] and since then several other researchers have implemented this approach.Most of the solutions, that these researchers proposed were tested on bibliographic or oncoauthorship data sets [4], [6], [7], and [8].In 2009, Chen et al [1] depicted several algorithms used by IBM on their internal social network, which enable its employees to connect with each other.Song et al. used matrix factorization to estimate the similarity between nodes in real life social networks such as Facebook and MySpace [9].In 2011, W J. Cukierski et al [10] extracted 94 distinct graph features.Using a Random Forests classifier, they achieved impressive results in predicting links on Flickr datasets.
Here we are proposing a novel approach for predicting a link (friendship) between two individuals (nodes) in a social network using OSN (Online Social Network) data and predict linkage possibility by applying Naïve Bayes Classifier on attribute data values of nodes in existing linkages.

II. PROBLEM STATEMENT
Classification of links in social network can be done on different types of node data:

a) Topological Attribute Data b) Node Interaction Data c) Non-Topological Attribute Data
All the above types of node data can be used to classify the links.The classification helps in predicting the possibility of connection between two non-connected nodes.Most of the research to date is done on Topological Attribute data and Node interaction Data.
In this paper, we propose the mechanism of classification based on non-topological attribute data.The dataset used for experimentation will be from Facebook™.We will be using Naïve Bayes classifier for classifying the existing links and use the classification for predicting a link between two nodes.

A. Online Social network
The Social Network in consideration, in this paper, is Facebook.Facebook is an online social interaction and networking service.A user above 13 years of age can create an account on Facebook.On Facebook, a user can make friends with other Facebook users.A user can post anything on her/his wall (representation of profile space) or her/his friend's wall.A user can "Like" or "comment" on posts by her/him or her/his friends.The posts, "Like" action and comments can be termed as public interaction between users.All the public interactions between users, done by a user, are available for view to all users on the timeline of the user.
Other than public interaction a user can have private interaction with a friend user.The possible ways of private communication is chatting or inbox messaging.All the private communications are confidential and are visible only to the two users between whom the interaction has taken place.These public and private interactions between users are termed as node interaction data and can be used to predict friendship.
A user stores his profile information as the time of registration with Facebook.Profile information on Facebook can range from First Name, Last Name, Date of Birth, Gender, Religion, Home Town, Current City and Relationship Status to Work, Work History and Education History information.Over the period of time, a user can update, add or delete profile information.A user can also put restrictions on visibility of this information from "Public" to "Friends Only" to "Only Me" to any other specific friends group available.Due to the selective visibility of data governed by the user and nearly all the attributes are optional there is a wide possibility of having attribute values as blank.

B. DatasetPreparation
A sample subset of Facebook was used as a dataset to work on.This Dataset was extracted from Facebook using Facebook App named "FBNetworkAnalysis".URL for this app is "https://app.facebook.com/mytestappfbabhi".The data was collected from 7637 users.In the context of this analysis, users are represented as nodes and friends are represented as two nodes on an edge.A friendship is represented as a link between two users.A user can mark a link as "Friend", "Cousin" (any other relative) or "Spouse"/"Significant Other".This Relationship is taken as name/type of the link.Link name/type is not considered in this analysis.
All the profile information made available by the user are considered as the node attributes and the analysis of the links is done using the values of the node attributes of the users (nodes) in a link (edge).

C. Data Representation
Facebook sample data set will be represented as a graph with finite nodes and a finite number of connections.A www.ijacsa.thesai.orgconnection (Link) can only be established when friend request is sent by one user (Node) and accepted by another user (Node).Mutual acceptance by both the nodes makes the link undirected.There is only one link between two nodes, the link type can differ on the nodes in consideration ("Friend", "Relative" or "Spouse") so there will be no multi-edges.If two individuals have same school populated in the School attribute the weight of having a friendship will be 0.0235561 if school populated in the School attribute for both the individuals is different, then weight of having a friendship will be 0.0032468 instead.
If any of the individuals does not have the school attribute populated/shared, then no weight is added for school attribute in the friendship score.

H. Friendship Score (FSCORE)
If a Link is present between two nodes, a and b, then: [ ] Where FNT (a,b) is a function non-topological attribute of a and b.If a link is not formed between two nodes, a and b, then friendship score needs to be calculated using nontopological attribute data.The FSCORE is calculated as: If two individuals (Non Friends) have only Same Gender and no other attributes populated, the probability of Friendship is 0.0047786 and hence the Friendship Score for future friendship is 0.0047786.Similarly, if two individuals only have Same Location then their Friendship Score will be 0.0106998.

IV. MATHEMATICAL MODEL FOR DATASET WITH MISSING DATA
FSCORE or Attribute weight calculated here is with the data where there are missing attribute values in many node elements.Also in the case of the social data in consideration, all features/attributes cannot be assumed to be independent of each other.Considering features are dependent on each other www.ijacsa.thesai.orgNaïve Bayesian distributed probability equation cannot be used here.Instead, we propose the use of the weighted average.

∑ ∑
Where: i: Attributes/Features e.g.Gender (if there is no value associated with the attributes in any/both of the nodes a or/and b then that attribute will not be taken up for the calculation of FSCORE) j: Different Attribute Values possible e.g.Same Gender, Different Gender (No gender available is also a valid possible value, but it will already be excluded from the equation because of the elimination of attributes while collating the final set of i's)  The in 1 is different than the one in 2 due to the difference in dataset.Dataset used in equation 2 is the subset (nodes with no missing values for gender, location, school and favorite athlete attributes) of the one used for equation 1.
In the case, if complete data is available, number of permutation combination to store and update, on link formation, increases with the increase in the attributes in consideration.This becomes cumbersome to maintain and update the data of all the combinations of attributes.In the case www.ijacsa.thesai.org of n attributes, 2 n combinations need to be maintained.For excluding some of the attributes from the final set of test, attributes may lead to maintaining different combinations separately.
What we propose for this is an equation of approximation.

| |
On the same data set used in equation 2: We consider a relation of two attributes and use the relations in a link to the next attribute in consideration.What we have done in above example is have a probability of Same Gender for Same Location with Probability of Same Location for Same School and Probability of Same School for Same Favorite Athlete along with Probability of a link having Same Favorite Athlete.This is done on the data set which has complete data and no Missing Values.The nearness of the FSCORE in equation 2 and 3 confirms approximation works well with proposed formula.

VI. CONCLUSIONS AND RECOMMENDATION
A relationship is made on different parameters and we have tried to quantify the parameters for relationship building, depending on an existing link/relationship data as stated in the paper.Deriving a possibility of a relationship (FSCORE) can be analyzed using the proposed model in this paper.
FSCORE is an effective way of predicting the possibility of relationship/link between two nodes using Non-topological attribute values of nodes.Significance and weight, of nontopological attributes, is determined by the already existing links and recurrence of a value pattern for these nontopological attributes in existing links.FSCORE can be used to calculate the cost of connecting to a distant node in a graph.FSCORE can provide a measure of strength between two unconnected nodes in order to make decisions or predictions in a different set of problems in a graph network.FSCORE can also provide a factor to help identify/quantify connected nodes.FSCORE can be used to compare and rate a relation of connected or unconnected nodes stronger or weaker to other relations.
In a graph network, if a link of reference is to be invoked or an optimized path for traversal has to be identified, then FSCORE can provide a quantitative value for analysis between two connected or unconnected node.FSCORE can be used as a relationship cost parameter in similar Graph Network problems.FSCORE is calculated using non-topological attribute values between nodes and can be coupled with topological attribute data to improve the prediction possibility.

Value of feature 1 d2:
fbG = (U, L) where U (or U (fbG)) is a set of nodes L (or L (fbG)) is a set of links, each of which is a set of two nodes (undirected) Two nodes that are associated with a link are adjacent nodes.Let n = |U| and m = |L| The neighbor of each node u is N (u) = {v |uvԐ L} The degree of user u is d (u) = |N (u)| D. Naïve Bayes Classifier Naïve Bayes classifier depends on Bayes theorem | ( ) Where, | : Probability of instance d being in class cj (This is what we will be computing) : Probability of generating instance d given class cj (We can imagine that being in class cj, causes you to have feature d with some probability) : Probability of occurrence of class cj (This is just how frequent the class cj, is in our database) : Probability of instance d occurring (This is just how frequent the instance d, is in our database) And Where, | : Probability of instance d being in class cj : Existing Links having instance d in class cj : Probability of instance d occurring To simplify the task, naïve Bayesian classifiers assume attributes have independent distributions, and thereby estimate: value of Feature 2 so on and so forth … Advantages of Naïve Bayes Classifier:  Fast to train (single scan In the Training set, Number of Friends with: SG + SL + SS + SFA = 381 SG + SL + SS + DFA = 572 SG + SL + DS + SFA = 612 SG + SL + DS + DFA = 964 SG + DL + SS + SFA = 423 SG + DL + SS + DFA = 237 SG + DL + DS + SFA = 447 SG + DL + DS + DFA = 524 DG + SL + SS + SFA = 144 DG + SL + SS + DFA = 208 DG + SL + DS + SFA = 212 DG + SL + DS + DFA = 1102 DG + DL + SS + SFA = 68 DG + DL + SS + DFA = 83 DG + DL + DS + SFA = 208 DG + DL + DS + DFA = 310 -(2) According to the above, FSCORE for two nodes with Same Gender (SG), Same Location (SL), Same School (SS) and Same Favorite Athlete (SFA), and no other attribute value populated, in a network where there are Lots of nodes with missing attribute values is as follows.When no nodes are missing attribute data, every attribute value matches or does not match between two nodes.In such cases calculating FSCORE can be done differently In the Training set, number of friends with: