Social Network Link Prediction using Semantics Deep Learning

Currently, social networks have brought about an enormous number of users connecting to such systems over a couple of years, whereas the link mining is a key research track in this area. It has pulled the consideration of several analysts as a powerful system to be utilized as a part of social networks study to understand the relations between nodes in social circles. Numerous data sets of today’s interest are most appropriately called as a collection of interrelated linked objects. The main challenge faced by analysts is to tackle the problem of structured data sets among the objects. For this purpose, we design a new comprehensive model that involves link mining techniques with semantics to perform link mining on structured data sets. The past work, to our knowledge, has investigated on these structured datasets using this technique. For this purpose, we extracted realtime data of posts using different tools from one of the famous SN platforms and check the society’s behavior against it. We have verified our model utilizing diverse classifiers and the derived outcomes inspiring. Keywords—Link prediction system; post analysis; semantic similarity; data analysis; social network analysis; dictionary; cosimilar links


I. INTRODUCTION
The social network is an online platform where peoples use to create informal communities or social relations with other peoples who share alike or different interests, views, genuine links, and experiences.It allows discussions and relations with other people online.Examples of SN platforms are Facebook, Instagram, Snapchats, etc.It is the modern technique to model the relations between the people in a group or community.Social network analysis (SNA), a rising branch started after sociology [1].To predict the link between the networks is its main determination.For example, to perceive who take the most "important" role in a circle we can design the social network link between individuals and the link between two individuals shows connection like working on the same task.As soon as the network is built it can be used for information gathering of individuals the most active user, the common interest, followers, likes, etc.
Study this valuable information in the social networks has open the door for research where researcher used the different models of analysis such as sentiment analysis, link prediction, semantic analysis and many tools are existing to get the clear picture of analysis results.
In the beginning, the majority of the research in the social network has been done by social scientists and psychologists and lately Computer Scientists contributed a lot.SNA is now use for different research purposes as the hidden conceptual model [2].
In order to show relationship in the network between the links nodes and edges are used where nodes represent persons and edge shows a relation between nodes.Edge data can be lost due to many reasons i.e. partial information gathering methods or ambiguity of links or source restrictions [3].The variations in short period cause many problems and generate many challenging questions like:  Two heads will be linked together for how much time?
 Does the link between two heads are formed by others? Peoples that are not linked, is it expected that they will get linked at some point later?The examples that we address in this study is to predict the future relationship between two heads, realizing that there is no relationship between the people in the existing state.Hence, to predict such deviations with high correctness is important for the future of social networks.
Data can be extracted from Social Networks using different techniques which can typically emit only few information about nodes due to privacy.All information about nodes can't be gathered without permission.
In this paper, we propose a method to predict the relationship between people that may appear or fade with time, based on their behavior towards the Social Network.To explore such biased behavior of nodes we extracted special data from one of the famous social platform using various tools.Section II covers the related work, Section 3 covers methodology of and purposed framework followed by experimental setup and results.

II. LITERATURE REVIEW
The SN has received extensive consideration in the analysis work.It has been adequately improved to study the changed applications such as malicious networks [4], online societies and professional groups between other networks.Numerous workshops were enfolded in the mid-1990s to unite the artificial intelligence (AI) and investigation of links groups.During the conference in 1998 on AI were reduced and Link Analysis started working for the first time with a direct focus on covering AI techniques to related data [5].www.ijacsa.thesai.orgBasically, structural link analysis from profiles and groups approach considered the problems of foreseeing, categorizing marking friends' relations in SN by application feature constructing approach [6].
In [7], [8], the authors cover the advances in probabilistic models, manifold, and deep learning.This encourages longerterm unanswered inquiries regarding the proper objectives for learning of good representations, computing representations (i.e., inference), and similarly the geometrical influences between representation learning.It also tells about the learning of density estimation and manifold by using the links to predict classes or qualities of entities [9], [10].
Whereas other work has been done by using the location based on high-dimensional space.In [11] authors have identified features set that are the solution of superior performance under the supervised learning set up and explain the effectiveness of the features of their class density distribution.
The work defined in [12] presented the link prediction method which is created on comparison of the nodes.It is the significant utilization of nodes that consider the correspondence of nodes such as age, gender, etc.
In [13] practice the SN which gives the best approach to seek and get customized, reliable health advice from peers at wherever and at any time, by tracing dental health information searched for got on Twitter.In [14], the authors proposed a link prediction model that can forecast links that may exist or vanish later.The model has been effectively practiced in two distinct spaces (health care and a stock market).
Whereas in [15] the authors utilize the unique way for semantic analysis which is called Wikipedia Link Vector Model or WLVM that practices only the hyperlink composition of Wikipedia instead of full written material.
Nowadays there is a new trend of finding the online friends as well as offline friends as done in [16] which help users to off line contacts, known and find new groups online by utilizing the machine learning classifier.The classifier distinguishes missing associations even when practical checked on the tough problem of categorizing associations between individuals who have at least one common friend.
The approach used in [17] based on recommended that the combination of topological structures and node traits improve association forecast.For this purpose, they used Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to optimize weights of nodes.
In [18], max-margin learning technique for nonparametric latent component social models is used.The author creates a bond among max-margin learning and Bayesian nonparametric to determine different latent structures for link prediction.
In [19], the author examined the innovative issue of negative connection with just positive connections.The authors proposed a principled system NeLP by observing the negative links, which can misuse positive connections and component driven interaction to foresee negative connections.
In [20], author revisit on weighting technique and proposed the two new weighting methods for link prediction min-flow and multiplicative.Also, find that these two methods give different prediction results on data sets.
In [21], author worked on Twitter and location-based SN to check the manifold link that may exist across the network between binary users and find that binary users connected on both platforms having more neighborhood then users which are connected only on one platform.
In [22], author proposed that Bi-directional links are more useful than uni-directional links for testing the real data set.For this purpose, the author proposed a new directing randomized technique to study the part of direction for predicting links in a network.
In [23], researcher has proposed a CF-based web service which provides predictions for various substances by using the ratings and opinions of people providing on Facebook and other social media sites.
In [24], author used the different algorithms such as support vector machine and decision tree and different topology structure to check the either the links exist between two nodes and also checked the load on the link and link type.
In [25], author calculated the similarity by calculating its different features between two homepages and computed the possibility when these pages are more related to each other.
The author Popescul and Ungar [26] proposed the statistical learning model of reference prediction where model learn the link prediction from queries of database which also involves joins, aggregation and selections.

III. PROPOSED STUDY AND DESIGN
In the previous section it has been described that there are different studies that have been performed for Link Prediction by considering various factors.This section gives comprehensive visions about the proposed study and design for Link Prediction Framework.For our research, we targeted the highly active social platform and performed the semantic analysis on it.
The purposed approach consists of two frameworks Post analysis and Post kind analysis to predict the links.The theme of the purposed work is to explore page follower's behavior against pages posts in the form of likes and comments, checking the kind of the posts and checking the page trends to predict the links between page followers.

A. Conceptual Framework for Data Selection
In Fig. 1, a conceptual framework is given for data extraction.Initially, data collection is one of the major tasks as there is neither any database publicly available nor any tool is available to extract data directly.So, for our research we selected the publicly available pages of Facebook and extracted the data related to posts from these pages.
The Facebook is picked as data source because of its diversity in nature of data, as it provides rich functionalities alongside humongous audience.There are certain other social networks which also provide different functionalities www.ijacsa.thesai.orgaccording to user interest i.e.LinkedIn the users are closely connected while on Instagram only the images and video data is present.
Principally, for data collection, we used different methods and techniques to make the properly consolidated dataset.After the extraction of data, we applied different scripts and procedures to make a proper authenticated dataset as extracted data contains huge crude information.Preprocessing step contains the several stages such as removing of stop words, white spaces, emojis, URLs, etc.This process required ample effort and time.

B. Analysis
Our first approach for Link Prediction was by using the semantic framework on Posts data and making the relationships between page followers based on it.Fig. 2, describes semantic similarity kernel framework on the dataset.After preprocessing we applied the Semantic kernel on the targeted dataset.
Primarily, applied the TF-IDF method to get term document matrix as it defines the frequency of input words in the dataset and term by term co-relation.It helps to identify which post is more identical to another post on the same page.We applied the KNN approach by Semantic Kernel to determine the relationships in the dataset.
The Semantic kernel also comprises the GVSM (Generalized Vector Space Model) [27] by accepting the vectors which are independent linearly and gives the results in form term by term relationship.Let suppose if X is matrix which consists of n documents and m terms than using GVSM gives the semantic kernel. (1) In this expression K shows the gram matrix of rows, G is the gram matrix of Columns.G must be semi positive and must show the internal product of vector terms.

C. Link Prediction based on Post Kind Analysis
Our second proposed method for link prediction is the Post kind analysis.Facebook page followers can comment on the posts and posts consist of URLs, emojis, digits, hashtags, etc.These comments and likes depend upon the post nature.Users behavior varies from post to post.It is not necessary if a user liked the one post of the page may or may not like all the page posts.For our framework, we first changed the preprocessed posts dataset into the matrix and compared that matrix with the dictionary [28], [29].As the results, we get three categories of posts text as shown in Fig. 3.We made the user's classes against these categories to predict the relations between them.It is also beneficial to identify the page trend.As more optimistic posts on the page the more positive page, it is.
If the XYZ is a dataset of Facebook posts by comparing the dataset with the dictionary we categorized the posts.Let's suppose there are 2 posts A and B and both posts belong to be optimistic category.Or 20-page followers showed the response against the post-A or 30-page followers showed the response against the post B on the basses of its responses we predicted the links that might be developed with time.It also helps to identify and predict the link in such a way that if 2 posts having the same nature then the post follower of A must show response to the post B.

IV. EXPERIMENTAL DESIGN AND SETUP
In this section, complete details are given related to the collection of data set and also complete results are shown in this chapter.

A. Data Collection and Analysis
This study covers the phases that require meeting the problem of Link Prediction based on semantic analysis and deep learning.
Consequently, for data collection, we used the different methods and techniques to make the properly consolidated dataset.Gathered data contains multiple columns such as.Where we pick out the only those data columns which meet the condition.Data preprocessing was applied to targeted data to remove the raw and unstructured dataset.The preprocessing of data was very time-consuming.

B. Targeted Pages
There are multiple pages on Facebook related to the movies, games, education, celebrities, news, books, poetry, online shopping and many more.On the basses of its content, we can categories these pages.For experimentation, we targeted the four categories as shown in the Table I.
Initially, against each category, we extracted the two different pages data where data extraction sections contain two parts.
IDs are gathered using Netvizz [30] which is one of the Facebook applications.It supports data extraction, data collection from different parts of Facebook.Against the selected page IDs, we extracted the data from Facebook using Facepager [31] and R tool.Facepager can fetch the publicly available data from Twitter, Facebook and web pages.Facepager takes the pages and groups IDs and allows users to access the data.It permits users to retrieves the Facebook posts, albums, pictures data also its metadata in the form of likes, shares, comments, and tags etc.For this work, we choose only Pages posts as mentioned above.We created new databases for targeted Pages and retrieve the data of Posts and its metadata and stored in the form of CSV as shown in Fig. 4.

C. Preprocessing of Data
The Extracted dataset was not in purified form as the posts were written in different format e.g.good was written as gud.Intelligent replacements of such words were made to change the data in the purified form by using scripts and algorithms which gives the text free from irrelevant content and reduce the size of the dataset for better processing.It consists of three steps where each step its own importance.
Step 1: Removed the stop words (most frequently used words) as it's not directly related to the content, Remove the extra spaces in the text, digits etc.
Step4: Intelligent replacements of words.We used the C++, Java scripts and R tool for this purpose the transformation of the dataset was very time-consuming www.ijacsa.thesai.organd another difficult task whereas preprocessing results are shown in Fig. 5, which shows that extracted Scrolls dataset contains 18% raw data and only 32% was useful.After preprocessing, was applied the semantic kernel on the selected dataset which sequentially compares page posts with each other.As page follower's behavior varies from post to post even on the same page.If a follower acts positively against one post on the page it's not necessary he/she act similarly on other posts.As the semantic kernel tells similarity which helps to identify which posts have the same content.We assigned the temporary number to the Posts IDs for easy processing of dataset and in graph representation phase we used again the original IDs.This framework 1 gives the content based post similarity.The following form of results is produced on the dataset of Scrolls as shown in figures.We interpreted the all possible posts co-relate with other posts where few pots create the direct relations and few generated the co-similar relation.We picked the one node(post) form the posts dataset and checked its co-similarity with another node(post).Let's suppose we select the node(post) 8 as shown in Fig. 6.Now by checking its similarity with other posts it leads to the post 4 and 15 post as shown in Fig. 7 and 10, respectively.By continuing this example following relations are formed.
Post 4 has the most co-similarity with the post 34 as shown in Fig. 8 where 34 have co-similarity with post 27 vice versa as presented in Fig. 9.So we end the process here as both post are bi-relational.Post 15 has the co-similarity post 48 vice versa.sowe end the process here as both post are bi-relational as shown in Fig. 11.
Here the nodes are 8,4,34,27,15,48.The direct relation created between nodes are as follows: By using these directed relations of nodes, we find the cosimilar links in the nodes as shown below: For representation of the results of this proposed framework we used the Graph technique as shown in Fig. 12.We picked out the one Node we called it starting Node and against that Node we checked which post is more co-related with it.We did the same with the second Node and continued this process until 2 posts created the bidirectional graph.Consequently, by using these graphs we find the Co-Similar links and giving the links prediction on the bases of these Co-Similar links.Fig. 14

D. 2 nd Method
After comparing the Posts of Scrolls with dictionary, following results are produced.On the bases of these Positive and Negative posts, we predicted that 1st post follower may create the relation with post follower of the same category.The same experiment is done on all other categories of Facebook pages.For the experiment, we take the 62 posts after analysis it divided the posts 9 in the Positive category and 9 in Negative and remaining all are the Neutral as shown in Table II.Fig. 18 expresses the expected type of links.

V. CONCLUSION
The theme of this work is to exploit Social Networks for prediction of nature of relationships among users that are not directly connected.For this purpose, alike pages from famous Social Networks was selected and data was gathered according to nature of work by using our proposed framework, results have been achieved.Moreover, our proposed framework indicates the involvement of semantic approach.
The proposed framework was including the involvement of dictionary in order to find the nature of post which is also playing the vital role in categorization posts as well as the links among users.For future work, a comprehensive tool should be developed that has the capability to exploit the public available data from SN.Thus, results are creating links among users belong to different networks.Moreover, the data can be used to monitor the activities of users on certain page or group.In addition, this activity will also enable us to find the Groups or Page with public sentiments.Finally, the timing in our approach certainly enables us to find spam pages or groups as well as users across any social network.
diagonal matrix whose terms are the diagonal elements of .The semantic kernel that parallels to different estimates of K are: clustering approach and gives which terms are more co-related.If the XYZ is a Facebook page by applying the semantic kernel on posts, post A terms matches with the post D terms.Let A contains 20 comments by page followers and D have the 40 comments by page followers than we can predict that it might be possible that post A followers create the relation in future with post D followers as both posts having the same content.

Fig. 3 .
Fig. 3.A conceptual framework for post kind analysis.


Page followers like against posts. Query Time and its duration. Page follower's comments against posts. Followers Ids. Post creating time, date, etc.

Fig. 12 .
Fig. 12. Results of semantic analysis framework of scrolls above example.

Fig. 13 .
Fig. 13.Followers against these posts where post followers are attached to Posts IDs.The first half or the IDs shows the Page ID where other half is that person ID who posted on the page.IDs i.e. 141290822651720__781519135295549 Fig.13shows the Post followers against these posts who commented on these posts.A similar experiment has been done on all abovementioned Pages and results are incredible.It is most commonly used technique for the network analysis and provides the actual representation of a network in the form of graphs.As the graph consists of Nodes (V) and Edges (E): G = (V, E) and 15 signifies the Co-Similar Links of Humans of Pakistan.www.ijacsa.thesai.org

Fig. 17 .
Fig. 17.Co-similar link of human of Kinnaird where post followers are attached to Posts IDs.

Fig. 18 .
Fig. 18.Possibility of links creates between page followers based upon the dictionary results.