Deep Neural Network-based Relationship Identification Framework to Discriminate Fake Profile Over Social Media

Abstract—Involvement of social media like personal, business and political propaganda activities, attracts anti-social activities and has also increased. Anti-social elements get a wider platform to spread negativity after hiding their identity behind fake and false profiles. In this paper, an analytical and methodological user identification framework is developed to significantly binds implicit and explicit link relationship over the end-users graphical perspective. Identify malicious user, its communal information and sockpuppet node. Apart from that, this work provides the concept of the deep neural network approach over the graphical and linguistic perspective of end-user to classify as malicious, fake and genuine. This concept also helps identify the tradeoff between the similarity of nodes attributes and the density of connections to classifying identical profile as sockpuppet over social media.


I. INTRODUCTION
Social media has entered our lives in many areas, among 7.5 billion people globally; 3.1 billion are active on social media. Many activities, such as communication, entertainment, political campaigning, and shopping, are carried out on social media platforms [1]. As a result of this, huge data generated spontaneously on social media platforms continuously emerge. The spread and popularity of social media have attracted the attention of antisocial elements. These people, unfortunately, use social media to scam or cyberbullying activity through a fake account.
Ungenuine user-profiles opened by users for mischievous purposes in social networks such as Facebook, Twitter and LinkedIn are called fake accounts. Fake accounts are usually opened for lack of trust, fear or hiding from anyone, protecting oneself from the potential loss of important news and accessing information by hiding. Apart from this, fake accounts are also opened in celebrities' names to gather followers, run ad campaigns, run negative campaigns about a brand, or get personal information and profile information of users. The credibility and global expansion of social media can infer that fake accounts opened using individuals or companies' names can pose a major problem.
Domenico et al. [2] state that false profiles on social networks are those that do not comply with the terms and conditions established by the platform, they do not belong to real people, they do not belong to the person they indicate, and they pretend to be real profiles existing. They also indicate fake, manual or artisan profiles (created by people) and those generated and manipulated manually and automatically (bots or robots). They mention that there are different types of "tasks" of a fake profile: stalker, cyberbullying, gamers, spammers, pornography, digital reputation, media manipulation, cybercrime.
There are different categories of fake profiles, generated for different purposes. Some of them (gamers or stalkers) may be harmless. Still, others have a clear intention of causing damage or seeking financial gain for themselves, insults, extortion, threats, scams and worst Cases, corruption and grooming of minors.
Recently, researchers applying classification approach to detect fake account over social media. But due to a lack of graphical and linguistic implicit information [3], [4] for end node, the performance of this research does not get significant results. On the other hand, linguistic pattern and geocommunal information of end-user are crucial characteristics to identify the pattern of the end-user.
However, graphical communal characteristics depend upon the implicit and explicit link relationship. The explicit link relationship easily extracted from the graphical structure. Whereas, extraction of the implicit link relationship is a challenging task. Mining of linguistics and behavioural pattern of user-generated content such as, like, dislike, follow, comment and share lead to extract implicit graphical structure.
In this paper, an analytical and methodological user identification framework is developed to significantly binds implicit and explicit link relationship over the end-users graphical perspective. Identify malicious user, its communal information and sockpuppet node. Apart from that, this work provides the concept of the deep neural network approach over the graphical and linguistic perspective of end-user to classify as malicious, fake and genuine. This concept also helps identify the tradeoff between the similarity of nodes attributes and the density of connections for Influence maximization.
The organization of the paper is as follows. In the second part, the relevant literature is given, and the social media analysis and fake account detection programs are briefly mentioned. In the third part, the algorithms we developed and used are mentioned. While the evaluation results are mentioned in the fourth section, results and suggestions are given in the last (IJACSA) International Journal of Advanced Computer Science and Applications, Vol. 12, No. 3, 2021 section.

II. RELATED WORK
Social networking has become an increasingly important application in recent years, because of its unique ability to enable social contact over the internet for geographically dispersed users. A social network can be represented as a graph, in which nodes represent users, and links represent the connections between users.
The purpose of the literature survey is to gain and understand the diverse and dynamic nature of social media data for feature extraction to extract Misuse of Fake Profiles for Review Spam On Social Media [1][2][3][4][5][6][7], Detection of fake review spreading community [8,10].
Along with that total eight articles (published in 2016 to 2019) presented in this paper are summarized in Table  1 that contains six columns. The main task of the articles is illustrated in the second column. Column third illustrates method used. Column fourth illustrate method and algorithm used for account verification in different application. Whereas sixth column describes the name of data sets and its source that has been used for evaluating different methodology.
Cresci et al. [5] developed a behaviour model inspired by biological DNA in detecting spambots in social networks in another bot research. By changing the genetic algorithm's different parameters, it was determined how advanced bots escaped from detection techniques, in another Galindo et al. [6] examined political bots in the General Elections. The accounts considered in the study using three different data sets are grouped as bot or human. To classify the data set, features such as the age and location of the relevant Twitter accounts, the length of the user step, the sickness per tweet, and the time between two retweets were used. AdaBoost, logistic regression, support vector machine and naive Bayes have been tested as the classification algorithm. Logistic regression worked the best among them. Based on the data in Chasma, author can say that bots retweet slightly less than real accounts. Also, bots include more external URLs in their tweets than original accounts.
Ruiz et al. [7] claim that when detecting bots on Twitter, follower friends' ratio will not always give us correct results. They think that bots can unfollow accounts that do not automatically follow back. Instead, the text in the tweets of bot accounts is more uniform than the actual accounts. They use text entropy to measure similarity. It also deals with the methods used to access Twitter to detect bot and human accounts. For example, most human accounts use the web or mobile application, while the bots have stated that they use other applications such as API, they also stated that human accounts have a more complex timing behaviour than bots and cyborgs. In this study, they use multiple classification methods as bots or human accounts in the Twitter social network. The process of updating has been carried out. By applying feature extraction techniques to the data set, it has been prepared for the dilution process.

III. PROPOSED WORK
A graphical, linguistics and social theory based relationship identification (RIF) framework is developed to identify mali-cious end-user over social media, as shown in Fig. 1. This framework amalgamates linguistics, temporal and contextual ethics of user-generated content with profile and graphical information.
The RIF framework extract feature vector to delineate user behaviours and similarity index over social media. Classifying identical profile concerning to similar user via Jaccard coefficient over linguistics pattern of tweets and provide linguistics, temporal and contextual meaning to develop a mathematical model for classifying identical profile as sockpuppet over social media.

A. Data Extraction
RIF framework analyze and extract user pattern from usergenerated content, profile and graphical information of social media user. This approach encapsulates social media mining concepts, theories, with the concept of natural language processing to extract the communal intersection of user-generated and profile content from social media.

B. User Feature Vector
RIF framework examine and correlate user profile (u f ), generated data (c f ) with graphical perspective (g f ) of social media data as .
The taxonomy of user feature includes profile, content, and graph-based feature, as shown in Fig. 2. Whereas in this work profile-based feature comprises validation of profile information such as suspicious user profile is verified or not, profile age, profile cover, and picture as However, content-based features include temporal, contextual validation of user-generated data, grammatical quality, and emotional context of surfing nature as shown in Fig. 3.
Temporal taxonomy comprises time interval between tweets(t g ), retweets (rt g )and its frequency(t f ,rt f ). Contextual content includes term and document frequency of user tweets, Whereas linguistics feature reflects the standard of language script and sensitivity incorporate susceptibility of the user while tweets.
However, graph-based features include validation of structural and relational nature of end-user such as number of friends, follower, friend distribution, etc. www.ijacsa.thesai.org

C. Textual Feature
Textual feature of social media user generated content are classified into three class behalf of content, reviewer profile and network dimension as shown in Fig. 4. RIF framework examine and correlate following Review, Reviewer and Network centric feature.

D. Relationship Identification
After identifying profile and textual feature of end-user as seed profile , relationship identification employed balance theory to extracts hidden relationships of other similar profile with seed profile as implicit link relationship. For instance, consider g(v,r e ) as a social media graph having 11 users nodes and 9 relationship edges, as shown in Fig. 5. Then after applying the balance theory of SMM, two hidden implicit relationships are extracted over graph g(v,r e ), as shown in Fig.  6 by the red line.
After extracting the secret relationship, nodes are hierarchically differentiated according to their implicit status derived through the status theory. After applying the status theory, node colour over the clique are changed. The Degree of the brightness of node color has shown its hidden implicit statuses over the clique, as shown in Fig. 7.
Simultaneously, the graph transmit effects as explicit characteristics extracted through Influence, Homophily, and Confounding correlation theory. Higher status communal node changes the belongingness of its lower status node into their respective community through the Influence theory. Whereas, homophily builds the belongingness of similar characteristics node over the same community. However, any online forum creates an environment to make individuals similar, as confounding.
After extracting implicit information from social media through social theory, NCF generates vertex degree vector and reachability matrix, as shown in equation 6.16 and 6.17.
Where,n v d is represent node degree vector and n i d is the number of node having degree i in desire clique structure. Whereas, node rm represent node rechability square matrix having n*n dimension and r vi,vj is the modular distance between node v i and v j node rm = r vi,vj n * n After extracting node feature vector and matrix , multiplication of vertex degree vector and node reachability matrix return A i,j as the highest influence node. Simultaneously, the Kmeans algorithm builds the community of similar nodes with a similarity index of the Jaccard coefficient over the initial point A i,j .

IV. ENVIRONMENTAL SETUP AND RESULT ANALYSIS
The comparative analysis is present interesting and useful facts regarding the state-of-the-art of malicious account classification technique. For performance evaluation of DNN based RIF framework with basic stand-alone classifiers such as Random Forest (RF), Bagging Classifier, J48 Classifier, Random Tree, and Logistic Regression has been carried out over two different interaction and structural anomalies social media data set, namely Crude and Cresci Collaborators (CCDS) data set. Crude dataset [10] has 6824 profile data(Fake+ Genuine),59153788 tweets, 4899493 followers, 16236669 Likes, 67976 listed count 1367 URL Shared. Simultaneously, CCDS [11] has 3474 genuine accounts, 8377522 genuine tweets, 991 fake account, and 1610176 fake tweets.
Performance evaluation of Random Forest (RF) for malicious account classification with and without user feature and social theory is described in Table I. The RF algorithm acquires 67.09%, 66.98%, 68.12% and 80.21% 78.45%, 81.78% precision with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table II and Fig. 8(a). The RF algorithm's performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The RF algorithm acquires 1.53%, 1.36%,3.09% and 33.02%, 30.10%, 35.62% improvement over the precision with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 8(b).
Whereas, RF algorithm acquires 67.34%, 66.14%, 68.92% and 78.41% 74.24%, 79.12% recall with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table II and Fig. 8(c).The RF algorithm's performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The RF algorithm acquires 3.09%, 1.26%,5.51% and 41.15%, 33.65%, 35.62% improvement over the recall with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 8(d).
Simultaneously, RF algorithm acquires 67.84%, 65.91%, 69.46% and 78.9% 76.14%, 79.98% F1-Score with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table II and Fig. 8(e).The RF algorithm's performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The RF algorithm acquires 4.19%, 1.23%, 6.68% and 38.37%, 33.53%,40.27% improvement over the F1-Score with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 8(f).
However, RF algorithm acquires 92.94%, 92.1%, 93.45% and 95.78%, 94.56%, 96.2% Accuracy with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table II and Fig. 8(g).The RF algorithm's performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The RF algorithm acquires 1.41%, .49%, 1.96% and 5.46%, 4.12%,5.92% improvement over the Accuracy with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 8(h).
The Bagging algorithm acquires 66.59%, 65.19%, 67.82% and 75.22%, 74.61%, 76.15% precision with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table III and Fig. 9(a). The RF algorithm's performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Bagging acquires 3.43%, 1.26%, 5.34% and 42.25%, 41.09%, 44.01% improvement over the precision with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 9(b).    www.ijacsa.thesai.org Whereas, Bagging acquires 66.69%, 64.85%, 67.98% and 73.16%, 70.58%, 73.89% recall with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table III and Fig. 9(c).The Bagging performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The RF algorithm acquires 5.79%, 2.87%, 7.84% and 49.80%, 44.51%, 51.29% improvement over the recall with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 9(d).
Simultaneously, Bagging acquires 65.24%, 64.12%, 68.72% and 73.65%, 71.25%, 75.28% F1-Score with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table III and Fig. 9(e). The Bagging performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Bagging acquires 17.85%, 15.82%, 24.13% and 47.80%, 42.99%, 51.07% improvement over the F1-Score with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 9(f).
However, Bagging acquires 92.74%, 91.42%, 93.14% and 95.55%, 92.18%, 95.98% Accuracy with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in table 3 and figure 9(g).The Bagging performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Bagging acquires 1.98%, .53%, 2.42% and 6.83%, 3.06%, 7.31% improvement over the Accuracy with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 9(h).
The J48 algorithm acquires 63.52%, 62.78%, 64.15% and 70.6%, 64.52%, 72.82% precision with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table IV and Fig. 10(a). The J48 performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Bagging acquires 4.75%, 3.53%, 5.79% and 52.19%, 39.08%, 56.97% improvement over the precision with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 10(b).
Whereas,J48 acquires 62.02%, 59.69%, 62.84% and 69.23%, 65.82%, 71.56% recall with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in table 4 and figure 10(c).The J48 performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The J48 acquires 5.14%, 1.19%, 6.53% and 64.52%, 56.42%, 70.06% improvement over the recall with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 10(d).
Simultaneously, J48 acquires 61.73%, 58.36%, 62.84% and 69.19%, 66.58%, 70.69% F1-Score with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table IV and Fig. 10(e).The J48 performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The J48 acquires 19.68%, 13.14%, 21.13% and 58.73%, 52.74%, 62.17% improvement over the F1-Score with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 8(f).
However, J48 acquires 91.87%, 91.25%, 93.14% and 93.95%, 92.56%, 94.64% Accuracy with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table IV and Fig. 8(g).The J48 performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The J48 acquires 1.77%, 1.09%, 3.18% and 6.50%, 4.92%, 7.28% improvement over the Accuracy with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 10(h).
The Random tree algorithm acquires 64.67%, 64.08%, 65.84% and 72.76%, 70.28%, 73.58% precision with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table V and Fig. 11(a). The J48 performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Random tree acquires 1.57%, 0.64%, 0.41% and 41.04%, 36.23%, 42.62% improvement over the precision with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 11(b).
Whereas, Random tree acquires 65.13%, 64.56%, 66.18% and 50.86%, 49.86%, 52.69% recall with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table V and Fig. 11(c). The Random tree performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Random tree acquires 4.59%, 3.68%, 6.28% and 7.48%, 5.37%, 11.35% improvement over the recall with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 11(d).
Simultaneously, Random tree acquires 63.78%, 62.86%, 64.27% and 54.15%, 52.85%, 55.28% F1-Score with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table V     11(e). The Random tree performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Random tree acquires 16.79%, 15.11%,17.69% and 11.97%, 9.28%, 14.31% improvement over the F1-Score with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 11(f).
However, Random tree acquires 72.76%, 70.28%, 73.78% and 94.84%, 92.42%, 95.58% Accuracy with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table V and Fig. 11(g). The Random tree performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Random tree acquires .98%, .44%, 2.01% and 6.49%, 3.77%, 7.32% improvement over the Accuracy with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 11(h).
The Logistic algorithm acquires 60.92%, 59.85%, 62.56% and 63.18%, 61.56%, 64.58% precision with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table VI and Fig. 12(a). The Logistic performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Logistic acquires 15.01%, 12.99%, 18.10% and 10.44%, 7.60%, 12.88% improvement over the precision with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 12(b).
Whereas, Logistic regression algorithm acquires 61.54%, 60.21%, 63.08% and 82.54%, 79.95%, 84.12% recall with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table VI and Fig. 12(c). The Logistic performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The J48 acquires 15.20%, 12.71%, 18.08% and 79.32%, 73.69%, 82.75% improvement over the recall with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 12(d).
Simultaneously, Logistic regression algorithm acquires 61.11%, 60.48%, 62.46% and 82.72% 80.56%, 83.86% F1-Score with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table  VI and Fig. 12(e). The Logistic performance is significantly boosted up after rectifying network information by user fea- ture, social theory, and fusion of both. The Logistic acquires 15.02%, 13.83%, 17.56% and 46.56%, 42.74%, 48.58% improvement over the F1-Score with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 12(f).
However, Logistic regression algorithm acquires 90.17%, 89.23%, 91.47% and 84.67% ,83.41%, 85.98% Accuracy with user feature, social theory, and fusion of both respectively over Crude and CCDS data set as shown in Table VI and Fig. 12(g).The Logistic performance is significantly boosted up after rectifying network information by user feature, social theory, and fusion of both. The Logistic acquires 2.11%, 1.04%, 3.58% and 10.32%, 8.68%, 12.03% improvement over the Accuracy with user feature, social theory, and fusion of both over Crude and CCDS data set, as shown in Fig. 12(h).

V. CONCLUSION
Online Social Network (OSN) is a network hub where people with similar interests or real world relationships interact. As the popularity of OSN is increasing, the security and privacy issues related to it are also rising. Fake and Clone profiles are creating dangerous security problems to social network users. Cloning of user profiles is one serious threat, where already existing userâC™s details are stolen to create duplicate profiles and then it is misused for damaging the identity of original profile owner. They can even launch threats like phishing, stalking, spamming, etc. Fake profile is the creation of profile in the name of a person or a company which does not really exist in social media, to carry out malicious activities. In this paper graphical, linguistics and social theory based relationship identification (RIF) framework is developed to identify malicious end-user over social media.This framework amalgamates linguistics, temporal and contextual ethics of user-generated content with profile and graphical information. The RIF framework extract feature vector to delineate user behaviors and similarity index over social media. Classifying identical profile concerning to similar user via Jaccard coefficient over linguistics pattern of tweets and provide linguistics, temporal and contextual meaning to develop a mathematical model for classifying identical profile as sockpuppet over social media. RIF framework achieve maximum 82.49% precision, 87.76% recall, 86.19% F1-Score, and 98.54% Accuracy. However its gain maximum 27.73% improvement in precision, 66.56% improvement in recall, 55.92% improvement in F1-Score, and 14.61% improvement in Accuracy.