Link Prediction Schemes Contra Weisfeiler-Leman Models

Link prediction is of particular interest to the data mining and machine learning communities. Until recently all approaches to the problem used embedding-based methods which leverage either node similarities or latent group memberships towards link prediction. Chen and Zhang recently developed a class of non-embedding approaches called Weisfeiler-Leman (WL) Models. WL-Models extract subgraphs around links and then encode subgraph patterns via adjacency matrices using the so-called Palette-WL algorithm. A training stage then learns nonlinear graph topological features for link prediction. Chen and Zhang compared two WL-Models – a linear regression model (―WLLR‖) and a neural networks model (―WLNM‖) – against 12 different common link prediction schemes. In this paper, all author claims are validated for WLLR. Additionally, WLLR is tested against 22 additional embedding-based link prediction techniques arising from common neighbor-, pathand random walk-based schemes. WLLR is shown not to be superior when calculable. In fact, in 80% of the datasets where comparisons were possible, one of our added implementations proved superior. Keywords—Weisfeiler-Leman; link prediction; machine learning; linear regression; common walk; path-based; random walk; stochastic block; matrix factorization


I. INTRODUCTION
Improvement in effective link prediction has been of particular interest to the data mining and machine learning communities.Much interest arises from diverse real-world applications.
Particular applications include friend recommendation in social networks [2], product recommendation in e-commerce [3], knowledge graph completion [4], finding interactions between proteins [5], and recovering missing reactions in metabolic networks [6].Additional interests arise from the search to overcoming a central challenge for researchers: determining which method is best for a particular situation, especially when each scheme is grounded in a particular heuristic.
Heuristics range in complexity from the more complex (e.g.stochastic block models [5], probabilistic matrix factorization [7]) to the more simplistic (e.g. common neighbors (CN) [1], Katz index [9]).Heuristics with mid-level complexities include methods which calculate node proximity scores via network topologies or random walks.Amongst the diverse methods which exist, the following two challenges have always persisted.
1) Heuristic complexity does not often translate into corresponding performance.The more simplistic often work well, are more interpretable, and scalable.The Katz and CN indices are exemplary examples.The latter asserts higher link probability as the number of common neighbors increases and is reasonably accurate with respect to links on social networks.
2) All known heuristics lack universal applicability to different kinds of networks.CN is again a prime example: its performance electrical grids and biological networks is quite poor [10] notwithstanding its excellent aforementioned successes.Resistance distance (RN) is a converse example: it performs poor where CN thrives [11].A study of over 20 different heuristics found flaws in each, making none universally effective performance models [10].www.ijacsa.thesai.orgThis paper is organized as follows.In Section II, a highlevel overview of link prediction is presented.In Section III, we present Palette-WL and the implementation of WL-Models.Section IV details the many specific link prediction models implemented in this paper along with the key results that were obtained.Section V gives a conclusion, while Section VI presents directions for future work.The tail end of this work, following the References Section, includes an Appendix where the full set of computation results from this paper is presented in various tables.

II. OVERVIEW OF LINK PREDICTION
Historically, link prediction models have been featurebased (a.k.a.embedding-based) arising either from (1) topological features or (2) latent features.
1) Topological feature models.These models leverage node similarities, either locally or globally.Topological models do not perform well when similarity scores do not capture the network formation mechanisms.Common neighbor-based methods (e.g.CN [1], Adamic-Adar [2]), Path-based methods (e.g., Katz [9]), and random-walk based methods (e.g.PageRank [1]) all fall within this category.A breakdown of each of these categories is given in Section IV.
2) Latent feature Models.These models assume that latent groups exist for nodes and that links are determined by group memberships.Latent models extract group memberships via the low-rank decomposition of a network adjacency matrix [3] or via training which fits probabilistic models [5].Given these models" focus on individual nodes, a central weakness arises in understanding how networks are formed.Popular methods include ranking methods [17], learning to rank methods [17], matrix factorization [16], and stochastic block methods [5], [18].This paper implements methods from the latter two.
Weisfeiler-Leman models for link prediction are not feature-based.The ideas which motivated its implementation arose from two research areas related to graph classification: design of efficient graph kernels [14], [19], and effective graph labeling schemes [15] arising from impositions of vertex orderings.Niepert et al. [15], in particular, focus on orderings towards defining receptive fields around node pixels; the fields are then used to learn a convolutional neural network for graph classification.WL-models [40] work by instead extracting subgraphs around links instead of node pixels, and by focusing on link prediction rather than on graph classification.WL-models are new to the link prediction landscape being only formally published in the recent Chen-Zhang paper [40] which this paper tests.Chen and Zhang, in particular, implement two key WL-Models, WLLR and WLNM, along with a third, Palette-WLNM, an extension of WLNM.Among the two, WLNM is superior.Area Under the receiver operating characteristic Curve (AUC) Results are listed in Table I, split in two parts, for five datasets and twelve non-WL methods.In short, WLNM outperforms nine state-of-the-art link-prediction methods developed by heuristic means (e.g.Katz, PageRank, SimRank, etc.), and three latent feature models (stochastic model block, and two matrix factorization methods); a full explanation of all methods will be given in Section IV. \WLLR is less successful, but nonetheless a strong adversarial method.Palette-WLNM is tested elsewhere in their paper, but not considered in this present paper given this paper"s focus on WLLR.
The five datasets used above are USAir, NS, PB, Yeast, and C.ele.USAir is a network of US airlines.NS is a collaboration network of researchers who publish papers on network science.PB is a network of US political blogs.Yeast is a protein-protein interaction network in yeast.C.ele is a neural network of C. elegans.All evaluation methods (CN, Jac, AA, etc., will be described in full detail in Section IV. 2) Encode subgraph patterns: via adjacency matrices with vertex ordering given by Palette-WL.
3) Training: learns nonlinear graph topological features for link prediction.
To understand each step, a review of graph labeling functions, along with the base WL-Algorithm is in order.A graph labeling function is a map L: V → C from vertices V to an ordered set of colors C = {1, ..., n}.C uniquely determines the vertex order in an adjacency matrix whenever L is one-toone.The WL-algorithm ("WL"), then, is a color refinement algorithm which iteratively updates vertex colors on a particular graph labeling function, specified below, until a fixed point is reached.(Palette-WL, discussed later, further ensures that the converged function is one-to-one.)WL specifically works by iteratively augmenting vertex labels using neighbors" labels.It then compresses augmented labels into new "signature" labels until convergence.At first, all vertices are set to the same color "1".Each vertex gets its new signature string by concatenating its own color and the sorted colors of its immediate neighbors.Vertices are then sorted by the ascending order of their signature strings and assigned new colors 1, 2, 3. Vertices with the same signature strings get the same color.WL is formally presented below.See Fig. 1 for an example of the process.We now fully outline the three-step (1-3) process for implementing a WL-model.Step 1 extracts K-vertex enclosing subgraphs via the "Extract Enclosing Subgraphs Algorithm" below.
Step 2 then encodes subgraph patterns via adjacency matrices with vertex ordering given by the Palette-WL, also noted below.Fig. 2 gives an overview of these steps in action.
Step 3, training, is via any viable machine learning algorithm.The authors use a neural network, WLNM, to achieve superior results.They also train via linear regression, in a method called WLLR.Chen and Zhang show, via mathematical proof that the Palette-WL graph labeling function converges in at most K iterations to a one-to-one function for a graph with K vertices.Furthermore, the function is color-order preserving: vertices" color orderings are preserved from state-to-state.Both of these facts enable WL-models to successfully predict links.

V. ASSESSMENT OF WL-MODELS
We use Area Under the receiver operating characteristic Curve (AUC) to measure results.AUC measures on the probability that a randomly chosen missing link is given a higher score than a randomly chosen nonexistent link.More precisely, if among n independent comparisons, there are n times the missing links having a higher score, n times those have the same score, the AUC value is AUC = (n + 0.5n)/n.

A. Assessment Methods
In our experiments we compare the authors WL-model implemented with linear regression (WLLR), against methods from four traditional link-based assessment areas: common neighbor-based methods, path-based methods, random walkbased methods, and latent feature-based methods.For each test, we compute the AUC value and tabulate the results.The Appendix includes tables with all of our results.We outline the algorithms used in assessment below.Our implementation is motivated largely by two recent papers [38], [39].

B. Common Neighbor (CN)-based Methods
For a node x, let (x) be the set of neighbors of x.The idea is that two nodes x and y are more likely to share a link if they have many common neighbors.The most basic measure CN(x,y), defined to be | (x)   (y)|, asserts this.Note that if A is the adjacency matrix for the corresponding graph, then CN(x, y) = A 2 (x, y).Now let  be a scaler for a link measure M  so that M  (x,y) = CN(x, y)/, and let dx denote the degree of node x.For various choices of , different CN-measures, noted in Table II  The Jaccard Index (Jac) [23] gives the probability that x and y are adjacent, given an edge of either x or y.The Salton Index (SltOn) [20] is often also called cosine similarity in the literature.Sørensen Index (Sor) [24] is primarily used for ecological data.Hub Promoted Index (HPI) [25] is used to quantify the topological overlap of pairs of substrates in metabolic network; under this scheme, links adjacent to hubs are more likely to be assigned to be assigned high scores.Hub Depressed Index (HDI) is the complementary measure to HPI.Finally, the Leicht-Holme-Newman Index (LHN) [22] assigns high similarity to node pairs that have many common neighbors compared to the expected number of such neighbors; in particular, d x  d y is proportional to the expected number of common neighbors of nodes x and y in the configuration model [26].
Three related CN-based methods we considered are Preferential Attachment Index (PA), Adamic-Adar Index (AA), and Resource Allocation Index (RA); these are noted in Table III.PA is motivated by a preferential attachment mechanism which ensures that the probability that a new link to be added connects x and y is proportional to d x  d y .PA is often used to quantify the functional significance of links subject to various network-based dynamics such as percolation [27], synchronization [28], and transportation [29].AA is a refinement of simple counting of common neighbors; it assigns less-connected neighbors more weight [2].RA [8] is motivated by the resource allocation dynamics on complex networks.Suppose x and y are not directly connected and x can send a resource to y, with common neighbors playing the role of transmitters.If each transmitter has a unit of resource, and distributes equally to all its neighbors, then RA is the amount of resource y received from x.
We also considered three local naïve Bayes methods with common neighbor, Adar-Adamic index, and resource allocation, respectively.These are listed as LNBCN, LNAA, and LNBRA in the tables provided in the Appendix.
In our experimental runs on the five data sets, we able to validate all of the KDD results [40] on the measures that were used in that paper: CN, Jac, AA, RA, and PA.See the Appendix, Tables VI, VII and VIII.Numerical values were www.ijacsa.thesai.orgrarely the same, but sufficiently close.In addition, in each of our experimental runs inclusive of all additional measures, the RA index generally performs best, while the AA, CN, and LNBAA indices follow closely behind in best overall performance.This "best performance" neglects comparisons against the WLNM test runs.Because WLNM outperforms even AA and CN, WLNM still provides superior performance according to KDD data.A caveat is that we only ran the WLmodel with linear regression, called WLLR, which fared worse amongst the various CN-measures above.In fact, almost without exception, all 13 common neighbor methods implemented exhibited superior performance over WLLR on the NS, PB, and Yeast data sets.Exceptions occurred with PA, Jac, LHN, and LNBRA on certain data sets.

C. Path-based Methods
The Katz Index [9] is based on an ensemble of all paths.It is a sum over the collection of all paths with a damping factor β providing shorter paths more weight.Letting A be the adjacency matrix, Katz(x, y) The Local Path Index (LP) [8,33] takes local paths into additional consideration beyond CN and is defined as LPI(x, y) = A 2 + ϵA 3 where ϵ is a free parameter; note that when ϵ = 0, the index is just CN.A more expanded version allows for n sum factors, and as n → ∞ the index becomes Katz.Experimental results show that the optimal n is positively correlated with the average shortest distance of the network [33].
The Leicht-Holme-Newman Index (LHNII) [22] is a variant of the Katz index and is based on the concept that two nodes are similar if their immediate neighbors are themselves similar.A self-consistent matrix formulation is S = AS + ψI = ψ(I -A) -1 = ψ(I + Katz(x, y)) where ψ and β are free parameters controlling the balance between the two components of the similarity.An formulation useful for computations is S = 2mD -1 (I -A/) -1 D -1 where λ is the largest eigenvalue of A, m is the total number of edges in the network, D is the degree matrix, and β is a free parameter.The choosing of β depends on the investigated network, and smaller β assigns more weights on shorter paths.
The KDD paper [40] only implements Katz with β = 0.01.In our experiments we also run Katz with β = 0.001, as well as LocalPath, and LHNII with β = 0.9, 0.95, and 0.99.See Appendix, Tables IX and X.In our implementations, on four data sets (USAir, NS, PB, and Yeast), Katz (both versions) and LocalPath always exhibited superior performance to WLLR, with the exception of Katz (β = 0.01) on USAir.For the NS dataset, LHNII actually exhibited top performance on WLLR over all methods discussed in this section.

D. Random Walk-based Methods
Resistance Distance (RD) is often called Average Commute Time (ACT) in other contexts and equal to s(x, y) + s(y, x) where s(x, y) denotes the average number of steps required by a random walker starting from node x to reach node y.The pseudoinverse of the Laplacian matrix L = D -A, is easily computable as m(L + (x, x) + L + (y, y) -2L + (x, y)) [11,30].ACT(x,y) is defined as the reciprocal of this with m=1 in order to ensure that two nodes are more similar whenever they have a smaller average commute time.
The PageRank (PR) algorithm [13] may be directly applied using Random Walk with Restart (RWR).Consider a random walker starting from node x, who will iteratively move to a random neighbor with probability c and return to node x with probability 1c.Let denote the probability that a random walker locates at node y in the steady state.Then p x = p x1 , p x2 , ... is given by p x = c • P T p x + (1c) • e x where e x is the n 1 vector in Euclidean n-space which is 1 at entry x and zero elsewhere, and P is the transition matrix with P(x, y) = 1/d x if x and y are connected and 0 otherwise.The solution, given by p x = (1c)(I -cP T ) -1 e x , is used in the code for this project.Define RWR as RWR(x, y)= p xy + p yx .SimRank (SR) [31] is defined in a self-consistent way similar to LHNI as SR(x, y) =  ( z(x)  z(y) SR(z, z))/(d x d y ) where SR(x, x) = 1 and   [0, 1] is a decay factor.The underlying idea is that two nodes are similar if they are connected to similar nodes.From a random-walk perspective, SR measures how soon two random walkers, respectively starting from nodes x and y, are expected to meet at a certain node.
Cosine Similarity based on L + (Cos+) [30] is defined using pseudoinverse L + of the Laplacian matrix L = D -A so that Local Random Walk with step s (LRWs) [34] measures the similarity between nodes x and y when random walker is initially put on node x and proceeds for s steps.The density vector is defined by V x (0) = e x and V x (t + 1) = P T •V x (t) for t ≥ 0. Then LRWs(x, y) = init(x) • V xy (s) + init(y) • V yx (s) where init is the initial configuration function.In [34] Liu and Lü determine init(x) by node degree so that init(x) = d x /m.Superposed Random Walk with step s [34] is similar to the LRWs index and defined as SRWs(x, y) = 1  i  s LRWs(x, y).
Here a random walker is continuously released at the starting point.A higher similarity is between the target node and the nodes nearby results.Matrix Forest Index [32] is defined by MFI = (I + L) -1 .MFI gives the ratio of the number of spanning rooted forests (such that nodes x and y belong to the same tree rooted at x) to all spanning rooted forests of the network.
Transfer Similarity with CN (TS) is defined, using a parametrized version of MFI, by TS = (I + λ • CN) -1 * CN.The KDD paper [40] only implements RD, PR, and SR.In our experiments we also implemented Cos+, RWR with β = 0.95, LRW with 3-5, SRW with 3-5, MFI, and TS.See Appendix, Tables XI, XII, and XIII.In our implementations, random walk-based methods could not be calculated on two datasets (NS, Yeast) due to memory limitations.On the other three sets (USAir, PB, and Celegens), every random walk method was superior to the WLLR model, with the exception of RD and SR and TS on all sets, and RWR 0.95 and MFI on USAir.www.ijacsa.thesai.org

E. Latent Feature-based Methods
Latent (present participle of lateo, "lie hidden") featurebased models attempt recover hidden features which are then used to predict graphs links.
Stochastic Block Model (SBM) [5], [35]-[37] partitions nodes into groups and the probability that two nodes are connected depends solely on the groups to which they belong.
Let  be a partition, Q ab denote the probability that groups a and b are connected (so that Q aa = 1 for all a), and c ab denote the number of connections (i.e.edges) between groups.The likelihood L(A|M) of observed structure A given M is therefore 1 -c_(ab) .From this SBM(x, y) [21] is defined via Bayes" Theorem as where Ω is the set of all partitions and p(M) is a constant assuming no prior knowledge about the model.
With Matrix Factorization (MF) [39], some entries in A are unknown.MF attempts to approximate A, using only known entries, into two low-dimensional matrices so that A  FG T with F being N K, G being N K, and K being the number of latent features.The squared error is thus given by (e ij ) 2 = (a ij -kKf ik g kj ) 2 .A regularization technique adds a factor to avoid over-fitting so that instead ).The goal is to minimize the sum of all squared errors to obtain optimal F and G.The gradient at current values is calculated via partial differentiation with respect to f ik and separately with respect to g kj .Weights are then updated in the direction opposite the gradient and this gives rules f ik  = f ij +  (2 e ij g kj - f ik ) and g kj  = g kj +  (2e ij f ik - g kj ) which are then used iteratively until error converges to a minimum.Implicit in the above formulation is that squared errors must be known elements of A, so a ij is in the training set.
Amongst the two methods, our results showed that SB generally provided better results with respect to link prediction.See Appdendix, Table XIV.For the USAir and Celegens data sets, SB outperformed the WLLR.For the PB and Celegens datasets, MF likewise outperformed WLLR.

F. Summary
Each of the aforementioned methods were implemented with various parameters as outlined in the Appendix.In particular, our implementation extends the KDD authors" implementation by testing all data sets on the additional common neighbor schemes (13 instead of 5), path-based methods (6 instead of 1), random walk-based methods (13 instead of 3).We also implemented all of the authors" latent feature models.On the five datasets sets tested, all author data on these methods were able to be validated.In our implementation, a WL-model is run as linear regression and noted as WLLR.WLLR did not exhibit superior performance in any run for which comparisons were possible.Amongst the comparisons possible, in 80% of these our implementations (not implemented in the KDD paper) demonstrated superior performance.In particular, WLLR AUC results for NS, PB, Yeast, and Celegens were 0.865, 0.838, 0.860, and 0.804, respectively; the corresponding results for LHRII (all cases), LRW3, LNBAA, and LRW3 were 0.9690, 0.9367, 0.8990, and 0.9197, respectively.For USAir, the WLLR AUC score was 0.930 and RA demonstrated superior performance with an AUC score of 0.9540.Common neighbor methods were superior in 40% of the cases.Random walk-based methods were superior in another 40% of the cases.Finally, the Pathbased methods were superior in the remaining 20% of cases.Table IV

VI. CONCLUSION
Chen and Zhang [40] develop novel WL-link prediction scheme which to-date best predicts links in real world graph data, according to experimental data.An innovative feature is that link formation mechanisms are learned, not assumed.WLlink prediction schemes work by encoding enclosed subgraphs as adjacency matrices.Encoding occurs via the author"s modification of the Weisfeiler-Leman (WL) algorithm [12] from graph theory.The authors" modification, Palette-WL, labels vertices according to their structural roles in the subgraph and preserves subgraph intrinsic directionality.Training on adjacency matrices then learns a predictive model.When training is done on the authors" neural network, the model is called WLNM, i.e.WL Neural Machine, and this requires advanced support for parallelism or distributed computing.When training is done with linear regression, the model is called WLLR, i.e.WL Linear Regression, and this was the focus of our work; forthcoming work will tackle WLNM.
We extended the authors" implementations by testing all data sets on additional CN-, path-and random walk-based schemes.In particular, we implemented 32 such schemes compared to the nine from the KDD authors.We also implemented all of the authors" latent feature models.On five data sets, all author data on these methods were validated.The WLNM model still demonstrates superiority even when all additional schemes are implemented, according to data provided by the KDD authors.The linear regression version of the model, WLLR, on the other hand, was not superior when calculable.In fact, in 80% of the datasets where comparisons were possible, one of our added implementations proved superior.www.ijacsa.thesai.orgVII.FUTURE WORK Given the successes in validating the results from [40] for WLLR and in demonstrating a multitude of results which supplant that model, our next step will be to implement and validate WLNM.That is, the current work is limited in scope to only the WLLR model.Therefore, the goal will be to determine if the WLNM model can also be supplanted.We will also run all experiments on the three additional data sets (Power, Router, and E.coli) which the authors test in [40] and which require advanced support for parallelism or distributed computing.
Other opportunities for future work also abound.For instance, in what ways can might the graph coloring scheme be applied to other algorithms in data science (e.g.clustering)?Also, as the authors' algorithm pertains only to undirected graphs, how might the WLNM be modified in order to apply to directed graphs?It is also plausible that for specific classes of graphs, a parred set of calculations might suffice towards leading to manageable neural network computations; one direction for future work would be to identify such classes and prove optimal computational bounds.Another line of work would be to determine whether there might be a class of circumstances in which a heuristic method (or an embedding model) may provide a better result.If so, what properties, would such a class or model have, mathematically, and could an example be found in nature?www.ijacsa.thesai.org

TABLE III .
THREE RELATED CN-BASED MEASURES demonstrates these results.

TABLE VI .
COMMON NEIGHBOR METHODS I e. Preferential Attachment (PA)

TABLE VII .
COMMON NEIGHBOR METHODS II*

TABLE VIII .
COMMON NEIGHBOR METHODS III b. Local naive bayes method with Common Neighbor (LNBCN) c.Local naive bayes method with Adar-Adamic Index (LNBAA) d.Local naive bayes method with Resource Allocation (LNBRA)

TABLE XI .
RANDOM-WALK BASED METHODS I .Resistance Distance, or Average Commute Time (RD) b.PageRank, or Random Walk with restart, with damping factor d = 0.85 (PR) ac. SimRank with 0.6 (SR)

TABLE XII .
RANDOM-WALK BASED METHODS II*

TABLE XIV .
LATENT FEATURE BASED METHODS b. Matrix Factorization with classification loss function (MF-c)