Supervised Random Walk for link prediction

Last week in our reading group at LinkedIn we discussed "Supervised Random Walks: Predicting and Recommending Links in Social Networks" by L. Backstrom and J. Leskovec, which appeared in ACM Internation Conference on Web Search and Data Mining (WSDM), 2011.

This is quite interesting paper that talks about link prediction problem. Link prediction is a fundamental problem in a graph or social networks to predicts edges/links that don't exist currently but may appear in future. Applications of link prediction includes (1) "People You May Know" feature in social networks such as LinkedIn or Facebook, (2) predict links in protein-protein interaction, (3) suggest links to bloggers to add, (4) predict collaboration, etc.

In the past there has been two main approaches to link prediction in a graph: (1) classification potential edges as positive or negative using various features from the graph; and (2) random walk on the graph to rank nodes by assigning probability to each node.

In the first approach, various features such as node (e.g., age, gender), edge (e.g., interaction, activity), graph (e.g., the number of common friends) are used to predict links. Past edges can be used as labeled training data to classify potential edges as positive or negative edges in the future.

In the second approach, random walks are used to assign probabilities to nodes, and rank nodes based on these probabilities. Higher the rank, more likely it is to be an edge between the starting node and that node. Random walks have been used to find out important nodes in a graph (like Pagerank) and also used for finding similar nodes in a graph. Random walk with restarts is one example of this approach.

This paper talks about how to combine these two approaches into one by building a model to bias a random walk in a graph, and use the random walk for link prediction. Node and edge (e.g., the number of common friends) features are used to learn transition probabilities between edges. And then random walk with restart is run to find out a probability score for each node. Nodes are ordered by these probability score to find out the top nodes that are likely to be connected with the starting node in the future.

This is quite interesting approach that does not need complicated feature engineering to find out graph features, but this approach is very compute intensive since random walk takes multiple iterations of graph traversal, and if the graph is huge (say 100s of millions of nodes) then it will take a long time to run. In my experience, frequently running link prediction pipeline is more important then marginal improvement in link prediction accuracy since online social graph changes continuously.

Nevertheless, this paper brings very interesting ideas to combine the classification approach and the random walk approach for link prediction.

References

Supervised Random Walks: Predicting and Recommending Links in Social Networks, L. Backstrom and J. Leskovec. In Proceedings of the 4th ACM international conference on Web search and data mining.
Link Prediction Problem for Social Networks, D. Liben-Nowell and J. Kleinberg. In Proceedings of 12th International Conference on Information and Knowledge Management (CIKM), 2003.
Predicting Positive and Negative Links in Online Social Networks, J. Leskovec, D. Huttenlocher and J. Kleinberg. In Proceedings of 19th International World Wide Web Conference (WWW), 2010.