Reinventing People You May Know at LinkedIn

People You May Know (PYMK) recommends other people to connect with allowing members to grow their network, and it’s one of the most recognizable feature at LinkedIn. PYMK is responsible for building more than 50% of LinkedIn’s professional graph. The two main challenges in building People You May Know are machine learning and scale.

In terms of machine learning, the basic problem behind PYMK is link prediction over social graph, that is, figuring out missing edges on the social graph that are present in the real life, but may not yet reflected on LinkedIn’s professional graph. At the heart of it is binary classification problem to predict whether two people know each other or not. PYMK uses a logistic regression model for the binary classification problem to combine hundreds of features. PYMK uses LinkedIn’s open-sourced large-scale machine learning library for training models with hundreds of millions of samples for training.

There are many feature or signals used for predicting whether two people know each other. For example, one of the first thing to look is friends-of-friends or triangle closing. If Alice knows Bob and Bob knows Carol then Alice and Carol might know each other since there is one common connection. As the number of common connections increases the likelihood of two people knowing each other increases. After closing these triangles PYMK scores such candidate pairs using other features such as overlapping organizations (for example, same company, same school, same group), geographical distance, age difference, etc.

There are many interesting modeling challenges in feature engineering. For example, as part of our research to understand how two people working in the same organization know each other, we built a novel model factoring in the time of joining and departing an organization, the size of the organization, likelihood of knowing each other in an organization (as some organization are more social than the others) as published in WWW’13. The logic is simple: the affinity between two members who worked together in a small organization for 10 years is greater than members who've worked together for only a few months.

Another interesting modeling challenge is how to incorporate user feedback through impression discounting, that is, discount the PYMK results that are seen by users and ignored (see our KDD’14 paper for more details). The intuition is simple: PYMK results that are seen by users and did not lead to any connection are ignored by users and should be lowered in the ranking of PYMK results.

In terms of scale, PYMK system daily processes 100s of terabytes of data, 100s of billions of potential connections, and pushes new PYMK results every day. As PYMK look at second degree network (connections of connections), the rate of growth in the data processing is much faster than the site growth. This poses a unique challenge in scale that we need to keep optimizing PYMK system to deal with such high growth in data processing while keep refreshing PYMK results every day. LinkedIn has built an ecosystem of big data for addressing scaling challenges in PYMK including many systems such as Voldemort key-value store, Azkaban Hadoop workflow management, Apache Kafka for streaming, Apache DataFu for simplifying data processing in Hadoop PIG, Cubert for efficient joins and data processing, etc.

There are many interesting ongoing work to improve People You May Know further, for example,

Large-scale distributed machine learning
Large-scale social graph processing
Network A/B testing

Stay tuned!

(Interestingly, People You May Know was invented at LinkedIn.)