Influence of First Connections for a New Employee on Growth and Retention

Image: https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAlkAAAAJDIxZDM1N2NjLTY1YjYtNDdiNC05MGNiLWQwNGJmMzY4N2Q4Mw.jpg

Influence of First Connections for a New Employee on Growth and Retention

Social network matter in engaging a new member in a community. Are there patterns in initial connections in the new company that influence future retention in the company? How does a new employee in a company network or connect with other employees of the company? Is there any similarity in the company network of a new employee?

Despite the importance of these questions, there is a little understanding on how the ego-network (connections of an individual and network of connections between them) of a new member of a community evolves within the community. Recently we analyzed LinkedIn’s data and published a detailed study to answer these questions at World Wide Web (WWW) Conference:

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement

In this post I will discuss some of the highlights of the paper including growth of network in a new company, diversity of network and retention of new employee over time.

We use top 500 companies in LinkedIn (by average degree), which includes more than 1 million members and more than 100 thousands new employees in 2013. To our knowledge, this is the largest analysis on company network and employees behavior.

Connecting with senior and large network people initially results in longer retention

First, we looked into whether there is any relationship between initial connections after joining a company and retention in the company. We looked at the network size and seniority of first ten connections for a new employee inside a company. We checked whether the new employees work for the same company after 1.5 years. We found that if initial connections are more senior and have larger network then the new employee is less likely to leave the company early.

More diverse your initial friends network implies more diverse and large network for you over time

Second, is there any relation between the network status of initial connections of new employee and their future network status? We computed average degree and functional group diversity in their first 10 connections ego-network. After data analysis, we obtained the final degree and functional group diversity in the final ego-network after 1.5 years. We found that more diverse your initial connections network implies more diverse and large network for you over time.

Image: https://media.licdn.com/mpr/mpr/shrinknp_800_800/AAEAAQAAAAAAAAglAAAAJDhhMDVjOWEyLTVmZjctNDQ5Zi1hYzFjLWQzNTIwYmRkYTEyMA.jpg

Network growth through triadic-closure led by high-degree people

Finally we analyzed the growth of network for new employees. We found that new employees grow their network through triad-closing propagation led by high-degree people.

Image: https://media.licdn.com/mpr/mpr/shrinknp_800_800/AAEAAQAAAAAAAAewAAAAJDUzYzAwODg5LTg3M2EtNDdiOS1hNWU4LTMzN2MzY2RjMWQ0Yw.jpg

Figure: Larger Weiner index implies that the real network is more viral in the triadic-closure propagation than a random network.

For more details check out our paper:

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement, Atef Chaudhury, Myunghwan Kim, and Mitul Tiwari. In the Proceedings of the 25th International Conference Companion on World Wide Web (WWW), April 2016.

Reinventing People You May Know at LinkedIn

People You May Know (PYMK) recommends other people to connect with allowing members to grow their network, and it’s one of the most recognizable feature at LinkedIn. PYMK is responsible for building more than 50% of LinkedIn’s professional graph. The two main challenges in building People You May Know are machine learning and scale.

In terms of machine learning, the basic problem behind PYMK is link prediction over social graph, that is, figuring out missing edges on the social graph that are present in the real life, but may not yet reflected on LinkedIn’s professional graph. At the heart of it is binary classification problem to predict whether two people know each other or not. PYMK uses a logistic regression model for the binary classification problem to combine hundreds of features. PYMK uses LinkedIn’s open-sourced large-scale machine learning library for training models with hundreds of millions of samples for training.

There are many feature or signals used for predicting whether two people know each other. For example, one of the first thing to look is friends-of-friends or triangle closing. If Alice knows Bob and Bob knows Carol then Alice and Carol might know each other since there is one common connection. As the number of common connections increases the likelihood of two people knowing each other increases. After closing these triangles PYMK scores such candidate pairs using other features such as overlapping organizations (for example, same company, same school, same group), geographical distance, age difference, etc.

There are many interesting modeling challenges in feature engineering. For example, as part of our research to understand how two people working in the same organization know each other, we built a novel model factoring in the time of joining and departing an organization, the size of the organization, likelihood of knowing each other in an organization (as some organization are more social than the others) as published in WWW’13. The logic is simple: the affinity between two members who worked together in a small organization for 10 years is greater than members who've worked together for only a few months.

Another interesting modeling challenge is how to incorporate user feedback through impression discounting, that is, discount the PYMK results that are seen by users and ignored (see our KDD’14 paper for more details). The intuition is simple: PYMK results that are seen by users and did not lead to any connection are ignored by users and should be lowered in the ranking of PYMK results.

In terms of scale, PYMK system daily processes 100s of terabytes of data, 100s of billions of potential connections, and pushes new PYMK results every day. As PYMK look at second degree network (connections of connections), the rate of growth in the data processing is much faster than the site growth. This poses a unique challenge in scale that we need to keep optimizing PYMK system to deal with such high growth in data processing while keep refreshing PYMK results every day. LinkedIn has built an ecosystem of big data for addressing scaling challenges in PYMK including many systems such as Voldemort key-value store, Azkaban Hadoop workflow management, Apache Kafka for streaming, Apache DataFu for simplifying data processing in Hadoop PIG, Cubert for efficient joins and data processing, etc.

There are many interesting ongoing work to improve People You May Know further, for example,

Large-scale distributed machine learning
Large-scale social graph processing
Network A/B testing

Stay tuned!

(Interestingly, People You May Know was invented at LinkedIn.)

Growth Diffusion at LinkedIn via Cascading Invitations

Figure 1: Example LinkedIn signup cascade

Many of the popular websites such as LinkedIn power their growth through guest invitations from existing members to non-members. New members joining can also send such guest invitations resulting in cascade of membership growth at a large scale. How does such cascade of membership growth looks like? How viral are these cascades?

Recently we analyzed LinkedIn’s growth through such guest invitations addressing these questions, the largest structural analysis of cascading growth diffusion, and published our work in the 24th International World Wide Web Conference (WWW) 2015:

Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily

In this post I will discuss some of the highlights of the paper including growth through cascading invitations, structure and homophily in the cascade.

LinkedIn is the largest professional network with more than 360 million members. LinkedIn membership has grown through warm signups because of guest invitations and direct signup at the site without an invitation. A significant fraction of the members joined LinkedIn through the cascading guest invitations resulting in warm signups. The cascading signups can be organized into a collection of trees: each time a member signs up directly, she becomes the root of a new tree, and every user who signs up by accepting an invitations from a member A becomes the child of member A in the tree with member A. One example of such a tree is shown (at the top) in Figure 1. We find that LinkedIn’s signup cascade trees are huge, very viral (compared to previously studied diffusion phenomenon), and members remain active for a long time in sending guest invitations resulting in more warm signups.

size_over_time_samenumtrees_2592000_all_0jpg

Figure 2: Pattern of LinkedIn’s cascade trees growth over time

How does LinkedIn’s cascade trees grow in size over time? In Figure 2 we plot the growth of the 1000 biggest cascade trees on LinkedIn. We see a surprisingly robust growth pattern in these cascade trees (and all the trees as well). Also, we observe that the number of cascade trees are growing over time at a deliberate pace. In short, we are observing persistent, parallel increase in the number of cascade trees and size contributing to warm signup growth at LinkedIn.

How does growth diffusion affects the characteristics of members present in the cascade trees? In addition to analyzing the structure of growth cascade trees, we also connect interaction between the characteristics of the members present in the trees and structure of the trees. We find the geography and industry play an important role in the cascade and shows similarity between the inviter and invitee. Figure 3 shows within cascade tree similarity and compares with between tree similarity for members present in the tree on country, region, industry, engagement, and seniority among growth cascade trees. We observe that similarity within tree on country and industry dimensions are much higher compared to between-tree similarities.

within_tree_homophily_all_new_over_100jpg

Figure 3: Within-tree and between-tree similarity (homophily) on country, region, industry, engagement, and seniority among growth cascade trees.

Surprisingly similarity between inviter and invitee is not sufficient to explain within tree similarity we observe. We find that higher order Markov models, in which a node’s characteristics not only depend on the parent but ancestors as well, produce a level of similarity and homophily that closely matches observed data as shown in Figure 4.

Figure 4: Root-guessing experiment where we are trying to predict the country of root node based on the country of a given node.

This is the largest growth diffusion study (that we are aware of) and for more details check out our paper:

Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily, Ashton Anderson, Daniel Huttenlocker, Jon Kleinberg, Jure Leskovec, and Mitul Tiwari. In the Proceedings of the 24th International World Wide Web Conference (WWW), May 2015.

Organizational Overlap on Social Networks and its Applications

Online social networks have become important tools for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed: two individuals might know each other, but may not have established a connection on the site.

Therefore, link prediction and recommendations are important tasks for any online social network. We published a paper in the 22nd International World Wide Web Conference (WWW), May 2013 that describes how we developed a novel organizational overlap model for link prediction between two users in a network:

Organizational Overlap on Social Networks and its Applications

In this post, I’ll briefly discuss some of the highlights from the paper, including link prediction, recommendations, and community detection.

Link prediction and recommendations

Social network sites use recommendation systems such as LinkedIn’s ‘People You May Know’ to enable a significant number of link creations.

People You May Know at LinkedIn

A basic problem in network analysis is predicting links for partially observed networks, that is, given a snapshot of connections at time t, can we predict links at time t+1. On any online social network, two members might know each other, but may not have established a connection on the site. Link prediction and recommendations help address this problem and create a more complete social graph to improve user involvement.

Link Prediction: Given a snapshot of a network at time t, can we predict links at time t+1

As part of our research to understand edge affinity between users, we built a novel model factoring in the time of joining and departing an organization. The logic is simple: the affinity between two members who worked together in an organization for 10 years is greater than members who've worked together for only a few months. We built a mathematical model based on this organizational time overlap and validated the model with LinkedIn’s social network data.

We used this model to predict existing edges on LinkedIn and two other public networks and found that this method’s top-5 prediction accuracy was 42% better than Common Neighbor and Adamic-Adar based link prediction. We also showed empirically that our model works for diverse organizations such as companies, schools, and online groups.

Detecting communities

Detecting communities within an organization is another important challenge. On most online social networks, a user can follow an entity to receive updates on it within a personalized news feed. For example, members can follow a company on LinkedIn and receive company updates.

Following a company on LinkedIn

To recommend entities for a member to follow, we look at entities the member's community is already following. Simply using the entire organization yields inferior results, as most organizations are diverse and contain several orthogonal groups (for example, sales, marketing, engineering) and subgroups (for example, front-end, database, machine learning).

As another example, consider a news feed generated by online activity and how its volume can quickly overwhelm a user. A key feature in ranking a news feed is to promote an update if the member is in the same community as the originator of the update.

Community Detection: Given a network, detect communities among the nodes in the network

The organizational overlap model also works well for detecting communities within an organization. It is usually hard to evaluate the quality of communities because of a lack of ground truth. We used an indirect method to evaluate the quality: intuitively, the speed of information propagation should be faster within a community, so we measured the quality of detected communities by the speed of information propagation within it.

We evaluated detected communities within the LinkedIn network by the propagation speed of company follows and sharing activity. Results show that communities detected by our method are up to 66% better than communities detected by only links in terms of the propagation speed of shared articles, and 15% better in terms of the propagation speed of company follows.

Beyond MapReduce

Google announced that they are not using MapReduce anymore: "Google dumps MapReduce favor new hyper scale analytics system". MapReduce has been a simple abstraction that has made large scale data processing easier, scalable, and fault-tolerant. However, MapReduce paradigm does not work well for many use cases such as stream processing, iterative computation, graph processing, real-time analytics, etc. Here is another blog post on this announcement: "The elephant was a trojan horse on the death of map reduce at Google".

Summary of a few papers from SIGIR 2012 - Part I

(photo from: http://www.city-data.com/picfilesv/picv32970.php)

Here is a short summary of a few papers from SIGIR 2012:

Adaptation of the Concept Hierarchy Model with Search Logs for Query Recommendation on Intranets by Ibrahim Adeyanju, Dawei Song, M-Dyaa Albakour, Udo Kruschwitz, Anne De Roeck and Maria Fasli. This paper talks about enhancing query suggestions on Intranets. The paper combines concept hierarchy model with query-flow graph based query suggestions. First the paper talks about creating hierarchical clustering of intranet documents to create concept hierarchy. Then for a given query find candidates from the concept hierarchy and uses query logs to adapt the query suggestions candidates based on past user clicks. Here are more details about the overall project and related papers: http://autoadapt.essex.ac.uk/tiki/tikiwiki-3.0/tiki-index.php
An Exploration of Ranking Heuristics in Mobile Local Search by Yuanhua Lv, Dimitrios Lymberopoulos, Qiang Wu. This paper describes in depth analysis of local search features such as user's location, ratings and number of reviews for a business, user's profile and personal preference, and how each of these features affect click-rate on results. This paper also talks about incorporating the category of businesses (used by local search engines such as Yelp, Google local, and Bing local) in ranking results. This paper describe a machine learning approach to combine these signals to predict click-rate.
Detecting Quilted Web Pages at Scale by Marc Najork. Web spam detection is a serious issues in improving the quality of search results. This paper talks about an algorithm for detecting 'quilted' web pages (web-pages that are stitched together by combining content from other web pages. The algorithm takes a corpus of web pages as input and outputs a set of quilted web pages along with source pages used to in those quilted web pages. The algorithm first extracts patch grams by finding k-grams that are not too popular (occur in at most) m web-pages and occur in at least one web-page . Then for each document, the algorithm finds patch grams and the source documents (other than the document in consideration) containing the patch-grams.

Also, industrial track was very interesting and I will post a summary soon. I presented on Related Searches at LinkedIn that I described in an earlier post: related searches at LinkedIn blogpost.

On leap second bug

Last Saturday "leap second" adjustment caused issues with many online sites: "Leap second bug wreaks havoc across web".

Google's SRE team posted a nice blog post on how they fixed leap second issue: "Time, technology and leaping seconds" by "leap smear", where they change duration of each second reported by NTP depending on "leap second" is added or subtracted. Wondering whether similar techniques can be applied to Stratum 0/1 NTP servers so that the rest of the people don't have to worry about leap seconds in future?

igraph: a nice graph visualization package in R

Trying out igraph, a graph visualization package in R. Looks promising.

A paper with a good summary of ordinal regression models

A paper with a good summary of ordinal regression models: "Regression models for ordinal responses: A review of methods and applications"

The first method described is "Cumulative Logit Model" or "Proportion Odds model". In this model, dependence of dependent variable can be expressed as log of odds equals to a linear combination of independent variables and thresholds on the ordinal values. This model is used by Yehuda Koren and Joe Sill in their OrdRec paper (RecSys'11) to extend SVD++ for ordinal feedback of items.

Mitul Tiwari’s Space