Social network matter in engaging a new member in a community. Are there patterns in initial connections in the new company that influence future retention in the company? How does a new employee in a company network or connect with other employees of the company? Is there any similarity in the company network of a new employee?
Despite the importance of these questions, there is a little understanding on how the ego-network (connections of an individual and network of connections between them) of a new member of a community evolves within the community. Recently we analyzed LinkedIn’s data and published a detailed study to answer these questions at World Wide Web (WWW) Conference:
Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement
In this post I will discuss some of the highlights of the paper including growth of network in a new company, diversity of network and retention of new employee over time.
We use top 500 companies in LinkedIn (by average degree), which includes more than 1 million members and more than 100 thousands new employees in 2013. To our knowledge, this is the largest analysis on company network and employees behavior.
Connecting with senior and large network people initially results in longer retention
First, we looked into whether there is any relationship between initial connections after joining a company and retention in the company. We looked at the network size and seniority of first ten connections for a new employee inside a company. We checked whether the new employees work for the same company after 1.5 years. We found that if initial connections are more senior and have larger network then the new employee is less likely to leave the company early.
More diverse your initial friends network implies more diverse and large network for you over time
Second, is there any relation between the network status of initial connections of new employee and their future network status? We computed average degree and functional group diversity in their first 10 connections ego-network. After data analysis, we obtained the final degree and functional group diversity in the final ego-network after 1.5 years. We found that more diverse your initial connections network implies more diverse and large network for you over time.
Network growth through triadic-closure led by high-degree people
Finally we analyzed the growth of network for new employees. We found that new employees grow their network through triad-closing propagation led by high-degree people.
Figure: Larger Weiner index implies that the real network is more viral in the triadic-closure propagation than a random network.
For more details check out our paper:
Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement, Atef Chaudhury, Myunghwan Kim, and Mitul Tiwari. In the Proceedings of the 25th International Conference Companion on World Wide Web (WWW), April 2016.
People You May Know (PYMK) recommends other people to connect with allowing members to grow their network, and it’s one of the most recognizable feature at LinkedIn. PYMK is responsible for building more than 50% of LinkedIn’s professional graph. The two main challenges in building People You May Know are machine learning and scale.
In terms of machine learning, the basic problem behind PYMK is link prediction over social graph, that is, figuring out missing edges on the social graph that are present in the real life, but may not yet reflected on LinkedIn’s professional graph. At the heart of it is binary classification problem to predict whether two people know each other or not. PYMK uses a logistic regression model for the binary classification problem to combine hundreds of features. PYMK uses LinkedIn’s open-sourced large-scale machine learning library for training models with hundreds of millions of samples for training.
There are many feature or signals used for predicting whether two people know each other. For example, one of the first thing to look is friends-of-friends or triangle closing. If Alice knows Bob and Bob knows Carol then Alice and Carol might know each other since there is one common connection. As the number of common connections increases the likelihood of two people knowing each other increases. After closing these triangles PYMK scores such candidate pairs using other features such as overlapping organizations (for example, same company, same school, same group), geographical distance, age difference, etc.
There are many interesting modeling challenges in feature engineering. For example, as part of our research to understand how two people working in the same organization know each other, we built a novel model factoring in the time of joining and departing an organization, the size of the organization, likelihood of knowing each other in an organization (as some organization are more social than the others) as published in WWW’13. The logic is simple: the affinity between two members who worked together in a small organization for 10 years is greater than members who've worked together for only a few months.
Another interesting modeling challenge is how to incorporate user feedback through impression discounting, that is, discount the PYMK results that are seen by users and ignored (see our KDD’14 paper for more details). The intuition is simple: PYMK results that are seen by users and did not lead to any connection are ignored by users and should be lowered in the ranking of PYMK results.
In terms of scale, PYMK system daily processes 100s of terabytes of data, 100s of billions of potential connections, and pushes new PYMK results every day. As PYMK look at second degree network (connections of connections), the rate of growth in the data processing is much faster than the site growth. This poses a unique challenge in scale that we need to keep optimizing PYMK system to deal with such high growth in data processing while keep refreshing PYMK results every day. LinkedIn has built an ecosystem of big data for addressing scaling challenges in PYMK including many systems such as Voldemort key-value store, Azkaban Hadoop workflow management, Apache Kafka for streaming, Apache DataFu for simplifying data processing in Hadoop PIG, Cubert for efficient joins and data processing, etc.
There are many interesting ongoing work to improve People You May Know further, for example,
Stay tuned!
(Interestingly, People You May Know was invented at LinkedIn.)
Figure 1: Example LinkedIn signup cascade
Many of the popular websites such as LinkedIn power their growth through guest invitations from existing members to non-members. New members joining can also send such guest invitations resulting in cascade of membership growth at a large scale. How does such cascade of membership growth looks like? How viral are these cascades?
Recently we analyzed LinkedIn’s growth through such guest invitations addressing these questions, the largest structural analysis of cascading growth diffusion, and published our work in the 24th International World Wide Web Conference (WWW) 2015:
Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily
LinkedIn is the largest professional network with more than 360 million members. LinkedIn membership has grown through warm signups because of guest invitations and direct signup at the site without an invitation. A significant fraction of the members joined LinkedIn through the cascading guest invitations resulting in warm signups. The cascading signups can be organized into a collection of trees: each time a member signs up directly, she becomes the root of a new tree, and every user who signs up by accepting an invitations from a member A becomes the child of member A in the tree with member A. One example of such a tree is shown (at the top) in Figure 1. We find that LinkedIn’s signup cascade trees are huge, very viral (compared to previously studied diffusion phenomenon), and members remain active for a long time in sending guest invitations resulting in more warm signups.
Figure 2: Pattern of LinkedIn’s cascade trees growth over time
How does LinkedIn’s cascade trees grow in size over time? In Figure 2 we plot the growth of the 1000 biggest cascade trees on LinkedIn. We see a surprisingly robust growth pattern in these cascade trees (and all the trees as well). Also, we observe that the number of cascade trees are growing over time at a deliberate pace. In short, we are observing persistent, parallel increase in the number of cascade trees and size contributing to warm signup growth at LinkedIn.
How does growth diffusion affects the characteristics of members present in the cascade trees? In addition to analyzing the structure of growth cascade trees, we also connect interaction between the characteristics of the members present in the trees and structure of the trees. We find the geography and industry play an important role in the cascade and shows similarity between the inviter and invitee. Figure 3 shows within cascade tree similarity and compares with between tree similarity for members present in the tree on country, region, industry, engagement, and seniority among growth cascade trees. We observe that similarity within tree on country and industry dimensions are much higher compared to between-tree similarities.
Figure 3: Within-tree and between-tree similarity (homophily) on country, region, industry, engagement, and seniority among growth cascade trees.
Figure 4: Root-guessing experiment where we are trying to predict the country of root node based on the country of a given node.
This is the largest growth diffusion study (that we are aware of) and for more details check out our paper:
Online social networks have become important tools for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed: two individuals might know each other, but may not have established a connection on the site.
Therefore, link prediction and recommendations are important tasks for any online social network. We published a paper in the 22nd International World Wide Web Conference (WWW), May 2013 that describes how we developed a novel organizational overlap model for link prediction between two users in a network:
Organizational Overlap on Social Networks and its Applications
In this post, I’ll briefly discuss some of the highlights from the paper, including link prediction, recommendations, and community detection.
Social network sites use recommendation systems such as LinkedIn’s ‘People You May Know’ to enable a significant number of link creations.
A basic problem in network analysis is predicting links for partially observed networks, that is, given a snapshot of connections at time t, can we predict links at time t+1. On any online social network, two members might know each other, but may not have established a connection on the site. Link prediction and recommendations help address this problem and create a more complete social graph to improve user involvement.
As part of our research to understand edge affinity between users, we built a novel model factoring in the time of joining and departing an organization. The logic is simple: the affinity between two members who worked together in an organization for 10 years is greater than members who've worked together for only a few months. We built a mathematical model based on this organizational time overlap and validated the model with LinkedIn’s social network data.
We used this model to predict existing edges on LinkedIn and two other public networks and found that this method’s top-5 prediction accuracy was 42% better than Common Neighbor and Adamic-Adar based link prediction. We also showed empirically that our model works for diverse organizations such as companies, schools, and online groups.
Detecting communities within an organization is another important challenge. On most online social networks, a user can follow an entity to receive updates on it within a personalized news feed. For example, members can follow a company on LinkedIn and receive company updates.
To recommend entities for a member to follow, we look at entities the member's community is already following. Simply using the entire organization yields inferior results, as most organizations are diverse and contain several orthogonal groups (for example, sales, marketing, engineering) and subgroups (for example, front-end, database, machine learning).
As another example, consider a news feed generated by online activity and how its volume can quickly overwhelm a user. A key feature in ranking a news feed is to promote an update if the member is in the same community as the originator of the update.
The organizational overlap model also works well for detecting communities within an organization. It is usually hard to evaluate the quality of communities because of a lack of ground truth. We used an indirect method to evaluate the quality: intuitively, the speed of information propagation should be faster within a community, so we measured the quality of detected communities by the speed of information propagation within it.
We evaluated detected communities within the LinkedIn network by the propagation speed of company follows and sharing activity. Results show that communities detected by our method are up to 66% better than communities detected by only links in terms of the propagation speed of shared articles, and 15% better in terms of the propagation speed of company follows.
For more details, check out the full paper:
Google announced that they are not using MapReduce anymore: "Google dumps MapReduce favor new hyper scale analytics system". MapReduce has been a simple abstraction that has made large scale data processing easier, scalable, and fault-tolerant. However, MapReduce paradigm does not work well for many use cases such as stream processing, iterative computation, graph processing, real-time analytics, etc. Here is another blog post on this announcement: "The elephant was a trojan horse on the death of map reduce at Google".
(photo from: http://www.city-data.com/picfilesv/picv32970.php)
Here is a short summary of a few papers from SIGIR 2012:
Also, industrial track was very interesting and I will post a summary soon. I presented on Related Searches at LinkedIn that I described in an earlier post: related searches at LinkedIn blogpost.
Metaphor builds on a number of signals and filters that capture several dimensions of relatedness in search activity.
Correlation based on clicks: The second signal is based on query-result clicks, that is, search queries that result in clicking the same result. For example, search results for queries “Hadoop” and “MapReduce” have a common result that is clicked often by members, then we can say those two queries are related. LinkedIn’s personalized search results brings added richness to the queries related by this signal.
Metaphor, our related search recommendation engine runs on Hadoop. The query logs and activity tracking data is aggregated from the production systems using Kafka, a publish-subscribe streaming system for event collection and distribution. Metaphor consists of several map-reduce jobs implemented in Hadoop Java and Pig, a scripting language on top of Hadoop. Azkaban, a Hadoop workflow management tool, is used for managing Metaphor’s more than 50 map-reduce jobs. The final recommendations are stored in Voldemort, a key-value store system, and served in production from Vodemort store.
We evaluated Metaphor in various ways using traditional Precision-Recall measure as well as online A/B testing. Check out our paper [1] for more details.
Metaphor: a system for related search recommendations, Azarias Reda, Yubin Park, Mitul Tiwari, Christian Posse, and Sam Shah. In Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM), October 2012 (to appear).
Last Saturday "leap second" adjustment caused issues with many online sites: "Leap second bug wreaks havoc across web".
Google's SRE team posted a nice blog post on how they fixed leap second issue: "Time, technology and leaping seconds" by "leap smear", where they change duration of each second reported by NTP depending on "leap second" is added or subtracted. Wondering whether similar techniques can be applied to Stratum 0/1 NTP servers so that the rest of the people don't have to worry about leap seconds in future?
]]>Trying out igraph, a graph visualization package in R. Looks promising.
A paper with a good summary of ordinal regression models: "Regression models for ordinal responses: A review of methods and applications"
The first method described is "Cumulative Logit Model" or "Proportion Odds model". In this model, dependence of dependent variable can be expressed as log of odds equals to a linear combination of independent variables and thresholds on the ordinal values. This model is used by Yehuda Koren and Joe Sill in their OrdRec paper (RecSys'11) to extend SVD++ for ordinal feedback of items.
]]>WSDM 2012 list of accepted papers: http://wsdm2012.org/program/detailed.html.
Also, here is a list of papers related to social networks: http://cs2socialnetworks.wordpress.com/2011/12/05/wsdm-2012-accepted-on-social-networks
Looked at a few papers over the weekend.
Attended RecSys 2011 in October this year. Here is the offcial proceedings: RecSys 2011 proceedings. Conference had many interesting papers and industrial track was also pretty good with presentations from Ebay, Facebook, Twitter, Pandora, and Netflix.
Clojure creater Rich Hickey on virtues of simplicity over easiness: http://www.infoq.com/presentations/Simple-Made-Easy
]]>Nice blog post by @arsatiki on Bayesian A/B testing theory and code: http://arsatiki.posterous.com/bayesian-ab-testing-with-theory-and-code
]]>Last week in our reading group at LinkedIn we discussed "Supervised Random Walks: Predicting and Recommending Links in Social Networks" by L. Backstrom and J. Leskovec, which appeared in ACM Internation Conference on Web Search and Data Mining (WSDM), 2011.
This is quite interesting paper that talks about link prediction problem. Link prediction is a fundamental problem in a graph or social networks to predicts edges/links that don't exist currently but may appear in future. Applications of link prediction includes (1) "People You May Know" feature in social networks such as LinkedIn or Facebook, (2) predict links in protein-protein interaction, (3) suggest links to bloggers to add, (4) predict collaboration, etc.In the past there has been two main approaches to link prediction in a graph: (1) classification potential edges as positive or negative using various features from the graph; and (2) random walk on the graph to rank nodes by assigning probability to each node.In the first approach, various features such as node (e.g., age, gender), edge (e.g., interaction, activity), graph (e.g., the number of common friends) are used to predict links. Past edges can be used as labeled training data to classify potential edges as positive or negative edges in the future. In the second approach, random walks are used to assign probabilities to nodes, and rank nodes based on these probabilities. Higher the rank, more likely it is to be an edge between the starting node and that node. Random walks have been used to find out important nodes in a graph (like Pagerank) and also used for finding similar nodes in a graph. Random walk with restarts is one example of this approach.This paper talks about how to combine these two approaches into one by building a model to bias a random walk in a graph, and use the random walk for link prediction. Node and edge (e.g., the number of common friends) features are used to learn transition probabilities between edges. And then random walk with restart is run to find out a probability score for each node. Nodes are ordered by these probability score to find out the top nodes that are likely to be connected with the starting node in the future. This is quite interesting approach that does not need complicated feature engineering to find out graph features, but this approach is very compute intensive since random walk takes multiple iterations of graph traversal, and if the graph is huge (say 100s of millions of nodes) then it will take a long time to run. In my experience, frequently running link prediction pipeline is more important then marginal improvement in link prediction accuracy since online social graph changes continuously.Nevertheless, this paper brings very interesting ideas to combine the classification approach and the random walk approach for link prediction.
References
References:
I hope I can learn something from the following quotes.
"You can't expect to prevent negative feelings altogether. And you can't expect to experience positive feelings all the time...The Law of Emotional Choice directs us to acknowledge our feelings but also to refuse to get stuck in the negative ones." -- Greg Anderson
"It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something." --Franklin D. Roosevelt]]>