tag:mitultiwari.posthaven.com,2013:/posts Mitul Tiwari's Space 2016-11-21T21:06:58Z Mitul Tiwari tag:mitultiwari.posthaven.com,2013:Post/1081002 2016-06-06T19:00:00Z 2016-08-13T21:59:58Z Influence of First Connections for a New Employee on Growth and Retention

Influence of First Connections for a New Employee on Growth and Retention


Social network matter in engaging a new member in a community. Are there patterns in initial connections in the new company that influence future retention in the company? How does a new employee in a company network or connect with other employees of the company? Is there any similarity in the company network of a new employee?

Despite the importance of these questions, there is a little understanding on how the ego-network (connections of an individual and network of connections between them) of a new member of a community evolves within the community. Recently we analyzed LinkedIn’s data and published a detailed study to answer these questions at World Wide Web (WWW) Conference:

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement

In this post I will discuss some of the highlights of the paper including growth of network in a new company, diversity of network and retention of new employee over time.

We use top 500 companies in LinkedIn (by average degree), which includes more than 1 million members and more than 100 thousands new employees in 2013. To our knowledge, this is the largest analysis on company network and employees behavior.

Connecting with senior and large network people initially results in longer retention

First, we looked into whether there is any relationship between initial connections after joining a company and retention in the company. We looked at the network size and seniority of first ten connections for a new employee inside a company. We checked whether the new employees work for the same company after 1.5 years. We found that if initial connections are more senior and have larger network then the new employee is less likely to leave the company early.

More diverse your initial friends network implies more diverse and large network for you over time

Second, is there any relation between the network status of initial connections of new employee and their future network status? We computed average degree and functional group diversity in their first 10 connections ego-network. After data analysis, we obtained the final degree and functional group diversity in the final ego-network after 1.5 years. We found that more diverse your initial connections network implies more diverse and large network for you over time.

Network growth through triadic-closure led by high-degree people

Finally we analyzed the growth of network for new employees. We found that new employees grow their network through triad-closing propagation led by high-degree people.

Figure: Larger Weiner index implies that the real network is more viral in the triadic-closure propagation than a random network.

For more details check out our paper:

Influence of First Steps in a Community on Ego-Network: Growth, Diversity, and Engagement, Atef Chaudhury, Myunghwan Kim, and Mitul Tiwari. In the Proceedings of the 25th International Conference Companion on World Wide Web (WWW), April 2016.


]]>
Mitul Tiwari
tag:mitultiwari.posthaven.com,2013:Post/912017 2015-10-02T04:04:28Z 2015-10-02T23:20:04Z Reinventing People You May Know at LinkedIn


People You May Know (PYMK) recommends other people to connect with allowing members to grow their network, and it’s one of the most recognizable feature at LinkedIn. PYMK is responsible for building more than 50% of LinkedIn’s professional graph. The two main challenges in building People You May Know are machine learning and scale.

In terms of machine learning, the basic problem behind PYMK is link prediction over social graph, that is, figuring out missing edges on the social graph that are present in the real life, but may not yet reflected on LinkedIn’s professional graph. At the heart of it is binary classification problem to predict whether two people know each other or not. PYMK uses a logistic regression model for the binary classification problem to combine hundreds of features. PYMK uses LinkedIn’s open-sourced large-scale machine learning library for training models with hundreds of millions of samples for training.

There are many feature or signals used for predicting whether two people know each other. For example, one of the first thing to look is friends-of-friends or triangle closing. If Alice knows Bob and Bob knows Carol then Alice and Carol might know each other since there is one common connection. As the number of common connections increases the likelihood of two people knowing each other increases. After closing these triangles PYMK scores such candidate pairs using other features such as overlapping organizations (for example, same company, same school, same group), geographical distance, age difference, etc.

There are many interesting modeling challenges in feature engineering. For example, as part of our research to understand how two people working in the same organization know each other, we built a novel model factoring in the time of joining and departing an organization, the size of the organization, likelihood of knowing each other in an organization (as some organization are more social than the others) as published in WWW’13. The logic is simple: the affinity between two members who worked together in a small organization for 10 years is greater than members who've worked together for only a few months.

Another interesting modeling challenge is how to incorporate user feedback through impression discounting, that is, discount the PYMK results that are seen by users and ignored (see our KDD’14 paper for more details). The intuition is simple: PYMK results that are seen by users and did not lead to any connection are ignored by users and should be lowered in the ranking of PYMK results.


In terms of scale, PYMK system daily processes 100s of terabytes of data, 100s of billions of potential connections, and pushes new PYMK results every day. As PYMK look at second degree network (connections of connections), the rate of growth in the data processing is much faster than the site growth. This poses a unique challenge in scale that we need to keep optimizing PYMK system to deal with such high growth in data processing while keep refreshing PYMK results every day. LinkedIn has built an ecosystem of big data for addressing scaling challenges in PYMK including many systems such as Voldemort key-value store, Azkaban Hadoop workflow management, Apache Kafka for streaming, Apache DataFu for simplifying data processing in Hadoop PIG, Cubert for efficient joins and data processing, etc.

There are many interesting ongoing work to improve People You May Know further, for example,

  • Large-scale distributed machine learning
  • Large-scale social graph processing
  • Network A/B testing

Stay tuned!

(Interestingly, People You May Know was invented at LinkedIn.)


]]>
Mitul Tiwari
tag:mitultiwari.posthaven.com,2013:Post/870163 2015-07-05T18:34:26Z 2015-07-05T18:34:26Z Growth Diffusion at LinkedIn via Cascading Invitations

Figure 1: Example LinkedIn signup cascade

Many of the popular websites such as LinkedIn power their growth through guest invitations from existing members to non-members. New members joining can also send such guest invitations resulting in cascade of membership growth at a large scale. How does such cascade of membership growth looks like? How viral are these cascades?

Recently we analyzed LinkedIn’s growth through such guest invitations addressing these questions, the largest structural analysis of cascading growth diffusion, and published our work in the 24th International World Wide Web Conference (WWW) 2015:

Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily

In this post I will discuss some of the highlights of the paper including growth through cascading invitations, structure and homophily in the cascade.

LinkedIn is the largest professional network with more than 360 million members. LinkedIn membership has grown through warm signups because of guest invitations and direct signup at the site without an invitation. A significant fraction of the members joined LinkedIn through the cascading guest invitations resulting in warm signups. The cascading signups can be organized into a collection of trees: each time a member signs up directly, she becomes the root of a new tree, and every user who signs up by accepting an invitations from a member A becomes the child of member A in the tree with member A. One example of such a tree is shown (at the top) in Figure 1. We find that LinkedIn’s signup cascade trees are huge, very viral (compared to previously studied diffusion phenomenon), and members remain active for a long time in sending guest invitations resulting in more warm signups.


size_over_time_samenumtrees_2592000_all_0jpg

Figure 2: Pattern of LinkedIn’s cascade trees growth over time


How does LinkedIn’s cascade trees grow in size over time? In Figure 2 we plot the growth of the 1000 biggest cascade trees on LinkedIn. We see a surprisingly robust growth pattern in these cascade trees (and all the trees as well). Also, we observe that the number of cascade trees are growing over time at a deliberate pace. In short, we are observing persistent, parallel increase in the number of cascade trees and size contributing to warm signup growth at LinkedIn.

How does growth diffusion affects the characteristics of members present in the cascade trees? In addition to analyzing the structure of growth cascade trees, we also connect interaction between the characteristics of the members present in the trees and structure of the trees. We find the geography and industry play an important role in the cascade and shows similarity between the inviter and invitee. Figure 3 shows within cascade tree similarity and compares with between tree similarity for members present in the tree on country, region, industry, engagement, and seniority among growth cascade trees. We observe that similarity within tree on country and industry dimensions are much higher compared to between-tree similarities.

within_tree_homophily_all_new_over_100jpg

Figure 3: Within-tree and between-tree similarity (homophily) on country, region, industry, engagement, and seniority among growth cascade trees.


Surprisingly similarity between inviter and invitee is not sufficient to explain within tree similarity we observe. We find that higher order Markov models, in which a node’s characteristics not only depend on the parent but ancestors as well, produce a level of similarity and homophily that closely matches observed data as shown in Figure 4.

Figure 4: Root-guessing experiment where we are trying to predict the country of root node based on the country of a given node.


This is the largest growth diffusion study (that we are aware of) and for more details check out our paper:


Global Diffusion via Cascading Invitations: Structure, Growth, and Homophily, Ashton Anderson, Daniel Huttenlocker, Jon Kleinberg, Jure Leskovec, and Mitul Tiwari. In the Proceedings of the 24th International World Wide Web Conference (WWW), May 2015.


]]>
Mitul Tiwari
tag:mitultiwari.posthaven.com,2013:Post/826722 2015-03-18T22:09:41Z 2015-03-18T22:09:41Z Organizational Overlap on Social Networks and its Applications

Online social networks have become important tools for networking, communication, sharing, and discovery. A considerable challenge these networks face is the fact that an online social network is partially observed: two individuals might know each other, but may not have established a connection on the site.

Therefore, link prediction and recommendations are important tasks for any online social network. We published a paper in the 22nd International World Wide Web Conference (WWW), May 2013 that describes how we developed a novel organizational overlap model for link prediction between two users in a network:

Organizational Overlap on Social Networks and its Applications

In this post, I’ll briefly discuss some of the highlights from the paper, including link prediction, recommendations, and community detection.

Link prediction and recommendations

Social network sites use recommendation systems such as LinkedIn’s ‘People You May Know’ to enable a significant number of link creations.

People You May Know at LinkedIn

People You May Know at LinkedIn

A basic problem in network analysis is predicting links for partially observed networks, that is, given a snapshot of connections at time t, can we predict links at time t+1. On any online social network, two members might know each other, but may not have established a connection on the site. Link prediction and recommendations help address this problem and create a more complete social graph to improve user involvement.

Link Prediction Given a snapshot of a network at time t can we predict links at time t1

Link Prediction: Given a snapshot of a network at time t, can we predict links at time t+1

As part of our research to understand edge affinity between users, we built a novel model factoring in the time of joining and departing an organization. The logic is simple: the affinity between two members who worked together in an organization for 10 years is greater than members who've worked together for only a few months. We built a mathematical model based on this organizational time overlap and validated the model with LinkedIn’s social network data.

We used this model to predict existing edges on LinkedIn and two other public networks and found that this method’s top-5 prediction accuracy was 42% better than Common Neighbor and Adamic-Adar based link prediction. We also showed empirically that our model works for diverse organizations such as companies, schools, and online groups.

Detecting communities

Detecting communities within an organization is another important challenge. On most online social networks, a user can follow an entity to receive updates on it within a personalized news feed. For example, members can follow a company on LinkedIn and receive company updates.

Follow a company updates

Following a company on LinkedIn

To recommend entities for a member to follow, we look at entities the member's community is already following. Simply using the entire organization yields inferior results, as most organizations are diverse and contain several orthogonal groups (for example, sales, marketing, engineering) and subgroups (for example, front-end, database, machine learning).

As another example, consider a news feed generated by online activity and how its volume can quickly overwhelm a user. A key feature in ranking a news feed is to promote an update if the member is in the same community as the originator of the update.

Community Detection Given a network detect communities among the nodes in the network

Community Detection: Given a network, detect communities among the nodes in the network

The organizational overlap model also works well for detecting communities within an organization. It is usually hard to evaluate the quality of communities because of a lack of ground truth. We used an indirect method to evaluate the quality: intuitively, the speed of information propagation should be faster within a community, so we measured the quality of detected communities by the speed of information propagation within it.

We evaluated detected communities within the LinkedIn network by the propagation speed of company follows and sharing activity. Results show that communities detected by our method are up to 66% better than communities detected by only links in terms of the propagation speed of shared articles, and 15% better in terms of the propagation speed of company follows.

Further reading

For more details, check out the full paper:

  • Organizational Overlap on Social Networks and its Applications, Cho-Jui Hsieh, Mitul Tiwari, Deepak Agarwal, Xinyi (Lisa) Huang, and Sam Shah. In Proceedings of the 22nd International World Wide Web Conference (WWW), May 2013.

  • ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/708927 2014-06-30T02:17:49Z 2014-06-30T02:24:01Z Beyond MapReduce

    Google announced that they are not using MapReduce anymore: "Google dumps MapReduce favor new hyper scale analytics system". MapReduce has been a simple abstraction that has made large scale data processing easier, scalable, and fault-tolerant. However, MapReduce paradigm does not work well for many use cases such as stream processing, iterative computation, graph processing, real-time analytics, etc. Here is another blog post on this announcement: "The elephant was a trojan horse on the death of map reduce at Google".


    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553765 2012-08-27T08:10:00Z 2013-10-08T17:19:59Z Summary of a few papers from SIGIR 2012 - Part I

    (photo from: http://www.city-data.com/picfilesv/picv32970.php)

    Here is a short summary of a few papers from SIGIR 2012:

    • Adaptation of the Concept Hierarchy Model with Search Logs for Query Recommendation on Intranets by Ibrahim Adeyanju, Dawei Song, M-Dyaa Albakour, Udo Kruschwitz, Anne De Roeck and Maria Fasli. This paper talks about enhancing query suggestions on Intranets. The paper combines concept hierarchy model with query-flow graph based query suggestions. First the paper talks about creating hierarchical clustering of intranet documents to create concept hierarchy. Then for a given query find candidates from the concept hierarchy and uses query logs to adapt the query suggestions candidates based on past user clicks. Here are more details about the overall project and related papers:  http://autoadapt.essex.ac.uk/tiki/tikiwiki-3.0/tiki-index.php
    • An Exploration of Ranking Heuristics in Mobile Local Search by Yuanhua Lv, Dimitrios Lymberopoulos, Qiang Wu. This paper describes in depth analysis of local search features such as user's location, ratings and number of reviews for a business, user's profile and personal preference, and how each of these features affect click-rate on results. This paper also talks about incorporating the category of businesses (used by local search engines such as Yelp, Google local, and Bing local) in ranking results. This paper describe a machine learning approach to combine these signals to predict click-rate.
    • Detecting Quilted Web Pages at Scale by Marc Najork. Web spam detection is a serious issues in improving the quality of search results. This paper talks about an algorithm for detecting 'quilted' web pages (web-pages that are stitched together by combining content from other web pages. The algorithm takes a corpus of web pages as input and outputs a set of quilted web pages along with source pages used to in those quilted web pages. The algorithm first extracts patch grams by finding k-grams that are not too popular (occur in at most) m web-pages and occur in at least one web-page . Then for each document, the algorithm finds patch grams and the source documents (other than the document in consideration) containing the patch-grams.

    Also, industrial track was very interesting and I will post a summary soon. I presented on Related Searches at LinkedIn that I described in an earlier post: related searches at LinkedIn blogpost.

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553768 2012-07-29T22:16:00Z 2016-08-06T16:24:15Z Related Searches at LinkedIn

    Search plays an important role in online social networks such as LinkedIn as it provides an essential mechanism for discovering members and content on the network. Related search recommendation is one way for improving members’ search experience in finding relevant results. We recently wrote a paper on Metaphor, the related search recommendation system at LinkedIn [1], which is going to appear in proceedings of the 21st Internationl Conference on Information and Knowledge Management (CIKM), 2012. I am also giving a presentation on related search at LinkedIn at SIGIR 2012 conference, the premier information retrieval gathering. I am happy to discuss some of the interesting findings from our paper here.

    Design


    Metaphor builds on a number of signals and filters that capture several dimensions of relatedness in search activity.


    Correlation based on time: The first signal is based on collaborative filtering by analyzing searches done in the same sessions, that is, searches correlated by time are considered related. For example, members search for “Hadoop” and “big data” in succession then we can say those two queries are related .One challenge is to deal with popular queries like “Obama” by damping popular queries by a TF-IDF measure.

     

    Correlation based on clicks: The second signal is based on query-result clicks, that is, search queries that result in clicking the same result. For example, search results for queries “Hadoop” and “MapReduce” have a common result that is clicked often by members, then we can say those two queries are related. LinkedIn’s personalized search results brings added richness to the queries related by this signal.


    Correlation based on term overlap: The third signal is based on overlapping terms present in the queries. For example, queries “Jeff LinkedIn” and “LinkedIn CEO” has a common term “LinkedIn, and we can say those two queries are related. One challenge here is identifying the importance of each term. For example, for search query “iOS engineer”, and two candidate queries “iOS developer” and “mechanical engineer”, the overlapping term query “iOS developer” is more important, although there is only one term is common between the search query and the candidate queries. We address this challenge by weighing importance of terms by a variant of TF-IDF measure.

    Length bias: The fourth and the final signal is based on the number of terms in the search query and the candidate related queries. While working on this project we arrived at an interesting insight that members tend to click on suggested queries that are one term longer than their initial query, which corresponds to refining the initial query. For example, LinkedIn members tend to search for “Hadoop developer” after searching for “Hadoop”. We developed a statistical biasing model to give more importance to a longer query recommendation and use it for final query ranking.

    Implementation

     


    Metaphor, our related search recommendation engine runs on Hadoop. The query logs and activity tracking data is aggregated from the production systems using Kafka, a publish-subscribe streaming system for event collection and distribution. Metaphor consists of several map-reduce jobs implemented in Hadoop Java and Pig, a scripting language on top of Hadoop. Azkaban, a Hadoop workflow management tool, is used for managing Metaphor’s more than 50 map-reduce jobs. The final recommendations are stored in Voldemort, a key-value store system, and served in production from Vodemort store.


    We evaluated Metaphor in various ways using traditional Precision-Recall measure as well as online A/B testing. Check out our paper [1] for more details.

    References


    Metaphor: a system for related search recommendations, Azarias Reda, Yubin Park, Mitul Tiwari, Christian Posse, and Sam Shah. In Proceedings of the 21st International Conference on Information and Knowledge Management (CIKM), October 2012 (to appear).

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553771 2012-07-04T22:03:00Z 2013-10-08T17:19:59Z On leap second bug

    Last Saturday "leap second" adjustment caused issues with many online sites: "Leap second bug wreaks havoc across web".

    Google's SRE team posted a nice blog post on how they fixed leap second issue: "Time, technology and leaping seconds" by "leap smear", where they change duration of each second reported by NTP depending on "leap second" is added or subtracted. Wondering whether similar techniques can be applied to Stratum 0/1 NTP servers so that the rest of the people don't have to worry about leap seconds in future?

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553774 2012-03-04T21:28:00Z 2013-10-08T17:19:59Z igraph: a nice graph visualization package in R

    Trying out igraph, a graph visualization package in R. Looks promising.

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553775 2012-01-29T23:51:17Z 2013-10-08T17:19:59Z A paper with a good summary of ordinal regression models

    A paper with a good summary of ordinal regression models: "Regression models for ordinal responses: A review of methods and applications"

    The first method described is "Cumulative Logit Model" or "Proportion Odds model". In this model, dependence of dependent variable can be expressed as log of odds equals to a linear combination of independent variables and thresholds on the ordinal values. This model is used by Yehuda Koren and Joe Sill in their OrdRec paper (RecSys'11) to extend SVD++ for ordinal feedback of items.

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553778 2012-01-23T04:53:00Z 2013-10-08T17:19:59Z WSDM accepted papers related to social networks

    WSDM 2012 list of accepted papers: http://wsdm2012.org/program/detailed.html.

    Also, here is a list of papers related to social networks: http://cs2socialnetworks.wordpress.com/2011/12/05/wsdm-2012-accepted-on-social-networks

    Looked at a few papers over the weekend.

    • This paper, "Inferring Social Ties across Heterogenous Networks" by
      Jie Tang, Tiancheng Lou and Jon Kleinberg, talks about mapping labels of connections from one social networks to another.
    • First paper I have seen on daily deals "Daily Deals: Prediction, Social Diffusion, and Reputational Ramifications" describes detailed analysis of daily deals data from Groupon and Living Socials and correlates the data with Facebook/Twitter/Yelp data. Interestingly that they found that businesses that appear on daily deals site gain more reviews on Yelp by lose average ratings by 10%.
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553780 2011-12-30T22:10:00Z 2013-10-08T17:19:59Z Some notes from RecSys 2011

    Attended RecSys 2011 in October this year. Here is the offcial proceedings: RecSys 2011 proceedings. Conference had many interesting papers and industrial track was also pretty good with presentations from Ebay, Facebook, Twitter, Pandora, and Netflix.

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553782 2011-12-30T08:49:00Z 2013-10-08T17:19:59Z A talk by Clojure creater Rich Hickey

    Clojure creater Rich Hickey on virtues of simplicity over easiness: http://www.infoq.com/presentations/Simple-Made-Easy

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553784 2011-12-30T06:06:00Z 2013-10-08T17:19:59Z A blog post by @arsatiki on Bayesian A/B testing

    Nice blog post by  on Bayesian A/B testing theory and code: 

    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553786 2011-05-16T04:11:00Z 2013-10-08T17:19:59Z Supervised Random Walk for link prediction

    Last week in our reading group at LinkedIn we discussed "Supervised Random Walks: Predicting and Recommending Links in Social Networks" by L. Backstrom and J. Leskovec, which appeared in  ACM Internation Conference on Web Search and Data Mining (WSDM), 2011.

    This is quite interesting paper that talks about link prediction problem. Link prediction is a fundamental problem in a graph or social networks to predicts edges/links that don't exist currently but may appear in future. Applications of link prediction includes (1) "People You May Know" feature in social networks such as LinkedIn or Facebook, (2) predict links in protein-protein interaction, (3) suggest links to bloggers to add, (4) predict collaboration, etc.

    In the past there has been two main approaches to link prediction in a graph: (1) classification potential edges as positive or negative using various features from the graph; and (2) random walk on the graph to rank nodes by assigning probability to each node.

    In the first approach, various features such as node (e.g., age, gender), edge (e.g., interaction, activity), graph (e.g., the number of common friends) are used to predict links. Past edges can be used as labeled training data to classify potential edges as positive or negative edges in the future.

    In the second approach, random walks are used to assign probabilities to nodes, and rank nodes based on these probabilities. Higher the rank, more likely it is to be an edge between the starting node and that node. Random walks have been used to find out important nodes in a graph (like Pagerank) and also used for finding similar nodes in a graph. Random walk with restarts is one example of this approach.

    This paper talks about how to combine these two approaches into one by building a model to bias a random walk in a graph, and use the random walk for link prediction. Node and edge (e.g., the number of common friends) features are used to learn transition probabilities between edges. And then random walk with restart is run to find out a probability score for each node. Nodes are ordered by these probability score to find out the top nodes that are likely to be connected with the starting node in the future.

    This is quite interesting approach that does not need complicated feature engineering to find out graph features, but this approach is very compute intensive since random walk takes multiple iterations of graph traversal, and if the graph is huge (say 100s of millions of nodes) then it will take a long time to run. In my experience, frequently running link prediction pipeline is more important then marginal improvement in link prediction accuracy since online social graph changes continuously.

    Nevertheless, this paper brings very interesting ideas to combine the classification approach and the random walk approach for link prediction. 

     

    References

    • Supervised Random Walks: Predicting and Recommending Links in Social Networks,  L. Backstrom and J. Leskovec. In Proceedings of the 4th ACM international conference on Web search and data mining.
    • Link Prediction Problem for Social Networks, D. Liben-Nowell and J. Kleinberg. In Proceedings of 12th International Conference on Information and Knowledge Management (CIKM), 2003. 
    • Predicting Positive and Negative Links in Online Social Networks, J. Leskovec, D. Huttenlocher and J. Kleinberg. In Proceedings of 19th International World Wide Web Conference (WWW), 2010.
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553788 2011-04-29T11:15:00Z 2013-10-08T17:19:59Z Summary of Haystack system paper from Facebook

    This week in our reading group at LinkedIn we read and discussed this paper: Finding a needle in Haystack: Facebook's photo storage.

    As the name suggests the Haystack system is designed and built by Facebook guys to store and serve photos.

    The basic goals of Haystack system are as follows: first, high throughput and low latency, store all metadata in main memory, and save on multiple disk I/Os. They are able to achieve 4x speedup than the earlier system by doing this. Second, fault-tolerance. Replicate photos across multiple data centers. Third, cost-effectiveness. They claim to be 28% cheaper than the earlier system.

    In the earlier NAS (Network Attached Storage) photo storage system, the main bottleneck was 3 disk accesses for fetching each photo: one for fetching the location of corresponding inode, second for reading inode, and third for fetching the file data.

    The main idea of the Haystack system is the design of Haystack store that lead to fetching of a photo in a single disk access. In the Haystack store, they keep multiple photos in a single large file. They need only the offset of the photo in the file and size of the photo to access a photo. This way they are able to reduce the amount of metadata needed to access a photo, and keep all metadata for photos in memory. Hence, no disk access is needed to fetch the metadata for a photo and the photo can be fetched from the disk in a single disk access.

    Overall, impressed by the scale of Haystack photo storage system that stores more than 20 Petabytes in form of 260 billions of photos and serves millions of photos per second at the peak.

    Reference:
    • D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in Haystack: Facebook’s photo storage. In OSDI ’10
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553789 2010-04-13T06:09:00Z 2013-10-08T17:19:59Z Effectiveness of data I attended a talk by Dr. Peter Norwig last week at "The Analytics Revolution Conference" on "The Unreasonable Effectiveness of Data". He talked about how having large amount of data changes the results/outcome, how simpler algorithm works much better with more data than a complex algorithm with less data. He showed three examples to support his case: (1) Spelling correction; (2) Google image search; and (3) Google translation.

    (1) Spelling correction. Dr. Norwig talked about underlying principles behind Google's spelling correction. He talked about how frequency of terms in web data is used to figure out possible suggestions and how to score each of the suggestion. For a given query w, Google computes the most likely spelling correction for the query. This is done by computing

    argmax_c P(c|w) = argmax_c P(w|c) P(c) (after ignoring P(w) which is same for the given query w)

    where,

    P(c) is computed using the number of times c occurs in the web corpus. P(w|c) is computed using edit distance between w and c.

    So a simple frequency based spelling correction algorithm performs well and in practice works much better than dictionary based algorithm. Also, it is language independent so it works for any language by just changing the input web data. For more details, see Peter Norwig's blog post on spelling correction:


    2. Image search. Dr. Norwig talked about how Google constructs similarity graph of crawled images, and how for a given query, dense clusters of images in the graph are likely to be the best quality images. Here also large numbers of images help in improving result.


    3. Google translation. Google translation has been sited by many Googlers as a major success of using large amount of data. Dr. Norwig reiterated this point, and talked about how simple co-occurrence of phrases from two different languages is used by Google translation to translate between the two languages.]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553790 2009-10-01T05:01:00Z 2013-10-08T17:19:59Z Organizing the Web around Concepts
    During the initial days of the web, directories like Yahoo manually organized the web to find the relevant information. As web grew in size and search engine technology evolved, search engines like Google became the main source to query the web. Today, we see the next wave is making the web navigation easier by reorganizing the internet by topic or concept, and increasingly meaningful web (which may lead to Semantic Web) is being built around concepts such as Freebase, Google Squared, DBLife, and Kosmix topic pages. At Kosmix, we often ask the technical philosophy driving this change. Here is a brief overview for the geeks among us.
    To start with, what do we mean by concepts? A concept is loosely defined as a set of keywords of interest, for example, the name of a restaurant, cuisine, event, name of a movie, etc. There are various websites tailored to a particular kind of concept such as Yelp for restaurants (e.g., Amarin Thai), IMDB for movies (e.g., The Shawshank Redemption), LinkedIn for professional people, Last.fm for music (e.g., U2), etc.
    Why should one care about organizing the web around concepts? There are three main kinds of web pages: search pages, topic/concept pages, and articles. Organizing the web around concepts can benefit each one of them.
    Search pages. A search results page for a given query consists of various relevant links with snippets, for example, Google search results pages on "Erykah Badu". Web data around concepts can improve search results in two ways. First, a search page can show a bunch of concepts related to the query, and their relationships to the query. This will help in further refining the query, and enable exploration of concepts related to the query. Second, a search page can promote the concept page result for a concept closely matching the query.
    Concept/Topic pages. A topic page or concept page organizes information around a concept, for example consider this music artist page on "Erykah Badu". Such pages can utilize attributes of concepts, and show content related to the concept and its attributes, such as, albums, music videos, songs listing, album reviews, concerts, etc.
    Articles. Articles can put semantic links to the concepts present in the article, and promote exploration of concepts present in the article, for example, this page on oil prices.
    Given so many benefits of arranging the web around concepts, how can we achieve that? Some of the ways to arrange the web around concepts are as follows.
    1. Editorial: An editor can pick a set of interested concepts, create attributes of the concepts, and organize the data around the concepts. Many sites like IMDB (for movies) have taken this approach. This approach gives high quality content but it's not scalable in terms of the number of concepts.
    2. Community: Many sites such as Wikipedia and Yelp have taken this approach in which a community of users picks concepts, creates the attributes of the concepts, and organizes the data around the concepts. This process scales as the user community grows, but it is hard to build such community, this approach is susceptible to spam, and scale is limited. For example, Wikipedia has grown to millions of concepts with such a large user base, but it size is still far from the scale of the web.
    3. Algorithmic approach: One way to organize the web around concepts is to mine the web for concepts and their attributes, and link data with concepts. This approach is the most promising in terms of scaling to the size of the web. Various steps in this approach are (a) Concept Extraction, (b) Relationship mining, and (c) Linking data with concepts.
    (a) Concept Extraction. There are two main methods for concept extraction from web pages, site-specific and category-specific.
    In the site specific method, the structure or semantics of a site is used to extract concepts. Many web sites generate HTML pages from the databases through a program, and such pages have similar structure. One can write site specific rules or wrappers to extract interesting data from such web pages, but writing such wrappers is labor intensive task. Kushmerick et. al. have proposed wrapper induction technique to automatically learn wrapper procedures based upon samples of such web pages. A recent work by Dalvi et. al. extends the wrapper induction technique to dynamic web pages. Another site specific method is to use natural language processing to understand semantic of web pages and to mine concepts from web pages.
    In the category specific method, web pages are classified into categories, such as, restaurants, shopping, movies, etc., and category specific extraction rules are applied. For example, extract menu, reviews, cuisine, location for restaurants; extract price, reviews, ratings for shopping category; and extract actors, director, ratings for movies. This method is more scalable in terms of the number of web pages compared to the site specific method, but slightly more error prone since classification and extraction errors accumulate.
    (b) Relationship mining. After extracting interesting concepts, one needs to match them with concepts in the database to create attributes, to grow concepts, and to find relationships between concepts. Some web databases like Freebase provide substantial amount of relationships between Wikipedia concepts.
    (c) Linking data with concepts. As mentioned earlier, organizing web around concepts can benefit experience with search pages, topic pages, and article pages by linking them with concepts.
    The algorithmic approach to organizing the web around concepts is somewhat error prone, though it improves as algorithms for a particular step improves. However, it is most promising in terms of scaling to enormous web that exists.
    In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data.
    References
    [1] N. Kushmerick, D. S. Weld, R. B. Doorenbos: Wrapper Induction for Information Extraction. IJCAI (1) 1997.
    [2] N. Dalvi, P. Bohannon, F. Sha: Robust web extraction: an approach based on a probabilistic tree-edit model. SIGMOD Conference 2009.
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553792 2009-08-11T04:47:00Z 2016-11-21T21:06:58Z Deep Web: The Hidden Treasure

    (Iceberg image ©Ralph A. Clevenger)

    Experts estimate that search engines can access less than 1% of the data available on the Web, only the tip of the iceberg. Where is the rest of Internet’s data? It’s lurking in the Deep Web.

    What is the Deep Web?

    The Deep Web is defined as dynamically generated content hidden behind HTML forms that is inaccessible for search engines' crawlers. Deep Web is also referred as the hidden or invisible Web.

    The Deep Web consists of three key elements: (1) pages and databases accessible only through HTML forms; (2) disconnected pages not accessible to crawlers; and (3) password protected and subscription only sites. Some people also include real time Web data as a part of the Deep Web, since it's changing so fast that traditional search engines are not able to surface it in their results.

    How Vast is the Deep Web?

    According to one study by Michael K. Bergman in 2000, the Deep Web accounted for 7,500 terabytes of data. At that time, search engines could index only 10s of terabytes of data. By 2004, a subsequent study by Kevin Chang and his colleagues estimated that the Deep Web had grown to more than 30,000 terabytes of data. At this rate, one can only imagine how vast it must be today, particularly given the ubiquity of the Internet over the past five years. Such
    an enormous amount of data has huge wealth of information—the key is figuring out how to access it.

    Is it Possible to Access the Deep Web?

    Absolutely-though it’s not easy. There are two main approaches to accessing Deep Web data: run-time integration, and off-line indexing.

    In the run-time integration approach, one has to build a system that performs the following tasks: first, figure out the appropriate forms that are likely to have results for the given query terms; second, map the query terms suitably to search those forms and integrate the results from various forms; and third, extract relevant parts of results to display. This approach enables richer experience for users, and sites like Cazoodle.com seems to rely on this method.

    But there are some drawbacks to run-time integration. It’s extremely difficult to figure out appropriate forms for the given query terms. In addition, mapping query terms to search those forms and extracting information from the results is highly labor-intensive tasks and not very scalable.

    In the off-line indexing approach to access Deep Web data, one has to construct a set of queries to search through forms, process the queries through forms while off-line, and index the result. Once the query set is constructed, this approach can reuse the search engine infrastructure for crawling, indexing results, and index serving.

    Google has taken this approach to surface Deep Web content. However, algorithmically constructing input values for forms is a non-trivial task. Furthermore, this approach cannot be applied to HTML forms that use HTTP POST method, since the resulting URLs are the same, and form inputs are part of HTTP request rather then the URL.

    The Kosmix Approach to the Deep Web

    At Kosmix, we surface Deep Web content by using a combination of run-time integration and off-line indexing approaches. At the core of Kosmix technologies are (1) a sophisticated categorization engine that enables mapping of query to appropriate category; (2) a highly scalable fetching and run time integration system to fetch data from various sources, integrate, and provide rich experience; and (3) an off-line crawling and indexing systems that enables
    scalability.

    For example, for a query like "Ravioli", we show nutritional values from Fatsecret.com. Our categorization technology enables us to identify Ravioli as a food query, and enables us to surface Deep Web content from Fatsecret.com.

    The Next Hurdle

    While invaluable treasures are hiding behind the Deep Web, there are significant challenges to solving the problem of reaching this information. The next step for search engines will be to find an easier way to tap into the Deep Web, and to keep up with the Real Time Web.

    My prediction? The Deep Web will force a drastic change in how traditional search engine systems are designed and built.


    References:

    1. Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value. http://brightplanet.com/index.php/white-papers/119.html, 2000.
    2. Kevin Chen-chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured Databases on the Web: Observations and Implications. In SIGMOD 2004.
    3. Cazoodle. http://www.cazoodle.com/docs/Press_Kit.pdf, 2009.
    4. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. In VLDB 2008.
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553794 2008-12-09T16:58:00Z 2013-10-08T17:19:59Z Untitled Kosmix launched a new beta-ish version today. Check out this post explaining what's new in the beta version: "Kosmix goes betaish".

    We also announced $20 million funding from Time-warner and others: "Kosmix adds rocketfuel".

    How is Kosmix different from a search engine? Search engines treat web as a collection of documents. In contrast, we treat web as a collection of applications, and bring together the best of the web applications relevant to the given query on one page, and present them in a two dimensional layout. Our categorization technology helps in figuring out the category of the query, and picking the best applications. Check out a few examples: Simpsons, San Francisco, Diabetes, Acura TSX.

    Check out Anand's blog for more details.]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553796 2008-07-24T14:45:00Z 2013-10-08T17:19:59Z iPhone 3G disassembly iPhone 3G disassembly: http://live.ifixit.com/Guide/First-Look/iPhone3G]]> Mitul Tiwari tag:mitultiwari.posthaven.com,2013:Post/553798 2008-07-12T04:55:00Z 2013-10-08T17:19:59Z PhD comic on F-1 students Life of F-1 students, PhD comics sums it all: F-1 students visa process :)]]> Mitul Tiwari tag:mitultiwari.posthaven.com,2013:Post/553799 2008-06-05T15:02:00Z 2013-10-08T17:19:59Z Hilarious video on Onion Check out this hilarious video: New Wearable Feedbags Let Americans Eat More, Move Less :) ...]]> Mitul Tiwari tag:mitultiwari.posthaven.com,2013:Post/553801 2008-05-24T05:53:00Z 2013-10-08T17:20:00Z My comments on "Why the World Needs a New Database System" Nice post by Anand on Datawocky
    Why the World Needs a New Database System ! The post talks about the need for a new database system built on top of commodity hardware, that can handle large amount of data, and provides an easy to use query language. On solution front, Anand's post focuses mainly on Aster Data. Although there are mainly solutions mentioned in the comments sections.

    Definitely there is a need for a simple query language (SQL like) to process large amount of data distributed over commodity hardware.

    I feel Hypertable , an open source version of Bigtable, may be a good way to address this problem. Hypertable uses a query language (HQL - Hypertable Query Language) that is similar to SQL. Hypertable can be installed over distributed file system like HDFS, and KFS. Since these file systems are fault-tolerant and scale well with the number of nodes in the system, Hypertable is also fault-tolerant and scalable.

    I wonder how Aster data solution fares with respect to Hypertable.]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553762 2008-05-07T23:54:00Z 2013-10-08T17:19:59Z $200 barrel of oil? $200 barrel of oil at the end of the year?

    And check this link :
    For $9995, we can make ethanol in our backyard. So cycle of waste is complete, use energy to produce crops, and use crops to produce energy/ethanol, and continue our energy hungry lifestyle ...]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553764 2007-01-18T06:59:00Z 2013-10-08T17:19:59Z Freezing Austin For the first time in the last 5 years, I have seen so much ice/snow in Austin, thanks to the ice storm. Many people including me are under "house arrest" since last couple of days. It's so new to Austin that I thought of posting some pics :). Check out more pics here.





    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553767 2007-01-17T06:58:00Z 2013-10-08T17:19:59Z Jack is back ... Another season (6th) of "24" started with a bang. Four out of twenty-four hours are already passed with action packed sequences, and Jack, as usual, is all over the place with his self-healing power.

    A trivia: if you get involved in a bar fight, and if you have an option to chose between Jack Bauer and Harry Potter for your help, then whom will you pick? Of course, Jack Bauer. See this blog for an interesting argument.

    Check out this site for some interesting facts about Jack Bauer. Some of them are really funny. For example,
    • Jack Bauer never retreats, he just attacks in the opposite direction.
    • The city of Los Angeles once named a street after Jack Bauer in gratitude for his saving the city several times. They had to rename it after people kept dying when they tried to cross the street. No one crosses Jack Bauer and lives.
    • When Jack Bauer ran out of ammo, he caught 3 bullets in his chest and used them to reload.
    • Once, someone tried to tell Jack Bauer a "knock knock" joke. Jack Bauer found out who was there, who they worked for, and where the goddamned bomb was.
    • Jack Bauer doesn't need to eat, sleep, or use the bathroom because his organs are afraid of making him angry.
    • Superman is one of the few individuals who could possibly survive a confrontation with Jack Bauer. But that is only because he can fly away.
    • Jack Bauer once forgot where he put his keys. He then spent the next half-hour torturing himself until he gave up the location of the keys.
    • When Jack Bauer plays dodgeball, the ball dodges Jack Bauer.
    • ... :)
    ]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553770 2006-11-19T02:59:00Z 2013-10-08T17:19:59Z On mood swings I was feeling a bit low in the middle of the week. After wasting a few hours, I was still not alright. In an attempt to get over it, I started asking questions like why am I feeling low? What was it that triggered the feeling? How should I cope up with this? What do others do in this situation? And I started getting some answers when I was posing the situation in the form of questions. It seems years of computer science problem solving training has deeply affected consciousness as well. Now do I need to deal with any situation by posing a relevant question? Not a bad option though :). The following blog was a good read, http://health.yahoo.com/experts/yeahdave/5837/riders-on-the-storm

    I hope I can learn something from the following quotes.

    "You can't expect to prevent negative feelings altogether. And you can't expect to experience positive feelings all the time...The Law of Emotional Choice directs us to acknowledge our feelings but also to refuse to get stuck in the negative ones." -- Greg Anderson

    "It is common sense to take a method and try it. If it fails, admit it frankly and try another. But above all, try something." --Franklin D. Roosevelt]]>
    Mitul Tiwari
    tag:mitultiwari.posthaven.com,2013:Post/553772 2006-11-12T05:14:00Z 2013-10-08T17:19:59Z movie Borat I recently watched this movie Borat. While watching the movie, I was wondering what a Kazakh would think about the movie. And I was not surprised when I found that there is enough controversy about the movie that I didn't know about (check this out: Borat controversy). The movie is fictional, but I wonder how can the director/producer depict a country (and people from that country) in such a bad way, and also somewhat illogical. For example, the main character could learn driving in American highway and roads, but could not learn some basic things in the daily life.]]> Mitul Tiwari tag:mitultiwari.posthaven.com,2013:Post/553776 2006-04-16T04:44:00Z 2013-10-08T17:19:59Z Untitled


    Dhoni, the official driver of Indian team :).]]>
    Mitul Tiwari