Mitul Tiwari's Space

WSDM accepted papers related to social networks

WSDM 2012 list of accepted papers: http://wsdm2012.org/program/detailed.html.

Also, here is a list of papers related to social networks: http://cs2socialnetworks.wordpress.com/2011/12/05/wsdm-2012-accepted-on-social-networks

Looked at a few papers over the weekend.

This paper, "Inferring Social Ties across Heterogenous Networks" by
Jie Tang, Tiancheng Lou and Jon Kleinberg, talks about mapping labels of connections from one social networks to another.
First paper I have seen on daily deals "Daily Deals: Prediction, Social Diffusion, and Reputational Ramifications" describes detailed analysis of daily deals data from Groupon and Living Socials and correlates the data with Facebook/Twitter/Yelp data. Interestingly that they found that businesses that appear on daily deals site gain more reviews on Yelp by lose average ratings by 10%.

Some notes from RecSys 2011

Attended RecSys 2011 in October this year. Here is the offcial proceedings: RecSys 2011 proceedings. Conference had many interesting papers and industrial track was also pretty good with presentations from Ebay, Facebook, Twitter, Pandora, and Netflix.

Xavier Amatrian posted a nice summary of the conference: Xavier's notes on RecSys 2011.
The best paper award went to a paper by Yehuda Koren and Joe Sill: OrdRec: An ordinal model for predicting personalized item rating distributions. This paper talks about using user feedback, which is given as ordinal ranks of items, in collaborative filtering techniques.
A short paper/poster by Jian Wang, Badrul Sarwar and Neel Sundaresan won the best poster award: Utilizing related products for post-purchase recommendation in e-commerce. This paper talks about post-purchase recommendations at Ebay. In this paper, author describes utilizing product categories, content and behavioral similarities, and temporal aspects (whether the item is available or not) for recommending products after purchase.
Daniel Tunkelang from LinkedIn gave a tutorial on Recommendations as a Conversation with the user.
Workshop on Personalization in Mobile Applications was quite interesting. Christine Outram from MIT gave a talk on "The Copenhagen Wheel" during the workshop. Janne Lindqvist from Rutgers University gave an interesting talk on Foursquare "I'm the Mayor of My House". Hyun-Jun Kim from Samsung gave a talk on "EmoSens: Affective Entity Scoring, A Novel Service Recommendation Framework for Mobile Platform", which was about recognizing emotions from smart phone touch, accelerometer, gyroscope sensors and recommendations based on that.
Facebook's Rajat Raina gave a talk on "Learning Relevance in a Large-scale Social Network". He talked about how relevant ads are shown to the users. Facebook uses Open Directory concepts to map users and ads into concept space and do ad matching.
Neel Sundaresan from Ebay gave an interesting industrial keynote. Here are notes from Siah: http://openresearch.wordpress.com/2011/10/26/notes-from-neel-sundaresans-keynote-speech-at-recsys-2011/
Twitter's Pankaj Gupta gave a talk on "Recommendations and Discovery at Twitter". He talked about edge predictions used in "Who to Follow". Twitter uses features such as personalized pagerank, the most-common neighbors, top followings, etc. to generate "Who to Follow" recommendations. Interestingly less than 50% of twitter follows come from 2nd degree edges.
Oscar Celma (BMAT) and Paul Lamere (The Echo Nest) gave tutorial on Music Recommendation and Discovery Revisited
Workshop on Recommender Systems and the Social Web was also very interesting.
Here are slides from some of the presentations: http://www.slideshare.net/tag/recsys2011
Tutorial slides are available here: RecSys 2011 tutorials

A talk by Clojure creater Rich Hickey

Clojure creater Rich Hickey on virtues of simplicity over easiness: http://www.infoq.com/presentations/Simple-Made-Easy

A blog post by @arsatiki on Bayesian A/B testing

Nice blog post by @arsatiki on Bayesian A/B testing theory and code: http://arsatiki.posterous.com/bayesian-ab-testing-with-theory-and-code

Supervised Random Walk for link prediction

Last week in our reading group at LinkedIn we discussed "Supervised Random Walks: Predicting and Recommending Links in Social Networks" by L. Backstrom and J. Leskovec, which appeared in ACM Internation Conference on Web Search and Data Mining (WSDM), 2011.

This is quite interesting paper that talks about link prediction problem. Link prediction is a fundamental problem in a graph or social networks to predicts edges/links that don't exist currently but may appear in future. Applications of link prediction includes (1) "People You May Know" feature in social networks such as LinkedIn or Facebook, (2) predict links in protein-protein interaction, (3) suggest links to bloggers to add, (4) predict collaboration, etc.

In the past there has been two main approaches to link prediction in a graph: (1) classification potential edges as positive or negative using various features from the graph; and (2) random walk on the graph to rank nodes by assigning probability to each node.

In the first approach, various features such as node (e.g., age, gender), edge (e.g., interaction, activity), graph (e.g., the number of common friends) are used to predict links. Past edges can be used as labeled training data to classify potential edges as positive or negative edges in the future.

In the second approach, random walks are used to assign probabilities to nodes, and rank nodes based on these probabilities. Higher the rank, more likely it is to be an edge between the starting node and that node. Random walks have been used to find out important nodes in a graph (like Pagerank) and also used for finding similar nodes in a graph. Random walk with restarts is one example of this approach.

This paper talks about how to combine these two approaches into one by building a model to bias a random walk in a graph, and use the random walk for link prediction. Node and edge (e.g., the number of common friends) features are used to learn transition probabilities between edges. And then random walk with restart is run to find out a probability score for each node. Nodes are ordered by these probability score to find out the top nodes that are likely to be connected with the starting node in the future.

This is quite interesting approach that does not need complicated feature engineering to find out graph features, but this approach is very compute intensive since random walk takes multiple iterations of graph traversal, and if the graph is huge (say 100s of millions of nodes) then it will take a long time to run. In my experience, frequently running link prediction pipeline is more important then marginal improvement in link prediction accuracy since online social graph changes continuously.

Nevertheless, this paper brings very interesting ideas to combine the classification approach and the random walk approach for link prediction.

References

Supervised Random Walks: Predicting and Recommending Links in Social Networks, L. Backstrom and J. Leskovec. In Proceedings of the 4th ACM international conference on Web search and data mining.
Link Prediction Problem for Social Networks, D. Liben-Nowell and J. Kleinberg. In Proceedings of 12th International Conference on Information and Knowledge Management (CIKM), 2003.
Predicting Positive and Negative Links in Online Social Networks, J. Leskovec, D. Huttenlocher and J. Kleinberg. In Proceedings of 19th International World Wide Web Conference (WWW), 2010.

Summary of Haystack system paper from Facebook

This week in our reading group at LinkedIn we read and discussed this paper: Finding a needle in Haystack: Facebook's photo storage.

As the name suggests the Haystack system is designed and built by Facebook guys to store and serve photos.

The basic goals of Haystack system are as follows: first, high throughput and low latency, store all metadata in main memory, and save on multiple disk I/Os. They are able to achieve 4x speedup than the earlier system by doing this. Second, fault-tolerance. Replicate photos across multiple data centers. Third, cost-effectiveness. They claim to be 28% cheaper than the earlier system.

In the earlier NAS (Network Attached Storage) photo storage system, the main bottleneck was 3 disk accesses for fetching each photo: one for fetching the location of corresponding inode, second for reading inode, and third for fetching the file data.

The main idea of the Haystack system is the design of Haystack store that lead to fetching of a photo in a single disk access. In the Haystack store, they keep multiple photos in a single large file. They need only the offset of the photo in the file and size of the photo to access a photo. This way they are able to reduce the amount of metadata needed to access a photo, and keep all metadata for photos in memory. Hence, no disk access is needed to fetch the metadata for a photo and the photo can be fetched from the disk in a single disk access.

Overall, impressed by the scale of Haystack photo storage system that stores more than 20 Petabytes in form of 260 billions of photos and serves millions of photos per second at the peak.

Reference:

D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in Haystack: Facebook’s photo storage. In OSDI ’10

Effectiveness of data

I attended a talk by Dr. Peter Norwig last week at "The Analytics Revolution Conference" on "The Unreasonable Effectiveness of Data". He talked about how having large amount of data changes the results/outcome, how simpler algorithm works much better with more data than a complex algorithm with less data. He showed three examples to support his case: (1) Spelling correction; (2) Google image search; and (3) Google translation.

(1) Spelling correction. Dr. Norwig talked about underlying principles behind Google's spelling correction. He talked about how frequency of terms in web data is used to figure out possible suggestions and how to score each of the suggestion. For a given query w, Google computes the most likely spelling correction for the query. This is done by computing

argmax_c P(c|w) = argmax_c P(w|c) P(c) (after ignoring P(w) which is same for the given query w)

where,

P(c) is computed using the number of times c occurs in the web corpus. P(w|c) is computed using edit distance between w and c.

So a simple frequency based spelling correction algorithm performs well and in practice works much better than dictionary based algorithm. Also, it is language independent so it works for any language by just changing the input web data. For more details, see Peter Norwig's blog post on spelling correction:

2. Image search. Dr. Norwig talked about how Google constructs similarity graph of crawled images, and how for a given query, dense clusters of images in the graph are likely to be the best quality images. Here also large numbers of images help in improving result.

3. Google translation. Google translation has been sited by many Googlers as a major success of using large amount of data. Dr. Norwig reiterated this point, and talked about how simple co-occurrence of phrases from two different languages is used by Google translation to translate between the two languages.

Organizing the Web around Concepts

During the initial days of the web, directories like Yahoo manually organized the web to find the relevant information. As web grew in size and search engine technology evolved, search engines like Google became the main source to query the web. Today, we see the next wave is making the web navigation easier by reorganizing the internet by topic or concept, and increasingly meaningful web (which may lead to Semantic Web) is being built around concepts such as Freebase, Google Squared, DBLife, and Kosmix topic pages. At Kosmix, we often ask the technical philosophy driving this change. Here is a brief overview for the geeks among us.
To start with, what do we mean by concepts? A concept is loosely defined as a set of keywords of interest, for example, the name of a restaurant, cuisine, event, name of a movie, etc. There are various websites tailored to a particular kind of concept such as Yelp for restaurants (e.g., Amarin Thai), IMDB for movies (e.g., The Shawshank Redemption), LinkedIn for professional people, Last.fm for music (e.g., U2), etc.
Why should one care about organizing the web around concepts? There are three main kinds of web pages: search pages, topic/concept pages, and articles. Organizing the web around concepts can benefit each one of them.
Search pages. A search results page for a given query consists of various relevant links with snippets, for example, Google search results pages on "Erykah Badu". Web data around concepts can improve search results in two ways. First, a search page can show a bunch of concepts related to the query, and their relationships to the query. This will help in further refining the query, and enable exploration of concepts related to the query. Second, a search page can promote the concept page result for a concept closely matching the query.
Concept/Topic pages. A topic page or concept page organizes information around a concept, for example consider this music artist page on "Erykah Badu". Such pages can utilize attributes of concepts, and show content related to the concept and its attributes, such as, albums, music videos, songs listing, album reviews, concerts, etc.
Articles. Articles can put semantic links to the concepts present in the article, and promote exploration of concepts present in the article, for example, this page on oil prices.
Given so many benefits of arranging the web around concepts, how can we achieve that? Some of the ways to arrange the web around concepts are as follows.
1. Editorial: An editor can pick a set of interested concepts, create attributes of the concepts, and organize the data around the concepts. Many sites like IMDB (for movies) have taken this approach. This approach gives high quality content but it's not scalable in terms of the number of concepts.
2. Community: Many sites such as Wikipedia and Yelp have taken this approach in which a community of users picks concepts, creates the attributes of the concepts, and organizes the data around the concepts. This process scales as the user community grows, but it is hard to build such community, this approach is susceptible to spam, and scale is limited. For example, Wikipedia has grown to millions of concepts with such a large user base, but it size is still far from the scale of the web.
3. Algorithmic approach: One way to organize the web around concepts is to mine the web for concepts and their attributes, and link data with concepts. This approach is the most promising in terms of scaling to the size of the web. Various steps in this approach are (a) Concept Extraction, (b) Relationship mining, and (c) Linking data with concepts.
(a) Concept Extraction. There are two main methods for concept extraction from web pages, site-specific and category-specific.
In the site specific method, the structure or semantics of a site is used to extract concepts. Many web sites generate HTML pages from the databases through a program, and such pages have similar structure. One can write site specific rules or wrappers to extract interesting data from such web pages, but writing such wrappers is labor intensive task. Kushmerick et. al. have proposed wrapper induction technique to automatically learn wrapper procedures based upon samples of such web pages. A recent work by Dalvi et. al. extends the wrapper induction technique to dynamic web pages. Another site specific method is to use natural language processing to understand semantic of web pages and to mine concepts from web pages.
In the category specific method, web pages are classified into categories, such as, restaurants, shopping, movies, etc., and category specific extraction rules are applied. For example, extract menu, reviews, cuisine, location for restaurants; extract price, reviews, ratings for shopping category; and extract actors, director, ratings for movies. This method is more scalable in terms of the number of web pages compared to the site specific method, but slightly more error prone since classification and extraction errors accumulate.
(b) Relationship mining. After extracting interesting concepts, one needs to match them with concepts in the database to create attributes, to grow concepts, and to find relationships between concepts. Some web databases like Freebase provide substantial amount of relationships between Wikipedia concepts.
(c) Linking data with concepts. As mentioned earlier, organizing web around concepts can benefit experience with search pages, topic pages, and article pages by linking them with concepts.
The algorithmic approach to organizing the web around concepts is somewhat error prone, though it improves as algorithms for a particular step improves. However, it is most promising in terms of scaling to enormous web that exists.
In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data.
References
[1] N. Kushmerick, D. S. Weld, R. B. Doorenbos: Wrapper Induction for Information Extraction. IJCAI (1) 1997.
[2] N. Dalvi, P. Bohannon, F. Sha: Robust web extraction: an approach based on a probabilistic tree-edit model. SIGMOD Conference 2009.

Deep Web: The Hidden Treasure

(Iceberg image ©Ralph A. Clevenger)

Experts estimate that search engines can access less than 1% of the data available on the Web, only the tip of the iceberg. Where is the rest of Internet’s data? It’s lurking in the Deep Web.

What is the Deep Web?

The Deep Web is defined as dynamically generated content hidden behind HTML forms that is inaccessible for search engines' crawlers. Deep Web is also referred as the hidden or invisible Web.

The Deep Web consists of three key elements: (1) pages and databases accessible only through HTML forms; (2) disconnected pages not accessible to crawlers; and (3) password protected and subscription only sites. Some people also include real time Web data as a part of the Deep Web, since it's changing so fast that traditional search engines are not able to surface it in their results.

How Vast is the Deep Web?

According to one study by Michael K. Bergman in 2000, the Deep Web accounted for 7,500 terabytes of data. At that time, search engines could index only 10s of terabytes of data. By 2004, a subsequent study by Kevin Chang and his colleagues estimated that the Deep Web had grown to more than 30,000 terabytes of data. At this rate, one can only imagine how vast it must be today, particularly given the ubiquity of the Internet over the past five years. Such
an enormous amount of data has huge wealth of information—the key is figuring out how to access it.

Is it Possible to Access the Deep Web?

Absolutely-though it’s not easy. There are two main approaches to accessing Deep Web data: run-time integration, and off-line indexing.

In the run-time integration approach, one has to build a system that performs the following tasks: first, figure out the appropriate forms that are likely to have results for the given query terms; second, map the query terms suitably to search those forms and integrate the results from various forms; and third, extract relevant parts of results to display. This approach enables richer experience for users, and sites like Cazoodle.com seems to rely on this method.

But there are some drawbacks to run-time integration. It’s extremely difficult to figure out appropriate forms for the given query terms. In addition, mapping query terms to search those forms and extracting information from the results is highly labor-intensive tasks and not very scalable.

In the off-line indexing approach to access Deep Web data, one has to construct a set of queries to search through forms, process the queries through forms while off-line, and index the result. Once the query set is constructed, this approach can reuse the search engine infrastructure for crawling, indexing results, and index serving.

Google has taken this approach to surface Deep Web content. However, algorithmically constructing input values for forms is a non-trivial task. Furthermore, this approach cannot be applied to HTML forms that use HTTP POST method, since the resulting URLs are the same, and form inputs are part of HTTP request rather then the URL.

The Kosmix Approach to the Deep Web

At Kosmix, we surface Deep Web content by using a combination of run-time integration and off-line indexing approaches. At the core of Kosmix technologies are (1) a sophisticated categorization engine that enables mapping of query to appropriate category; (2) a highly scalable fetching and run time integration system to fetch data from various sources, integrate, and provide rich experience; and (3) an off-line crawling and indexing systems that enables
scalability.

For example, for a query like "Ravioli", we show nutritional values from Fatsecret.com. Our categorization technology enables us to identify Ravioli as a food query, and enables us to surface Deep Web content from Fatsecret.com.

The Next Hurdle

While invaluable treasures are hiding behind the Deep Web, there are significant challenges to solving the problem of reaching this information. The next step for search engines will be to find an easier way to tap into the Deep Web, and to keep up with the Real Time Web.

My prediction? The Deep Web will force a drastic change in how traditional search engine systems are designed and built.

References:

Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value. http://brightplanet.com/index.php/white-papers/119.html, 2000.
Kevin Chen-chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured Databases on the Web: Observations and Implications. In SIGMOD 2004.
Cazoodle. http://www.cazoodle.com/docs/Press_Kit.pdf, 2009.
Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. In VLDB 2008.

Untitled

Kosmix launched a new beta-ish version today. Check out this post explaining what's new in the beta version: "Kosmix goes betaish".

We also announced $20 million funding from Time-warner and others: "Kosmix adds rocketfuel".

How is Kosmix different from a search engine? Search engines treat web as a collection of documents. In contrast, we treat web as a collection of applications, and bring together the best of the web applications relevant to the given query on one page, and present them in a two dimensional layout. Our categorization technology helps in figuring out the category of the query, and picking the best applications. Check out a few examples: Simpsons, San Francisco, Diabetes, Acura TSX.

Check out Anand's blog for more details.