Effectiveness of data

I attended a talk by Dr. Peter Norwig last week at "The Analytics Revolution Conference" on "The Unreasonable Effectiveness of Data". He talked about how having large amount of data changes the results/outcome, how simpler algorithm works much better with more data than a complex algorithm with less data. He showed three examples to support his case: (1) Spelling correction; (2) Google image search; and (3) Google translation.

(1) Spelling correction. Dr. Norwig talked about underlying principles behind Google's spelling correction. He talked about how frequency of terms in web data is used to figure out possible suggestions and how to score each of the suggestion. For a given query w, Google computes the most likely spelling correction for the query. This is done by computing

argmax_c P(c|w) = argmax_c P(w|c) P(c) (after ignoring P(w) which is same for the given query w)

where,

P(c) is computed using the number of times c occurs in the web corpus. P(w|c) is computed using edit distance between w and c.

So a simple frequency based spelling correction algorithm performs well and in practice works much better than dictionary based algorithm. Also, it is language independent so it works for any language by just changing the input web data. For more details, see Peter Norwig's blog post on spelling correction:


2. Image search. Dr. Norwig talked about how Google constructs similarity graph of crawled images, and how for a given query, dense clusters of images in the graph are likely to be the best quality images. Here also large numbers of images help in improving result.


3. Google translation. Google translation has been sited by many Googlers as a major success of using large amount of data. Dr. Norwig reiterated this point, and talked about how simple co-occurrence of phrases from two different languages is used by Google translation to translate between the two languages.