Organizing the Web around Concepts

During the initial days of the web, directories like Yahoo manually organized the web to find the relevant information. As web grew in size and search engine technology evolved, search engines like Google became the main source to query the web. Today, we see the next wave is making the web navigation easier by reorganizing the internet by topic or concept, and increasingly meaningful web (which may lead to Semantic Web) is being built around concepts such as Freebase, Google Squared, DBLife, and Kosmix topic pages. At Kosmix, we often ask the technical philosophy driving this change. Here is a brief overview for the geeks among us.
To start with, what do we mean by concepts? A concept is loosely defined as a set of keywords of interest, for example, the name of a restaurant, cuisine, event, name of a movie, etc. There are various websites tailored to a particular kind of concept such as Yelp for restaurants (e.g., Amarin Thai), IMDB for movies (e.g., The Shawshank Redemption), LinkedIn for professional people, Last.fm for music (e.g., U2), etc.
Why should one care about organizing the web around concepts? There are three main kinds of web pages: search pages, topic/concept pages, and articles. Organizing the web around concepts can benefit each one of them.
Search pages. A search results page for a given query consists of various relevant links with snippets, for example, Google search results pages on "Erykah Badu". Web data around concepts can improve search results in two ways. First, a search page can show a bunch of concepts related to the query, and their relationships to the query. This will help in further refining the query, and enable exploration of concepts related to the query. Second, a search page can promote the concept page result for a concept closely matching the query.
Concept/Topic pages. A topic page or concept page organizes information around a concept, for example consider this music artist page on "Erykah Badu". Such pages can utilize attributes of concepts, and show content related to the concept and its attributes, such as, albums, music videos, songs listing, album reviews, concerts, etc.
Articles. Articles can put semantic links to the concepts present in the article, and promote exploration of concepts present in the article, for example, this page on oil prices.
Given so many benefits of arranging the web around concepts, how can we achieve that? Some of the ways to arrange the web around concepts are as follows.
1. Editorial: An editor can pick a set of interested concepts, create attributes of the concepts, and organize the data around the concepts. Many sites like IMDB (for movies) have taken this approach. This approach gives high quality content but it's not scalable in terms of the number of concepts.
2. Community: Many sites such as Wikipedia and Yelp have taken this approach in which a community of users picks concepts, creates the attributes of the concepts, and organizes the data around the concepts. This process scales as the user community grows, but it is hard to build such community, this approach is susceptible to spam, and scale is limited. For example, Wikipedia has grown to millions of concepts with such a large user base, but it size is still far from the scale of the web.
3. Algorithmic approach: One way to organize the web around concepts is to mine the web for concepts and their attributes, and link data with concepts. This approach is the most promising in terms of scaling to the size of the web. Various steps in this approach are (a) Concept Extraction, (b) Relationship mining, and (c) Linking data with concepts.
(a) Concept Extraction. There are two main methods for concept extraction from web pages, site-specific and category-specific.
In the site specific method, the structure or semantics of a site is used to extract concepts. Many web sites generate HTML pages from the databases through a program, and such pages have similar structure. One can write site specific rules or wrappers to extract interesting data from such web pages, but writing such wrappers is labor intensive task. Kushmerick et. al. have proposed wrapper induction technique to automatically learn wrapper procedures based upon samples of such web pages. A recent work by Dalvi et. al. extends the wrapper induction technique to dynamic web pages. Another site specific method is to use natural language processing to understand semantic of web pages and to mine concepts from web pages.
In the category specific method, web pages are classified into categories, such as, restaurants, shopping, movies, etc., and category specific extraction rules are applied. For example, extract menu, reviews, cuisine, location for restaurants; extract price, reviews, ratings for shopping category; and extract actors, director, ratings for movies. This method is more scalable in terms of the number of web pages compared to the site specific method, but slightly more error prone since classification and extraction errors accumulate.
(b) Relationship mining. After extracting interesting concepts, one needs to match them with concepts in the database to create attributes, to grow concepts, and to find relationships between concepts. Some web databases like Freebase provide substantial amount of relationships between Wikipedia concepts.
(c) Linking data with concepts. As mentioned earlier, organizing web around concepts can benefit experience with search pages, topic pages, and article pages by linking them with concepts.
The algorithmic approach to organizing the web around concepts is somewhat error prone, though it improves as algorithms for a particular step improves. However, it is most promising in terms of scaling to enormous web that exists.
In short, organizing the web around concepts is a promising area and a stepping stone to bring meaning behind the web data.
References
[1] N. Kushmerick, D. S. Weld, R. B. Doorenbos: Wrapper Induction for Information Extraction. IJCAI (1) 1997.
[2] N. Dalvi, P. Bohannon, F. Sha: Robust web extraction: an approach based on a probabilistic tree-edit model. SIGMOD Conference 2009.

Deep Web: The Hidden Treasure


(Iceberg image ©Ralph A. Clevenger)

Experts estimate that search engines can access less than 1% of the data available on the Web, only the tip of the iceberg. Where is the rest of Internet’s data? It’s lurking in the Deep Web.

What is the Deep Web?

The Deep Web is defined as dynamically generated content hidden behind HTML forms that is inaccessible for search engines' crawlers. Deep Web is also referred as the hidden or invisible Web.

The Deep Web consists of three key elements: (1) pages and databases accessible only through HTML forms; (2) disconnected pages not accessible to crawlers; and (3) password protected and subscription only sites. Some people also include real time Web data as a part of the Deep Web, since it's changing so fast that traditional search engines are not able to surface it in their results.

How Vast is the Deep Web?

According to one study by Michael K. Bergman in 2000, the Deep Web accounted for 7,500 terabytes of data. At that time, search engines could index only 10s of terabytes of data. By 2004, a subsequent study by Kevin Chang and his colleagues estimated that the Deep Web had grown to more than 30,000 terabytes of data. At this rate, one can only imagine how vast it must be today, particularly given the ubiquity of the Internet over the past five years. Such
an enormous amount of data has huge wealth of information—the key is figuring out how to access it.

Is it Possible to Access the Deep Web?

Absolutely-though it’s not easy. There are two main approaches to accessing Deep Web data: run-time integration, and off-line indexing.

In the run-time integration approach, one has to build a system that performs the following tasks: first, figure out the appropriate forms that are likely to have results for the given query terms; second, map the query terms suitably to search those forms and integrate the results from various forms; and third, extract relevant parts of results to display. This approach enables richer experience for users, and sites like Cazoodle.com seems to rely on this method.

But there are some drawbacks to run-time integration. It’s extremely difficult to figure out appropriate forms for the given query terms. In addition, mapping query terms to search those forms and extracting information from the results is highly labor-intensive tasks and not very scalable.

In the off-line indexing approach to access Deep Web data, one has to construct a set of queries to search through forms, process the queries through forms while off-line, and index the result. Once the query set is constructed, this approach can reuse the search engine infrastructure for crawling, indexing results, and index serving.

Google has taken this approach to surface Deep Web content. However, algorithmically constructing input values for forms is a non-trivial task. Furthermore, this approach cannot be applied to HTML forms that use HTTP POST method, since the resulting URLs are the same, and form inputs are part of HTTP request rather then the URL.

The Kosmix Approach to the Deep Web

At Kosmix, we surface Deep Web content by using a combination of run-time integration and off-line indexing approaches. At the core of Kosmix technologies are (1) a sophisticated categorization engine that enables mapping of query to appropriate category; (2) a highly scalable fetching and run time integration system to fetch data from various sources, integrate, and provide rich experience; and (3) an off-line crawling and indexing systems that enables
scalability.

For example, for a query like "Ravioli", we show nutritional values from Fatsecret.com. Our categorization technology enables us to identify Ravioli as a food query, and enables us to surface Deep Web content from Fatsecret.com.

The Next Hurdle

While invaluable treasures are hiding behind the Deep Web, there are significant challenges to solving the problem of reaching this information. The next step for search engines will be to find an easier way to tap into the Deep Web, and to keep up with the Real Time Web.

My prediction? The Deep Web will force a drastic change in how traditional search engine systems are designed and built.


References:

  1. Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value. http://brightplanet.com/index.php/white-papers/119.html, 2000.
  2. Kevin Chen-chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured Databases on the Web: Observations and Implications. In SIGMOD 2004.
  3. Cazoodle. http://www.cazoodle.com/docs/Press_Kit.pdf, 2009.
  4. Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. In VLDB 2008.