Deep Web: The Hidden Treasure

(Iceberg image ©Ralph A. Clevenger)

Experts estimate that search engines can access less than 1% of the data available on the Web, only the tip of the iceberg. Where is the rest of Internet’s data? It’s lurking in the Deep Web.

What is the Deep Web?

The Deep Web is defined as dynamically generated content hidden behind HTML forms that is inaccessible for search engines' crawlers. Deep Web is also referred as the hidden or invisible Web.

The Deep Web consists of three key elements: (1) pages and databases accessible only through HTML forms; (2) disconnected pages not accessible to crawlers; and (3) password protected and subscription only sites. Some people also include real time Web data as a part of the Deep Web, since it's changing so fast that traditional search engines are not able to surface it in their results.

How Vast is the Deep Web?

According to one study by Michael K. Bergman in 2000, the Deep Web accounted for 7,500 terabytes of data. At that time, search engines could index only 10s of terabytes of data. By 2004, a subsequent study by Kevin Chang and his colleagues estimated that the Deep Web had grown to more than 30,000 terabytes of data. At this rate, one can only imagine how vast it must be today, particularly given the ubiquity of the Internet over the past five years. Such
an enormous amount of data has huge wealth of information—the key is figuring out how to access it.

Is it Possible to Access the Deep Web?

Absolutely-though it’s not easy. There are two main approaches to accessing Deep Web data: run-time integration, and off-line indexing.

In the run-time integration approach, one has to build a system that performs the following tasks: first, figure out the appropriate forms that are likely to have results for the given query terms; second, map the query terms suitably to search those forms and integrate the results from various forms; and third, extract relevant parts of results to display. This approach enables richer experience for users, and sites like Cazoodle.com seems to rely on this method.

But there are some drawbacks to run-time integration. It’s extremely difficult to figure out appropriate forms for the given query terms. In addition, mapping query terms to search those forms and extracting information from the results is highly labor-intensive tasks and not very scalable.

In the off-line indexing approach to access Deep Web data, one has to construct a set of queries to search through forms, process the queries through forms while off-line, and index the result. Once the query set is constructed, this approach can reuse the search engine infrastructure for crawling, indexing results, and index serving.

Google has taken this approach to surface Deep Web content. However, algorithmically constructing input values for forms is a non-trivial task. Furthermore, this approach cannot be applied to HTML forms that use HTTP POST method, since the resulting URLs are the same, and form inputs are part of HTTP request rather then the URL.

The Kosmix Approach to the Deep Web

At Kosmix, we surface Deep Web content by using a combination of run-time integration and off-line indexing approaches. At the core of Kosmix technologies are (1) a sophisticated categorization engine that enables mapping of query to appropriate category; (2) a highly scalable fetching and run time integration system to fetch data from various sources, integrate, and provide rich experience; and (3) an off-line crawling and indexing systems that enables
scalability.

For example, for a query like "Ravioli", we show nutritional values from Fatsecret.com. Our categorization technology enables us to identify Ravioli as a food query, and enables us to surface Deep Web content from Fatsecret.com.

The Next Hurdle

While invaluable treasures are hiding behind the Deep Web, there are significant challenges to solving the problem of reaching this information. The next step for search engines will be to find an easier way to tap into the Deep Web, and to keep up with the Real Time Web.

My prediction? The Deep Web will force a drastic change in how traditional search engine systems are designed and built.

References:

Michael K. Bergman. White Paper: The Deep Web: Surfacing Hidden Value. http://brightplanet.com/index.php/white-papers/119.html, 2000.
Kevin Chen-chuan Chang, Bin He, Chengkai Li, Mitesh Patel, and Zhen Zhang. Structured Databases on the Web: Observations and Implications. In SIGMOD 2004.
Cazoodle. http://www.cazoodle.com/docs/Press_Kit.pdf, 2009.
Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy. Google's Deep Web crawl. In VLDB 2008.