Searching the Web
The Deep Web
The concept of the Deep Web is not surprising to me: I’ve had a bunch of pages up on the Web that I haven’t bothered or wanted to have indexed in crawler-based sites, and I’ve even used robots.txt and Sitemap files to restrict access to some of my content. But I had never actually thought about the value of “hidden” content, and I didn’t realise that Sitemaps were a sneaky tool (well, the Sitemap generators are!) for accessing more of the deep web, because they inform search engines of more URLs than their crawlers would be able to find.
Now, I’ll point out that as a Linux user, I didn’t have access to Copernic or Sherlock, so I used Web-based metasearch tools instead.
When I first read about the Deep Web in this course, I wondered about how the content was accessible to anyone other than the owner. In my researches, I came across Turbo10, a Deep Net search tool that returns Web content as well as other hidden Internet resources. Turbo10 is different to many other search engines because it causes the relevance ranking, topic clustering and result merging to occur in the client browser rather than on the server (this is done in the interests of speed and is achieved by way of asynchronous data transfer). It turns out that Turbo10 may use a technique called federated searching, whereby programmed “adapters” automatically connect to topical deep web search engines, searching and extracting results from the Deep Net. Turbo10 has provided a great paper that explains the mechanics of searching the Deep Net.
To the actual task:
Google Search for “project management methodologies”: top result - Project Management Methodologies
, about 3,600,000 results in total.
Turbo10 Search for “project management methodologies: top result - Project Management Methodologies, 20 results in total.
I personally don’t think either search produced great results: a project management body would have carried more authority than an individual project manager and would hold more relevance than a book review. Turbo10 was quite disappointing, actually, in the number of results it returned.
Another great paper on searching the Deep Web is Using the Deep Web: A How-To Guide For IT Professsionals.
Boolean Searching
To get the biggest result set, use OR between keywords. Note that this will return the entire result set for each keyword joined by this operator.
Google Search for “project management OR methodologies”: top result - Project Management - Wikipedia.
If you need all keywords in your results, you’re better off using the AND operator or no operator at all (since AND is the default operator). I think the most useful searching techniques are to use the minus (-) operator to remove results that include the phrases you specify (eg project management methodologies -agile). Another good technique is to quote phrases if you wish the exact phrase to appear in the result set (eg “I have of late, but wherefore I know not”).
To obtain result sets originating only from university sources, it is best to use the Advanced Search option. Yahoo allows you to set specific Top Level Domains to search within: this is the best approach. Simply enter the search phrase, then limit the search to .edu domains.
Further references:
Librarian Search Guide
Great Boolean search cheat sheet
Dogpile
Ask
Organising Search Information
I used OpenOffice Calc (a spreadsheet tool) and The Gimp to record my search information. These are great open source tools that I use regularly. I built a similar system to the one you see in the screenshot below for recording website bugs for CybaSumo.com.


August 28th, 2008 at 3:15 pm
For another Deep Web project.
see http://www.isen.org
and http://blog.isen.org
August 28th, 2008 at 7:17 pm
Thanks, Matt.