Enormous expanses of the Internet are unreachable with standard web search engines. This book provides the key to finding these hidden resources by identifying how to uncover and use invisible web resources. Mapping the invisible Web, when and how to use it, assessing the validity of the information, and the future of Web searching are topics covered in detail. Only 16 percent of Net-based information can be located using a general search engine. The other 84 percent is what is referred to as the invisible Web—made up of information stored in databases. Unlike pages on the visible Web, information in databases is generally inaccessible to the software spiders and crawlers that compile search engine indexes. As Web technology improves, more and more information is being stored in databases that feed into dynamically generated Web pages. The tips provided in this resource will ensure that those databases are exposed and Net-based research will be conducted in the most thorough and effective manner.
|Publisher:||Information Today, Inc.|
|Edition description:||1 ED|
|Product dimensions:||6.00(w) x 9.50(h) x 1.01(d)|
About the Author
Gary Price is a reference librarian at George Washington University. He lives in Vienna, Virginia. Chris Sherman is the director of the guide to Web searching on About.com and president of Searchwise, a consulting firm. He is the author of the CD-ROM Handbook and a frequent contributor to Online magazine. He lives in Los Angeles. Danny Sullivan works for SearchEngineWatch.com.
Read an Excerpt
The Invisible Web
Uncovering Information Sources Search Engines Can't See
By Chris Sherman, Gary Price
Information Today, Inc.Copyright © 2001 Chris Sherman and Gary Price
All rights reserved.
The Internet and the Visible Web
To understand the Web in the broadest and deepest sense, to fully partake of the vision that I and my colleagues share, one must understand how the Web came to be.
— Tim Berners-Lee, Weaving the Web
Most people tend to use the words "Internet" and "Web" interchangeably, but they're not synonyms. The Internet is a networking protocol (set of rules) that allows computers of all types to connect to and communicate with other computers on the Internet. The Internet's origins trace back to a project sponsored by the U.S. Defense Advanced Research Agency (DARPA) in 1969 as a means for researchers and defense contractors to share information (Kahn, 2000).
The World Wide Web (Web), on the other hand, is a software protocol that runs on top of the Internet, allowing users to easily access files stored on Internet computers. The Web was created in 1990 by Tim Berners-Lee, a computer programmer working for the European Organization for Nuclear Research (CERN). Prior to the Web, accessing files on the Internet was a challenging task, requiring specialized knowledge and skills. The Web made it easy to retrieve a wide variety of files, including text, images, audio, and video by the simple mechanism of clicking a hypertext link.
The primary focus of this book is on the Web — and more specifically, the parts of the Web that search engines can't see. To fully understand the phenomenon called the Invisible Web, it's important to first understand the fundamental differences between the Internet and the Web.
In this chapter, we'll trace the development of some of the early Internet search tools, and show how their limitations ultimately spurred the popular acceptance of the Web. This historical background, while fascinating in its own right, lays the foundation for understanding why the Invisible Web could arise in the first place.
How the Internet Came to Be
Up until the mid-1960s, most computers were stand-alone machines that did not connect to or communicate with other computers. In 1962 J.C.R. Licklider, a professor at MIT, wrote a paper envisioning a globally connected "Galactic Network" of computers (Leiner, 2000). The idea was far-out at the time, but it caught the attention of Larry Roberts, a project manager at the U.S. Defense Department's Advanced Research Projects Agency (ARPA). In 1966 Roberts submitted a proposal to ARPA that would allow the agency's numerous and disparate computers to be connected in a network similar to Licklider's Galactic Network.
Roberts' proposal was accepted, and work began on the "ARPANET," which would in time become what we know as today's Internet. The first "node" on the ARPANET was installed at UCLA in 1969 and gradually, throughout the 1970s, universities and defense contractors working on ARPA projects began to connect to the ARPANET.
In 1973 the U.S. Defense Advanced Research Projects Agency (DARPA) initiated another research program to allow networked computers to communicate transparently across multiple linked networks. Whereas the ARPANET was just one network, the new project was designed to be a "network of networks." According to Vint Cerf, widely regarded as one of the "fathers" of the Internet, "This was called the Internetting project and the system of networks which emerged from the research was known as the 'Internet'" (Cerf, 2000).
It wasn't until the mid 1980s, with the simultaneous explosion in use of personal computers, and the widespread adoption of a universal standard of Internet communication called Transmission Control Protocol/Internet Protocol (TCP/IP), that the Internet became widely available to anyone desiring to connect to it. Other government agencies fostered the growth of the Internet by contributing communications "backbones" that were specifically designed to carry Internet traffic. By the late 1980s, the Internet had grown from its initial network of a few computers to a robust communications network supported by governments and commercial enterprises around the world.
Despite this increased accessibility, the Internet was still primarily a tool for academics and government contractors well into the early 1990s. As more and more computers connected to the Internet, users began to demand tools that would allow them to search for and locate text and other files on computers anywhere on the Net.
Early Net Search Tools
Although sophisticated search and information retrieval techniques date back to the late 1950s and early '60s, these techniques were used primarily in closed or proprietary systems. Early Internet search and retrieval tools lacked even the most basic capabilities, primarily because it was thought that traditional information retrieval techniques would not work well on an open, unstructured information universe like the Internet.
Accessing a file on the Internet was a two-part process. First, you needed to establish direct connection to the remote computer where the file was located using a terminal emulation program called Telnet. Then you needed to use another program, called a File Transfer Protocol (FTP) client, to fetch the file itself. For many years, to access a file it was necessary to know both the address of the computer and the exact location and name of the file you were looking for — there were no search engines or other file-finding tools like the ones we're familiar with today.
Thus, "search" often meant sending a request for help to an e-mail message list or discussion forum and hoping some kind soul would respond with the details you needed to fetch the file you were looking for. The situation improved somewhat with the introduction of "anonymous" FTP servers, which were centralized file- servers specifically intended for enabling the easy sharing of files. The servers were anonymous because they were not password protected — anyone could simply log on and request any file on the system.
Files on FTP servers were organized in hierarchical directories, much like files are organized in hierarchical folders on personal computer systems today. The hierarchical structure made it easy for the FTP server to display a directory listing of all the files stored on the server, but you still needed good knowledge of the contents of the FTP server. If the file you were looking for didn't exist on the FTP server you were logged into, you were out of luck.
The first true search tool for files stored on FTP servers was called Archie, created in 1990 by a small team of systems administrators and graduate students at McGill University in Montreal. Archie was the prototype of today's search engines, but it was primitive and extremely limited compared to what we have today. Archie roamed the Internet searching for files available on anonymous FTP servers, downloading directory listings of every anonymous FTP server it could find. These listings were stored in a central, searchable database called the Internet Archives Database at McGill University, and were updated monthly.
Although it represented a major step forward, the Archie database was still extremely primitive, limiting searches to a specific file name, or for computer programs that performed specific functions. Nonetheless, it proved extremely popular — nearly 50 percent of Internet traffic to Montreal in the early '90s was Archie related, according to Peter Deutsch, who headed up the McGill University Archie team.
"In the brief period following the release of Archie, there was an explosion of Internet-based research projects, including WWW, Gopher, WAIS, and others" (Deutsch, 2000).
"Each explored a different area of the Internet information problem space, and each offered its own insights into how to build and deploy Internet-based services," wrote Deutsch. The team licensed Archie to others, with the first shadow sites launched in Australia and Finland in 1992. The Archie network reached a peak of 63 installations around the world by 1995.
Gopher, an alternative to Archie, was created by Mark McCahill and his team at the University of Minnesota in 1991 and was named for the university's mascot, the Golden Gopher. Gopher essentially combined the Telnet and FTP protocols, allowing users to click hyperlinked menus to access information on demand without resorting to additional commands. Using a series of menus that allowed the user to drill down through successively more specific categories, users could ultimately access the full text of documents, graphics, and even music files, though not integrated in a single format. Gopher made it easy to browse for information on the Internet.
According to Gopher creator McCahill, "Before Gopher there wasn't an easy way of having the sort of big distributed system where there were seamless pointers between stuff on one machine and another machine. You had to know the name of this machine and if you wanted to go over here you had to know its name.
"Gopher takes care of all that stuff for you. So navigating around Gopher is easy. It's point and click typically. So it's something that anybody could use to find things. It's also very easy to put information up so a lot of people started running servers themselves and it was the first of the easy-to-use, no muss, no fuss, you can just crawl around and look for information tools. It was the one that wasn't written for techies."
Gopher's "no muss, no fuss" interface was an early precursor of what later evolved into popular Web directories like Yahoo!. "Typically you set this up so that you can start out with [a] sort of overview or general structure of a bunch of information, choose the items that you're interested in to move into a more specialized area and then either look at items by browsing around and finding some documents or submitting searches," said McCahill.
A problem with Gopher was that it was designed to provide a listing of files available on computers in a specific location — the University of Minnesota, for example. While Gopher servers were searchable, there was no centralized directory for searching all other computers that were both using Gopher and connected to the Internet, or "Gopherspace" as it was called. In November 1992, Fred Barrie and Steven Foster of the University of Nevada System Computing Services group solved this problem, creating a program called Veronica, a centralized Archie-like search tool for Gopher files. In 1993 another program called Jughead added keyword search and Boolean operator capabilities to Gopher search.
Popular legend has it that Archie, Veronica and Jughead were named after cartoon characters. Archie in fact is shorthand for "Archives." Veronica was likely named after the cartoon character (she was Archie's girlfriend), though it's officially an acronym for "Very Easy Rodent-Oriented Net-Wide Index to Computerized Archives." And Jughead (Archie and Veronica's cartoon pal) is an acronym for "Jonzy's Universal Gopher Hierarchy Excavation and Display," after its creator, Rhett "Jonzy" Jones, who developed the program while at the University of Utah Computer Center.
A third major search protocol developed around this time was Wide Area Information Servers (WAIS). Developed by Brewster Kahle and his colleagues at Thinking Machines, WAIS worked much like today's metasearch engines. The WAIS client resided on your local machine, and allowed you to search for information on other Internet servers using natural language, rather than using computer commands. The servers themselves were responsible for interpreting the query and returning appropriate results, freeing the user from the necessity of learning the specific query language of each server.
WAIS used an extension to a standard protocol called Z39.50 that was in wide use at the time. In essence, WAIS provided a single computer-to-computer protocol for searching for information. This information could be text, pictures, voice, or formatted documents. The quality of the search results was a direct result of how effectively each server interpreted the WAIS query.
All of the early Internet search protocols represented a giant leap over the awkward access tools provided by Telnet and FTP. Nonetheless, they still dealt with information as discrete data objects. And these protocols lacked the ability to make connections between disparate types of information — text, sounds, images, and so on — to form the conceptual links that transformed raw data into useful information. Although search was becoming more sophisticated, information on the Internet lacked popular appeal. In the late 1980s, the Internet was still primarily a playground for scientists, academics, government agencies, and their contractors.
Fortunately, at about the same time, a software engineer in Switzerland was tinkering with a program that eventually gave rise to the World Wide Web. He called his program Enquire Within Upon Everything, borrowing the title from a book of Victorian advice that provided helpful information on everything from removing stains to investing money.
Enquire Within Upon Everything
"Suppose all the information stored on computers everywhere were linked, I thought. Suppose I could program my computer to create a space in which anything could be linked to anything. All the bits of information in every computer at CERN, and on the planet, would be available to me and to anyone else. There would be a single, global information space.
"Once a bit of information in that space was labeled with an address, I could tell my computer to get it. By being able to reference anything with equal ease, a computer could represent associations between things that might seem unrelated but somehow did, in fact, share a relationship. A Web of information would form."
— Tim Berners-Lee, Weaving the Web
The Web was created in 1990 by Tim Berners-Lee, who at the time was a contract programmer at the Organization for Nuclear Research (CERN) high-energy physics laboratory in Geneva, Switzerland. The Web was a side project Berners-Lee took on to help him keep track of the mind-boggling diversity of people, computers, research equipment, and other resources that are de rigueur at a massive research institution like CERN. One of the primary challenges faced by CERN scientists was the very diversity that gave it strength. The lab hosted thousands of researchers every year, arriving from countries all over the world, each speaking different languages and working with unique computing systems. And since high-energy physics research projects tend to spawn huge amounts of experimental data, a program that could simplify access to information and foster collaboration was something of a Holy Grail.
Berners-Lee had been tinkering with programs that allowed relatively easy, decentralized linking capabilities for nearly a decade before he created the Web. He had been influenced by the work of Vannevar Bush, who served as Director of the Office of Scientific Research and Development during World War II. In a landmark paper called "As We May Think," Bush proposed a system he called MEMEX, "a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility" (Bush, 1945).
The materials stored in the MEMEX would be indexed, of course, but Bush aspired to go beyond simple search and retrieval. The MEMEX would allow the user to build conceptual "trails" as he moved from document to document, creating lasting associations between different components of the MEMEX that could be recalled at a later time. Bush called this "associative indexing ... the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the MEMEX. The process of tying two items together is the important thing."
In Bush's visionary writings, it's easy for us to see the seeds of what we now call hypertext. But it wasn't until 1965 that Ted Nelson actually described a computerized system that would operate in a manner similar to what Bush envisioned. Nelson called his system "hypertext" and described the next-generation MEMEX in a system he called Xanadu.
Nelson's project never achieved enough momentum to have a significant impact on the world. Another twenty years would pass before Xerox implemented the first mainstream hypertext program, called NoteCards, in 1985. A year later, Owl Ltd. created a program called Guide, which functioned in many respects like a contemporary Web browser, but lacked Internet connectivity.
Excerpted from The Invisible Web by Chris Sherman, Gary Price. Copyright © 2001 Chris Sherman and Gary Price. Excerpted by permission of Information Today, Inc..
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Chapter 1 — The Internet and the Visible Web,
Chapter 2 — Information Seeking on the Visible Web,
Chapter 3 — Specialized and Hybrid Search Tools,
Chapter 4 — The Invisible Web,
Chapter 5 — Visible or Invisible?,
Chapter 6 — Using the Invisible Web,
Chapter 7 — Case Studies,
Chapter 8 — The Future: Revealing the Invisible Web,
Chapter 9 — The Best of the Invisible Web,
Chapter 10 — Art and Architecture,
Chapter 11 — Bibliographies and Library Catalogs,
Chapter 12 — Business and Investing,
Chapter 13 — Computers and Internet,
Chapter 14 — Education,
Chapter 15 — Entertainment,
Chapter 16 — Government Information and Data,
Chapter 17 — Health and Medical Information,
Chapter 18 — U.S. and World History,
Chapter 19 — Legal and Criminal Resources,
Chapter 20 — News and Current Events,
Chapter 21 — Searching for People,
Chapter 22 — Public Records,
Chapter 23 — Real-Time Information,
Chapter 24 — Reference,
Chapter 25 — Science,
Chapter 26 — Social Sciences,
Chapter 27 — Transportation,
About the Authors,
Most Helpful Customer Reviews
Classic book on the deep web. A bit dated but good for background reading.
What a great book, essential reading for anyone who uses the web to find information. It explains, clearly and concisely, what the invisible web is (the 80% of the web that is NOT indexed by search engines) and why that material is 'invisible'. In addition, the book has 19 (!) chapters with descriptions of invisible web resources on topics ranging from health/medicine to news to science. This book should be on every web searcher's desk.
I think that this book is so good. How good is it you might ask? We'll let me tell you, it is so good, that, like, I know it is so good even before reading it. Can't wait til it's released so that I can see how good it actually is.