The Search Lounge

11/08/2004

Wayback Machine

Wayback Machine (part of the Internet Archive)
Type of Engine:
Not so much an algorithmically based engine as it is an access point to archived versions of web sites. And an essential tool for searchers.
Overall: Very good.
If this engine were a drink it would be…Cola. I can’t make my favorite drink, Jack and Coke, without it, but I don’t drink it by itself. Just like you can’t search just with the Wayback Machine, but if you mix it with your favorite search engine you’ll be a happy customer.

Intro
Although the Internet Archive has been around for 8 years, and although they’re not really a search engine per se, I love what they do so much that I wanted to write about them for the Lounge. Their goals are lofty, inspiring, and unique. In my opinion they are one of the most important sites on the web. Founded in 1996 by Brewester Kahle, the Internet Archive is a public nonprofit organization whose goal is to create and keep regularly scheduled snapshots of as many web sites as possible. The Wayback Machine is the interface for viewing these stored versions of web sites. The Archive also archives movies, audio files, and books.

Kahle has stated that his organization’s goal is to store everything. It could be one of the greatest achievements of all time” (2004). It is a clear and powerful mission statement. It has vastly significant consequences for future generations. We can not begin to fathom how differently the last 1,500 years would have been had Alexandria’s Library been preserved and its knowledge not lost. The same can be said for other information that has been lost. The early manuscripts at the Library of Alexandria were burned, much of early printing was not saved, and many early films were recycled for their silver content” (Kahle, 1996). Although 1,500 years from now scholars will probably not be very interested in a personal web page dedicated to someone’s dog, there are countless other web sites that do contain valuable information on many subjects. Kahle does not want history to repeat itself by society losing valuable information that’s stored only on the web. For preserving books, he has even gone so far as to suggest that every book in the Library of Congress could be scanned. He says that universal access to all the knowledge in the Library of Congress could be had for around $280,000,000. He estimates he can scan and digitize all 28,000,000 books for $10 each. And in terms of the web, he says it “is growing at about 20 terabytes of compressed data a month, which is manageable” (2004). OK, if he says so.

The Internet Archive has been very successful in taking snapshots of millions of web sites, but there is still the major challenge of providing access to it all. Currently they have addressed this by creating an interface called the Wayback Machine that lets users view archived versions of web pages. Just type in a URL and all the archived versions of the site will be presented. The Internet Archive has mentioned here and here that they want to create a textual search interface to its archive, but no such interface currently exists for the public. This is the main area for the Internet Archive to improve in. The Wayback Machine, although incredibly powerful, needs to be augmented by text searching so that users can locate archived web sites by topic. Of course that’s no easy thing to do, but since IA is affiliated with Alexa (also founded by Kahle), maybe Alexa can share its indexing capabilities. Easier said than done.

UI & Features
There’s really not much to say. You just type in a URL, hit “Take Me Back”, and there you go. On the Advanced Search page there’s a few more options such as merging aliases, a.k.a. de-duplicating, where yahoo.com and yahoo.com/index will be mapped to each other. There’s also a function to compare two snapshots but unfortunately this wasn’t working for me.

Query Examples
Go as obscure as you want and there’s a good chance the Wayback Machine will find it. First I tried a search for a Bukowski page I know of. The first snapshot was in March, 1999, and then is followed with periodic snapshots since then.

I then tried something less obscure, http://www.mlb.com, as in Major League Baseball. The first archived page is from December 22, 1996. For 1996-1999 there are only 1 to 5 snapshots per year. In 2000 things started to kick in, but since 2001 there’s been snapshots on almost a monthly basis. But let’s take a closer look just for fun. It turns out that mlb.com was owned by Morgan, Lewis and Bockius, a law firm. Then beginning with the October 9, 2000 snapshot it becomes the homepage for Major League Baseball.

It’s not clear to me why on some days there are multiple snapshots, but it doesn’t really bother me.

Conclusion
If the Wayback Machine ever goes live with a good searching interface it will be incredibly potent. Not only would you be able to target specific web sites by URL, but you’d also be able to search archived versions of all that valuable content that has been lost from the web. Imagine the power of that. It’s the Internet version of a library with its online catalog plus access to the content of all the books the library ever had.

Knowing that the Internet Archive exists and is working quietly behind the scenes makes it easier for me to sleep at night.

8 Comments:

Post a Comment

<< Home