Make an Internet archive for the ages
Twenty-eight years ago this month, I entered the World Wide Web. First among my friends – and among the first thousand Internet users all over the world – I explored a very small universe of content.
But how did Internet users find something before Google, a tool that appeared several years later?
I went through this list by methodically clicking one link after another, exploring the loaded website in my “browser” and moving to the next. By the time I had completed this list, another list had popped up, this one hosted at the National Center for Supercomputer Applications (NCSA). Almost as famous as CERN, NCSA gave the world the world’s first widely used browser, Mosaic. The NCSA list overlapped a lot with the CERN list, but a few new sites were appearing at the bottom of the list every day, so I spent a day or two visiting all the sites on this list that I had not already visited.
In seven days, I was done. I had surfed on all the Web.
For a few months, I managed to keep up with new websites as they appeared on the NCSA list, being proud to stay up to date with this amazing new technology. But by the end of February 1994, more sites were on this list every day than I could find the time to explore. Soon after, the list manager threw in the towel – the web’s exponential growth meant that no archivist could hope to keep pace.
In early 1994, two enterprising Stanford University students (no, no Larry and Sergei) set up Jerry and Dave’s Guide to the World Wide Web, a part-time project that quickly became the first of the Internet’s âunicornsâ: Yahoo! By taking a librarian’s approach to the too good things of the First Web, Yahoo! asked you to choose your category, then your sub-category, and maybe even your sub-sub and sub-sub-sub-categories, leaving you with an organized list of websites for you to review at your leisure, each being dedicated to your sub-sub-sub-topic of interest.
It took 18 months for exponential growth to overwhelm Yahoo! ; each sub-sub-sub-sub-sub-category produced a list of sites that was too long to crawl. At this point, I started to keep lists of links – “bookmarks” – like a breadcrumb trail to guide me to my favorite sites. When that list got long enough, I curated the best of the best, put them together into a list named Stones, Stars and Gold, and put them on a page on my own site.
Number of websites on the Internet over time
- 1991 (World Wide Web invented): 10
- 1994 (launch of Yahoo!): around 3,000
- 1998 (Google launched): around 2.4 million
- 2004 (Facebook launch): around 51.6 million
- Today: around 1.7 billion
Visiting this list today and working methodically from top to bottom, only about one-fifth of the links load the pages they pointed to in 1995. Most of them don’t go to anything at all, or to something that has the same name, but is completely different. In less than a generation, my snapshot of the first web – very personal, specific, and meaningful – has almost gone bad.
The term “link rotâMay not be new – the concept dates back to the first decade of the web – but most people won’t know that the web was designed to do its best to prevent untimely link death. The Uniform Resource Locator, or URL, had been defined by web creator Sir Tim Berners-Lee as “immutable” – it must not change. A URL is assigned once – a pointer to a page, photo, or podcast – and that’s it. This url always points to these bits. At least that’s the theory. Unfortunately, immutable URLs immediately entered the too hard cart. From that moment, the rot set in.
Brewster Kahle saw the problem almost immediately. In 1996, the co-inventor of WAIS (Wide Area Information Server) founded the Internet Archive and began a methodical safeguard of the all the Web. “How can you cite a document if the documents disappear every 44 days?” He asked, using his web saver to power something called the Return machine – a technology designed to stop rot. Insert a dead URL into the Wayback Machine and it will show you all of its backups of this webpage, since it began its first analysis 25 years ago.
Using Wayback Machine on my list of favorite 80% dead links from 1995, I find that many (probably most) of these websites can be salvaged. The links themselves may be dead, but the pages and images pointed to by those links continue to persist. If I wanted, I could recreate my page with links that exploit the Wayback Machine, bringing the list back to life. Yet that may not be enough to prevent a more pernicious form of link rot.
A recent paper from a group of researchers based in the United States shows that even a good web backup can miss the point. “Where have the web archives gone?” Details the fate of four web archives (the Internet Archive, thankfully, not among them) that changed their own URLs in 14 months from 2017 to 2019. While well-intentioned, these changes broke many URLs pointing to content from these archives. An archive is good, it is certainly better than losing data. But an archive that does not provide immutable URLs for his data, well, that’s maximum link rot.
We are all generating so much data all the time now – on smartphones and wearable devices and Zoom calls etc. – that archiving is no longer a luxury. Without an archive, we lose our connection to our digital past. I learned this viscerally when I searched online for resources regarding First international conference on the World Wide Web, which took place at CERN in May 1994, and which I attended. There is very little documentation on the event, and only a few photos, for one of the most important events in computer history: the web’s âbig bangâ moment. Why? The answer is almost too obvious: the conference took place before the web took off. The medium we use to record, commemorate and share our experiences simply did not exist. It was only subsequently implemented by the more than 300 researchers who attended the conference.
The shadow cast by this absence has shown me that if we are not very careful, we could lose our connection to our past. The data may stay somewhere, but could be so difficult to locate that most would simply resign themselves to some kind of perpetual digital amnesia. In 1984George Orwell wrote: “He who controls the past controls the future.” I would say that those who forget the past don’t have much of a future.
All the archives we add daily – in photos we share on Facebook, movies uploaded to YouTube, rants posted on Twitter, etc. – mean that this threat affects almost all of us. What can we do? We can demand immutability in perpetuity. Any organization that publishes to the web must ensure that even when reviewing its systems, the data will remain available and accessible forever, via the same URLs. We cannot let our history rot. It doesn’t have to happen, and it shouldn’t. Not if we want to be able to understand how we got here and where we are going.
Do you like the weekly? You will love the COSMOS quarterly magazine.
Biggest news, in detail, quarterly. Buy a subscription today.