A Visit to the Physical Internet Archive
While I was in San Francisco for the AI Engineer Summit earlier this month, I took the opportunity to visit the Internet Archive — the actual physical archive in the California town of Richmond, about twenty minutes drive from San Francisco.
I’d bought a ticket to “go behind–the-scenes at the physical archive” on Wednesday, Oct. 11, and I arrived just before the start time of 6 p.m. I was glad I hadn’t arrived any earlier, since the location of the physical archive was (unsurprisingly) a warehouse in an industrial part of Richmond. There didn’t seem to be anything else to do in the area.
I had instructed the Uber driver to drop me off at a car park with an Internet Archive sign. But as I looked around, I couldn’t see a public entrance to the warehouse. There were a few other confused-looking internet history nerds standing around, so we awkwardly introduced ourselves and discussed whether we were in the right place. Eventually, a couple of people at the end of the street, about 200 yards away, spotted us and waved us over.
It turned out a group of people had already made themselves comfortable inside the main building, drinking complimentary cokes, beers or mineral water, and eating finger food. The crowd was a mix of older people (perhaps from the generation that worked in Silicon Valley during the 1960s and 70s) and younger geeks (my guess is that many were either librarians or professional webheads — me being an example of the latter).
When the tour began about half an hour later, thirty or forty people gathered in front of an enthusiastic red-shirted man with thinning gray hair. He was of course the founder of the Internet Archive, Brewster Kahle. At first, I was surprised he would be conducting the tour himself, but it soon became clear that Kahle lives and breathes the mission of the Internet Archive. He began by showing us the shipping containers full of old books and other materials, while reeling off some facts (“the Internet Archive is a nonprofit library; we started it 27 years ago, 1996.”).
Later in the tour, Kahle eagerly demonstrated the book-scanning machine, pointed out stacks of boxes gifted to the archive (full of books, videos, disks, records, cassettes, and other media), and stood to the side proudly while his film archivists told us how they convert vintage home videos into high-res digital files. It was a fascinating look into the day-to-day operations of the Internet Archive, which is staffed by a number of friendly and probably liberal-minded Californians — including Brewster’s son, Caslon.
What the Internet Archive Stores
The Internet Archive is perhaps most well known for its Wayback Machine, which debuted in 2001 and has been archiving web pages since 1996. “We collect about a billion URLs every day, just kind of an astonishingly large number,” said Kahle during his tour. “And it’s now two and a half trillion URLs in the Wayback Machine collection — these old web pages. And it’s queried about six or seven thousand times a second.”
But the physical archive, as its informal name suggests, is a repository of physical media — books, catalogs, old computer disks, film, records and cassette tapes, and much more. When a new piece of media comes in, the Internet Archive staff first decide whether it’s a duplicate of something they already have — a process they call “deduping.” If it is a dupe, it’s discarded or given away. If not, it is digitized and then the physical item is stored. (As an aside, the Internet Archive says it only makes available digital copies of a book if it owns the physical copy.)
“We’ve been digitizing books now since the early 2000s,” said Kahle, “and we ended up building our own book scanners.” He added that IA digitizes “about a million books a year” and they’ve digitized in the order of 7 or 8 million books in total (on its about page, the IA says it has “41 million books and texts”, so the majority of those must be text items other than books).
As for music, it’s a media type that has historically had multiple formats — LPs, CDs, cassettes, MP3, etc. Kahle was particularly enthusiastic about 78 RPM records, which he said were around from about 1900 to 1950. “There are maybe 2 or 3 million of them,” he said, “[and] we’ve digitized about 450,000.”
“We’re trying to basically do all the media types,” continued Kahle. “And what I’ve been finding is that the time that […] things have been becoming obsolete, it’s happening faster and faster. […] Not only do you not have access to the same things; even if you have access, it’s not presented to you in such a way that you actually use it.”
Note: If you’re interested in donating items to the Internet Archive, check this web page for a list of media types it is currently accepting.
How the Internet Archive Keeps Going
Someone in the tour group asked Kahle how often the IA needs to buy new servers, to store this constant influx of new media.
“Continuously,” he replied. “We buy a new rack pair — because it always comes in a pair — every two months [or] three months. […] In one rack, you can put around five petabytes now.”
Of course, the IA has been in the news this year because of legal attacks from both the book publishing industry and the music industry (the latter regarding the 78 RPM records project). Kahle made several sniping comments about these legal challenges during the tour, but it was clear it had taken a toll on the IA. “That’s still going through the courts,” he sighed, regarding the book publishers’ lawsuit, “and it’s incredibly expensive.”
So how does the IA survive? Kahle said that the IA runs mainly on donations, from 110,000 individuals averaging about $5 per person, as well as “foundations giving us serious amounts of money.” The IA also offers subscription services to libraries and other organizations.
“We also survive by, well, not spending a lot,” he added. “I mean, you notice the servers don’t have any air conditioning, right? If it gets hot, we just open the windows. So, it’s green. But it’s also inexpensive.”