DNA: The Long-term Data Storage Format that Will Never Go Obsolete
Digital archivists have been worried about the ephemeral nature of digital storage for some time now. How can you trust vital documents to any storage technology, which will most likely be obsolete within a decade or two? Now some researchers are investigating the use of nature’s own digital storage mechanism, DNA (deoxyribonucleic acid) for long-term data retention.
At the Linux Foundation’s Vault storage conference, held last week in Raleigh, North Carolina, European Bioinformatics Institute (EBI) researcher Nick Goldman talked about the feasibility of using DNA as a long-term storage format, a talk timely not only because it was at a storage conference, but also because Monday is DNA Day.
“We’re talking really long-term. It’s not just about making storage available tomorrow, but making it available in a year’s time, or in 10 years time, a 100 years time or a 1,000 years time,” Goldman said.
DNA, he explained, has some natural advantages as a storage format. It is compact. It lasts a long time. It’s inexpensive to maintain. Best of all, DNA is the hereditary material for humans, which means, as long as humans are alive, there will be DNA readers.
“It will never be obsolete,” Goldman said of DNA. “Every couple of years, a new device comes along. But the medium will always be readable, as long as there are humans who are concerned about their own health.”
Essentially DNA is made of four constituent molecules, nucleotides known in shorthand as A, C, G, and T. In genomes, they are tethered together in a chain molecule stretching into billions of bases, like Lego blocks. The genetic code is captured by the ordering of the letters.
“Running through the 3.5 billion years of evolution on earth, DNA is being used in the genomes to hold the information about all the processes and molecules needed for every cell in the living organism,” Goldman said.
The double helix arrangement of the genome provides a sturdy copying format as well. It is two of identical genomic chains stuck together, providing stability and an easy way to copy the DNA information. “Break the helix open, use each side as a template to make a new copy, and now you’ve doubled the number of copies you have,” Goldman said.
We’ve developed good technologies to read genomes. A new series of genome reading machines are soon coming to market that can be held in the hand. Put a DNA sample into a solution, put a drop of the solution in the machine, and then get the results back through a USB connection to a computer. Writing DNA is not as cost-effective yet (current estimates run about $12,000 per MB), but if scaled up to mass production needs, those costs should also come down as well.
In theory, there is no reason why any binary code couldn’t be represented as a series of DNA fragments. To test DNA as a storage mechanism, Goldman’s group came up with a basic code to convert any binary set of 0s and 1s so they can be represented in a series of DNA fragments. The encoding algorithm, Goldman admitted, was “real undergrad stuff,” devised just to show the encoding would be possible.
“Because we will always be reading DNA. There will always be a reader”
They encoded onto DNA fragments a number of different digital objects, including a photograph of the home campus, an audio file of Martin Luther King’s “I Have a Dream” speech, a PDF copy of James Watson and Crick Frances’ famous first paper describing the DNA helical structure, and a text copy of all of Shakespeare’s sonnets. They were then able to decode it and recovered all this information perfectly.
This experiment was, to Goldman’s knowledge, the first time information was encoded on DNA and successfully decoded with no errors.
It was all ones and zeros, of course, but Goldman’s biologist peers were amazed nonetheless. “You can store pictures. You can store sound as well? This is amazing!” they told Goldman.
The sequenced code, amounting to less than a single MB that was written by a company called Agilent. It arrived at Goldman’s office in a test-tube and appeared to be no more than a film of dust around the inside of the container. “DNA itself at room temperature is just dust. It is almost invisible in the quantities we were working with,” Goldman said.
Similar work was going on at the same time. A Harvard University research group, working with Technicolor, encoded one of the first science fiction movies ever made, 1902’s “Voyage to the Moon.” Researchers from University of Washington and Microsoft just encoded a series of pictures, many of them of cats, on DNA.
Goldman’s team did encounter some challenges. The readers had some difficulty interpreting a fragment that had long series of repeating nucleotides. So the encoding algorithm was reworked so that there would be no repeating elements. The researchers also invented an indexing system that would establish what order the DNA would have to be read in.
Goldman is part of an early-stage start-up planning on commercializing the DNA data storage process. To do this on a commercial scale, there would still need to be three to four orders of magnitude in the improvement write performance. This is entirely feasible given that reading DNA has jumped six orders of magnitude in the past 15 years.
More sophisticated encoding techniques would also be needed. Fountain codes would be a good match for the format, Goldman said.
“The idea here is that information is being spat out in a system is a bit like a fountain, and you set a bucket under there to catch some of the transmissions. If the code works well, you can reconstruct the whole message fairly quickly,” Goldman said. “This works well with the DNA model where you sequence some DNA strands until you got the whole message.”
Wisdom for the Ages
Capturing man’s knowledge for the ages is a surprisingly difficult thing to do, and the digital age is not helping matters any.
The formula for making Damascus steel, a steel made in the Middle East renown for its strength, is largely lost to us now. “This in the past was a useful product, and we just lost the information,” Goldman said.
What other information do we have now that we’d want to keep for hundreds of years? Chances are we are not keeping it. How many of us still have readers for the floppy drives that we saved data on only 20 years ago? The longevity of optical disks may be only a few decades at most.
For longevity, you can’t beat the old school recording formats, even if they don’t have the searchability or information densities offered by their newer counterparts. England still uses vellum, or lambskin, to record its laws. Goldman also offered the example of a series of stone slabs housed at the Temple of Confucius in Beijing, which were used in the 17th century to record poems, still legible today.
DNA offers some appealing attributes for long-term storage. It’s information dense—it is very small. A test-tube brimmed with DNA would hold the equivalent of a million CD-ROMs.
Another distinct advantage to DNA is that it lasts for a really, really long time. We’ve sequenced DNA from wooly mammoths who died more than 20,000 years ago, from bison over 60,000 years ago and horses from several hundred thousand years ago. In all these cases, the DNA has degraded over time, but there was still enough left to reconstruct the entire genome.
“And those weren’t carefully prepared samples. They were just horses that dropped dead somewhere cold,” Goldman said. “You don’t need to keep rewriting it. It just sits there as long as you like.”
To keep DNA intact, it needs to be in a fairly dry and cool environment. For messages we want to save for thousands of years can be saved, for instance, in something like the Svalbard Global Seed Vault, an international collaborative project to store seeds to preserve biological diversity.
Decoding DNA is not an inherently complex process, Goldman explained. You don’t need hundreds of disk drives, and sequencing plugs it in to make a copy. Instead, you just treat a test-tube DNA to a polymerase chain reaction (PCR), a chemical process modeled on one that takes place with living cells. “You run it for two minutes and get two copies. Then you run it for another two minutes you get four copies then run it for another minute, and you get eight copies,” Goldman said.
Best of all, as long as humans are interested in their well-being, DNA will never be a format that goes out of style,
“Because we will always be reading DNA. There will always be a reader,” Goldman said.
Feature image via Pixabay.