There’s a lot of data stored on Facebook, and a lot of its users’ own content. That content is the most important asset on the service, and users need to believe it’s secure, otherwise they won’t share. Getting storage right is critical – and is helping define how Facebook designs its data centers.
In the early days, says Facebook storage engineer Jeff Qin in a talk at the Storage Visions conference, its storage system grew using standard filers which took 10 I/O operations to save a photo — and wasted several more on directory traversals. While company hack days added features, like photo-tagging, the first real change to the service was the deployment of its own Haystack storage service in 2010. A RAID-6 storage service, with global replication for photographs, Haystack uses a single IOP per photo request.
With photo storage one of the biggest demands on the service, it’s important for Facebook to understand just what users are doing with photos, as how they’re viewed and shared determines how photo storage needs to be designed. It turns out that photograph access cools down quickly, with initial views ten times higher than after 18 months.
Keeping all photos in expensive, fast storage quickly becomes a waste of performance, requiring unnecessary power and cooling.
Understanding that led to a redesign of Facebook’s storage architecture, keeping the Haystack service for hot photos, with F4, a slower tier, for warm storage of images that haven’t quite stopped being viewed, before finally deleting the extra copies and moving them to an offline cold storage facility. This way there’s performance for the short period where friends and family are sharing images with each other – usually after an event – and then a focus on durability in the long term, where it’s likely to be just you wanting to access old images. If an image becomes popular it can be moved out of a cold storage tier for a short period, until its refound popularity fades again.
Facebook’s cold storage facility is very different from a traditional data center. Instead of being designed for performance, it’s designed above all to be energy efficient and for what Quin describes as “great data durability”. High storage efficiency means no UPS, no redundant power supplies, and no generators; just rack after rack after rack of storage. The facility was built from scratch in eighteen months, right down to the custom software it uses. It’s a lot quieter than a typical data center, as the service spins down unused drives to save energy.
Custom servers built using designs donated to the Open Compute Project currently fill Facebook’s data centers. With its own hardware for storage, compute, and networking, it’s now focusing on new designs that will let it scale for another billion users.
That’s where the “Disaggregated Rack” comes in to play. Instead of servers that include compute, memory, flash storage and HDD storage, Facebook’s disaggregated server model splits the various server components across separate racks, allowing it to tune the components for specific services and to use what Qin calls “smarter hardware refreshes” to extend useful life. By separating server resources mixes of compute, memory and storage on different racks can be combined, for example, to deliver a set of servers that can run Hadoop. As loads and usage change, the balance of components that power a service can be changed — keeping inefficiencies to a minimum.
Qin notes that the key to this approach is faster networking, with the latest technologies used to build Facebook’s first fabric-based data center in Iowa. The system is designed to work at speeds up to the network card line rate – though it’s not yet operating at that speed, as the service doesn’t need the bandwidth. Qin expects this approach to extend the life of storage modules, as Facebook can swap out memory and CPU on a different, faster, schedule.
But what does Facebook want from storage vendors? Qin made a surprising point:
“What Facebook wants is cold flash.”
In order to support its usage model, which has minimal writes and a lot of reads, Facebook wants flash vendors (though as Qin points out, any other solid state memory technology will do) to deliver what he describes as “The worst flash memory possible. Dense and low cost. Long writes, low endurance and lower IOPS/TB are all OK.”
Cold flash isn’t the only technology Facebook is investigating for its storage requirements. It’s also looking at using optical disks as part of its cold storage archive. There are plenty of good reasons for using optical storage archives: it’s persistent, long lasting and dense. Qin talked about disks with 10 to 15 year life expectancies, and densities of over 1PB in a rack – and 2 to 4PB in a rack in a few years. “We need to be agnostic to media types for cold storage,” he said.
Building web scale storage platforms is a complex task, and Facebook’s lessons, and its future directions, aren’t for everyone. The most useful thing to learn from Facebook is not to assume that you need the same storage architecture everyone else uses; look at what actually matters for what you’re building and no matter how many standard principles it might contradict, put together an architecture that gives you what you need.
Feature image via Flickr Creative Commons.