When Scaling Data Infrastructure, Sometimes Greed Is Good
The best tech debates are never truly settled. Think tabs vs. spaces, emacs vs. vim. To those we can now add the argument about the best way to scale data infrastructure. Should you meet demand by adding more resources and extending your footprint (scale-out), or increasing performance within your existing footprint (scale-up)?
For the longest time, it wasn’t even a question. Scale-up ruled by default. Improving performance meant adding memory, buying heftier CPUs or plugging in faster, more efficient storage arrays.
Google changed everything in 1998 by running what quickly became the world’s most popular Web application on racks of cheap commodity servers. If it needed to scale, it just slapped more servers into a data center. When one of its servers stopped working, Google’s architecture routed around the damage, much like the Internet itself.
Unlike scale-up, this scale-out approach had no theoretical limit. At some point, you couldn’t buy a faster CPU, but you could always add another server. Maybe it wasn’t the absolute best and fastest. Maybe it wasn’t fully utilized. But it was there to field the traffic when you needed it.
Compare this with AltaVista, one of the most popular early Web search engines, conceived as a showcase for DEC’s 64-bit Alpha processors. Today it’s a footnote in the history of the Internet. Google and its horizontal scalability buried its vertically scaled competitor.
Other developments swung the pendulum even further toward scale-out. Java, introduced in 1995, and containers (such as the 2001 Linux VServer, and other developments that led up to Docker in 2013) made code easily portable. Google, Facebook and Yahoo open-sourced their internal technology for lashing computing resources together. Virtualization blurred the line between servers and software, while public clouds like AWS, Azure and GCP made it easy to spin up an infinite number of virtual machines.
According to the new conventional wisdom, scale-up was for suckers. Smart, forward-thinking IT shops were the ones where everything was virtualized, distributed and software-defined — whether it made sense or not. One group of Microsoft researchers observed, “In today’s world nobody gets fired for adopting a cluster solution.” In 2015, a Harvard Business Review article could pose the question, “Does hardware even matter anymore?”
The thing about pendulums is: they swing back. More than a decade into the cloud era, the additional costs, trade-offs and dangers involved with scale-out are becoming hard to ignore, while new technologies are making scale-up strategies more viable. Scale-out isn’t going anywhere, but there’s every sign that scale-up is primed for a comeback.
The most obvious downside to scale-out architectures is performance. By definition, scaling up with high-performance hardware provides better performance server-for-server than scaling out with commodity hardware. Operating system-specific software has grown much more effective at tapping the resources available to it. Back in the early 2000s, tools for exploiting multiple CPU cores were still in their infancy, so only the most technologically sophisticated organizations could take full advantage of the latest and greatest hardware. For most, it was easier to just spin up another server. But the infrastructure for getting the most out of multicore processors, NUMA architectures, hyperthreading and multithreading has vastly improved, and at this point, scale-up offers an unmistakable performance advantage over scale-out.
“People all-too-often simply accepted the pain and suffering their large horizontal-scale deployments were causing.”
Similarly, in the move to commoditization Java was put forth as the solution to portable run-anywhere deployment. However, Java turned out to be far from the optimal portability be-all and end-all panacea it was purported to be. While it made deploying code much easier, it also bottlenecked performance. For demanding applications, running in the Java Virtual Machine turned out to be a positive liability. For databases, it meant lower speeds, unpredictable latency spikes (such as due to the dreaded Garbage Collection) and a cap on how far you could scale. Beyond a certain point, the limits of the JVM became the limits of your application. The developers of Opera discovered this when they tried to keep millions of browsers in sync using Cassandra, with its Java-based code. No matter how many clusters they scaled out, they couldn’t keep processes from dropping and constant garbage-collection issues from cropping up.
The moral: scale-out may be great at spinning up new resources, but it doesn’t do you any good if the resources don’t perform the way you need.
An even more serious objection to scale-out is the burden of management. Scaling out means multiplying the number of entities that need tending. If you think of a cluster as a spinning plate and maintenance as the way you keep the plates spinning, it’s obvious that the more you have in the air, the more likely something will come down. Even with techniques like multi-replication and global distribution, losing nodes is painful.
Will a crash be fatal? Probably not. Scale-out architectures are designed for failure. That doesn’t mean the cost is zero. When the plates come down, there’s still a lot of ceramic on the floor. To use another metaphor, building a scale-out architecture is like signing up for a constant low-grade fever. You aren’t entirely bedridden, down and out for the count, but every day is filled with headaches and pain. With hundreds or even thousands of nodes to manage, the laws of probability mean that some device in the cluster is nearing or already past its MTTF. Either that, or Garbage Collection has caused a node to stall (again). As the original Cassandra white paper itself explained, “Dealing with failures in an infrastructure comprised of thousands of components is our standard mode of operation; there are always a small but significant number of server and network components that are failing at any given time. As such, the software systems need to be constructed in a manner that treats failures as the norm rather than the exception.”
While there is some core wisdom to this — failure can be stochastically modeled and architected to work around — in practice, people all-too-often simply accepted the pain and suffering their large horizontal-scale deployments were causing, even as their system’s “temperature” continued to rise to dangerous levels. Node bounces. Failed restores. Having to make constant tweaks and diving saves.
This is the real eye-opener. Scaling out means increasing your attack surface, which creates more opportunities for the bad guys. The more nodes you need to monitor, the harder they are to defend. That’s just intuitive.
But scale-out creates even more insidious security problems that we’re only now beginning to appreciate. Due to flaws in Intel’s CPU architecture, tens of millions of processors are vulnerable to so-called “side-channel” attacks like Meltdown, Spectre and Zombieload. These exploits allow rogue processes to spy on other processes using a sort of peephole in the CPU. Good software design doesn’t help. The flaw is in the hardware itself, so any shared architecture is vulnerable. The problem is masked, but not eliminated, by containers, VMs and multitenant software design. Many scale-out servers are, in fact, dense node (32 CPUs and up). Only through cloud deployments and virtualization you may not realize you are on a large shared cluster, leaving you vulnerable to neighbors you don’t even know you have.
The best defense against side-channel exploits is to run single-tenant processes on bare-metal servers, thereby minimizing your exposure to noisy (and potentially nosy) neighbors. The trouble is, public cloud providers encourage multitenancy to get full use out of each server. Many critical applications are also now designed to scale out rather than up, so get little benefit out of running on a big bare-metal box. To make matters worse, any applications written in Java will use only a fraction of the server’s full resources, creating yet another incentive for server virtualization to soak up those extra resources.
So you have vulnerable hardware, unoptimized applications and cloud providers that want to pack as many users as possible onto their servers — in other words, perfect conditions for side-channel attacks.
The problem isn’t going away. A new generation of CPUs may block the latest exploits, but shared architecture will always be vulnerable to new ones. For their own protection, organizations will have to stop scaling out by default, which in turn will make “greedy” applications more attractive for their ability to use every ounce of available compute.
To be clear, scale-out is popular for good reason, but scale-up is also here to stay. Core counts are only going up over time. Users should not saddle their big data architecture to any system that cannot take full advantage of the raw processing power and capacity of denser nodes. By having fewer, denser nodes, you also minimize your attack surface and lower your administration costs. And if getting every ounce of productivity from those big boxes is important to you, pause and consider whether virtualization is the right thing to be doing with those dense nodes.
The real point is: you shouldn’t have to pick one or the other. That’s a fool’s choice. The data architectures you deploy today should be able to scale up and out. Scale up to get maximum efficiency and performance out of each box you deploy, and scale-out by deploying as many boxes as you need. If, in this day and age, you are choosing technology that forces you to do one or the other, and doesn’t support both, it’s time you look for new data infrastructure.
Feature image via Pixabay.