Scaling systems can be a mighty challenge, even using software built specifically for the task.
“Scaling is painful,” admitted David Greenberg, who shared his experience of his time implementing an Apache Mesos cluster manager at hedge fund investment firm Two Sigma, in a talk at the QCon developer conference, taking place in New York this week. Greenberg now works as a consultant to Mesosphere, which offers commercial support of Mesos.
Five years back, Two Sigma needed to build a highly scalable system for testing new predictive models. After looking at software packages designed for distributed computing, such as Hadoop and OpenStack, the company settled on a Mesos container-based system and acquired a few hard-earned lessons on the way.
Why did a hedge fund need cluster computing? Turns out that predicting the future requires a lot of processing power. The company buys and sells financial instruments based on automated algorithms. In many cases the algorithms need to be shaped by machine learning, running through multiple iterations of the same problem to sculpt the best approach. The larger the simulation, the more accurate the model could be tuned.
A typical footprint of a simulation job would require anywhere between 10-100GB of RAM and 1 – 20 CPUs, and the job could take anywhere from 15 minutes to a few hours.
Through much of the last decade, the company did this work using a Java Virtual machine-based system that served researchers well. They would submit the job, along with a set of resources needed, such as a number of CPUs, that would be required to complete the job. They also had access to a rich set of monitoring and querying tools to inspect their jobs as they were carried out.
The system worked very well for about a dozen researchers. As more joined the ranks of Two Sigma, and as their models grew more voluminous, the system started presenting some problems.
For one, not every researcher would get full and equal access to the cluster. At first, the cluster was available on a first-come first served basis, which some researchers games by arriving early in the morning and scheduling large jobs, which kept other jobs from running.
Another issue was workload isolation. Each JVM workload was set to use a predetermined set of resources. However, as researchers started bringing in richer financial toolsets, such as Python Panda, the amount of required memory started to spike, which lead to the machines swapping processes in and out of memory, slowing the performance of the machine to a crawl.
So in 2012, the firm started looking for a way to more gracefully scale operations. At first, the engineering team looked at Hadoop, along with the then-newly released YARN scheduler. There was a dearth of documentation, however. And Greenberg had heard, through the in-hall conference chats, despite its reputation as a big data platform, that Hadoop could cough up issues when scaling beyond a few hundred nodes. OpenStack was considered for its superior isolation capabilities, though this stack offered only rudimentary superior scheduling capabilities.
They had also looked at a number of schedulers for supercomputers, including Torque, Moab, and Slurm. At first, such software would seem to be a fit. After all, what are supercomputers but vast collections of individual nodes? But these schedulers were better suited to running large jobs across multiple servers, rather than running multiple smaller jobs on single nodes.
By 2012, both AirBnB and Twitter were both using Mesos. Twitter, in fact, had a 20,000 node Mesos cluster in production. That was a good sign. Also, the software supported by Mesosphere, which turned out to be invaluable for helping Two Sigma add new features, such as a command line interface and support for security.
“Mesos scaled really well, it supported batch jobs and containers,” Greenberg said.
The company had to do a lot of work to prepare for the new Mesos-based architecture, however. Many applications had to be refactored, which is a job that is both complex and offers little in the way of immediate benefits. But the work is important because it sets the stage for faster advancement.
The company needed more sophisticated scheduling algorithms than what Mesos could offer through its Dominant Resource Allocation (DRF). So, the company extended DRF with Cumulative Resource Share (CRS), embedded into a home-built scheduler called Cook, which dynamically adjusts demands against resources.
Also, they learned that standard debugging techniques don’t really work for distributed systems. When something goes wrong, it’s not like you can just SSH into a machine and scan the logs, Greenberg pointed out. Monitoring is needed.
For this job, Two Sigma built its own monitoring system, called Satellite (Named after the indie-rock band Six Fingered-Satellite). For the base, they turned to Riemann, which Greenberg described as a “Swiss Army Knife for connecting to many sources.” You can script specific actions with Riemann, which allowed the development team to set alerts such as to send an alert when more than 20 percent of the machines were doing heavy memory swapping. The company uses PagerDuty to alert administrators during off-hours. They also turned to Elasticsearch search engine and its Kibana visualization package, for visualizations, ad-hoc queries and dashboards.
Not So Fast…
Despite Greenberg’s glowing recommendation, not everyone at QCon was so enamored of Mesos.
In another talk, Chien Huey, a DevOps engineer at the XO Group media company walked through an evaluation process of DCOS (Data Center Operating System), an open-source package combining Apache Mesos and the Marathon orchestration framework both maintained by Mesosphere.
The XO Group was looking for an alternative to running containers (within virtual machines) in the Amazon Web Services’ Elastic Beanstalk service, the costs of which were mounting for the company, as were worries about lock-in. DCOS had some limitations, however.
There were many good things about DCOS, Huey noted. The software makes it really easy to set up a cluster, providing a near turn-key experience. It also provides easy installation of complex applications, such as Apache Cassandra or Jenkins.
But Huey found that the software had some shortcomings as well, for what the XO Group was looking for. Huey found no way to automate container operations that used private container registries. Mesosphere does offer a manual workaround for working with private registries, so Huey developed a script that would refresh the needed encryption key every 12 hours.
— The New Stack (@thenewstack) June 13, 2016
Huey was also not impressed with Mesosphere’s approach to auto-scaling, which involves installing an agent on the master node. “What if that node goes down?” he asked. Other companies have found workarounds to this issue, however. Netflix has developed Fenzo, for instance, to help with this issue.
Mesosphere is a sponsor of The New Stack.