Twitter’s IT Chargeback System Sets the Stage for End-to-End Service Lifecycle Management
Twitter is finding profit in trusting in the old reporter adage, “follow the money.”
Having moved to a microservices-based service architecture, the company has set up a chargeback system that lets the managers of services know how much it costs the company to run those services, while also providing valuable usage information to the managers of the infrastructure that is used to run these services. The upshot is that the chargeback system is helping Twitter improve the utilization and efficiency of its infrastructure.
“When we started this, there was a lot of skepticism as to how this would work,” admitted Twitter cloud platform manager Micheal Benedict, who presented the system at the LinuxCon North America 2016 conference, held last month in Toronto. Benedict is part of the platform engineering team is responsible for building out the frameworks, libraries and services that make up Twitter.
Since the launch of the chargeback system, the company has seen a 33 percent increase in cores utilized against cores reserved across the company’s compute platform. Over time, the chargeback system will incorporate all of Twitter’s computational resources.
The feedback the system produces each month also incentivizes engineers to make their systems more efficient. “The natural response of the engineers is to build services efficiently,” explained Twitter software engineer Vinu Charanya, who also gave the presentation.
Beyond the immediate utility of fiscal tracking of resource utilization, the system also is setting the stage for a cloud management platform, which could be used to run Twitter services on multiple public clouds and internal ones, one that rationalizes everything back to the project manager using the metadata of the chargeback setup.
As far as new technologies go, IT chargeback is not sexy. But if enterprises do adopt microservice architectures, then precise consumption billing for the resources on a per-department level will be a must. And Twitter is finding that such close accounting could lead to other significant benefits as well.
The Twitter Gear
At the time, Twitter was built on Ruby on Rails, and, at its peak, could serve 3,200 tweets per second (a small percentage of today’s Twitter traffic). During the crush of world Cup activity, the company added more servers, and engineers furiously tweaked the tuning of the Ruby VM and deployment tooling. Still, the site went down multiple times during the surge.
In the subsequent re-architecture, routing, presentation and logic were all uncoupled from one another. Different features of Twitter were teased apart, with each service standing alone as a discrete service.
“We built services that focused on specific aspects of the Twitter product,” Benedict said. Overall, there are “tens of thousands” of services running on Twitter these days.
The part of the site that shows which users a user follows is a stand-alone service, for instance. In this way, if the who-follows-who goes section goes down, the whole site would not be impacted. Rather, just that section would be unavailable.
By its very nature, Twitter has extremely spiky and unpredictable traffic, which surges whenever a global news event occurs and is discussed by the service’s hundreds of millions of users, explained Twitter Vice President of Engineering Chris Pinkham, in an earlier presentation caught on YouTube. On an average day, the entire Twitter infrastructure may generate over a trillion transactions, as each individual transaction, such as posting a Tweet, may result in multiple internal system transactions.
To run these services, the company maintains a set of core infrastructure services for both compute and storage resources. On the compute side, there was Apache Hadoop for batch analysis jobs, and Aurora and Mesos to orchestrate long-running services. Mesos is used for host management and Aurora is used for job scheduling. On the storage side, the company uses Manhattan, a general purpose low-latency, high throughput data store.
With Twitter divided neatly up into a set of discrete services, the next step for the company was to better understand how much of the infrastructure each of the services is using.
Enter IT chargeback. Chargeback “tracks infrastructure usage on a per-engineering team basis, and charges each owner usage costs accordingly,” Charanya said. With chargeback, each service owners gets a detailed report of how much infrastructure resources its service is consuming.
“We envisioned a single pane of glass for developers to request and manage all of their project and infrastructure identifiers,” Charanya said. The team built out a platform that they hope will become the company’s source of truth for resources, not only allowing developers to procure resources but to handle other duties as well, such as service-to-service authentication.
The initial challenge of building chargeback was that none of the infrastructure resources were defined in a consistent way, nor were they many ways of identifying who the users were of these resources, explain Charanya said.
To build out a chargeback system Twitter needed four things, according to Charanya:
Service Identity: A canonical way to identify a service across the infrastructure. A centralized system would need to be integrated with the system to create and manage identifiers.
Resource Catalog: The development team would need to work with the infrastructure teams to identify the resources, so they could be published and consumed by developers. The group would also need to establish the total-cost-of-ownership for running the resource over time, establishing entity models that can be used the calculate the costs.
Metering: A way to track usage of each resource by service identifiers, with all the results shipped to a central location by way of a standard extract, transform and load (ETL) pipeline, so the usage numbers collected then consolidated.
Service Metadata: to keep track of service and other metadata.
The infrastructure providers were given interfaces and APIs to hook their resources into. So for developers, the system could be used to request Hadoop services, or manage the storage, this chargeback system would provide the interface.
Determining the costs of the resources was particularly challenging. The system calculates the prices from “the bottom-up,” from the bare metal servers, Charanya said. And the price would change over time, as the price of the raw materials, such as storage, change.
The costs for server-per-day, for instance, incorporate all capital expenditures (capex) and operating expenditures (opex), including costs of the machines, licensing costs for any proprietary software, headroom or reserve capacity, inefficiencies, and human costs to operate these servers.
With the server-per-day price established, the team could then work up the stack to calculate the prices for other, more complex resources, such as Aurora or Hadoop. The system can then create reports for each engineering team, showing usage.
Some infrastructure resources are more challenging to monitor than others. Aurora may have a small number of metrics — such CPU, GPU, memory usage — whereas Manhattan has a larger set of “offerings,” including additional configurations such as clusters, machines.
Overall, the team established 500 metrics for nine infrastructure services. Collecting them has thus far generated about 100 million rows in a Vertica database.
Future of Chargeback
At the first day of every month, each Twitter engineering team covered by chargeback gets a chargeback report. The report includes a list of all the infrastructure services that the team’s project is using. The reports include the estimated cost for using these infrastructure tools.
The chargeback system also provides a profit-and-loss (P&L) report to infrastructure managers, including a margin of profitability. “The goal is to have zero margins. We don’t want anyone to have a profit or a loss,” Benedict said.
The PNL statement provides some good signals as to what is happening on the infrastructure, Benedict said. For instance, Twitter can tell if an infrastructure team has done some optimizing that allows for more people to use the same set of capacity, which would result in a lift in “profit.” In these cases, Benedict’s team can readjust the unit costs downward, passing on the savings to customers.
“Income” for the infrastructure is calculated by a series of steps. The service provider, when brought on board, defines for the chargeback system what role the infrastructure resource plays in helping run that service. A team maps back to a service. “A chargeback bill for a team essentially becomes an expense,” Benedict said. And that figure, in turn, becomes income for the infrastructure provider.
The server infrastructure, for instance, gets the majority of its”income” from Aurora and Mesos. This income is also an expense for Aurora/Mesos team, Charanya further explained. The Aurora/Mesos team may work to cut their costs, while the infrastructure teams work to improve their efficiency.
The system also produces a budgeting report, for the financing services, which validates its own spend calculations against this budget.
Other reports are generated for other users: The infrastructure teams want to know who their top teams are, and how much their resources are being used. The service centers learn about which services they rely on, and how much they are being used. The executives can be apprised of trends, and how the big projects are coming along.
This chargeback system eliminates the need for infrastructure owners to build self-service portals for their own services. And thanks to the wealth of metadata being captured, the system should also be able to offer reports on the latencies of an infrastructure resource.
While the chargeback may sound more like old-school accounting than new stack engineering, the discipline could actually set the stage for capabilities that many hyperscale companies are still struggling to manifest.
For instance, the chargeback system should help Twitter develop an end-to-end service lifecycle management process. “We felt there is a need to manage service lifecycle as a whole,” Benedict said. “We want to enable people to write services in a quick, easy way, get the relevant resources, and then launch, deploy, and subsequently monitor and kill the services after the entire lifecycle of services.”
With service lifecycle management in place, “every engineer can go to a single place, and view the services they own and the projects they own, to do request resources, to request identifiers to manage access control. We also let them view their bills and utilization reports, and finally facilitate deployment,” Benedict said.
The chargeback system, and service lifecycle manager could also play a vital role in developing a single platform, codenamed Kite, that could provide service builders the resources they require from multiple public clouds, as well as internal resources, all from a central location.
Such a platform, Benedict explained, offer developers access to cloud services such as virtual machines, and remain agnostic to the individual cloud. And it can also facilitate the lifecycle of an entire twitter project.