A Tale of Two Times: How News Publishers Address Change in the Data Center
It is no secret that the newspaper industry in the United States continues to weather the storm of revenue declines. Last June, the Pew Research Center reported that weekday circulation for daily print newspapers in America in 2016 declined on average by an annual rate of 10 percent from 2015. It’s also no secret that online advertising is absorbing these revenues, while generating new revenue along the way: At about the same time as the Pew report, the Internet Advertising Bureau reported digital advertising revenues for U.S.-based media in the first calendar quarter of 2017 alone climbed 23 percent year-over-year, to $19.6 billion collectively.
Newspapers, such as they continue to be called, are also in the digital media business. But up until recently, from a standpoint of business units, digital business units have typically been separate from print units. For the sake of their own survival, newspaper publishers need to re-establish themselves as news publishers across all media. This means those separate business units must consolidate — and that means their technology platforms have to come together, immediately.
If “integration” means welding together these units’ old software into aggregates, that’s not an option. But it may mean joining their disparate data warehouses, or at the very least, giving employees and journalists the appearance of having done so.
“We’ve been heavily disrupted by technology,” said Nick Rockwell, chief technology officer of The New York Times Company, speaking with The New Stack, “and yet we’ve been a little bit conservative about our adoption. And I think this is a time when it’s necessary and important for us to actually push ourselves out of our comfort zone, and be a bit on the cutting edge because the advantages are so profound and actually so well-suited to what we do that it can actually have a massive impact on our business. If we’re not benefiting from the positive aspects of technological disruption, then we’re hurting ourselves.”
“What we wanted was a way to unify all of our different data sources into a single view,” said Alejandro Cantarero, vice president for data at the Los Angeles Times Media Group of Tronc (the company formerly known as Tribune Publishing), “which is something that hadn’t been achieved at the company before.”
On the other coastline, the NYT Company has actually seen revenue growth on its digital side, last month reporting an 11 percent gain year-over-year in digital advertising revenue, attributed to an eye-opening 69 percent annual growth rate for digital subscribers. It’s not enough yet to shut off the watershed of revenue from print, which dropped 20 percent in the same period for a 9 percent annual drop in the quarter for the company overall.
For one reason or the other, both companies have a clear mandate to reorient their platform focus around a single, digital base. But their technology strategies for doing so are quite different from one another.
Tronc’s Old Monolith Is Now Its Safety Net
“We rebuilt the big data platform from scratch,” said Tronc’s Cantarero.
“There basically wasn’t a big data platform. When I joined the company early last year, it was to build out what would be our big data solution, and create data science as a core competency of the company.”
Arguably any newspaper’s most valuable data store is managed by its content management system (CMS). Last March, Tronc had already begun its mission to migrate its CMS to Arc, a platform built at the Washington Post. So whatever data platform the Tronc engineering team would build needed to be effectively decoupled from the CMS, relying on its data feed but not bound permanently to it.
“When building from scratch, I’m a big fan of jumping ahead,” said Cantarero in his interview with us. “I don’t want to be years behind. If you build off of the good, stable stuff, by the time you get your whole platform up and running, you’re going to be way behind the cutting edge. I think it’s been clear for awhile that microservices is the way everything is going, and there’s lots of advantages to that.”
The Tronc team’s strategy involved establishing a data-driven platform immediately, with the intent to add machine learning capabilities to the system in stages later. Without much time to evaluate options, it adopted Mesosphere Data Center Operating System (DC/OS) 1.0 on the very week it was released. With DC/OS in place, it would execute an extraordinary kind of reverse migration, engineered as a way to stay safe while it effectively tested its new code in production: It quickly migrated the old Tribune Company’s entire data platform to conventional AWS EC2-based virtual machines. Then the team created Mesos frameworks to assume responsibility for certain data-driven tasks in stages.
“We were able to replicate something that took four engineers a year-and-a-half to build, in a matter of weeks at Tronc,” said Cantarero. Kubernetes, he noted, might have been considered as an option if it had evolved back then to the point it’s at today. At his previous position with another firm, he told us, he and his colleagues found themselves building shims for Kubernetes, to enable functionality that was on its roadmap but had not yet been released. Mesos, and by extension DC/OS, had evolved into a complete platform earlier.
But even with a fully evolved Kubernetes as an option, said Tronc senior data engineer Matt Chapman, he and his colleagues would probably have rejected it on account of the fact that they would have had to run Spark and Cassandra — vital ingredients for Tronc’s analytics and forthcoming ML operations — in containers rather than natively. Chapman told us that containerized Spark, in his view, suffered from performance issues. And he noted rather pointedly that running most any service under the control of Docker, was not an option.
“I’m not a big fan of the Docker daemon, so I’d love to run without that,” said Chapman. “I value being closer to the metal with Mesos, being able to run lightweight, native containerization. We’re dealing with something where we want to get down to microsecond latency, to analyze web traffic in real time, [and] for certain things that are speed-dependent, like Kafka. We do immediate content recommendations for, what’s the very next thing you’re going to read.”
With new services being implemented in stages, Tronc can fall back to its old platform — which is still available through EC2 — if it encounters problems. So the old monolith now serves as a safety net. Chapman and Cantarero admitted that, on occasion, they have had to pull services out of DC/OS and fall back to the EC2 platform, especially when DC/OS is upgraded and the new configuration doesn’t mesh well with Tronc’s added code.
The NYT Avoids Administrative Hassle with Google’s BigQuery
The New York Times Company’s digital strategy was inspired by the serverless concept as it was explained to CTO Nick Rockwell: running code without an operating system. It’s a looser definition than how it’s often presented in The New Stack, as a Functions-as-a-Service platform.
“Any environment that’s self-scaling, self-reliant, and available,” as Rockwell defined it for us, “where you don’t have to deal with the problems of scaling and availability.”
The New Stack profiled Rockwell and his digital team last April, just after they had made a decision to move NYT’s data stores off of their data warehouse, bypassing Hadoop, into the public cloud with Google BigQuery. Since that time, Rockwell embraced the looser definition of serverlessness, decomposing existing applications and rebuilding their counterparts on Google App Engine.
“How to iterate through change is one of the core challenges of technology management,” asserted Rockwell in his most recent interview with us. “I think the real answers tend to be specific to each individual situation. But for us, if you have a fairly modular, overall application architecture, it’s pretty easy to start moving pieces.”
The NYT digital team and its management did investigate the possibility of running its operations in a containerized environment. They did decide to utilize Google Kubernetes Engine for some functions. But like Tronc, NYT decided against managing its own containerized platform, which would have involved placing Kubernetes at the center of its management strategy. One reason, according to Rockwell: Managing Kafka — which is critical to NYT’s current stack — entirely in-house has been too time-consuming, with too little value generated in the process.
“If you’re subscribing to a particular cloud and their way of doing things,” said the CTO, “a lot of that work of how to decompose services is done for you. You have to go along with it. I find that to be powerful, because one of the great sources of effort and inefficiency in software development is, all of the choices we have to make as developers. Every time you sit down to do a project, if you’re starting from a blank sheet of paper, you have to make every foundational technology choice over again. If you’re working in a constrained environment — like, say, App Engine — most of those choices are already made for you, and you have no choice but to get down to the work of creating the actual logic that you actually need.”
The contribution of open source to the modernization of infrastructure, Rockwell believes, has been valuable. But that’s coming to an end, he told us, as Amazon, Microsoft, and Google step into the space and start making decisions that development teams would probably be making anyway, only consuming more time in the process.
“We’re trending to use less open source in the future than we have in the past,” said Rockwell, “because instead, we’ll be relying on things like BigQuery, Pub/Sub, Cloud Spanner, App Engine, et cetera… We run Kafka ourselves, and I don’t really want to do that. We’ll be looking at Confluence managed Kafka in the next year; we’ll also look at whether a simpler queue like Pub/Sub can do, or can evolve to do, what we need. Because frankly, I don’t really want to run an open source application and be on the hook for managing a complex one like Kafka, if I can avoid it.”
Although the NYT Company and Tronc adopted two different approaches to rapidly transitioning their respective data platforms, both firms’ IT teams saw Kubernetes and containerization as a fork in the road. And rather than taking it, they detoured in order to avoid taking on a larger management workload. There may yet be a lesson here for container platform developers looking to win over larger businesses with smaller IT teams: Fully managed options may end up winning them over, and moving them onto platforms that, under the hood, do use containers. But the customers won’t know that, and may not care.
Feature image: Photograph of The New York Times press room circa 1942 originally published by the U.S. War Department, now in the public domain.