“Most of your bottlenecks, especially at the beginning levels of a small website, a startup, are at the SQL database level,” relates Tung Nguyen, vice president of engineering for the Bleacher Report sports Web site.
In the fifth episode of The New Stack @ Scale, we hear about Bleacher Report’s incredible journey from outgrowing “awful” blog software running on a single VM, to becoming a leading sports-oriented digital property that brings in more than 80 million unique viewers per month.
Nguyen was grilled by the New Stack editor Alex Williams and Fredric Paul, New Relic’s editor-in-chief. Also on hand was Lee Atchison, who provided perspective with his experiences in scaling at New Relic (as its principal cloud architect and advocate), as well as his time at AWS, where he created Elastic Beanstalk and led the team that managed the migration of Amazon’s retail platform from a monolith to an SOA-based architecture.
The New Stack @ Scale is our monthly podcast, sponsored by New Relic, that examines the various issues accompanying dynamic services and systems.
Nguyen’s story at Bleacher Report began in 2007 when it was bringing in about 50,000 unique viewers per month.
“In those days, 50,000 uniques was not a small number,” said Williams. “How were you managing it then?” he asked.
“Terribly disorganized,” admitted Nguyen. “I was tasked to build an engineering team and build a site while the founders were out there trying to get VC money.”
“At that time, we were located on a little VM machine out in Australia, a company called RimuHosting,” he said. “Every single night, around midnight, they would have a reboot of their systems, and it would take our site out for about thirty minutes. I didn’t really know about this until Pingdom came out. Then Pingdom came out, I hooked it up, and every single night, I got a page in the middle of the night. So, we quickly addressed that, and moved out of RimuHosting.”
“With 50,000 uniques, that is large — it’s not small — but it’s not the level of scale, and doesn’t require the high-availability needs that we now have,” he said. “I talked to a couple of companies — I believe Rackspace was one of them. The other company I spoke to at the time was Engine Yard, and Engine Yard was a startup at the time — their own adventure and their own journey just happened to coincide with ours perfectly.”
Nguyen couldn’t even recall the name of the blogging software Bleacher Report was running at the time, “but, they got to 50,000 uniques with this terrible software. They had to outsource the development team, and I eventually hired a couple of guys and brought it in-house.”
“We were just running this little WordPress-like blog, and we were starting to work on the Rails version of the site. Today, we still run our user-facing site on Rails, but there’s different components. It’s not just one big component anymore — that’s been a long journey,” he recalled. “Anyway, we started off with one big monolithic Rails application — this is Rails 0.2 days.”
One challenge was a footer that listed common searchable site tags, but which proved to contain “a Ruby loop that produced 560 queries, or so,” said Nguyen. “I reduced it to one, and then I cached it, and then suddenly everybody was like, ‘Oh, wow, our site is fast.'”
Atchison remembered those early Rails days, too. “Rails was really good for getting applications up and running quickly and conveniently. But efficiently? Not so much. And not because of Ruby, but because it did a very, very poor job, especially early on, in doing any sort of optimization of SQL queries.”
“It usually comes down to pretty simple things,” Nguyen explained. “You identify the bottlenecks, then you address them.”
“It just takes some kind of experience to do that,” he continued. “I wish there were a trade school to do this. I wish there was a book. I wish there was a conference. But you just have to learn through some blood, sweat, and tears, unfortunately.”
“I remember the benchmarks around Ruby were like 40 or 50 requests per second. Python was around there, too; Perl was around there, too; PHP was around there, too,” he said.
“So, I just wanted to reiterate that point, that Ruby itself is not really slow, it just depends what you’re trying to do,” he asserted. “It depends what problem you’re trying to solve. Ask that question first; take a step back, look around.”
So what did Bleacher Report see when they started to scale? “We need to expand, we need to build a sales team, we need to build other parts of the organization,” said Nguyen. “And we need more data to make good decisions. So, then we start building an analytics platform internally because Google analytics just can’t handle what we’re trying to do. So, we start breaking up the application, because it doesn’t make any sense whatsoever to build a big data collector within a Ruby on Rails app.”
“This starts about 2009-2010,” he continued. “At that time we had a team of three, and eventually, we had six.”
“Our transition didn’t start at the platform, really. We did a little bit of breaking-out of our components into a data collection app and an internal analytics tool. We didn’t do Java; we did Sinatra at that point,” he recalled.
“We looked at the data. I sat there, with New Relic, analytic throughput coming through our site. The throughput indicated to me that most of our requests were actually not web requests; most of our requests were API requests, read-only API requests that delivered JSON payloads, about 70-80% of the time. So, it makes a lot of sense to focus your time and energy on that component. So, that’s what we decided to do.”
“We decided to break up Bleacher Report in two main projects: The main Bleacher Report monolith, which still does most of the things, and then API requests, which serve 70-80 percent of our throughput,” he said. “This was going to be a read-only service, and it still exists today. And this thing has helped us scale tremendously, just making that decision, how we should break that up.”
“We didn’t completely adopt all the idealistic SOA principles immediately,” he said. “Again, it depends on where you’re at, and where your company is, and how your engineering team is, and a variety of variables.”
“New Relic has taken a very similar approach,” said Atchison, “in how we moved from our monolithic Rails stack into the distributed system we have nowadays.”
“We took the data intake side, we converted that over to Java, and that was our first real service. But since then, we’ve built a whole infrastructure, a whole ecosystem of services, as we’ve been trying to separate and decompose this monolithic Rails app over time,” Atchison said.
“The responsibilities a team goes through in order to make that happen have changed a lot. Now you have development teams worried about scale, you have development teams worried about how many instances of the service need to run, and where should they run, and availability issues,” he explained. “And you start seeing those additional responsibilities as burdens in many respects on development teams.”
“It’s created an environment where you can build an organization that can scale around a scaling application,” said Atchison. “It’s one thing to talk about scaling an application from the standpoint of the amount of traffic that it takes, but the application itself is scaling from the standpoint of how large it is, what it’s doing, the features it provides, et cetera.”
“That scaling can also be a problem,” he observed. “So, this move to services can ease that because you end up creating a better model where individual ownership of a section of the application is owned by individual teams with strong boundaries, strong barriers put up between them, in order to provide your solid API, solid [service level argreements], between teams.”
Nguyen and Atchison agreed that scaling challenges never seem to end. “It doesn’t matter which problem you solve,” said Nguyen, “it resurfaces itself in a different kind of way, different context. You just keep on going at it.”
“Over the years, I’ve gone through this transition from one service to, let’s say, about five, to eight, to about 20 of them,” he said. “So, our rubber band is starting to become a tangled mess also.”
“A lot of people see this and say, ‘I like to organize things, it’s not right, I feel uneasy.’ But you need to embrace it. This is actually the pattern of the Internet, to quote Adrian Cockcroft. The Internet actually works as a very resilient network. But the thing about the Internet is, again, you’re not going to be able to describe all the inter-connectivity,” said Nguyen.
“I have no idea when Google deploys. I have no idea when Yahoo deploys, or Facebook deploys, but it works,” he summarized. “And it works because, generally, you empower these companies, or you empower these small, two-pizza teams like AWS does, to run and own the full stack.”
“I’m just trying to solve the problems at Bleacher Report, with about 20 different services. In order to do that, I don’t need a conformity monkey that checks every single thing on the Internet. I don’t want to be the government. So, I’m just going to focus on Bleacher Report’s problems. Our problems are large,” he said, “and all you’ve got to do is build systems to automate this process. You can’t fight them with simply processes and human beings. It’s a combination of processes, human beings and automation.”
New Relic is a sponsor of The New Stack.