“For a lot of people, what’s been missing in most companies or organizations is when you build an application, especially an application meant for internal delivery [is] documentation,” observed Kelsey Hightower, who is the staff developer advocate for Google Cloud Platform, in this sixth episode of The New Stack @ Scale, our monthly podcast examining the various issues accompanying dynamic services and systems.
Lack of proper documentation is especially problematic if, in fact, the application is meant to be used by others in the organization.
“You end up with a bunch of intimate knowledge contained within very small groups of people. This is what highlights this world of silos,” Hightower continued, speaking with speaking with The New Stack’s Alex Williams, and New Relic’s editor-in-chief Fredric Paul. “When there’s a small group of people that know something and they don’t really communicate well with everyone else, we label them as a silo.”
Also, in this podcast is an illuminating segment produced by The New Stack correspondent Scott M. Fulton III, in which we heard from Avi Cavale of Shippable, Apcera founder and CEO Derek Collison, and Martin Croker, DevOps Capability Lead at Accenture.
Listen to all TNS podcasts on Simplecast.
Hightower suggested that writing applications or delivering services internally should be treated “as if they were going to be consumed by a third party paying for a service.”
“They need documentation; they need APIs. They should be able to be consumed without talking to the team that built them,” he said. “I think that goes a long way to accomplish some of the things that DevOps has helped to do.”
“There’s nothing wrong with collaboration,” said Hightower, “but … the bigger win, for me, is we need to define APIs between the various teams that can be consumed in a way that’s scalable.”
Whereas the DevOps approach of having people embed themselves across all teams is not truly scalable, according to Hightower. “You still run into some issues, or mistakes will still crop up because you’re relying on a bunch of human interactions.”
How does the efficacy of documentation and APIs relate to platform management at scale? At some point, a service level agreement (SLA) needs to represent the standard velocity at which the teams who are sharing the APIs can operate.
“Everyone wants the absolute fastest thing in the world,” Hightower explained. “This is where, cross-functionally, people can say, ‘Look, the fastest networking we have is X. You can’t get more than that unless we re-tool this whole thing.’ It helps everyone in the organization set expectations. Before you go and run up that sales contract to say that we can do X, it may be helpful to know what are the limitations of the hardware that you’re on, what are the limitations of the platform that you’re on.”
Williams noted how quickly all of this can get complicated. “Companies already have existing SLAs for all kinds of issues in matters of managing their own infrastructure. Now we’re getting into additional SLAs that relate to the API management.”
“Some teams have people focused on this as part of their job, as part of monitoring, but include the timings between API calls and SLAs that are happening,” said Hightower. “Let’s say you run into an issue in production and you track it down to a particular set of APIs. That might be the time you initiate that discussion around what should the SLA be.”
“We measure, we fix and we re-measure,” he said.
“The priorities are really important because there are a lot of SLAs,” said Paul. “Missing on some of them may not make a difference to the business, even if you miss by a mile while other areas — even a small deviation can wreak havoc. You need to be able to compare, and focus on the things that really matter.”
“Is this what scale means, then,” asked Williams, “to have these systems in place so you can manage the APIs, so you can manage the SLAs?”
“Yes, exactly,” replied Hightower, “because scale is relative. If you’ve got one user, just one user, you might get to service that user from your iPhone. For you, in particular, that’s all you need to concern yourself with. If you’re in a situation where you can see the trend lines, your user base is growing 20 percent over time, the only way you can actually say that your application is scalable is, are you meeting the demands as they arise? The only way you can tell that is if you’re measuring.”
“You may be at a place where you’re meeting 100 percent of your SLAs, meaning your SLAs should be designed around where your users are happy and not necessarily prohibiting the business to grow. That’s what your SLA should be measured against, not necessarily the same SLA as Google has,” he said.
Hightower asserted that once those SLAs are being met consistently, “then it’s safe to say that your organization is a scalable organization, and you more likely are using a scalable platform.
Does this point to a theoretical corollary between scale and complexity, Williams wonders?
Not necessarily, Hightower counters. “The reason why there’s so much excitement about the cloud and some of these tools like Mesos Marathon and Kubernetes, is that they’re saying, ‘We’ve encapsulated a lot of the complexity required to move at scale into something you download and install.'”
“So, in order to scale,” asked Williams, “you need to have that complexity abstracted?”
“Or deal with the pain of not,” said Hightower.
“Why wouldn’t you just use an entirely opinionated system, then, that abstracts all that complexity for you, instead of building your own homegrown system?” Williams continued, “Why would you ever even go to that effort?”
“I always look at this as layers,” said Hightower. “Let’s take Heroku, for example. It is completely opinionated. For a lot of people, it works 100%. You write your app, you push it to Heroku, you’re done; all of it is taken care of for you.”
“There are people that have a few use cases that that doesn’t satisfy,” Hightower observed, “so they walk back down the stack a little. They start looking for other things. Maybe Cloud Foundry hits the button for them, maybe it’s App Engine, and maybe they have a little bit more flexibility, so they may walk a little further back and go to Marathon or Kubernetes. For some people that are completely custom, they may only want something like an operating system where they do the rest themselves. There’s never going to be this one platform that works for everybody.”
Apcera and New Relic is a sponsor of The New Stack.