Pivotal’s Matt Stine on Cloud-Native Application Architectures
Alex Williams welcomes Pivotal’s Matt Stine, author of the recently published book Migrating to Cloud-Native Application Architectures, to The New Stack Analysts podcast.
For more episodes, check out the podcast section of The New Stack.
The topics of conversation, drawn mainly from Matt’s book, include twelve-factor applications, antifragile, and this era-defining debate on microservices and cloud-native architecture. Co-host Donnie Berkholz contributes as well, noting that, “The game is definitely changing.”
“Every company out there — at least every vendor out there — has gotten the memo that this is something that’s really important,” Donnie says, and he also helps to work through some of the definitions. “They want to enable their customers to build — using cloud-native architectures, using microservices, however you want to call it. ‘Cloud-native architecture’ does imply a bit bigger picture than does ‘microservices.'”
Matt is as articulate and contemplative as they come, so we’ll let his own words speak for themselves from here on out:
“When I use the term ‘cloud-native application architectures,'” says Matt, “what I’m trying to do is grab a hold of a bunch of different topics that are being discussed concurrently — the microservices conversation that’s going on; the twelve-factor application concept that Cloud Foundry, Heroku and others are talking about; and, this emphasis on building applications directly for cloud.
“In our case at Pivotal we’re talking about building for a platform. Some people call it ‘PaaS’ — my colleague Andrew Shafer hates ‘PaaS,’ so we talk about platform a lot — but this idea of self-service agile infrastructure, and the idea that your contract between your application code that you’re writing, and the platform that you’re trying to interact with, is not brokered by a ticketing system and human beings, but it’s brokered by you making calls to some API.
“That bridges over into this general topic of API-based collaboration, and creating contracts that you can verify with running code, as opposed to conversations and documents that get passed around, and that becomes the currency of the way applications work with each other, the way they talk to services, and the way they talk to the platform.
“Lastly, this concept that Nassim Taleb talked about in the book Antifragile — and a lot of the stuff he talks about in the book are out of scope for this conversation — but the idea that there’s a concept of systems that assumes that failure events are constantly going to occur.
“When Netflix started moving in to AWS, AWS was well-known for causing a lot of failures in systems, and a lot of companies suffered because of that. So they worked from this central tenet that, ‘Every component in the system should be able to fail, and is going to fail, and is going to continually fail, and so we need to design our system such that it will be able to continue providing service to customers, even in the event of one or multiple components disappearing, or having poor quality of service, whatever that may be.’
“Then they take that to the edge and say, ‘Not only are we going to assume failures are going to happen, we’re going to inject them purposefully in the system, and then use the learnings we garner from that to make the system stronger.’ So, you end up needing that far more in these distributed systems, and that carries us back to microservices. We’re typically doing this now in the cloud, so you have all these topic areas that are sort of independent conversations, but at the same time they seem to be converging on the same people talking about them and the same types of architectures being built, and so I just decided to take the cue from Adrian Cockcroft and start talking about ‘cloud-native app architectures,’ and apply that as an umbrella term to encompass all these different things, as opposed to ‘microservices,’ which is arguably a more popular word to fling around but doesn’t really capture the sum total of all the things that I think are important.”
On the Gravity of the Present Architectural Shift:
“One of the important things to note about microservices and cloud-native and these topic areas, is that what they represent is the first major architectural shift that we’ve seen after we started having the DevOps and continuous delivery conversations. I’d made that observation mentally, but I hadn’t really put it into those terms, and then I was sitting in Neal Ford‘s microservices presentation at a conference, just to see what his take on this topic was. He made that statement and it really crystallized for me.
“We started thinking about better ways to actually deliver value to our customers, and do that quicker, more continuously, and tighten feedback loops. A bunch of people are having those types of conversations and inventing language to describe how that might look. At the same time, the people who are having the conversations are day-to-day practitioners who are out there trying to build systems and get things done.
“A lot of companies are independently converging. They usually started with a monolith, and then that became too difficult to work with, and to continue to have the velocity they had when they were smaller and leaner, so they start breaking this thing apart. That ‘strangling the monolith’ phrase from the book I got from a reading a short article that [Martin] Fowler wrote about ‘strangler applications,’ that wasn’t about this topic but was very analogous to this topic, and then reading several early blog reports about companies. The one I talked about in the book is SoundCloud, and how they moved from this monolithic architecture to microservices, and slowly but surely other companies are starting to write these experience reports and post-mortems of what they did to go from one to the other — why they did it, what things worked, and what things didn’t work.
“What you’ve seen is that the same types of patterns have emerged, and people have started talking about them enough, that now we want to start to put general labels on these things. Some of the terms are old. ‘Twelve-factor’ has been around for a while, but only recently has it become something that a lot of people are talking about. I hadn’t even seen the word ‘twelve-factor’ on the software conference circuit for years, and then this year, on the NFJS tour that I speak on, I have a session and there’s another speaker that has a session speaking specifically about how to build applications like this. Every software conference I’ve gone to in 2015, half of the agenda is talking about microservices.
“A large customer of ours is now trying to scale the stuff out to a much larger development organization, and there you have a huge number of people who are heads-down, day-in-and-day-out, just trying to get their jobs done. Most of them aren’t entrenched in the conversation the way we are. Now, I’m basically teaching people a brand new language to use to talk about application development before we can ever learn the tools and platforms and frameworks and patterns that we have to use to actually build things that way. There’s a big crossing of the chasm that we have to do at this point, going from these early-adopting groups to the larger majority of people who are trying to deliver code.”
On Established Companies Re-tooling as Software Companies:
“The way people see their identity is certainly starting to shift, [judging from] the number of companies that we’re talking to at Pivotal, or who are now approaching us, saying, ‘Okay, I see what’s happening in these verticals that are sitting next to me, and I’m starting to get the idea that it’s about to happen in mine, and I feel like we need to start to reinvent ourselves as a technology company, not because the technology itself is superior, but because we need to start looking at how we can go much faster — not just deliver stuff faster, but validate ideas faster.
“I was having this conversation this week: ‘two of the primary reasons why I’m interested in microservices is because they allow me to decentralize the way we get work done and decentralize the governance of that work.’ If you take that to its logical conclusion, you almost create this world in which teams are building services, and the services that survive are determined by natural selection. When you make it possible for groups to act autonomously, then you can stand up teams very quickly and say, ‘Go validate this idea,’ rather than, ‘We’re going to spend six months planning and gaining funding for a project that is not going to deliver anything for two years, and we have no idea until two years is over if it’s even a good idea to build the thing that we set out to build.’ So, companies are starting to realize that that mode of operation is what’s going to eventually get them killed.”
Regarding Twelve-factor Applications:
“12-factor dot net is a marketing tool for Heroku. They obviously wanted to help people to build apps that will run well on their platform, so that people will run more apps on their platform. But it was, in my mind, one of the first places where a set of architecture or design principles for writing code was described that represented a loose contract between an application and the platform on which it’s running.
“Here’s a platform that says, ‘You hand me your code, and here are the things that I’m going to do to take that code and transform it into an app that’s going to run somewhere, and I’m going to keep it healthy, and I’m going to allow you to scale it horizontally, and I’m going to allow you to gain visibility into how this application’s behaving. What are the events that are coming out of it? What are the performance metrics that are coming out of it?’ They go through these twelve factors to say, ‘If you design your app like this then it’s going to run well in this environment.’ When we started doing Cloud Foundry version two, we also decided to use Heroku’s buildpack model, and in doing that we started to tune Cloud Foundry very well for these twelve-factor apps.
“One might look at the way Docker guides application packaging and think, ‘Now I’ve got my application package as this Docker image, and I’m going to take that Docker image and I’m going to run it on some platform, whether that’s Cloud Foundry, or Kubernetes, or Mesos, or Amazon — whatever you pick doesn’t really matter — the way that those platforms are going to hook into your Docker container once it’s spun up, and give information to it in terms of configuration, and get information out of it in terms of events and logs, and how those things will be scaled, they all start to look very similar.
“If you’re building for one of those target platforms which you could bundle under this label of ‘cloud platforms,’ then this twelve-factor idea is step one. If you want to get code running and running well in an environment like this, that’s what you have to do.”
“The way we’re talking about things now is much more focused on the applications and services, and their purpose, and what they’re trying to do, and we almost want to think about the infrastructure underneath that as a utility model: I plug my laptop into the wall, and I assume that I’m going to get power, but I don’t really think about how that power is generated and brought to me. I just know that this is going to work. The PaaS, or the self-service agile infrastructure, is trying to bring that to me as an application developer and application runner. ‘Here’s my code. I don’t really care how you get this running, or how you stitch it together and make it able to talk to all the things that it needs to talk to. As a developer of this business service, I don’t really have time to think about all the pieces underneath it.’
“Part of that is APIs, in that I’m making an API call to give you my code and some metadata to go run that thing, and I’m making another API call to scale that thing, and then I’m assuming that that thing is going to monitor my processes, and if my processes fail, then it’s going to restart those processes for me. Those are the types of things that I need to even begin to have something that might feel like an antifragile infrastructure.
“What’s ‘antifragile?’ The way Taleb describes it is: first, define ‘fragile.’ Fragile is: you put stress on something and it tends break or fail. The opposite of that is not ‘something that doesn’t break when you put stress on it;’ it’s, in his mind, ‘something that gets better when you put stress on it.’ How would we build a software system that, when we put stress on it, that system is actually going to get better? Think about the types of stressors that we put on software systems; one of the biggest ones is change. The requirements change, business changes, the market changes, and we need the software to change with it.
“Traditionally, the way we responded in infrastructure and operations is: change breaks things, so we’re going to make change harder, make the change process more disciplined and more process-driven, and that’s going to slow things down, and if we do that, then we’re going to build these robust systems that, when we stress them, they don’t break. They might not break when we put too much load on, or when a server goes down and we can fail-over, but if I want to add a new feature, all of a sudden that type of stressor is forbidden. The stressor the business most wants to place on the systems in this new world of ‘we must innovate or we die’ is change.
“I think we have not progressed as a discipline enough yet that we can actually build systems that truly are antifragile, in and of themselves — that when I start to beat on these things they actually, in a sense, improve themselves. I don’t think we’ve got that; there’s a certain missing component that we just don’t have yet. I use the ‘antifragile’ term very loosely to grab from that topic area an idea that, ‘it would be great if we could build software that way, so how do we start thinking that way, and how do we start building things?’ Chaos-monkey-type-stuff is one step in the right direction of saying, ‘If change is risky because change breaks things, then I’m going to make it so that I know where my weak points are, and I enhance my ability to respond better when things attack those weak points, and supposedly, over time, the system doesn’t get better because the system has the ability to improve itself, but we actually give ourselves better data to work from, that we can use to build a better system that’s more resilient to these types of changes.
“Dick Gabriel wrote about systems that behave, in a sense, like the human immune system. If your immune system hasn’t ever seen an attacker before, it’s going to be quite weak and not know how to deal with infection. One of the ways that we we help our kids to not get sick is: we get them around kids that are sick.
“The idea then is not to say, ‘Okay, we have to build our systems that way.’ I don’t think we have the tools to do it yet. But at least, let’s start thinking about what it would take to build systems like this, because if we could achieve that, that would be a very powerful tool in our toolkit.
“Joe Armstrong, who created Erlang, gave this really great talk several years ago. I believe the title of it was ‘How to Build Systems that Run Forever.’ He walks through several of these areas that are very similar to what we’ve been talking about today, and ultimately they end up being the core tenets of how Erlang works. The idea of ‘component failing’ should not ever cause the system to stop working, because if that’s ever true, then you can have a very small piece of the puzzle start behaving badly and that can cause the larger type of service that you’re trying to deliver to not be deliverable.”
On State and the Split-brain Problem:
“One of the elephants in the room right now is that a lot of stuff that were talking about from a cloud-native perspective, a lot of the patterns that we’re describing, they have this core idea that the services that you’re building are stateless, and yet we know that you cannot build a system that is completely devoid of state. There’s always going to be state somewhere, and you shift it around and you move it, but eventually you have to find out, ‘How do I deal with the place that the state is going to live?’
“If there’s state, there are things that are going to read state, and that’s fine, but there are also things are going to write state, and you want to make sure that those writes are coordinated appropriately, so that the data set is actually consistent. There is a spectrum of consistency between complete consistency and eventual consistency, but you want to not have chaos, so you have some amount of coordination.
“One of the ways that we do that is by electing leaders. If you have a leader election algorithm that says, ‘The leader’s going to coordinate the writes,’ and you introduce a network partition, and one side of the system thinks the leader’s gone, while the other side of the system thinks that the leader’s still there, that leader keeps operating. The other side elects a new leader, and now you have essentially two brains in the system, both of which think they’re in control, but there’s this network partition that is separating them sufficiently so that reality is obscured, and that creates a whole host of problems, some of which are extremely difficult to recover from.”
On Problems and Solutions:
“[In ‘The Eight Fallacies of Distributed Computing‘, the authors provide] eight assumptions that people make when they first get into this realm of building distributed applications. All of these assumptions are absolutely wrong, but if you make them, and you build a system that way, at some point the fact that you made a wrong assumption about reliability and consistency — the fact that you’ve distributed your system so you have components that are flying around out there doing different things, and then you have network interconnects, and we realize the network is not reliable — it introduces latency, and we don’t have infinite bandwidth, and it’s not secure.
“The network itself can cause all sorts of problems for the system. But also, because you’ve separated the components of the system, and now they’re independently deployable, that means that they can come and go, and move around, and change. Even if the network keeps working, the thing that you’re talking to on the other side of that network connection might actually be different, or fail, or change.
“As a developer living in a system like this, you cannot not think about these things, but they introduce a lot of overhead into what you’re building. One of the things we’re trying to do at Pivotal is to build frameworks and tools and platforms that introduce a lot of the well-known, well-described, boilerplate solutions to dealing with these fallacies, into your systems — I use it loosely, but, ‘as-a-service’ — so that you can go back again to focusing on the business code that you’re trying to write and deliver value with.”
Pivotal is a sponsor of The New Stack.