Infrastructure as Code or Cloud Platforms — You Decide!
Let’s compare two prevalent approaches to cloud infrastructure management. First is what we broadly classify as Infrastructure as Code (IaC), where engineers use programming\scripting languages to build a set of scripts to achieve the desired topology on a cloud platform. Terraform, Cloud Formation, Chef, Puppet and Ansible are some popular ones.
This technology consists of a language to write scripts, plus a controller that can run the scripts. Once satisfied with the result, the user would save the scripts in a code repository. Subsequently, if a change is to be made, then the files would be edited and the same process repeated.
The second category would be a cloud orchestrator or platform. This would typically be a thin abstraction over native cloud APIs that would interface with the user as a web service, and the user would connect to the service (via UI or API) and build the cloud topology within that web service itself.
The topology built will be applied by the orchestrator and saved in its own database. The user does not need to explicitly save the configuration. When an update has to be made, the user will again log in to the system and make changes.
For smaller-scale use cases, a platform may be too heavy. But at scale, the IaC approach tends to morph into an in-house platform. A better strategy, in this case, is to use an off-the-shelf platform that can be enhanced with IaC scripts when customization is required. Megascale data centers like those belonging to Facebook and Netflix are a different ballgame and are not considered in this context.
The fundamental value that a platform-based approach provides is what we call “long-running context.” People may also call this a “project” or a “tenant.” A context could map to, say, an application or an environment like demo, test, prod or a developer sandbox. When making updates to the topology, the user always operates in this context. The platform would save the updates in its own database within this context before applying the same to the cloud. In short: You are always guaranteed that what is present in this database is what is applied to the cloud.
In the IaC approach, such a context is not provided natively and is left to the user. Typically this would translate to something like “Which scripts need to be run for which context?” or maybe a folder in the code base that represents a configuration for a given tenant or project. Defining the context as a collection of code is harder because many of the scripts might be common across tenants. So most likely it comes down to the developers’ understanding of the code base.
A platform is a more declarative approach to the problem, as it requires little or no coding, as the system would generate the code based on the intent, without requiring knowledge of low-level implementation details. Meanwhile, in the case of IaC, any changes require a good understanding of the code base, especially when operating at scale. In the platform approach, a user can come back and log in to the same context a few days later and continue where they left off without having to dig deep into the code to understand what was done before.
Difference Between the Code Base and What Is Applied to the Cloud
The second fundamental difference between the two is that IaC is a multistep process (write the script, run it and merge it in the repo), while a platform is a one-step process (log in to the context and make the change). With IaC, it is possible that the user might update a script, but may also forget or postpone saving it in the repository. Meanwhile, another engineer could have made changes to the code base for their own side of topology and merged it. Now, since many pieces of code are shared for the two use cases, the first developer might find themselves in a conflict which, even if resolved by merging the code, lands them in a situation where what was run in the cloud is not what is in the repo. Now the developer has to re-run the merged code to validate, notwithstanding the possibility of causing regression. To avoid this risk, we need to now test the script in a QA environment.
All the ‘Other’ Stuff
IaC tools will enable deployments, but there is so much more to running infrastructure for cloud software. We need an application-provisioning mechanism, a way to collect and segregate logs and metrics per application, monitor health and raise alerts, create an audit trail, and an authentication system to manage user access to the infrastructure. Several tools are available to solve these individual problems, but they need to be stitched together and integrated into an application context. Kubernetes, Splunk, CloudWatch, Signalfx, Sentry, Elk and Oauth providers are all examples of these tools. But the developer needs a coherent “platform” to bring all this together if they want to operate at a reasonable scale. This brings us to our next point.
Much of IaC Is Basically a Homegrown Cloud Platform
When talking to many engineers we hear the argument that Infrastructure as Code combined with BASH scripts of even regular programming languages like Go, Java and Python provide all the hooks necessary to overcome the above challenges. Of course, I agree. With this sort of code, you can build anything. However, you might be building the same kind of platform that already exists. Why not start from an existing platform and add customization through scripts?
The second argument I have heard is that Infrastructure as Code is more flexible and allows for deep customization, while in a platform, you might have to wait for the vendor to provide the same support. I think as we are progressing in technology to the point where cars are driving themselves — once thought to be little more than pure fantasy! — platforms are far more advanced than they are given credit for and have great machine-generation techniques to satisfy most, if not all, use cases. Plus, a good platform would not block a user from customizing the part that is beyond its own scope via scripting tools. A well-designed platform should provide the right hooks to consume scripts written outside the platform itself. Hence this argument does not justify building a code base for the majority of the tasks that are standard.
‘There Is No Platform That Fits Our Needs’
This is also a common argument. And I agree: A good platform should strive to solve this prevalent problem. At DuploCloud, we believe we have built a platform that addresses the majority of the use cases while giving developers the ability to integrate policies created and managed outside the system.
‘The San Mateo Line!’
A somewhat surprising argument in favor of building homegrown platforms is that it is simply a very cool project for an engineer to tackle — especially if those engineers are from a systems background. I live in Silicon Valley and have found a very interesting trend while talking to customers specifically in this area.
When we talk to infrastructure engineers, we find that they have a stronger urge to build platforms in-house, and they are quite clear that they are building a “platform” for their respective organizations and are not, as they would consider it, “scripting.” For such companies, customization is the common argument against off-the-shelf tools, while hybrid cloud and on-premises are important use cases. Open source components like Kubernetes, Consul, etc., are common, and thus I frequently hear the assertion that the wheel need not be reinvented. Yet the size of the team and time allocated for the solution is substantial. In some cases, the focus on building the platform overshadows the core business product that the company is supposed to sell. While not entirely scientific, I tend to see these companies south of San Mateo.
Meanwhile, the engineering talent at companies north of San Mateo building purely software as service applications is full stack. The applications use so much native cloud software — S3, Dynamo, Amazon Simple Queue Service (SQS), Amazon Simple Notification Service (SNS) — that it’s hard to be hybrid. They are happy to give the container to Amazon Elastic Container Service (Amazon ECS) via API or UI to deploy it. They find no joy in either deploying or learning about Kubernetes. Hence, the trend and depth of in-house customizations is much less.
How many times and how many people will write the same code to achieve the same use? Time to market will eventually prevail.