A Practical Step-by-Step Approach to Building a Platform
In my previous article, I discussed the concept of a platform in the context of cloud native application development. In this article, I will dig into the journey of a platform engineering team and outline a step-by-step approach to building such a platform. It is important to note that building a platform should be treated no differently than building any other product, as the platform is ultimately developed for internal users.
Therefore, all the software development life cycle (SDLC) practices and methodologies typically employed in product development are equally applicable to platform building. This includes understanding end users’ pain points and needs, assembling a dedicated team with a product owner, defining a minimum viable product (MVP), devising an architecture/design, implementing and testing the platform, deploying it and ensuring its continuous evolution beyond the MVP stage.
Step 1: Define Clear Goals
Before starting to build a platform, it is important to determine if the organization actually needs one and what is driving the need for it. Additionally, it is crucial to establish clear goals for the platform and define criteria for measuring its success. Identifying the specific business goals and outcomes that the platform will address is essential to validate its necessity.
While the benefits of reducing cognitive load for developers, providing self-serve infrastructure and improving the developer experience are obvious, it is important to understand the organization’s unique challenges and pain points and how the platform can address them. Some common business goals include the following:
- Accelerating application modernization through shared Kubernetes infrastructure.
- Reducing costs by consolidating infrastructure and tools.
- Addressing skill-set gaps through automation and self-serve infrastructure.
- Improving product delivery times by reducing developer toil.
Step 2: Discover Landscape and Identify Use Cases
Once platform teams establish high-level business goals, the next step in the platform development process is to understand the current technology landscape of the organization. Platform teams must develop a thorough understanding of their existing infrastructure and their future infrastructure needs, applications, services, frameworks and tools. Platform teams must also understand how their internal teams are structured, their skills in using frameworks like Terraform, the SDLC tools, etc. This can be done via a series of discovery calls and user interviews with different application teams/business units, inventory audits and interviews with potential platform users.
Through the discovery process, platform teams must identify the challenges that the internal teams face with the current services and tools, deriving the use cases for the platform based on the pain points of the internal users. The use cases can be as simple as creating self-serve development environments to more complex use cases like a single pane of glass administration for infrastructure management and application deployment. The following are several discovery items:
- Current infrastructure (e.g., public clouds, private clouds)
- Kubernetes distributions in usage (Amazon EKS, AKS, GKE, Upstream Kubernetes)
- Managed services (databases, storage, registry, etc.)
- CI/CD methodologies currently in use
- Security tools
- SDLC tools
- Internal teams and their structure for implementing RBAC, clear isolation boundaries and team-specific workflows
- HA/DR requirements
- Applications, services in use, common frameworks and technology stacks (Python, Java, Go, React.Js, etc.) to create standard templates, catalogs and documentation
Step 3: Define the Product Roadmap
The use cases gathered during the discovery process should be considered to create a roadmap for the platform. This roadmap should outline the MVP requirements necessary to build an initial platform that can demonstrate its value. Platform teams may initially focus on one or two use cases, prioritizing those potentially benefiting a larger group of internal users.
It is recommended to start by piloting the MVP with a small group of internal users, application teams or business units to gather feedback and make improvements. As the platform becomes more robust, it can be expanded to serve a broader range of users and address additional use cases. The following are several example user stories from cloud native application development projects:
- As a developer, I want to create a CI pipeline to compile my code and create artifacts. (CI as a Service and Registry as a Service)
- As a developer, I want to create a sandbox environment and deploy my application to the sandbox for testing. (Environment as a Service)
- As a developer, I want to deploy my applications into Kubernetes clusters. (Deployment as a Service)
- As a developer, I want access to application logs and metrics to troubleshoot product issues.
- As an SRE, I want to create and manage cloud environments and Kubernetes clusters compliant with my organization’s security and governance policies.
- As a FinOps, I want to create chargeback reports and allocate costs to various business units. (Cost management as a Service)
- As a security engineer, I want to consistently apply network security and OPA policies across the Kubernetes infrastructure. I also want to see policy violations and access logs in the central SIEM platform. (Network and OPA policy management as a Service)
Step 4: Build the Platform
Building the platform involves developing the automation backend to provide the infrastructure, services and tools that internal users need in a self-serve manner. The self-serve interface can vary from Jenkins pipelines to Terraform modules to Backstage IDP to a custom portal.
The backend involves automating tasks such as creating cloud environments, provisioning Kubernetes clusters, creating Kubernetes namespaces, deploying workloads in Kubernetes, viewing application logs, metrics, etc. Care must be taken to apply the organization’s security, governance and compliance policies as platform teams automate these tasks. The following simple technology stack is assumed for the example organization:
- Infrastructure: AWS
- Kubernetes: AWS EKS
- Registry: AWS ECR
- CI/CD: GitLab for CI and ArgoCD for application deployment
- Databases: AWS RDS Postgres, Amazon ElasticCache for Redis
- Observability: AWS OpenSearch, Prometheus and Grafana for metrics, OpsGenie for alerts
- Security: Okta for SSO, Palo Alto Prisma Cloud
The example organization runs workloads in the AWS cloud. All stateless application workloads are containerized and run in Amazon EKS clusters. Workloads utilize AWS RDS Postgres for the database and Amazon ElasticCache (Redis) for the cache. The initial user stories are:
- Create an AWS environment that creates a separate AWS account, VPC, an IAM Role, security groups, AWS RDS Postgres, AWS ElasticCache.
- Create an EKS cluster with add-ons required for security, governance and compliance.
- Download Kubeconfig file.
- Create a Kubernetes namespace.
- Deploy workload.
- Install the Backstage app and configure Postgres.
- Configure authentication using Backstage’s auth provider.
- Set up Backstage catalog to ingest organization data from LDAP.
- Set up Backstage to load and discover entities using GitHub integration.
- Create a blueprint in Rafay console to define a baseline set of software components required by the organization (cost profiles, monitoring, ingress controllers, network security and OPA policies, etc.).
- Install Rafay frontend and backend plugins in the Backstage app.
- Use template actions provided by the Rafay backend plugin to add software templates for creating services.
- Create a Cluster template with ‘rafay:create-cluster’ action and provide the blueprint and other configuration from user input or by defining defaults in cluster-config.yaml.
- Create Namespace and Workload templates using ‘rafay:create-namespace’ and ‘rafay:create-workload’ actions.
- Import UI widgets from the Rafay frontend plugin to create component pages for services and resources developed through templates (EntityClusterInfo, EntityClusterPodList, EntityNamespaceInfo, EntityWorkloadInfo, etc.).
The screens in the backstage developer portal look like the following after the implementation:
While this is a simple representation of a platform built using Backstage and Rafay backstage plugins, the actual platform may need to solve for many other use cases, which may require a larger effort. Similarly, platform teams may use some other interface and automation backend for building the platform.
Treat the Platform as a Product
When embarking on the journey of building a platform, it is essential to treat the platform as a product and follow a systematic approach similar to any other product development. The first step is to invest time in thoroughly discovering and understanding the organization’s technological landscape, identifying current pain points and gathering requirements from internal users. Based on these findings, a roadmap for the platform should be defined, setting clear milestones and establishing success criteria for each milestone.
Building such a platform requires consideration of various factors, including current and future infrastructure needs, application deployment, security, operating models, cost management, developer experience, and shared services and tools. Conducting a build versus buy analysis helps determine which parts of the platform should be built internally and which open source and commercial tools can be leveraged. Most platforms ultimately use all of these components. It is crucial to treat internal users as the platform’s customers, continuously seeking their feedback and iteratively improving the platform to ensure its success.