Can the Internal Developer Portal Solve Alert Chaos?
Platform engineering is gaining major traction, and 2023 is looking like another record year for the topic. Organizations are re-examining the developer experience and how platform engineering can make it better. The focus is to reduce cognitive load resulting from a shift-left approach and the increasing complexity of modern application development, achieving more developer productivity and self-sufficiency.
Alerts are numerous and spread across many tools, without much context, driving both cognitive load, or worse, alert fatigue. Can alerts be dealt with using the same platform engineering tools, namely the internal developer portal? Or should alert fatigue and too many alerting tools be left outside the domain of platform engineering?
An often-overlooked superpower that internal developer portals have is the ability to unify tools and practices that may not be tied to application development but to a more general “taking care of business.” This can be DevSecOps, bringing vulnerabilities and incidents into a package catalog, FinOps and more. This happens by virtue of the software catalog acting like a huge metadata store of anything related to developers and DevOps, providing the right information in context.
Last but not least, internal developer portals can play a role in unifying alerts from many tools into one central console, improving the developer experience and ensuring alerts are treated as they should be, in context.
The Software Catalog Connects Everything
According to Gartner, internal developer portals are the most mature tool in the emerging platform engineering space. In the portal, developers can access developer self-service actions, such as creating cloud resources, scaffolding microservices, performing Day-2 actions, setting up ephemeral environments and more. They can also use the software catalog to view abstractions of DevOps and software development life-cycle (SDLC) assets.
The core of the internal developer portal is the software catalog. Software catalogs come in all shapes and sizes. Some are a CI/CD software catalog. Some are driven by cloud resource data and others by GitOps data. The flexibility of the software catalog can allow the definition of any catalog entity whose data would shed light on how to work better, such as a running service in production. It all depends on the use case. Check out these internal developer portal templates to see the types of catalogs that can be created.
There’s a main theme here: The software catalog is a metadata store that contains everything a developer needs, and then shows developers abstractions of the data. It reduces cognitive load through redaction and whitelisting, while still allowing search on all metadata, all within role-based access control (RBAC) policies. We’ve shown how this works in a Kubernetes software catalog, essentially providing developers with a central point from which to access data about all things related to the software development life cycle.
What would be the use of unified alert data in the software catalog?
A Single Pane of Glass for All Alerts
DevOps tools, cloud resources and monitoring tools fire endless alerts, whether from Prometheus, Grafana, Datadog, Sentry, AWS, Coralogix, Splunk and others.
There are benefits to bringing alerts into a graph software catalog. When an alert for a specific resource occurs, you can immediately understand the ripple effect it has on other software catalog entities. For example, if you have an alert on a cloud resource (max memory), you can easily understand the affected microservices that are using this cloud resource and identify risks accordingly.
Many platform engineering leaders ask me whether they can add those alerts into the internal developer portal. It’s not that they are looking for yet another alert console. What they want to achieve is in-context alerts for developers in one central place without traversing many DevOps tools. It means that troubleshooting is no longer alert-oriented but oriented to the actual service or cloud resource that is affected. This is what the software catalog is all about.
The internal developer portal provides developers with a layer of visibility into the SDLC of their applications. Alerts affect the business and thus need to be looked at from the “resource” point of view and enrich the data given to developers about services and resources and not the other way around.
An alert for a cloud resource will also show the related service, its environment and the service or resource owner. This means that developers don’t need to check many alerting tools, and more importantly, developer cognitive load is reduced since there is no need to hunt for the affected services, entities, etc.
Let’s see how this is implemented in practice:
In the internal developer portal, we set up a blueprint for alerts. A Blueprint is the generic building block in Port. It represents assets that can be managed in Port, such as microservice, environments, packages, clusters, databases and many more. Blueprints can be used to represent any software catalog asset. Below you can see the properties defined for the alerts blueprints.
In this case, we chose the following properties for alerts:
- Category (the alert type) — either infrastructure, security, system health, etc.
- Severity level — error, warning and info
- Status — closed, acknowledged and open
- Total number of alert occurrences
- Source — a link to the relevant system with all additional information about the alert. The source can be any relevant system such as Snyk, Pager Duty, Sentry and more, all presented under one blueprint.
In this case, we created a relation between the “Alerts” blueprint and the “Running Service” blueprint so that each alert is linked to their relevant service and environment, providing context and an inkling of the blast radius of the alert.
We also added “Mirror Properties” (properties that are based on relations) to display additional information in the “Alert” blueprint that will give us more context to better understand the alert:
- The service and environment in which the alert occurs
- Environment type, for example EKS, GKE
- The service on call from PagerDuty (the “incident owner”)
- The service code owners from GitHub
- Investigate — links to the relevant monitoring dashboard for the specific service, such as Grafana
This generates alert entities in the software catalog:
We can now drill down into each of them:
In the blueprint, we also defined developer self-service actions that can be performed on alerts, from acknowledging to rolling back and restarting services.
Using Scorecards on Top of Alert Data Is a Force Multiplier for Workflow Automation
Internal developer portals let you define scorecards on top of any software catalog entity. Scorecards let you set and track standards and key performance indicators on anything from production readiness, DORA metrics, Kubernetes standards and more. Scorecards help developers with “you build it, you own it” because they measure and visualize what’s important.
Scorecards provide developers with a bird’s-eye view of a resource they own as an engineer (services, environments, etc.). Traditional scorecards can check whether an owner is assigned or a minimum version for a cluster. Scorecards work well for alerts since they “grade” alerts on the different entities in the software catalog. Adding this data provides a deeper understanding of the health/quality of a resource.
In turn, scorecard data associated with a specific software catalog entity can be part of a CI pipeline, for instance, not allowing certain create actions when alerts scorecards are poor.