Observability Design and Architecture: The Fundamentals
This is Part 1 of a two-part series on how observability design and architecture need to start with the fundamentals: keeping people and process at the center of it all. Part 2 is about the technology aspects.
Gaining observability into modern IT environments is a challenge that an increasing number of enterprises are finding difficult to overcome. Complex data, high data volume and velocity, the rise of cloud computing, and rapidly changing application platforms such as Kubernetes make creating and getting value from the observability pipeline difficult without the right people, processes and technology. Complexity has never been higher. How will enterprises architect and design their observability tools to overcome these challenges and gain observability into the modern enterprise?
Everything Still Starts with People and Process
As a technology person, I originally believed tools should solve everything, but I discovered that people and processes are still the foundation to achieve significant change with any IT function. I learned the hard way that I should have spent more time working on putting sustainable processes in place to support long-term success. Learn from my mistakes and spend as much time on people and process as you would on the right hardware and network design.
Let’s discuss common foundational steps to put observability teams in position to succeed.
Successful logging needs more than just the observability team. It takes a data/record governance team, an architecture team, legal support and a security team that reports to a high enough level in leadership so that concerns are addressed at the senior leadership level. When the development team turns off logging because it is too much work, senior leadership has to get involved and work with development leaders to get the issue resolved. Observability is a strategic goal and needs to be treated as such across the enterprise. The observability team also needs governance support to keep up with ever-changing rules around how data is managed in the United States and around the world. The rise of data privacy regulation is almost overwhelming, so the observability team needs a partnership with professionals who understand the issues and prioritize solving problems and removing roadblocks.
Process starts with data governance and is required for logging to be successful in any mid to large enterprise. The logging team cannot personally work with every developer, OS owner, tool owner or anyone else who generates logs. The enterprise needs clear rules for:
- Data formats.
- How data formats are changed.
- How new devices are added to logging.
- How logging is funded.
- How data is managed once it is logged.
- How data rules and regulations are shared with the logging team.
Metrics, events, logs and traces (MELT) are the most volatile data in most enterprises, and without clear rules, data issues quickly get out of control. Quality processes with a strong governance team can bring order out of the chaos.
Examples of How People and Process Matter
Enterprises can have hundreds of log formats to manage across every device, appliance, and internal and third-party application. Every format adds overhead to the people, process and technology and makes data regulation compliance hard and user searching more difficult.
Let’s Talk About Syslog…
Many applications and appliances use syslog as the default logging format. Unfortunately, many vendors routinely slap lipstick on a pig and call it “syslog.”
- Pushing structured data over UDP is not syslog.
- Continuous stream of random data over TCP is not syslog.
- Pushing random data over UDP/TCP is not syslog!
Syslog has well-defined standards that, if followed, work like magic in logging tools. To all the vendors reading this: Please, for the love of logging, follow the syslog standards below!
RFC3164 – https://datatracker.ietf.org/doc/html/rfc3164
RFC5424 – https://datatracker.ietf.org/doc/html/rfc5424
Lastly, do not get me started on timestamps. There should be consequences for timestamps without millisecond accuracy.
Data Privacy Regulations
Complexity for a multinational enterprise is growing due to distance and ever-expanding data privacy regulations such as the European Union’s General Data Protection Regulation (GDPR), Singapore’s Personal Data Protection Act (PDPA) and Japan’s Act on the Protection of Personal Information (APPI). Data privacy laws can carry significant penalties for non-compliance. Having a strong governance team with access to legal is crucial to know the rules and how to keep the enterprise compliant. It can be easy to make mistakes.
For example: Imagine you are processing transactions using consumer personally identifiable information (PII) for several small countries, and that process is actually occurring in another country. Each country that supplies PII data would have the right to subject the company and all foreign nationals with access to PII data to a data-sharing agreement that lists the nationality of anyone with access.
So, the agreement is with the company, the business unit that processes the transactions and the country with the data privacy regulation. You think you have your regulations covered, but alas, you forget your L1 help desk in India. The country fines the company more money than it made in the country during last year and threatens to bar the company from future business if it happens again.
Small mistakes add up and data privacy regulations are a focus area the observability team cannot manage by itself. Support from the governance team is required to be successful in a fast-changing and ever-complicated regulatory environment.
Application complexity is exploding as enterprises adopt platforms like Kubernetes. Enterprises need the ability to manage legacy logging and next-generation application platforms like Kubernetes at the same time. Planning and a strong toolset are required to keep legacy tools operating while transitioning to modern cloud native platforms.
According to the Kubernetes and Cloud Native Operations report, 45.6% of respondents use Kubernetes in production although only 15.7% use Kubernetes exclusively.
This is good data, considering this survey self-selects for forward-thinking enterprises interested in Kubernetes. The transition is happening, but slowly. Enterprises must make the choice to build or buy the right software to help manage this transition. Your enterprise and application architects should get the observability team involved early in the Kubernetes journey.
There are a number of facets to Kubernetes logging including security, application and operational logging. Your Kubernetes platform has to be factored in as well. Kubernetes on AWS, Azure, etc., are all a little different, so those details have to be taken into account. I cannot stress how important it is to partner with your security team to make sure you have good coverage. Kubernetes is a powerful platform that is changing the way applications are managed, but its many layers, general complexity and quickly evolving tooling create potential security gaps that enterprises need to have a plan to address.
The most important factors in the long-term success of your observability practice are people and process. Organizational fundamentals are still your foundation for success. You have to do the heavy lifting of coordinating your internal stakeholders to set up your organization for long-term success.
In Part 2, we’ll talk about how you can architect your infrastructure to support observability. We’ll explore how to scale your capacity, plan for failure and avoid being trapped by the capacity management fallacy.