Since selling his previous company, Wercker, to Oracle, Micha Hernandez van Leuffen has turned his attention to the difficulties surrounding incident response in massive distributed systems — and among distributed teams.
This new Amsterdam-based company, called Fiberplane, offers a real-time collaboration tool for site reliability engineers, where they can pull in data from observability and other tools to get to the heart of problems.
“When we did Wercker, which was container-native CI/CD platform, quite a complex distributed system… we were spinning up a lot of instances, a lot of containers, orchestrating these on Kubernetes. But it was hard to debug. So, when we needed to figure something out and did some debugging, it was always going back and forth between metrics, logs, traces, different dashboards, which I always call sort of this treasure hunt — figuring out what is really going on,” he said.
“We were doing these microservices, spread across multiple layers of abstraction — it’s very hard to figure out what’s going on underneath the covers. And then the dashboard as a form factor to use for debugging was not a great fit. Because dashboards are kind of great when you know what to measure, the known knowns, right? You set them up in advance under the presumption that you know what will go wrong.
“But in the world of microservices and distributed systems, and, of course, Kubernetes, these unknowns are way more prevalent. So that got me thinking we need a different form factor, one that is more explorative in nature. And integrates with all these different systems that you use,” he explained.
Built for Teams
Van Leuffen maintains that DevOps and site reliability engineers lack the purpose-built kind of collaboration tools available for other types of teams.
“It was always hard to collaborate around an incident because you kind of need to be in the same time zone, or at least wake up around the same time and sort of go through this treasure hunt together,” van Leuffen said in an episode of The New Stack Makers last September.
“When it comes to Google Docs or Notion in the world of productivity, obviously, Figma in the design space, but DevOps hasn’t really benefited from this move towards collaborative software,” he said in an interview.
Tools around observability and monitoring remain siloed environments designed for the individual user, rather than teams, he says.
“After interviewing 35+ companies, it is clear that the status quo for incident management is a mix of screenshots, log pasting and a lot of noise in the Slack #outages channel paired with the occasional Google doc. Disparate tooling has led to an increase in cognitive load paired with alert fatigue. We can do better,” a company blog post states.
Rust and Wasm
Fiberplane takes inspiration from the data science world with a Jupyter Notebook-type interactive platform that integrates with existing observability tools.
The technology is built in Rust and WebAssembly (Wasm). To resolve conflicts when multiple people are typing at the same time in a Fiberplane notebook and adding data including graphs or tables, it uses Operational Transformation, the same algorithm used by Google Docs.
“As multiple people might be collaborating inside of a notebook, the state of the notebook might diverge. The logic to deal with these conflicts in operation lives both on the server as well as the client. As such, it makes sense to implement Operational Transforms once, using the same codebase, instead of duplicating it across the client and server, probably using different programming languages,” van Leuffen said.
“As such, we chose Wasm — written in Rust, as our API is also Rust — as the way to implement this functionality, saving quite some development time and reducing the risk of inconsistencies. It’s a great fit for Wasm.”
It integrates with data from Elastic, Loki, Prometheus and terminal-based output from notebook-based CLIs such as kubectl (Kubernetes) and awscli (Amazon Web Services). The output of running the commands locally gets piped into a notebook where the data can be filtered and manipulated.
“These are real charts, real logs, in real time. … So that allows you to sort of start drawing conclusions and correlate events like, ‘Hey, I see this spike in this chart. Let me zoom in on it. What are the logs that go along with that specific service? Let me have a look at that.’”
This is data from your observability stack — monitoring, tracing, logging.
“This data might come from the server but could also be reachable from a user’s browser. To solve this, we implemented Fiberplane Providers, our plugin model, that can live both in the browser as well as inside the user’s network,” he said.
“This allows the plugins to be implemented in any language that target Wasm, increasing overall performance, plus that Wasm sandboxing model increases security.”
To run Wasm code in multiple environments, it created fp-bindgen, a bindings generator that can be used to write plugins in Rust and that can be run both in the browser as well as a Rust environment, increasing the interoperability between Rust and Typescript. It has open sourced fp-bindgen.
In addition, it provides the ability to codify a template with all the charts and logs ready to go for certain scenarios, the queries to be executed and actionable intelligence to deduce what’s going on.
The technology is in private beta, though the company is working toward a September general release.
“You go in and all of a sudden you have all of the history of everything that’s been going on and you can see who other people are in the room … And, right away, you’re up to speed on what’s going on, and you’re able to be more valuable to your team.”
AWS and Oracle are sponsors of The New Stack.