Contributed / Technology /

Introducing Honeycomb: A New Debugging Service to Address Flawed Systems

11 Apr 2017 7:00am, by

Stephany Wilkes
Stephany Wilkes, Ph.D., spent 20 years in various engineering roles before leaving the tech industry in 2015. She writes, works as an independent researcher, and shears sheep. At our request, Honeycomb.io commissioned Stephany to write this introduction to the company and its technology.

If online systems have proven one truism over and over again, it is that most of them screw up in some way at least some of the time. DevOps, engineering, and support teams can be forgiven for not knowing this because they cannot: The systems used for identifying and surfacing these problems are themselves flawed.

This is the conundrum that Honeycomb, a simple yet powerful tool for debugging modern software, addresses. It empowers engineering teams to solve problems that average metric- and log-based tools cannot surface, and helps answer questions that, even today, remain mysterious: Is my stuff working? If not, why not?

Honeycomb.io, which launches its SaaS service into full general availability this week, provides full stack observability, for event driven debugging. The company is designed from the ground up to debug live production software, consume event data from any source with any data model, and make collaboration a central part of problem-solving.

Honeycomb is built on the notion of an event-focused view of the software stack. In a nutshell, it offers genuine, full-stack observability, from clients (mobile, IoT), to your own services (APIs, other microservices), to software written by other people (databases, message buses, web servers, proxies, containers, schedulers, and so on). It allows you to ask and answer questions across any subset of the stack, whether you wrote it or not.

Existing tools make it difficult to surface and diagnose the unanticipated, unpredictable problems we see more of every day, for a few different reasons. They:

  • Make baked-in assumptions about end users’ technical environments;
  • Rely on predefined indexes and schemas;
  • Cannot efficiently handle high granularity data (cardinality);
  • Are not designed for exploratory, iterative, or real-time analysis;
  • Are not fundamentally collaborative.

Let’s take a look at each of these shortcomings and how Honeycomb addresses them.

Honeycomb Is Environment Agnostic

Many tools make assumptions about customers’ technical environments, which makes it difficult for these systems to look beyond their preconceived notions of what is important. In addition, the cost of initial instrumentation can be high, which makes such systems expensive and time-consuming to roll out and change.

Honeycomb, by contrast, is environment agnostic and integrations are fast and cheap. This means, among other things, that it is easy to send more and different data to Honeycomb. Honeycomb provides helpers for various languages, platforms, and software, but it speaks JSON and only JSON. “Much of the richness and detail of an app lives between the nodes and components,” said Charity Majors, Honeycomb co-founder and CEO. “It doesn’t make sense to start with anything but a lingua franca for all services and teams.”

Honeycomb Does Not Rely on Predefined Indexes and Schemas

Many tools depend on indexes and schemas, which are by definition predefined: They require you to predict what queries you will need to run in the future. This is particularly problematic during technical crises. When something goes haywire and engineers need to attack and destroy problems, it either takes too long or is impossible for them to answer questions they didn’t think of in advance but urgently need to answer.

As Majors puts it, “Schemas add friction and rigidity to your development flow. This is a very bad thing because it means you are trying to predict the unpredictable. There are so many times that weird, adjacent facts — that don’t quite add up — end up casting light on a complex issue. Honeycomb captures all those details, without making you slow down and consider whether it’s truly important enough to pay the cost of a migration, or whether it can really be squished into some existing data model or not.”

With Honeycomb, engineers never reach a point where they can’t ask more questions.

The purple line displays a single slow query.

This image elaborates on the first image, shown above. Here, Honeycomb digs into slow queries specifically and looks at the full event data in its Data Mode. Data Mode examines the full data for each event represented in the chart and provides a way to obtain the full context from actual events sent to Honeycomb. The term Playlist (top right, “Add to Playlist) describes collections of user-defined queries that are saved in Honeycomb for later use or use by others on a team.

Majors adds that “Some metrics shops try to imitate this approach using ‘tags’ on their metrics, but this approach is severely limited by the way they store their data. You can only have tags that number in the tens, or maybe the low hundreds before it falls over. This rules out having a tag per host ID, or build ID, or user ID, which is one of the top complaints of those platforms’ users. With Honeycomb, every metric is also effectively a tag you can search by.”

Many tools cannot handle having “user_id” as a dimension, because of the high cardinality caused by having a large number of users. Honeycomb is built to do exactly this, however; user_id as a dimension is a direct result of Majors and Yen’s experience supporting 60,000 apps that served millions of users at Parse.

The same issue also plagues dashboards, which only show what they are told to show, and often hide problems behind graphs and averages. After you have a handle on what’s important, dashboards can be helpful. They are not, however, a great way to figure out what’s important in the first place, nor can they do necessary investigative work.

Christine Yen, Honeycomb co-founder and chief technology officer, said, “I spent years at Parse (subsequently acquired by Facebook) trying to tell developers that Parse Analytics, based on pre-aggregated metrics and canned reports, was enough to understand what was going on with their app. Then I turned around and used internal Facebook tools, like Scuba, because they could do real-time, ad-hoc analysis of arbitrary datasets.”

Yen and Majors, who met at Parse, want to enable all engineers to do the same, to take matters into their own hands and dive into their own data.

Honeycomb Surfaces Fine-Grained Data

Conventional tools do not provide sufficiently fine resolution to pin problems down to very specific cases.

Say, for example, that a customer who makes up 0.1 percent of traffic reports that something is slow: Their system serves a million or so queries and 1,000 of these look sluggish. A typical first response for support is to check the 99th percentile (P99) response time. Of course, P99 will not include this 0.1 percent customer, so the event will be lost.

Even if P99 were high, however, it would not give a clue as to which requests are slow, and it is both extremely important and very difficult to know which 1,000 are problematic and what they have in common. Typical tools only indicate that there are 1,000 such problematic things, sending engineers spelunking through log files, the only real data available. But, in Majors’ experience, “Using raw log lines is a messy, expensive, wasteful, and a hard-to-enforce discipline.”

Take the example of an API erroring, semi-reliably, for a small number of customers, yet not enough to raise any system-wide alarms. Support folks need to know who these customers are and what they have in common, but Honeycomb is the only system that will tell them with any expedience.

Honeycomb Is Really Real-Time

Many tools claim the real-time moniker but, when it comes down to it, most are not. Honeycomb is. Engineers and support folks can ask any question in real time and get an answer in sub-seconds or seconds (literally), not minutes, hours, or days.

Honeycomb Enables Collaborative Debugging

Majors and Yen believe that “Debugging is a social activity,” that it is most efficient and successful when done as a team, drawing on many levels of local expertise. Majors said, “It must empower your least experienced or skilled performers to benefit from your top producers and both from each other’s efforts.”

Honeycomb gives all developers on a team a way to understand and share what’s going on throughout the system and to do so in a way more senior developers may understand implicitly. Engineers can share different data views to explain things to each other, including views that look backward.

Weekly Review shows all activity across a team in Honeycomb. Being able to see and share work is core to how Honeycomb is meant to be used because it creates a mechanism for every engineer to rise to the skill level of the best engineer in a specific area, such as MongoDB operation.

As for “Is my stuff working?” inquiring team members can, for instance, see the timeframe for the release of their code and receive a factual answer to share with others. Most tools provide engineers with basic information like “Before my code went out, a query took 7 milliseconds; afterward, it took 1 millisecond. Yes, it’s working.” Honeycomb, however, can tell an engineer “The specific query I worked on took 7 milliseconds,” data that can be washed out in more typical system-wide averages.

Here is a single kind of event (in this case, an http 202 status code) as viewed across different builds over time. The number of those events rises in the last build.

Because Honeycomb allows for different kinds of questions, it describes — at a much deeper level of granularity — exactly what is happening and why.

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments