Maybe Building a DIY Logging Tool Is Not the Best Idea

Welcome to The New Stack Makers: Scaling New Heights, a series of interviews, conducted by Scalyr CEO Christine Heckart, that cover the challenges engineering managers have faced when scaling architectures to support the demands of the business.
The idea behind building a logging analysis tool is fairly simple — That is until it’s time to scale across multiple teams and manage it beyond day two. Then things become a bit more complicated. There comes the temptation to build a logging platform because, well, how complicated could it be?
Scaling New Heights Episode #4 – Maybe Building a DIY Logging Tool is Not the Best Idea
Scalyr Field Chief Technology Officer Anthony Johnson knows this story well. Johnson worked at Ellie Mae, a financial services company, and before that he supported U.S. Senators and their aides with a messaging system for legislation and communications.
In working with the government, Johnson said, metrics from the logs were needed. Over one July 4 weekend, he wrote a program in C. It auto-updated and emitted metrics that ended up into a round-robin database, which served as a store for event data, time series, and other metrics. It was visualized with Cacti.
The team used the system to fix outages, Johnson said. Ultimately, it was a proprietary system of code that Johnson wrote. But none of the other people he worked with knew C.
“So at the end of the day, you know, I’m sure that the product went away, got bought by another company,” he said.
At Ellie Mae, Johnson built two services, one using use Amazon Web Services (AWS) Elasticsearch followed by a logging tool that he and his team built using just Elastic.
“We had scaling challenges with AWS Elasticsearch service,” Johnson said. “There were limits on how many nodes we could have, there were limits on the URL I could present to my users. And they’ve (AWS) solved a lot of these problems today. But again, I’m limited by my vendor.”
So they built their own platform using Elastic but “it took months and months and months,” Johnson said.
“We built it, you know, it was successful, and it ran great until it didn’t run great,” he said. “And, so we were at around two terabytes a day, you know, probably about 50 to 60 data nodes. We really started running into a lot of problems. Problems with cluster management, problems with data latency coming into the system.”
How many engineers have assumed that building custom software is simple? But then comes the start of the project, the launch, and all that comes with the work on day two and beyond. Building custom software will make sense for large software makers. But at a company making software for the government or the financial industry, the interest is less in building logging tools and more on just finding the best way to support the organization in its mission and goals.
Amazon Web Services is a sponsor of The New Stack.