Application Security / Programming Languages / Tools

Four Node.js Gotchas that Operations Teams Should Know about

31 Oct 2016 8:36am, by

This contributed piece is from a speaker at Node.js Interactive North America, an event offering an in-depth look at the future of Node.js from the developers who are driving the code forward, taking place in Austin, TX from Nov. 29 — Dec. 2.

Daniel Khan
Daniel Khan (@dkhan) has over 17 years of experience as full stack developer, architect and technical lead in the field of web engineering. As technology lead, Daniel is driving the support for Node.js performance monitoring at Dynatrace.

There is no doubt that Node.js is one of the fastest growing platforms today. It can be found at start-ups and enterprises throughout all industries from high-tech to healthcare.

A lot of people have written about the reasons for its popularity and why it has made sense in “digital transformation” efforts. But when you implement Node.js, do you have to replace your mainframes and legacy software with a shiny new Node.js-based microservice architecture?

Let’s zoom out and walk in the shoes of those who oversee the whole digital value chain: operation and performance teams. What challenges do operation and performance teams face today when they begin to implement Node.js? Does it require an entire gutting of their system?

dk1

Figure 1: How Node.js Is Used by Enterprises

New Tier / New Paradigm / New Challenges

In many cases, Node.js acts as a new tier that augments the enterprise stack and connects it with new offerings. It’s the fast moving technology at the edge of the system.

Figure 2: The number of tiers, the number of stakeholders and the complexity grew exponentially over the last 30 years

Figure 2: The Number of Tiers, the Number of Stakeholders and the Complexity Grew Exponentially over the Last 30 Years

One often embraced benefit of Node.js is that it enables teams to move much faster. Add microservices and suddenly there are multiple deployments per day compared to one every few weeks. For many enterprises, this introduces a new paradigm and requires changes to processes that affect other parts of the organization, particularly with those that are in charge of availability and performance, i.e. the operations and performance teams.

These teams don’t consist of Node.js experts and don’t have to. They are driven by metrics like mean time to repair (MTTR). Their main concern is to find the root cause of performance degradations and outages fast. How can these teams make sure the transition to Node.js goes smoothly based on their bottom line? How can they keep their systems humming?

Below we’ve listed out a few common Node.js problems that occur when you introduce it in the enterprise, and how best to manage and solve these problems.

Top Node.js Problems and How to Track Them Down

Node.js applications in enterprise scenarios are rather simple.

Common use cases are:

  • Fetching data from backends.
  • Performing authentication for incoming requests.
  • Rendering views.

Node.js uses Google V8 — the JavaScript engine of Chrome — as a runtime and a library called libuv that provides an event loop to perform asynchronous tasks. All of that is abstracted away from the user by a well-defined JavaScript API — not much can go wrong, e.g. there is no way to introduce thread locking issues and generally tracking down root causes is easier than in other platforms.

Still, some typical problem sources need to be watched closely.

1. Memory Leaks

Node.js is more similar to Java when it comes to runtime behavior. It’s a long running process and because of this, it is prone to memory leaks of all kinds. Like in other platform, memory leaks materialize in a steadily growing heap usage, which causes a crash when the maximum allocate able heap is exhausted. Often this is accompanied by high garbage collector churn while the runtime desperately tries to free memory.

Figure 3: Progression of a memory leak

Figure 3: Progression of a Memory Leak

Possible causes can be as simple as large objects that are hooked to the root scope and hence never freed. But, there are also more difficult cases caused by so-called closures (functions that rely on their enclosing scope) giving the garbage collector a hard time to dereference the dependencies. There are also cases where the host simply has too low of a memory configuration causing the garbage collector not to run in time.

To track down memory leaks, heap dumps are the tool of choice. There are several modules that export V8 hooks to JavaScript. Using them, it is fairly easy to trigger a dump whenever certain memory thresholds are exceeded. Here is an example that uses simple anomaly detection and utilizes the module v8-profiler to create dump files that can be consumed by Chrome Developer Tools.

Figure 4: Heap Dump analysis with Chrome Developer Tools

Figure 4: Heap Dump Analysis with Chrome Developer Tools

2. CPU Problems

Node.js runs in a single thread. Hence it’s not a good fit for CPU-heavy operations. If the CPU is occupied, e.g. because it’s transforming a large chunk of JSON – no other requests can be handled during this time.

Figure 5: Performance degradation caused by CPU congestion

Figure 5: Performance Degradation Caused by CPU Congestion

Netflix — a big Node.js shop — had such a problem when an automated script created routes without disposing of the old ones, causing the routing table to fill up over time. At some point, discovering the right function to call for an incoming request took so much time that it severely affected performance. Read their blog post about that.

Node.js out-of-the-box comes with hooks to switch on CPU sampling — the data produced by the sampler can then be consumed by various tools. Using this data, it is rather easy to find out where the time is spent.

Like for memory introspection, there are several ways to capture CPU samples from within JavaScript to analyze them in various tools.

Here is an example that uses v8-profiler again. This time for getting CPU sampling data to find out what was on the CPU at a given time slice.

Figure 6: By using D3.js to create a sunburst chart, we see the distribution of CPU time, showing that more than 25% is contributed to finding the right function to call for a route in a huge routing table.

Figure 6: By Using D3.js to Create a Sunburst Chart, We See the Distribution of CPU Time, Showing that More than 25% Is Contributed to Finding the Right Function to Call for a Route in a Huge Routing Table.

3. Back Pressure

When Node.js acts as a gluing tier connecting different parts of the stack, problems down the stack may surface first in Node.js. Back pressure occurs, when Node.js dispatches requests to slow backends. While Node.js has excellent capabilities for performing outbound requests, slow backends can cause congestion of the machinery waiting for those requests to come back. Degraded performance and even exceptions can be the result.

The metric to look at in this case is the number of dispatched vs. the number of returning requests at any given time.

Figure 7: Backpressure occurs when the requests pile up in Node.js because a backend is replying too slowly

Figure 7: Back Pressure Occurs When the Requests Pile up in Node.js Because a Backend Is Replying too Slowly

Such problems can only be tracked down to its root cause by using a monitoring solution that traces transactions passing through all tiers, providing metrics about inter-tier communication. Every major vendor in the APM space today provides agents that monitor requests going in and out to and from Node.js.

4. Security

Node.js offers a huge repository of small composable modules. Using the Node.js package manager (npm), it is a matter of seconds to add modules to a project, well-known frameworks like HAPI or Express build on them, and it would be highly inefficient to relinquish their use completely.

Still, every module installed is third party code. It can be poorly maintained and contain bugs that are never fixed or — even worse — security issues. Before using a module, a developer should always check its quality and make sure that it’s not trivial enough to be done themselves.

To tackle the problem, many enterprises also run their own, private npm repository where only packages that went through some auditing process can be found.

Tools like the Node Security Platform or Snyk can streamline this process by using exploit databases to find and fix possible security issues in installed modules.

Outlook

The Node.js diagnostics and the post mortem working groups solely focus on ways to extend and unify the tracing and debugging capabilities within Node.js.

A few highlights from them include:

  • A new tracing facility is around the corner. It will allow low overhead process level tracing.
  • There are current initiatives to unify the way core dumps can be analyzed.
  • With async-hooks, there will be finally a generic way to accomplish long stack traces and transactional tracing through callbacks.

Given the current pace of development and how active the community is driving performance topics; Node.js enterprise capabilities will make another leap in 2017.

Summary

Very often Node.js applications are small and not complex. Communication between tiers, memory leaks and CPU congestion can cause issues. Luckily the platform isn’t a black box, and for every problem, there are ways to introspect running applications to find the root cause.

Monitoring is a topic taken seriously by the Node.js project, and within the next releases, additional ways to trace, debug and monitor Node.js will be introduced adding, even more, capabilities to fix problems fast.

The next time your development team wants to implement Node.js, have no fear, Ops.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.