Programming Languages

BFFs: Developing NPR’s Voice User Interface with Node.js

24 Sep 2018 6:00am, by

OK, Google, ask Alexa to check if Siri can recommend to Cortana a movie to watch with Bixby.”

2018 has been the year of The Rise of the Virtual Assistants. Some of the biggest companies in the world are vying to place their “smart speakers” in your home: Amazon’s Alexa, Google Assistant, Microsoft Cortana, Apple HomePod. Though voice UI is far from a solved problem, these devices have still been selling fast and furiously — even faster than the iPhone did in its first year.

National Public Radio has been at work in the voice UI space from the very start. Consumers aren’t really buying radios these days — those dollars are being redirected to in-home smart devices. NPR is eager to be one of the voice-directed destinations available when users utter those words of in-home invocation, like “Siri” or “Alexa” or “OK, Google…” Nara Kasbergen is a senior full stack web developer at NPR and one of the nonprofit broadcast company’s five-member voice UI development team. As a preview for her presentation at next month’s Node+JS Interactive conference, Kasbergen spoke with The New Stack about her work helping to build a next-generation voice UI framework for NPR.

So: smart speakers, or virtual assistants? What is the best name for this tech, anyway?

Well, “smart speakers” is more like the marketing term. Really they aren’t all that smart, at least not yet. And the word “speaker” makes it sound like these are like furniture, meant to just fade into the background. But these aren’t some passive thing just sitting there on your end table: their manufacturers want them to take a very active role in your life, to help you do things, increase your connection with and awareness of their brand, etc. So developers working in voice UI prefer to call them “virtual assistants.” Although the truly active assistance part, where you can say, “Alexa, make me a haircut appointment” or “make restaurant reservations” — that is only just now becoming possible.

For right now, though, studies have shown that main active use case is people using them to listen and catch up on the news while they’re getting ready for work in the morning. We absolutely wanted to have a presence on all of these devices so you can say, “Alexa, play NPR” — and something actually happens.

So, how do you even prototype a screenless interface?

When I tell people what kind of work I do, they often respond with ‘You must be doing some interesting stuff with machine learning and natural language recognition.’ Unfortunately, that is not true. All that happens out in the cloud.

Here is how it goes: you speak to your device. The device itself does some very basic processing and then sends it on to the cloud service that powers your particular device (at this point we are pretty much talking about Alexa or Google). The cloud service does a whole bunch of natural language processing that translates that into a standard format JSON request that gets sent to your code. Your code — which can live in cloud or bare metal — takes that JSON blob, does whatever the original request is asking, and then turns out a response. Which is also a standard, predictable JSON format that the cloud service sends that back to the device, and the device responds to the user with speech, playing audio, etc.

There are official SDKs produced by Amazon and Google for developing voice UIs, and they’re pretty nice to work with. They take a lot of the boilerplate out of parsing these big blobs of JSON that are the requests and then also producing the big blobs of JSON that are the response. Our main job, then, is the part in between.

Overall the code for all this is not very hard at all. It actually makes for pretty boring code samples. If you have ever written a very basic Express server that takes in a request and returns a response, that is all that this is. There’s no machine learning, there no AI, nothing like that. For us the biggest challenges have been balancing platform limitations and user expectations — that was the thing keeping us up at night.

You characterize your actual role here as “building backends for frontends”?

When we first got started, working on the very first skill we were trying to build for Alexa, we realized we had actually built a backend for a frontend. Which we immediately started calling our BFF. As in, the front end was the Alexa and the back end was this “real” API we were consuming — real in the sense that it was already out there, production ready. In between was really where all the business logic for our app was living; the code we were writing for Alexa was really just taking in a request, figuring out from that request what API call it needed to make…and then producing some JSON response for the Alexa service for that.

When we were ready to expand this work to Google Assistant, we happily realized we did not need to rewrite that same business logic all over again. Instead, we realized, we could build two BFFs — one for Alexa and one for Google — but otherwise about 60 percent of the code is shared.  The shared code is the code that goes out to this “real” API, which is where all the business logic lives. The code that is different is the platform-specific code that takes in the requests, which are in different formats, and produces the responses which are also in different formats. Basically everything else is the same, though, and so we decided to it all in one codebase. Now we just produce different builds using Gulp, and those builds we package into a zip file and upload to Lambda.

The way we were able to do this was to use a pretty old concept which is totally not sexy and cool anymore, but worked really really well for this use case: the Model View Controller pattern. So the controller is the shared code, which makes the call to the API and then figures out error vs valid response. The view is the part that is different for Alexa vs. Google Assistant — code that takes in this model and produces the individualized JSON formats for each of those — while the model is this class that we called GenericResponseModel. It has fields for all the common properties that both Alexa and Google Assistant support. Things like output speech, which is what the speaker says:

class GenericResponseModel {
     public audioUrl = ‘’;
     public outputSpeech = ‘’;
     public repromptSpeech = ‘’;
     public cardContent = ‘’;
}

Where does Node.js enter into the equation?

Going serverless for any of these applications makes a great deal of sense because that is so much easier, developing on these pre-existing paths. Which makes node.js the clear winner because it’s the only language right now supported on every single serverless platform, including Azure functions.

Node has been great to work with both for the platforms we are primarily developing for right now, because the SDKs are robust and that made our lives easier. But also because, for us at NPR, it’s the language we all feel the most fluent in. Even though we only had 2 developers on this voice UI project, there are 20 other devs in the department…so using Node means everyone can participate in our code reviews. Or, if we shuffle people around on projects, they will have no problem picking up our stack because they’re already comfortable and familiar.

Speaking of stack…

Node.js with Typescript, Jest for testing. Deploy to AWS Lambda, using CloudFormation to set that up, and use Dynamo DB as our cloud database storage provider. CloudWatch for logging, and Google Analytics to collect info. That is an interesting choice, because it’s designed for websites, but we used the universal analytics SDK for Node as well and just sent everything as custom events. Which makes our marketing department really happy because they’re comfortable with GA.

Open source opportunities to get involved with voice UI development?

Friends active in the open source community have been asking me what they could build. What is missing from my toolkit, would it be helpful to have a formalized framework for voice UI development, for example. My answer is that a framework would not be helpful, because the code is not all that hard and the existing SDKs are so good at taking away the boilerplate and abstractions. What we struggle with is actually QA — we do write unit tests, but at the end of the day we don’t have confidence that these unit tests definitively prove that the voice assistant is always going to do what you want it to do. We all came from web development, mostly front end, and we are all used to writing end-to-end tests using a tool like Selenium, or a high-level wrapper like Nightwatch.js. Having those to prove that, when you roll out a new feature or fix a bug it’s not going to break everything else.

We don’t have anything like that for voice UI development right now. Right now, whenever we build something new or make a change, we basically have to get in a room with the Alexa or Google Home and run through all these test scripts. So if you are looking for an idea for your next open source project or startup, feel free to take this one and let me know what you build!

The Linux Foundation, which runs the  Node+JS Interactive conference, is a sponsor of The New Stack.

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.