Thanks to its deep learning toolkit, Microsoft is making huge strides in computer-based speech recognition.
Just this September, a Microsoft research team achieved an error rate of 6.3 percent on the Switchboard speech recognition benchmark, meaning the software interpreted just 6.3 percent of all words it “heard” incorrectly. The researchers used a recurrent neural network architecture, called long short term memory.
Less than a month later, training on a 30,000-word library, they were able to get that down to the 5.9 percent — about the same percentage of incorrect words that professional transcribers made on the same phone call recordings. It was the very first time that a computer has been able to recognize the words in a conversation as well as people can.
It was “a historic moment,” said riding Microsoft’s chief speech scientist Xuedong Huang, who founded the speech recognition team at Microsoft in 1993.
The deep learning algorithm Microsoft used can be found on the recently released version 2 of Microsoft’s CNTK library, Which used to be called the Computational Network ToolKit, but as of version 2, is now called the Cognitive Toolkit
The beta of CNTK 2 improves performance, lets developers use it with Python as well as C++ to make it more widely relevant and gets a new name to show that Microsoft believes its deep learning framework is ready for a lot more than AI research.
“The acronym stays the same but the name reflects the higher aspiration of what we’re trying to do for cognitive computing and supporting Microsoft Cognitive Services.”
“Many of the AI services Microsoft has are now created using CNTK. Cognitive Toolkit is the secret weapon for Microsoft to create cognitive services like Skype Translator and many other AI breakthroughs like speech recognition that has now reached human parity for conversational speech.”
Cognitive Toolkit started as a framework for speech recognition, using not only the usual GPUs to speed up deep learning, but unusually, letting you take advantage of multiple GPUs on multiple machines to do distributed, massive scale deep learning. That way you don’t lose performance or accuracy when you work with bigger datasets.
Ready for Production
With version 2, Cognitive Toolkit goes from a research tool to something you can use in a production system, Huang said. “Microsoft has been using it for internal workloads. It’s not only Cognitive Services that have been created using CNTK but many other production-ready models. This is a commercially proven tool, it’s been proven in big production systems; it’s not just a tool used to create a toy problem.”
The voice recognition in Cortana is now created using Cognitive Toolkit, and the Cortana team says it’s increased their productivity almost ten-fold. “Before they adopted it, they felt like they were driving a Volkswagen; after they switched it’s like a Ferrari,” Huang said.
Microsoft’s speech services team is using Cognitive Toolkit not just for speech recognition but to create more accurate acoustic models, so they can understand what you’re saying in a noisy environment like a party, a bus or an open-plan office. They’re also using long short term memory, and the improvements will show up in Cortana as well as Skype Translator.
One reason that Microsoft moved CNTK from its original, academic-only release on Codeplex to full open source on GitHub was to expand it to additional workloads beyond speech — starting with image recognition — but without losing the impressive performance. The speech APIs and the Custom Recognition Intelligent Service in Microsoft Cognitive Services (a set of REST APIs you can call to use pre-built machine learning algorithms in your code) were built with the Cognitive Toolkit. CRIS lets you create your own custom acoustic models, by uploading samples from difficult environments along with transcriptions.
Bing uses Cognitive Toolkit to discover “latent connections” in search terms to find better results — if you type “how do you make a pumpkin pie” you’re looking for recipes, even though you didn’t type that in. That kind of natural language understanding is quite different from speech recognition, and it needs a massive dataset to work on.
“No other solution allows us to scale learning to large data sets in GPU clusters as easily,” Clemens Marschner, a principal software development engineer who works on Bing relevance, said.
Natural language understanding is also driving a new customer support system that Microsoft is trying out, under the codename Skyline. The chat bot looks at what the customer says and suggests links to fix the problem; it was good enough to let 25 percent of users in the trial fix their own problem, rather than the usual 12 percent. If a human agent needs to step in to work on a complex problem, the bot summarizes the fault and the conversation so far, so the agent doesn’t need to annoy the customer by asking all the same questions again.
Python and Performance
Most of the commercial production models built with Cognitive Toolkit were done in CNTK 1, but Huang noted, “The guts are identical — but we have new flexibility in CNTK 2.”
One of the advantages of Cognitive Toolkit is the way you describe deep networks — which are usually very complex — as nodes on a directed computational graph with inputs and outputs; once you’ve described a network, all the computation to learn the network parameters is taken care of automatically. Because you don’t need to derive gradients or hand-code the interactions between variables for back-propagation, you can create complex computational networks by composing simple building blocks.
The BrainScript network description language introduced in CNTK 1.5 lets you express very deep nets, beam decoding and other complex structures using infix operators, nested variables, function definitions, recursive function calls, arrays, and even lambdas. There’s a library of standard components that cover state of the art machine learning models like Deep Residual Nets for Image Recognition and Sequence-to-Sequence with Attention, and readers for easily inputting text and speech for deep learning training.
And now you can call all of that with Python, instead of having to use C++.
“This was a major adoption barrier for CNTK in the past,” he explained. “Using C++ for enterprise AI; that’s not a problem, people are familiar with C++. But for the open source community, we needed Python and this beta offers native Python support. It’s the language they’re familiar with; Python is easier to understand, easier to evaluate, it’s an interpretive language. Often they already have existing code using Python and when they add deep learning, they just want to augment what they have instead of switching from Python to C++. For the first time, we are bringing performance and ease of use in a more balanced way, because it can be integrated into other environments more efficiently.”
Python support will make working with reinforcement learning easier (since the majority of reinforcement learning libraries are written in Python). That’s a style of machine learning where the agent learns the best way to perform a task — anything from playing a game to navigating through a space — using trial and error, and rewards when it gets something right. Often it’s used as part of a more complex machine learning system; the Microsoft customer support agent uses both long short term memory and other supervised deep learning methods, plus reinforcement learning to keep improving its results. The rewards can be explicit feedback from the human agent or the reactions of the customer – leaving the chat if they’re frustrated or thanking the bot if the information is useful.
You’ll get the same performance using Python, and you might see a performance boost with CNTK 2.
“Compared to the previous version, it delivers almost two times performance boost in scaling to eight Pascal GPUs in an NVIDIA DGX-1,” said Ian Buck, general manager of the Accelerated Computing Group at NVIDIA.
That depends on which version you’re upgrading from, noted Huang. “CNTK 1 has been updated almost every month.” Version 1.5 introduced a parallel processing technique called Block Momentum that significantly reduced communication costs so you could scale parallel training across a large number of GPUs spanning multiple machines. On a 64-GPU cluster, that improved performance by a factor of more than 50. Version 2 is an improvement over that, although if you’re already using v1.8 the performance increase will be incremental.
Cognitive Toolkit’s performance is already impressive, though. Researchers at Hong Kong Baptist University are running regular benchmarks on the most popular deep learning toolkits — CNTK, Tensorflow, Caffe and Torch testing popular workloads: fully connected and recurrent neural networks and two convolutional neural network architectures, AlexNet and ResNet.
“CNTK 2 remains the fastest deep learning toolkit for distributed deep learning,” claimed Huang, “and I want to highlight the word distributed. Even on a single GPU, CNTK offers the fastest performance on both fully connected and recurrent networks. On AlexNet, Caffe is, not surprisingly, the fastest; on ResNet, Torch is fastest. But CNTK, even on a single GPU, is the fastest toolkit for two out of the four. If you compare it with TensorFlow, on all four workloads CNTK is faster now — AlexNet, ResNet, recurrent networks and fully connected networks, even on a single GPU. And when you scale up beyond one machine, that’s where Cognitive Toolkit really shines because many other tools can’t even do that; Caffe is only designed for one machine with multiple GPUs. CNTK is the fastest performing distrusted deep learning network tool.”
In fact, the latest version of the benchmarks shows that “CNTK is on par with TensorFlow and Torch on ResNet,” according to the researchers. “As for RNNs… CNTK achieves the best performance for all available settings.”
For many developers, the easiest way to get those multi-GPU systems will be the new Azure N-series VMs that use NVIDIA Tesla K80 GPUs; they’re still in preview but you can use Cognitive Toolkit on them already. “In fact with Azure GPU, we support not only CNTK but TensorFlow, Torch and Caffe,” explained Huang. “If you want to run a small task on a single machine with and multiple GPUs you can use any of those tools — but if you want to be serious about big data, scaling out to multiple GPUs on multiple machines, CNTK is the only one that offers that performance.”
When the N-series VMs move into general availability, there will be a gallery image with Cognitive Toolkit already installed, and easier ways to scale out across multiple VMs. “Right now, you have to set up CNTK and run it on one VM; you can manage multiple VMs but it’s tedious, you have to use the command line. As we get the integration finished, it will be much easier to manage the distributed behavior. We’ll rely on Azure Batch to make scheduling much simpler once we are ready to launch the whole service. Azure GPU and CNTK together offer flexibility and ease of use; that will give the whole AI community a powerful toolkit to amplify AI for whatever they do.”
Feature image by Stefan Kunze via Unsplash.