Beyond the Switchboard: The Current State of the Art in Speech Recognition
Thanks to techniques such as deep learning, speech recognition keeps getting more accurate; depending on who you ask and how you measure it, deep learning based speech recognition systems might be better than human transcription now. Developers can call APIs that bring speech recognition to their own apps and services, but is it as good as the research results make it sound?
In 1996 the error rate for speech recognition was over 43 percent. By 2001, that was down to about 20 percent; outside of limited vocabularies like numbers, out of a sentence of ten words, two of them would be wrong. But unless users were prepared to invest a few months training recognition systems, the error rate stayed much the same for the next decade, plateauing around 15 percent. The hand-crafted generative speech models for recognition weren’t producing any more significant improvements and researchers from Geoffrey Hinton’s team at the University of Toronto worked first with Microsoft Research and then Google and IBM on using deep feedforward neural networks instead and error rates started dropping again.
When research scientist Francoise Beaufays started working at Google in 2005, speech recognition was still treated as science fiction, she told the New Stack. “We started by building small products, like a 411 service completely powered by voice recognition.” They were also using the older, statistical speech recognition models. “Part of the improvement was gaining access to more data and training with it, but it was also about evolving the technology,” she explained. The biggest breakthrough in this period was moving to neural networks while keeping the latency low enough to give results quickly.
In speech recognition, there are three models that work together: the acoustic model, the pronunciation model, and the language model. The acoustic model takes the waveform of speech and chops it up into small fragments and figures out each sound that the person is speaking. The pronunciation model that takes those sounds and strings them together to make words, and a language model that takes the words and strings them together to make sentences. These models are computed together into one big storage graph.
“When you speak a sentence. the sentence is pushed through the graph and we try and find the path of highest accuracy; the sequence of words that we think with the highest confidence is what you meant to say,” she said.
Moving to neural networks didn’t necessarily improve recognition accuracy immediately; “what it created was a path for a bunch of innovations,” she explained. “After the initial launch, every six to nine months we come up with a new architecture, a new type of neural network that’s more powerful, more efficient, less [susceptible to] noise.” The same continual development has improved error rates across the industry, although measuring that turns out to be tricky.
The standard test for measuring the accuracy of speech recognition uses a fairly old dataset called NIST 2000 Switchboard: recordings of telephone conversations between strangers, about common topics like sports and politics. “This is the standard speech set used by all researchers for the last 20 or so years,” Microsoft Technical Fellow Xuedong Huang explained to the New Stack. “It’s natural conversations that people had when they talked over the phone; they’re not talking to their own family, so they’re using standard English.”
In 2016, a Microsoft Research team got the error rate on the Switchboard corpus down to 5.9 percent. “That matched the human error rate when we hired a professional transcriber to transcribe the same data,” Huang told us.
Since then, there’s been some debate about how accurate you can get a human transcription to be. IBM used four teams of people, doing multiple transcriptions of each sentence and combining their work got the error rate as low as 5.1 percent (and reported an error rate of 5.5 percent using IBM Watson ). “When you have a group of people working together, the new error rate for humans is a historical low of 5.1 percent, and we just reported that with software we can match a group of people.”
Huang puts the improvement down to the speed of the latest version of the Microsoft Cognitive Toolkit (CNTK) for deep learning, which lets the team find more effective models more quickly. “It’s amazingly fast, and we were able to run experiments two to three times faster than we used to. That improved our productivity; we experimented with a thousand different models to get the final set of models that got us to 5.1 percent.”
But what does matching human performance on Switchboard mean for the speech recognition systems users and developers get access to? “This is a good milestone, but it by no means implies we’ve solved all the problems,” he pointed out. The advantage of benchmarking with Switchboard is an “apples to apples” comparison.
Beyond the Switchboard
Yet, Google doesn’t benchmark its speech recognition against Benchmark because “those tests are not relevant to our product and our users,” Beaufays said. “We have our own test set and we try to beat that, but comparing across companies is difficult. The audio is not representative of how people are talking; that material is not relevant to us. We have our own product and we optimize for that.”
Google product manager Dan Aharon quoted the 4.9 percent error rate that was announced at Google I/O this year and noted that as Google has made speech recognition available to customers as an API, it’s had to be adapted for customers who didn’t have the same needs as Google.
“It became obvious pretty quickly that large segments of customers were interested in using speech for things different from what Google has traditionally done and our initial model was not well suited for those kinds of use cases,” Aharon noted. “Over the last year or so we’ve worked to improve that and we’ve made a significant R&D effort to get better technology tuned for phone call transcription, for transcribing long-form audio and for multiple speakers.”
Both Microsoft and Google let customers add their own vocabulary, like product names, to improve accuracy; Microsoft also supports custom audio models for specific locations or age groups. Because YouTube has products aimed at younger users the team trained a system that was adapted to their voices.
“They not only have a different speech range but often choppy ways of expressing themselves that could force a different interpretation of the words if we didn’t pay attention to that,” Aharon explained. But rather than supporting custom models, “we found merging that with the overall system gave us a speech recognition system that works well for young and adult people.” (That’s used for Google Home but for legal reasons, the Google Cloud speech API doesn’t support apps directed at children under 13.)
Similarly, Microsoft’s 5.1 percent error rate came from training the system on more speech data than just Switchboard, or the Call Home project which collected and transcribed voice recordings in the early 1990s by offering US college students free long-distance calls. That’s more like colloquial everyday speech, but it’s still more than 20 years old. Google has been working with machine learning training company Appen (which also helped IBM with its group transcription test) to collect recordings from people with particularly challenging accents, covering Scottish dialects and the rural Midwest, and Microsoft has its own data sets too.
More than 10 percent of search is now made by voice and not long ago it was less than 1 percent.
Huang didn’t put figures on the accuracy of the Cognitive Services speech API, the new Harmon Kardon Invoke speaker that uses Cortana speech recognition, or the PowerPoint add-in that both transcribes and translates your presentation, other than saying it would be a higher error rate. “We share the same training tool between CNTK and the production system, and we use more data than for Switchboard benchmarking, but this is on demand and real-time; Switchboard is not. Also, far field is more challenging than a close-talk microphone because the signal to noise ratio is worse.”
Even though the recordings were made from older phones, the Switchboard recordings were done with handsets that put the microphone close to the mouth. To recognize speech across the room, speakers like the Invoke use multiple microphones; Google Home has two, Amazon Echo seven. “Noise is a challenge: is it manageable? Yes, if you have the right microphone. This a beamforming microphone array and we make sense of multiple beam forms to get the right person and we use signal processing to enhance the source of speech and make the overall accuracy better,” Huang explained. “With a beamforming array microphone, it’s close to near-field accuracy. With a PC open microphone, it’s not as good.”
The PowerPoint system needs a close-talk microphone; “this isn’t 5.1% error rate because it’s real-time,” she noted, “but it’s intelligent; it can learn from the PowerPoint content and if you talk about what’s in the slide, it can be further enhanced. It’s not the most sophisticated neural network, it’s not as good as the 5.1% research system — but it’s still highly usable.”
The Real Test Is Using It
Beaufays (who has a noticeable French accent) also suggested that the real test is whether people find speech recognition useful. “It’s really hard to assess the quality of a speech recognition system. In a real context, it’s really hard and quite subjective. If you have a person speaking to the phone and it’s inaudible because they have the music on full volume, expecting speech recognition to get all of that is a grey area.”
He noted that reaction from users is reflected in the adoption numbers: more than 10 percent of search is now made by voice and not long ago it was less than 1 percent. Instead of typing on their phone, users are taking that leap of faith.
And Huang highlighted challenges beyond just the error rate. “When we communicate as people, the performance drop from close talk near field to far field is not as dramatic as with machines. Machines do very well almost as good as humans if it’s close talk. If you have a little bit of an accent the tolerance level of machines today is not as good as humans; for a three-year-old kid, it’s going to be disastrous. If you have very far field conversations, people have this amazing ability to zoom in, to focus their attention and ignore everything else.” That’s called the cocktail party effect, and Mitsubishi Electric recently demonstrated a system that can distinguish two or three voices in a crowd.
That’s more a symptom of a bigger underlying issue, Huang suggested. “Most important of all is the understanding; humans use intelligent contextual information like body language and eye contact to really enhance our understanding [of speech]. People aren’t trained to transcribe every single word; even with grammatical mistakes and if the sentence isn’t perfect, we understand it based on the context. That’s the biggest gap where machines are far from getting close to human capability.”