Q&A: Dialpad’s Etienne Manderscheid on the Power of Voice AI
Providing excellent service during customer phone calls can give any business a leading edge over its competitors. But what if artificial intelligence was also added into the mix? New developments in natural language processing (NLP) and speech recognition technologies are enabling companies to not only transcribe phone calls in real-time, but to also leverage the power of data analytics to glean insights from these interactions in order to better train sales and support staff beyond standard scripts.
To find out more about how AI is changing the industry, we spoke to Etienne Manderscheid, vice president of machine learning at Dialpad, a cloud-based business communications platform. Prior to Dialpad, Etienne co-founded TalkIQ, a startup specializing in real-time speech recognition and natural language processing technologies, which was acquired by Dialpad back in May 2018. Etienne also has a PhD in computational neuroscience, and was pivotal in helping Dialpad to develop VoiceAI, which allows users to take a quantitative and actionable approach when analyzing conversations. VoiceAI is featured in a wide range of Dialpad products.
Dialpad has been described as a “cloud-based phone system.” Could you explain a little bit about what that means?
Well, legacy communication systems (like those offered by Cisco) have large, extensive installations that only live on-premise. And to push features, they have to go and actually make physical changes to that hardware. So it doesn’t allow for fast development cycles and it doesn’t let you scale rapidly. So, if you have a fast-growing team, you would have to upgrade your whole hardware installation, which might take four or six months. The idea with Dialpad is that we are entirely account-based, similar to Slack, so you can scale very easily, deploy very quickly, so our onboarding times and our feature development times are excellent. Being cloud-based is really what customers expect out here now in this age, and we’ve had a lot of success growing with companies like Uber, Motorola and other Bay Area-based companies.
Could you tell us a bit about how you got started on this path, taking you from academia to starting TalkIQ, and then joining Dialpad?
At the time I started my PhD in computational neuroscience, deep learning AI was not at all what it is today. At the time, the choice was either you can go study computers and learn about what AI was at the time, which I also did — or you could specialize in neuroscience and learn how does the human brain do it and to bring that into AI. I chose the second approach, because at the time what was being done with computers was very limited. My thought was that the brain does all these amazing things — so how can that be translated into AI? In fact, a lot of the most recent breakthroughs in AI — like intentional models which have become state-of-the-art recently — actually take inspiration from how the brain does things.
“We detect what we call ‘key moments’ in conversations — for example, a price objection, which can happen in any sales conversation. So we have models that detect those, as well as many other key moments.”
Four years ago, my colleagues and I co-founded the startup TalkIQ, and our vision was to improve conversations at-scale. If you look at specific verticals like sales and support, there is a lack of analytics that would help managers understand what is going on inside of their calls. We decided to do speech recognition and NLP to give them those analytics so they can do targeted coaching, understand the voice of the customer, and those are now things that we are building as VoiceAI, inside of Dialpad.
So the technology tries to find patterns in voice calls between customers and businesses?
That’s right. We detect what we call “key moments” in conversations — for example, a price objection, which can happen in any sales conversation. So we have models that detect those, as well as many other key moments. For a manager, being able to see how different agents are performing in terms of the various key moments, and relating those to the outcomes of the deals, helps them deliver much more targeted coaching. We also have features that help share organizational knowledge. Once it is identified that some agents are doing well on some things, this information can be put into call libraries that others can learn directly from, rather than going through months and months of trial and error.
How did you solve the potential problems that might arise if you have a customer who might speak with an accent, or uses non-standard vocabularies (like slang or technical jargon)?
We collect a lot of training data from people with diverse accents so that we can do acoustic modeling and language modeling that is going to represent all of our customers. There is a lot of data that we clean, label and then use to do acoustic modeling and language modeling adaptations. There is customization in our language model and the acoustic model that gives better results, compared to a generic, off-the-shelf solution.
How does that work exactly?
In language modeling, human-generated and machine-generated transcripts can be used to train the model further. There is a process of capturing the keywords that customers are using, such as product names, names of competitors, and names of people within those organizations. Those are important words to capture correctly. That’s a challenge for a generic speech recognition system. We have ways where we have customers tell us what those words are. We also use web scraping, where we can infer many of those words ourselves, and then we have to add those to our speech recognition system.
What we’ve had to do is create automated and semi-automated systems using machinery to infer the pronunciation of an unknown word. Using machinery, we infer the phonetic make-up from the letters and words. We also have a loop that uses human verification in order to make sure we are correctly capturing those and putting them in our dictionaries.
Essentially, you don’t want the model to create its own data and use it for its own training without any kind of supervision, because that’s kind of like asking the model to hallucinate something, and to then use it as ground truth. You want to have some opportunities for a human to say, “Actually, this is wrong.” You do want to have a conversation between the machine and human intelligence, in a sense.
Can you explain these models a bit further?
So these models are solving a language modeling task on gigabytes of Internet text. This helps them acquire the linguistic knowledge but also real-world knowledge that makes them much better for inference. That’s the starting point. What we do is we pop off the top layers for the deep neural networks. We keep a lot of the knowledge in the deep layers and we train it to solve a deeper task, which is much more adapted to the tasks that we need to solve. You’re re-orienting it to a different task, training it to different task, and you keep the deep neural layers that contain a lot of knowledge.
A lot of NPL is deep learning, and most of the speech recognition is deep learning. We also use heuristic models as well, so that we can quickly build proof of concepts. That’s another lesson we made at TalkIQ: it’s an excellent idea to build a very simple baseline model to first validate that everyone is on the same page and doing the right thing, and to then follow up with the deep learning-based model.
How will this technology change how business communications are done?
One example is real-time recommendations, which we developed for Dialpad Sell, the outbound call center for sales teams. Salespeople want to go and win deals. They don’t want to study scripts that probably need to be updated. Instead, here is an easy system where you can configure a real-time recommendation to trigger on certain words or on certain pre-defined moments, and offer a suggestion. So that’s a practical way that we can help organizations train, actually.
We’ve already got a number of companies that have enabled Voice AI. They get real-time transcriptions and live sentiment on all of their calls. In a call center, it used to be that the manager would randomly pick a call and listen to it. Probably it’s going okay, maybe there’s something to say about it. But, wouldn’t it be much nicer if it would alert you when there is a call that is going wrong? Using the sentiment analysis framework, we’ve developed a learning system which tells managers, hey, this call doesn’t have that good of a sentiment, do you want to barge in? We are creating that connection for them.
What are some of the future directions that the technology will take?
We are focusing on improving conversations in general, as nowadays there’s much more digital interaction. People are getting less and less feedback on facial micro-expressions and other biological cues than before. We need feedback. We are building a system where people can deliver feedback, but where AI can help infer and extrapolate where the feedback is absent, making sure that there is a mutual understanding between the participants in the conversation — especially in cross-cultural interactions — so that everybody’s voice is heard.
Feature Image by CSTRSK from Pixabay; Other images: Dialpad