To truly automate customer support, virtual agents need to be omnichannel - available not just through chat interfaces on a website, but also through voice interfaces, such as conversational IVRs in the contact center.
Most chatbot (NLU) platforms are purely text based, which means that in order to be deployed in a voice environment, an additional layer must be added to handle the conversion between voice and text. On the input side, speech is converted to text (ASR) which is then routed to the chatbot. On the output side, the chatbot response must be converted to speech (TTS), which is then played back to the end user.
Many chatbot platforms include this functionality (Dialogflow, Lex, LUIS...) transparently, which simplifies omnichannel deployment for bots built on them. However, precisely because this is done transparently, bot developers have little control over the ASR process, which can be an issue in cases where it doesn't work correctly.
Training high-performance ASR models is a complex task, requiring a large quantity of training data to include voices from a wide range of speakers to cover differences in pitch, tone, prosody, accent/pronunciation. As a result, these systems are trained to handle general purpose vocabulary, and can sometimes have difficulties when dealing with specialized language in a specific vertical or domain.
What can bot developers do in cases like these when they don't have access to the ASR model internals? At Bitext, we are pioneering the use of synthetic chatbot training data to compensate for errors in the ASR output.
We begin by carefully analyzing incorrect ASR output to detect common patterns. For example, if a spoken term like "payroll" is consistent transcribed as "pearl", we can systematically regenerate the training data for payroll-related intents to include "pearl" as a variant, thereby increasing the intent detect accuracy of the chatbot.
Similarly, if an ASR system is having trouble recognizing a particular accent and transcribing certain sounds incorrectly, we can generate training data that takes these systematic errors into account, so the bot will understand "fa" and "pack" as variants of "far" and "park", respectively.
If you want to get your voice-based conversational AI agents working with the same accuracy as your text-based chatbots, "giz a ring will ya"?