Eleven Best Practices, Tips, and Tricks for using Speech Recognition and Virtual Agent Bots with Voice Calling on the Twilio CPaaS Platform
Conversational bot designers and developers – as well as callers into speech-enabled Interactive Voice Response (IVR) and Virtual Agents, alike – are continually asking themselves the same questions: “Why doesn’t this bot understand me? What more does it need to be able to understand what I just said to it?”
While AI-based Automated Speech Recognition (ASR) can be inherently challenging (especially in noisy environments) and there are inherent accuracy and latency trade-offs to navigate, there are ways to improve speech recognition performance. This post will give you the best practices to maximize the odds of a superior automated self-service experience with Twilio.
Twilio’s Recommendations for improving <Gather><Speech> Recognition in an IVR
By implementing the following tips and recommendations, you can increase the likelihood that Google’s ASR used by Twilio will recognize spoken text correctly and that the customer’s application (or Twilio Studio Flow) can take the appropriate next action to attempt to confirm or validate the caller’s input. These best practices will minimize disturbance to the caller, delivering a more conversational IVR or customer engagement experience (or automated self-help experience). All of this will reduce caller frustration and ensure better overall efficiency and cost performance of any Twilio customer’s IVR system.
1. Rely on the (ever-improving) powers of mobile devices
Twilio recommends that customers utilize the mobile phone’s microphone for improved audio quality, along with the noise-canceling features already available on devices themselves.
To minimize outside noise interference, we recommended using the phone's handset mode rather than speakerphone mode to capture user input. By reducing the impact of background noise on speech recognition, noise-canceling microphones and acoustic echo cancellation can significantly enhance recognition accuracy.
2. Choose a high-quality PSTN connection provider
You cannot recognize speech on calls that do not get successfully connected to your app. Twilio has high-quality, reliable interconnects with multiple providers, serving both Inbound and Outbound calling use cases (including Number Porting) – at cloud/elastic scale – the world over. Don’t let poor connectivity foil your attempts to engage and service your customers!
3. Leverage "Hints" in <Gather> Verb to the max
Include all the possible inputs that a user may speak as part of the "Hints" in the <Gather>
verb. Add as many as you like into your code; there is no scaling penalty running an app with 1 or 10 hints vs. 99 (we allow hundreds – here are the current limits). Adding these will guide users’ input and increase the likelihood of accurate recognition.
Here are some examples of supported class tokens by language in Twilio’s Docs and Google’s Docs:
$ADDRESSNUM
(street number),$STREET
(street name), and$POSTCALCODE
$MONEY
(amount with currency unit)$OPERAND
(numeric)- DTMF, etc.
In this next example, we use the Class Token $OOV_CLASS_DIGIT_SEQUENCE
as the account number requested is numbers. The action URL will send the result to the Application URL when Gather completes.
Digits
Temperature
Phone Number
Street Address
Something you define
Using hints to discern between relevant and irrelevant homonyms depending on use case, (e.g., between “chicken” and “checking,”) is one strategy; re-prompting based on the available relevant choices (e.g., “I think you said ‘checking’ not savings – is that correct?”) is another. A third strategy is to use a virtual agent “bot” that can make probabilistic statistical “informed guesses” (i.e., use predictive AI) to pick the best choice from among relevant alternatives, such as our integration with Dialogflow CX.
4. Use Enhanced Speech Recognition, and pick the Twilio <Gather><Speech>
Google ASR Speech model best suited for your use case
The enhanced
attribute instructs <Gather>
to use a premium speech model that will improve the accuracy of transcription results. The premium speech model is only supported with the phone_call
speechModel
. The premium phone_call
model was built using thousands of hours of training data. It ensures 54% fewer errors when transcribing phone conversations when compared to the basic phone_call
model.
The following TwiML instructs <Gather>
to use premium phone_call
model:
<Gather>
will ignore the enhanced
attribute if any other speechModel
, other than phone_call
, is used.
For most use cases related to Voice input when collecting short individual utterances from an English speaking user, Twilio recommends using the enhanced phone_call
model with speechTimeout
set to auto
. This is instead of using Google’s default
speech model, as phone_call
is the speech model best suited for use cases where you'd expect to receive queries such as voice commands or voice search.
In languages other than English, for better endpointing (i.e., lower latency start of speech recognition), experimental_utterances
may be a better choice. For more on those experimental models, see below.
Twilio’s Experimental speech models are designed to give access to Google’s latest speech technology and machine learning research for some more specialized use cases. They can provide higher accuracy for speech recognition versus other available models, depending upon use case and language. However, some features that are supported by other available speech models are not yet supported by the experimental models, such as confidence scores (more on that, below).
Of special note, the experimental_utterances
model is best suited for short utterances of only a few seconds in length for languages other than English. It’s especially useful for trying to capture commands or other single-shot directed speech use cases (e.g., "press 0 or say 'support' to speak with an agent" in non-English languages). Alternatively, the speech model numbers_and_commands
might also work for such cases.
The experimental_conversations
model supports longer and spontaneous speech and conversations. For example, it is useful for responses to a prompt like "tell us why you're calling today," or capturing the transcript of an interactive session, or longer spoken messages in the 60-second snippets that <Gather>
supports.
Both experimental_conversations
and experimental_utterances
values for speechModel
support the set of languages listed here.
One final but especially important point when it comes to building speech models into your ASR application: you can change the speech models used multiple times, within a single TwiML application, over the course of multiple questions or prompts, to best suit the type of speech input you’re expecting for potentially each question or prompt. That is, you can specify the speech model, hints, etc., per each individual <Gather>
done in a TwiML app, to optimize your speech results’ accuracy.
5. Engineer your prompts to encourage natural and clear speech input
Encourage users to speak naturally and avoid rushing during interactions with the IVR system. Natural speech patterns improve the accuracy of speech recognition.
First, you’ll want prompts to be long enough for Twilio and Google to be ready for the speech input – but not so long that the user is put off. Telling a user what or how to speak – or giving examples – isn’t a bad idea.
In addition, the prompting questions asked should either:
- Be sufficiently narrow in scope that a generalized speech recognition engine has a decent chance of recognizing the answers from amongst a limited set of possible valid ones (for example, using plenty of verbal cues, such as “you can say things like ‘account balance’, or ask when your local branch is open”) or
- If a very broad question is the right starting point for conversations with your customers, consider using other, more structured tools for managing detected intents, utterances and phrases. Additionally, take advantage of those tools’ auto-generation of training (phrases) data and management of homonyms.
In short, if the set of answers and possible actions is small and short, building your own bot with ASR tools alone is a great idea. If the list of answers and actions is long, getting successful recognitions and correct routing and answers can be complicated, so consider also using a predictive AI bot-building tool like Twilio’s <Virtual Agent> connector using Google Dialogflow CX in addition to speech recognition.
6. Keep it clean (if you want)
The profanityFilter
attribute of <Gather>
specifies whether Twilio should filter profanities out of your speech recognition results and transcription. This attribute defaults to true
, which replaces all but the initial character in each filtered profane word with asterisks. You can also use Twilio Voice Intelligence and recorded transcripts to detect customer sentiment for later flagging or Segment profile updating.
7. Offer DTMF as Backup
Provide Dual-Tone Multi-frequency (DTMF), also known as “touch tones,” as an alternative input method when speech recognition fails. This allows users to input responses using the keypad if needed.
The input attribute allows you to specify which inputs (DTMF or speech) Twilio should accept – the default input for <Gather>
is dtmf
, but you can set input to dtmf
, speech
, or dtmf speech
.
If you’re expecting DTMF but the input from the caller might be speech, see the “Hints” section above in tip #3. You can set the number of digits you expect from your caller by including numDigits in <Gather>
.
If you set dtmf speech
for your input, the first detected input (speech
or dtmf
) will take precedence. If speech
is detected first, finishOnKey
(finish on a specified DTMF key press) will be ignored.
8. Stream it
Particularly if multiple call orchestration steps in real-time are NOT required (depending on your use case), consider using Twilio Media Streams with external speech recognition providers. Explore the option of using media streams to send speech data to an external speech recognition provider through the Twilio Marketplace.
Twilio Marketplace speech recognition partners can, for example, develop vocabularies optimized for certain industry verticals or use cases, or optimize for longer speech recognition “batching,” leading to improved recognition accuracy and performance in your application built as a Twilio customer. But, do note that Media Streams doesn’t yet support DTMF – that’s coming in a future version of Media Streams.
9. Leverage confidence scoring in the prompting application
When the caller finishes speaking or entering digits (or the timeout is reached), Twilio will make an HTTP request to the URL that the action attribute takes as a value. Twilio may send some extra parameters with its request after the <Gather>
ends.
If you specify speech as an input with input="speech"
, Twilio will also include a Confidence
parameter value along with the recognized speech result. Confidence
contains a confidence score between 0.0 and 1.0 (the percentage confidence level of the result from 0% to 100% confidence). A higher confidence score means a better likelihood that the transcribed speech result is accurate.
After <Gather>
ends and Twilio sends its request to your action
URL, if the Confidence
score is present you can act on the result. Speech Recognition will never explicitly tell you it didn’t recognize a word, you need to infer that from the Confidence
score.
For example, you could run a re-prompting routine on the user after a result below a specific threshold Confidence
(e.g., < 0.2), until recognized. Reprompting after low Confidence
scores rather than simply moving forward with an empty or low-confidence speech recognition result can avoid 500 errors sent back from your programmatic endpoint – or frustrating end users.
Using re-prompting cleverly (for instance, while applying hints, and with a more specific, constrained choice re-prompt) to select from amongst the available relevant choices is a great “combination” strategy, combining this tip with tip 3 above.
10. Don’t exhaust your callers’ patience
After a reasonable number of retries – likely two or three at most – try another tactic.
After a certain number of failures, consider transferring the call to a live agent, who may be better able to cope with noisy, indistinct, or unexpected input. Studio and Dialogflow CX make this straightforward with a “Live Agent Handoff” option configurable on the Studio widget. Or, if a customer’s response or question is sufficiently off-script but you still wish to handle it with an automated agent, consider doing a voice-enabled generative AI search for an answer to their query.
11. Implement 2FA or other post-processing techniques to deal with for near-homonyms
Though its capabilities are getting better quickly, ASR struggles mightily with homonyms: words or phonemes that sound alike, but have different meanings. In particular, alphanumerics – for example, an insurance policy number, bank account number, or patient ID with both letters and numbers in it – can be extremely problematic.
Unfortunately, these are also quite commonly needed in self-service (IVR) automation use cases where ASR is used in delivering Notifications.
One solution to this is to use a combination of tools, such as Twilio Verify for Two-Factor Authentication (2FA) to request only a portion of a mixed alphanumeric number. For example, you could ask the last four digits, or only the numeric section of an ID.
With part of an ID, you can then verify via a text message that the system has looked up not only the correct account number but also that the system is talking to the correct person. Other (post-processing) solutions involve using 2FA along with some combination of the above techniques: prompting upon getting a low-confidence recognition score, prompt engineering (to be more specific around letters used in the reprompt), and so on.
Maximizing your chance of success with Automatic Speech Recognition applications
Hopefully, this post has given you some valuable hints, tips, and tricks to architect your speech recognition application for success. By implementing these best practices, you’ll find yourself with happier users – both customers and agents, alike!
Once you’ve implemented the best practices, read our next post in this series about using Google Dialogflow CX’s Virtual Agent Bot.
More resources
Russ Kahan is the Principal Product Manager for <gather> Speech Recognition, Dialogflow Virtual Agents, Media Streams and SIPREC at Twilio. He’s enjoyed programming voice apps and conversing with robots since sometime back in the late nineties – when and this stuff was still called “CTI,” for “Computer Telephony Integration” – but he also enjoys real-world pursuits like scouting, skiing, swimming, and mountain biking with his kids. Reach him at rkahan [at] twilio.com
Jeff Foster is a Software Engineer on Twilio's Programmable Voice team, and he’s been working on Speech Recognition at Twilio for the last 6 years – including the original Dialogflow prototype implementations more than 2 years ago. He can be reached at jfoster [at] twilio.com.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.