How ISVs Can Add Voice as a Channel to their AI Platform

July 22, 2024
Written by
Reviewed by
Paul Kamp
Twilion

How ISVs Can Add Voice as a Channel to their AI Platform

Virtual assistants are growing increasingly widespread as AI technology continues to evolve at a rapid pace. But building scalable AI agents is complex. In addition to building the agent’s logic you have to design your offering considering conversation pacing, interruption handling, tone, accents, and striking the right balance between listening and speaking for an optimized user experience. Your technology will change the way we interact with apps and services.

As an ISV, you service hundreds (if not thousands – or more) of customers. Creating an elegant and scalable platform designed to accommodate these customers is crucial. When considering adding voice to your AI platform, the stakes are even higher because you have to consider network reliability, jitter, global presence and latency. By the end of this blog post you will have a better understanding on how to integrate Twilio Voice with your virtual assistant.

Prerequisites

Twilio Super Network

Your foundation is the most critical component of your virtual assistant. We at Twilio have invested years building a vast and resilient communications network that spans the globe – so you don’t have to. Our Super Network ensures that your AI platform can connect with users anywhere in the world, providing a local presence through regional data centers and local phone numbers, which reduces latency and improves call quality.

Quality and reliability. Your AI platform will benefit from optimized call routing and carrier redundancy ensuring the interaction is reliable even during peak traffic.

Global Infrastructure: Twilio voice can accommodate your growth without compromising performance. Twilio differentiates itself by highlighting its commitment to user privacy, security and ethical considerations.

Now that you understand how our network can accommodate your expansion and provide the best voice experience for your customers, let's understand what voice products you will need to leverage in order to add voice as a channel to your virtual assistant.

Product Briefs

In this section I will give a brief overview of the products you will need in order to build scalable voice solutions for your AI Platform.

  • Phone numbers: Twilio provides phone numbers in a wide range of countries, enabling businesses to establish a local presence in multiple locations.
  • Programmable Voice: allows you to make and receive phone calls programmatically.
  • Media Streams: provides real-time access to the audio stream of a call over websockets, allowing you to process and analyze the media data.
  • Twilio <Say>: is a text-to-speech (TTS) function with 3 type/tiers (Basic, Standard, Premium). The TTS catalog includes voices from Amazon Polly (GA) and Google (Public Beta), and SSML support.
  • Twilio Enhanced : Experimental models are designed to give access to the latest speech technology and machine learning research, and can provide higher accuracy for speech recognition over other models.
  • Recording s: Record customer conversations for quality assurance or compliance
  • Dialogflow CX Onboarding Guide: Our guidance and walkthrough to integrate your Twilio application with your Google Dialogflow CX virtual agent.
  • Automated Dialogflow Bot Creation: Best practices for creating a virtual agent using the API, which allows you to automate the process.

Architecture

Ideally your implementation will look something like the below screenshot.

Suggested Architecture to integrate Twilio Voice with an ISV AI Platform

Let’s go through a typical call flow:

  1. Your end customer will place a call to a Twilio phone number.
  2. Each customer will have a separate subaccount which may contain one or several phone numbers. Whenever a call comes in Twilio will request your application for instructions on how to handle the call.
  3. Your application will start a bidirectional stream enabling you to receive the raw audio from Twilio.
  4. At this point in time you can capture the intent from the end customer and this is where your magic happens. Since the stream is bidirectional then you will be able to send audio back to the end user.

Let’s go ahead and take a look on how to configure your phone number and media streams.

Receive raw audio from Twilio Voice

Let’s go through a high-level description on what needs to be configured in the Twilio console in order to receive the raw audio stream in your application.

1. Set up a Twilio account and obtain the Twilio Credentials. You will need the Account SID and the Auth Token to authenticate your API requests.

Account SID and Auth Token in the Twilio Console

2. Buy a new phone number. (You can also do this via API)

Twilio Buy a Number screen.

3. In the Twilio Console, navigate to Phone numbers and select your Twilio phone number. Under When a call comes in, configure the webhook URL to point to your AI server endpoint that can handle incoming calls. Remember to set up appropriate voice configuration such as the URL for the TwiML response and HTTP methods based on your virtual assistant’s requirements.

Webhook URL for Voice

4. Set up media stream events: When a call is established Twilio will make an HTTP request to your webhook URL. On your server, you will be able to process audio in real-time in order to do the speech recognition, sentiment analysis, keyword extraction or anything specific to your AI platform. You will be able to send the response back to the user since the connection with media streams is bilateral using . For inspiration feel free to use this github repository.

Running and testing your integration

Since you will be doing this at scale you will need to consider the following scenarios:

  1. How many calls can your application support? Is your infrastructure ready to process the amount of calls?
  2. Simulate real-world usage scenarios, including various call durations, call patterns. Include edge cases like rapid call starts and stops and unexpected disconnections.
  3. Use monitoring tools like Voice Insights to understand KYC, KYT, regulatory compliance , Business ROI metrics, and Troubleshooting data.

Conclusion

Congratulations! Now you are familiar with the products that you will need in order to add voice as a channel to your AI platform. We have several ISVs providing solutions like sentiment analysis for support and sales teams, omnichannel virtual assistants, and even real-time transcriptions using our Dialogflow integration.

At the time of publication of this blog post, we are working on a product for ISVs that is currently in pilot. This solution will enable you to reduce latency to less than a second and provide a more human-like experience. To learn more, feel free to reach out to your account executive.

Allison Torres is a Solutions Engineer for the ISV team. She is passionate about collaborating closely with customer product teams to expedite their roadmap and drive innovation.