ConversationRelay Application and Architecture for Voice AI Applications Built on AWS
ConversationRelay Application and Architecture for Voice AI Applications Built on AWS
Twilio is laser focused on innovating with Artificial Intelligence. ConversationRelay, which we launched at our SIGNAL London event, is our latest release in this effort.
When I first heard about ConversationRelay a few months ago, I knew right away that we were building something awesome. ConversationRelay, which makes building voice AI integrations straightforward, is now in Public Beta. It’s one way Twilio will help our customers leverage AI to build better and more personalized experiences for their customers. The high level value proposition of AI-backed-voice-agents is clear enough, but how will ConversationRelay help our customers?
I’ll answer that, and more, in this blog post.
Below, I’ll detail an example reference application and architecture for building enterprise applications based on Twilio ConversationRelay. The reference application is built using AWS and their serverless products, but the architecture provides a framework for building solutions on other cloud providers. But first, I’ll give you an overview of CustomerRelay and explain why, where, and how it’ll help you build your AI-powered agents.
Let’s dive in.
Where ConversationRelay fits in voice applications
Historically, voice applications have been more challenging to build than messaging (SMS, WhatsApp, Chat) applications because voice is a synchronous channel that cannot be easily broken into an event based architecture. For voice AI applications, we builders need to convert streaming voice into text, stream that text to an AI application, stream back the response, and finally convert the response back to voice all while handling session state, interruptions, speaker detection, noise cancellation, and minimizing latency. This is no trivial task!
Here is an architectural diagram showing the complex systems that a voice AI application needs to manage over a synchronous voice call where latency is of paramount importance!
How does this look with ConversationRelay?
Here’s how ConversationRelay – circled – simplifies the architecture:
With ConversationRelay, enterprises can focus on building the components that are true differentiators for their business: business and application logic and the interactions with their LLM(s) of choice. Twilio provides a simple interface for your applications while handling the complexities of speech-to-text, text-to-speech, interruptions, and more – all while providing the scalability expected from Twilio.
Are you excited now, too? Great, but there is more. Let’s dig deeper into that interface that ConversationRelay provides for your business voice AI Applications.
Interfacing your application with ConversationRelay
As mentioned above, voice calls are synchronous and voice applications need to maintain a constant connection. Typically this means that voice AI applications need to maintain a continuous session state as well as streams to both the caller and the application (including to the LLM). This adds additional complexity to the diagrams above.
Ideally, enterprises would be able to build voice applications using event-based paradigms similar to what they are able to do with messaging channel applications and frankly most other enterprise applications. Thankfully, ConversationRelay helps in this regard!
Consider the diagram below:
When a caller speaks, Twilio ConversationRelay uses a TTS provider to convert the speech into text and sends that completed text to the business application in a message over a websocket connection. In the diagram above, speech from the caller arrives at your application as “events” to be processed (websocket messages). Your application handles these events and can stream text responses back from your application (and LLM) as websocket messages to Twilio ConversationRelay where the streamed text is converted into speech.
This is the best of all worlds. Twilio handles the voice call connection, orchestration and session management, and speech-to-text and text-to-speech conversions (through our third party providers, such as Google Speech-to-Text and Amazon Polly). Your application can consume events and then stream back responses.
The final element to cover in this introduction is latency. None of this works if the latency exceeds what is expected in normal human conversation. ConversationRelay combines Twilio’s connections to best-in-breed ASR and TTS providers to enable a tool set capable of powering whatever AI-backed voice application that your organization wants to build.
Reference Application
Now, let’s turn our attention to building a reference application using ConversationRelay. We’ll use OpenAI as our LLM, and build our application architecture on top of AWS.
ConversationRelay Reference Application architecture
Here’s the application we’ll be building together:
The diagram includes notes, but let’s dive deeper into each one.
- Twilio Voice Infrastructure: Leverage Twilio’s industry leading CPaaS capabilities to connect to your customers. The inbound voice handler is routed to a REST API that will establish the ConversationRelay session using Programmable Voice (and Twilio’s Markup Languages, or TwiML).
- TwiML Response: Respond to a new inbound call using TwiML. This initial step can use user context to personalize the session for the user. For example, language and voice type can be set for each caller and the model and prompt can be set and customized in real time utilizing customer history and preferences.
- Datasource: Production applications clearly need capable databases to maintain state. The reference application uses DynamoDB.
- ConversationRelay: The TwiML from #2 establishes a unique websocket session and is ready to convert speech-to-text and send the text to your application. It’s also ready to receive inbound streams of text to reply to the caller using the selected specified text-to-speech provider.
- Websocket API: ConversationRelay sends converted text from the caller to your application, where you can handle the converted text as events. LLMs will stream text responses back to your application where you can, in turn, stream the text “chunks” back to ConversationRelay to be converted into speech.
- Business Application: This reference application uses AWS Lambdas and has examples on how to handle tool calling (including multiple parallel tool calls).
- LLM: The reference application calls OpenAI but the actual interface is in a single file that could be converted for a different LLM.
This all sounds great, but does it work?
It sure does! Let’s take a quick video tour before proceeding…
Let’s get started…
About the repo
This repo is broken into three top level folders.
The decoupled-architecture folder breaks the major pieces of the application into separate CloudFormation “stacks”. This follows best practices, and is included in the repo in case your organization wants to follow a similar pattern. Most people will want to start with a stack in the single-stack-solutions folder.
The prompts folder has a few different “use-cases” and the prompts are saved at this level so that they can be referenced in other places.
Finally the single-stack-solutions folder contains individual solutions in single CloudFormation “stacks”. These solutions are the best place to start because they are easier to get up and running quickly.
This blog is going to show you how to spin up the restaurant-ordering use case from the single-stack-solutions folder.
Want to watch a video of the installation instead?
Prerequisites
This is not a beginner level build! You should have some knowledge of AWS, serverless computing, and programming before continuing.
- Twilio Account. If you don’t yet have one, you can sign up for a free account here.
- A phone number provision in your Twilio Account.
- AWS Account with permissions to provision API Gateways, Lambdas, step functions, S3 buckets, IAM Roles & Policies, and SNS topics. You can sign up for an account here.
- AWS CLI installed with AWS credentials configured.
- AWS SAM CLI installed.
- Node.js installed on your computer.
- OpenAI - An OpenAI account and API Key to make calls to their Chat Completion API.
Let’s Build it!
1. Download the Code for this Application
Download the code from this repo, and then open up the folder in your preferred development environment.
The repo contains everything needed to deploy multiple solutions. Again, this blog post is going to spin up the restaurant-ordering app in the single-stack-solutions folder so navigate there first. Here is the correct location in my IDE:
First we need to install a couple of node packages. From a terminal window in single-stack-solutions/restaurant-ordering
, run the following commands to install some libraries:
2. Enter your API credentials
In order to make API calls, you will need to enter your credentials. For the purpose of this blog post and to get you up and running quickly, you can enter your credentials directly into the yaml file, but for best practices (and certainly for production use), save your credentials using methods approved by your organization.
Open up the file template.yaml
in the parent directory. This yaml
file contains the instructions needed to provision the AWS resources.
In the template.yaml
file use FIND and enter OPENAI
Uncomment the line that looks like this:
…and replace YOUR-OPENAI-API-KEY
with your actual OPENAI API Key.
Now, comment out the line that looks like this (that is, put a ‘#’ in front of it):
…that line uses AWS Secrets Manage to securely store your credentials.
You also need to do the same for TWILIO_ACCOUNT_SID
, TWILIO_AUTH_TOKEN
, SENDGRID_API_KEY
, and TWILIO_EMAIL_FROM_ADDRESS
. Search for those terms in the yaml file, and uncomment the lines that allow you to enter the values directly into the yaml file and comment out the lines that use AWS Secrets Manager. These values are needed if you want to send SMS messages or emails from your application. They are not required to spin up this application. Enter fake data for these values if you want to skip this functionality for now.
3. Deploy Code
With those settings in place, we are ready to deploy! From a terminal window, be sure you are in the single-stack-solutions/restaurant-ordering directory, and run:
This command goes through the yaml file template.yaml
and prepares the stack to be deployed.
In order to deploy the SAM application to your AWS account, you need to be sure that you have the proper AWS credentials configured. Follow these instructions. This application uses your locally saved AWS profile to deploy to your AWS account.
Once you have authenticated into your AWS account, you can run:
Note that the command references aws-profile.profile
in order to authenticate and deploy to your AWS account.
This will start an interactive command prompt session to set basic configurations and then deploy all of your resources via a stack in CloudFormation. Here are the answers to enter after running that command (except, substitute your AWS Region of choice):
After answering the last questions, SAM will create a changeset that lists all of the resources that will be deployed. Answer “y” to the last question to have AWS actually start to create the resources.
The SAM command prompt will let you know when it has finished deploying all of the resources. You can then go to your AWS Console and CloudFormation and browse through the new stack you just created. All of the Lambdas, Lambda Layers, API Gateways, IAM Roles, SNS topics are all created automatically. (IaC – Infrastructure as Code – is awesome!)
You should see a new stack called CR-RESTAURANT-ORDERING
:
Click into that stack and then go into the Outputs tab and copy the value for TwimlAPI
as you will need it in the next step! It should look like this:
4. Connect your Twilio phone number
With the value copied from the previous step, go to your Twilio Console and navigate to the phone number that you want to use for this application. In the Voice Handler section under A call comes in, select webhook and then paste in the value copied above to connect your Twilio phone number to the application you just spun up.
The correct Twilio page section will look like this:
5. Load the configuration details
Now, we need to load the configuration data that will power your voice AI application.
Run this command from the restaurant-ordering
folder:
This will create an item in your DynamoDB instance that will be used to configure each caller session. It is worth taking a look at this item in your DynamoDB Console.
In your AWS Console, navigate to DynamoDB and then select TABLES, and then the table called CR-RESTAURANT-ORDERING-ConversationRelayAppDatabase
. Lastly click on EXPLORE TABLE ITEMS.
Once there you should find and click on an item with a primary key called restaurantOrderingUseCase
. The item will look like this:
Hopefully, you recognize the attribute names and see how they can be used. The ConversationRelayParams
object can control how your application interacts with Twilio. The prompt
, dtmfHandlers
, and tools
attributes connect your application.
It is important to note that while these values are saved in this item, they can be changed in real time as needed to build fully personalized experiences.
For example, in the attributes above you could:
- Inject order history into the prompt if it exists and have the LLM ask if they want to place the same order that they previously placed.
- Change the language in the ConversationReplayParams if it is different from the default.
- Add or remove DTMF handlers for special customers or circumstances.
Make a call!
You should now be able to call your Twilio phone number and start talking with your new Voice AI Application!
Try changing the prompt and ConversationRelay parameters directly from the DynamoDB console. Inspect the session items generated by your calls to see how this application stitches together conversations.
Cleanup
To avoid any undesired costs, you can delete the application CloudFormation Stack from the AWS Console. Select the stack and the DELETE option as shown below:
Deploy to production
While you can get this system working pretty quickly, it is not ready for your production environment. This blog post and repo is intended to inspire and help you start building awesome AI-backed voice applications.
Conclusion
In this post, I introduced Twilio’s ConversationRelay from a technical perspective. ConversationRelay makes it straightforward to deploy AI-backed voice applications by handling session orchestration (including interruption handling), and by managing best of breed third party providers for speech-to-text and text-to-speech while minimizing latency.
I then walked through a reference architecture and application build on AWS to show how enterprises could build their own AI-backed voice applications. I demonstrated a build using OpenAI’s SDK to connect to their LLMs, though your enterprise could choose a different strategy. You can quickly connect a Twilio phone number to this new application and start interacting with your application and LLM.
Twilio is thrilled to launch ConversationRelay, and we are even more excited to see the awesome AI-backed voice applications that our customers build!
Bonus Material
- Check out a video demo I recorded: Deploy Apartment Search Single Stack Solution
Dan Bartlett has been building web applications since the first dotcom wave. The core principles from those days remain the same but these days you can build cooler things faster. He can be reached at dbartlett [at] twilio.com.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.