ConversationRelay Application and Architecture for Voice AI Applications Built on AWS

November 20, 2024
Written by
Reviewed by
Paul Kamp
Twilion

ConversationRelay Application and Architecture for Voice AI Applications Built on AWS

Twilio is laser focused on innovating with Artificial Intelligence. ConversationRelay, which we launched at our SIGNAL London event, is our latest release in this effort.

When I first heard about ConversationRelay a few months ago, I knew right away that we were building something awesome. ConversationRelay, which makes building voice AI integrations straightforward, is now in Public Beta. It’s one way Twilio will help our customers leverage AI to build better and more personalized experiences for their customers. The high level value proposition of AI-backed-voice-agents is clear enough, but how will ConversationRelay help our customers?

I’ll answer that, and more, in this blog post.

Below, I’ll detail an example reference application and architecture for building enterprise applications based on Twilio ConversationRelay. The reference application is built using AWS and their serverless products, but the architecture provides a framework for building solutions on other cloud providers. But first, I’ll give you an overview of CustomerRelay and explain why, where, and how it’ll help you build your AI-powered agents.

Let’s dive in.

Where ConversationRelay fits in voice applications

Historically, voice applications have been more challenging to build than messaging (SMS, WhatsApp, Chat) applications because voice is a synchronous channel that cannot be easily broken into an event based architecture. For voice AI applications, we builders need to convert streaming voice into text, stream that text to an AI application, stream back the response, and finally convert the response back to voice all while handling session state, interruptions, speaker detection, noise cancellation, and minimizing latency. This is no trivial task!

Here is an architectural diagram showing the complex systems that a voice AI application needs to manage over a synchronous voice call where latency is of paramount importance!

Diagram showing a business application processing caller interactions using multiple technologies like ASR and text-to-speech.

How does this look with ConversationRelay?

Here’s how ConversationRelay – circled – simplifies the architecture:

Diagram showing Twilio handling caller interactions with ASR, Text-to-Speech, and routing to business logic or LLM. using ConversationRelay.

With ConversationRelay, enterprises can focus on building the components that are true differentiators for their business: business and application logic and the interactions with their LLM(s) of choice. Twilio provides a simple interface for your applications while handling the complexities of speech-to-text, text-to-speech, interruptions, and more – all while providing the scalability expected from Twilio.

Which TTS and ASR Providers does ConversationRelay use?

Twilio is working with several providers and allows you to select providers, voices, and languages through the Twilio console as you build ConversationRelay sessions. Twilio will give our customers choices from best of breed providers while optimizing the connections to these providers to drive low latency conversations. Refer to the docs for the most recent list of providers available.

Are you excited now, too? Great, but there is more. Let’s dig deeper into that interface that ConversationRelay provides for your business voice AI Applications.

Interfacing your application with ConversationRelay

As mentioned above, voice calls are synchronous and voice applications need to maintain a constant connection. Typically this means that voice AI applications need to maintain a continuous session state as well as streams to both the caller and the application (including to the LLM). This adds additional complexity to the diagrams above.

Ideally, enterprises would be able to build voice applications using event-based paradigms similar to what they are able to do with messaging channel applications and frankly most other enterprise applications. Thankfully, ConversationRelay helps in this regard!

Consider the diagram below:

Diagram showing Twilio's ConversationRelay using WebSocket to handle caller speech and send responses

When a caller speaks, Twilio ConversationRelay uses a TTS provider to convert the speech into text and sends that completed text to the business application in a message over a websocket connection. In the diagram above, speech from the caller arrives at your application as “events” to be processed (websocket messages). Your application handles these events and can stream text responses back from your application (and LLM) as websocket messages to Twilio ConversationRelay where the streamed text is converted into speech.

This is the best of all worlds. Twilio handles the voice call connection, orchestration and session management, and speech-to-text and text-to-speech conversions (through our third party providers, such as Google Speech-to-Text and Amazon Polly). Your application can consume events and then stream back responses.

The final element to cover in this introduction is latency. None of this works if the latency exceeds what is expected in normal human conversation. ConversationRelay combines Twilio’s connections to best-in-breed ASR and TTS providers to enable a tool set capable of powering whatever AI-backed voice application that your organization wants to build.

Reference Application

Now, let’s turn our attention to building a reference application using ConversationRelay. We’ll use OpenAI as our LLM, and build our application architecture on top of AWS.

ConversationRelay Reference Application architecture

Here’s the application we’ll be building together:

Reference App relay for ConversationRelay on AWS

The diagram includes notes, but let’s dive deeper into each one.

  1. Twilio Voice Infrastructure: Leverage Twilio’s industry leading CPaaS capabilities to connect to your customers. The inbound voice handler is routed to a REST API that will establish the ConversationRelay session using Programmable Voice (and Twilio’s Markup Languages, or TwiML).
  2. TwiML Response: Respond to a new inbound call using TwiML. This initial step can use user context to personalize the session for the user. For example, language and voice type can be set for each caller and the model and prompt can be set and customized in real time utilizing customer history and preferences.
  3. Datasource: Production applications clearly need capable databases to maintain state. The reference application uses DynamoDB.
  4. ConversationRelay: The TwiML from #2 establishes a unique websocket session and is ready to convert speech-to-text and send the text to your application. It’s also ready to receive inbound streams of text to reply to the caller using the selected specified text-to-speech provider.
  5. Websocket API: ConversationRelay sends converted text from the caller to your application, where you can handle the converted text as events. LLMs will stream text responses back to your application where you can, in turn, stream the text “chunks” back to ConversationRelay to be converted into speech.
  6. Business Application: This reference application uses AWS Lambdas and has examples on how to handle tool calling (including multiple parallel tool calls).
  7. LLM: The reference application calls OpenAI but the actual interface is in a single file that could be converted for a different LLM.

This all sounds great, but does it work?

It sure does! Let’s take a quick video tour before proceeding…

Let’s get started…

About the repo

This repo is broken into three top level folders.

The decoupled-architecture folder breaks the major pieces of the application into separate CloudFormation “stacks”. This follows best practices, and is included in the repo in case your organization wants to follow a similar pattern. Most people will want to start with a stack in the single-stack-solutions folder.

The prompts folder has a few different “use-cases” and the prompts are saved at this level so that they can be referenced in other places.

Finally the single-stack-solutions folder contains individual solutions in single CloudFormation “stacks”. These solutions are the best place to start because they are easier to get up and running quickly.

This blog is going to show you how to spin up the restaurant-ordering use case from the single-stack-solutions folder.

Want to watch a video of the installation instead?

Prerequisites

This is not a beginner level build! You should have some knowledge of AWS, serverless computing, and programming before continuing.

  • Twilio Account. If you don’t yet have one, you can sign up for a free account here.
  • A phone number provision in your Twilio Account.
  • AWS Account with permissions to provision API Gateways, Lambdas, step functions, S3 buckets, IAM Roles & Policies, and SNS topics. You can sign up for an account here.
  • AWS CLI installed with AWS credentials configured.
  • AWS SAM CLI installed.
  • Node.js installed on your computer.
  • OpenAI - An OpenAI account and API Key to make calls to their Chat Completion API.

Let’s Build it!

1. Download the Code for this Application

Download the code from this repo, and then open up the folder in your preferred development environment.

Screenshot of the GitHub repository page for the Conversation-Relay-Serverless project.

The repo contains everything needed to deploy multiple solutions. Again, this blog post is going to spin up the restaurant-ordering app in the single-stack-solutions folder so navigate there first. Here is the correct location in my IDE:

Screenshot showing a folder structure of a restaurant ordering system project with various sub-folders and files.

First we need to install a couple of node packages. From a terminal window in single-stack-solutions/restaurant-ordering, run the following commands to install some libraries:

$ npm --prefix ./layers/layer-cr-open-ai-client/nodejs install
$ npm --prefix ./layers/layer-cr-open-sendgrid-email-client/nodejs install
$ npm --prefix ./layers/layer-cr-twilio-client/nodejs install

2. Enter your API credentials

In order to make API calls, you will need to enter your credentials. For the purpose of this blog post and to get you up and running quickly, you can enter your credentials directly into the yaml file, but for best practices (and certainly for production use), save your credentials using methods approved by your organization.

Open up the file template.yaml in the parent directory. This yaml file contains the instructions needed to provision the AWS resources.

In the template.yaml file use FIND and enter OPENAI

Uncomment the line that looks like this:

# OPENAI_API_KEY: "YOUR-OPENAI-API-KEY"

…and replace YOUR-OPENAI-API-KEY with your actual OPENAI API Key.

Now, comment out the line that looks like this (that is, put a ‘#’ in front of it):

OPENAI_API_KEY: '{{resolve:secretsmanager:CR_RESTAURANT_ORDERING:SecretString:OPENAI_API_KEY}}'

…that line uses AWS Secrets Manage to securely store your credentials.

You also need to do the same for TWILIO_ACCOUNT_SIDTWILIO_AUTH_TOKENSENDGRID_API_KEY, and TWILIO_EMAIL_FROM_ADDRESS. Search for those terms in the yaml file, and uncomment the lines that allow you to enter the values directly into the yaml file and comment out the lines that use AWS Secrets Manager. These values are needed if you want to send SMS messages or emails from your application. They are not required to spin up this application. Enter fake data for these values if you want to skip this functionality for now.

3. Deploy Code

With those settings in place, we are ready to deploy! From a terminal window, be sure you are in the single-stack-solutions/restaurant-ordering directory, and run:

$ sam build

This command goes through the yaml file template.yaml and prepares the stack to be deployed.

In order to deploy the SAM application to your AWS account, you need to be sure that you have the proper AWS credentials configured. Follow these instructions. This application uses your locally saved AWS profile to deploy to your AWS account.

The file called aws-profile.profile in the root directory of this repo needs to be set with your profile!
$ sam deploy --stack-name USE-CASE-RESTAURANT-ORDERING --template template.yaml --profile $(cat ../../aws-profile.profile) --capabilities CAPABILITY_NAMED_IAM

Note that the command references aws-profile.profile in order to authenticate and deploy to your AWS account.

This will start an interactive command prompt session to set basic configurations and then deploy all of your resources via a stack in CloudFormation. Here are the answers to enter after running that command (except, substitute your AWS Region of choice):

Configuring SAM deploy
======================
        Looking for config file [samconfig.toml] :  Not found
        Setting default arguments for 'sam deploy'
        =========================================
        Stack Name [CR-RESTAURANT-ORDERING]: CR-RESTAURANT-ORDERING
        AWS Region [us-east-1]: us-east-1 
        #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
        Confirm changes before deploy [y/N]: N
        #SAM needs permission to be able to create roles to connect to the resources in your template
        Allow SAM CLI IAM role creation [Y/n]: Y
        #Preserves the state of previously provisioned resources when an operation fails
        Disable rollback [y/N]: N
        CallSetupFunction has no authentication. Is this okay? [y/N]: y
        Save arguments to configuration file [Y/n]: Y
        SAM configuration file [samconfig.toml]: 
        SAM configuration environment [default]:

After answering the last questions, SAM will create a changeset that lists all of the resources that will be deployed. Answer “y” to the last question to have AWS actually start to create the resources.

Previewing CloudFormation changeset before deployment
======================================================
Deploy this changeset? [y/N]:

The SAM command prompt will let you know when it has finished deploying all of the resources. You can then go to your AWS Console and CloudFormation and browse through the new stack you just created. All of the Lambdas, Lambda Layers, API Gateways, IAM Roles, SNS topics are all created automatically. (IaC – Infrastructure as Code – is awesome!)

You should see a new stack called CR-RESTAURANT-ORDERING:

AWS CloudFormation dashboard showing stack CR-RESTAURANT-ORDERING with status UPDATE_COMPLETE.

Click into that stack and then go into the Outputs tab and copy the value for TwimlAPI as you will need it in the next step! It should look like this:

Screenshot of AWS CloudFormation stack outputs page showing Twilio API URL and related details.

4. Connect your Twilio phone number

With the value copied from the previous step, go to your Twilio Console and navigate to the phone number that you want to use for this application. In the Voice Handler section under A call comes in, select webhook and then paste in the value copied above to connect your Twilio phone number to the application you just spun up.

The correct Twilio page section will look like this:

Voice configuration page with a red arrow pointing to an empty URL field for webhook setup.

5. Load the configuration details

Now, we need to load the configuration data that will power your voice AI application.

Run this command from the restaurant-ordering folder:

$ aws dynamodb put-item --table-name CR-RESTAURANT-ORDERING-ConversationRelayAppDatabase --item "$(node ./configuration/dynamo-loaders/restaurantOrderingUseCase.js | cat)" --profile $(cat ../../aws-profile.profile)

This will create an item in your DynamoDB instance that will be used to configure each caller session. It is worth taking a look at this item in your DynamoDB Console.

In your AWS Console, navigate to DynamoDB and then select TABLES, and then the table called CR-RESTAURANT-ORDERING-ConversationRelayAppDatabase. Lastly click on EXPLORE TABLE ITEMS.

Once there you should find and click on an item with a primary key called restaurantOrderingUseCase. The item will look like this:

Screenshot of Dialogflow CX configuration for a restaurant ordering use case featuring multiple settings and parameters.

Hopefully, you recognize the attribute names and see how they can be used. The ConversationRelayParams object can control how your application interacts with Twilio. The prompt, dtmfHandlers, and tools attributes connect your application.

It is important to note that while these values are saved in this item, they can be changed in real time as needed to build fully personalized experiences.

For example, in the attributes above you could:

  • Inject order history into the prompt if it exists and have the LLM ask if they want to place the same order that they previously placed.
  • Change the language in the ConversationReplayParams if it is different from the default.
  • Add or remove DTMF handlers for special customers or circumstances.

Make a call!

You should now be able to call your Twilio phone number and start talking with your new Voice AI Application!

Try changing the prompt and ConversationRelay parameters directly from the DynamoDB console. Inspect the session items generated by your calls to see how this application stitches together conversations.

Cleanup

To avoid any undesired costs, you can delete the application CloudFormation Stack from the AWS Console. Select the stack and the DELETE option as shown below:

Screen showing AWS CloudFormation stacks with one stack selected and options to delete and update.

Deploy to production

While you can get this system working pretty quickly, it is not ready for your production environment. This blog post and repo is intended to inspire and help you start building awesome AI-backed voice applications.

Conclusion

In this post, I introduced Twilio’s ConversationRelay from a technical perspective. ConversationRelay makes it straightforward to deploy AI-backed voice applications by handling session orchestration (including interruption handling), and by managing best of breed third party providers for speech-to-text and text-to-speech while minimizing latency.

I then walked through a reference architecture and application build on AWS to show how enterprises could build their own AI-backed voice applications. I demonstrated a build using OpenAI’s SDK to connect to their LLMs, though your enterprise could choose a different strategy. You can quickly connect a Twilio phone number to this new application and start interacting with your application and LLM.

Twilio is thrilled to launch ConversationRelay, and we are even more excited to see the awesome AI-backed voice applications that our customers build!

Bonus Material

Dan Bartlett has been building web applications since the first dotcom wave. The core principles from those days remain the same but these days you can build cooler things faster. He can be reached at dbartlett [at] twilio.com.