Voice AI: Building Voice Bots with Twilio's ConversationRelay

Time to read:

December 11, 2024

Written by

Hao Wang

Twilion

Reviewed by

Twilion

Voice bots are revolutionizing customer service, virtual assistants, and various other voice-driven applications. Whether you're building a chatbot that answers common customer queries or a personalized assistant for your users, the underlying technology is crucial for delivering seamless, natural interactions. Enter Twilio's ConversationRelay— a tool that provides WebSocket connections, Speech-to-Text (STT), and Text-to-Speech (TTS) with ultra-low latency, enabling you to quickly create high-performance voice bots.

In this post, we'll walk you through how to leverage Twilio's ConversationRelay to build your own voice bot, using minimal code while integrating with OpenAI to generate conversational prompts. You'll learn how to set up the app, customize its behavior, and deploy it to a production environment, all while taking advantage of Twilio's powerful API features.

What is Twilio ConversationRelay?

Twilio's ConversationRelay is a service designed to simplify the process of building voice bots. It handles many of the complex aspects of voice interaction, including converting speech to text and text to speech, with an emphasis on low-latency responses. Instead of manually building all the components for real-time audio processing, ConversationRelay offers a seamless solution that integrates WebSocket connections, allowing voice data to be sent and received in real-time.

Key features of ConversationRelay include:

WebSocket Connections: ConversationRelay streams text between Twilio and your application via a WebSocket, enabling responsive, real-time conversations. You can then choose the LLM to generate conversational responses.
Low Latency: ConversationRelay processes and delivers responses with minimal delay—typically around 1 second—ensuring fast and natural interactions.
Speech-to-Text (STT) and Text-to-Speech (TTS): These built-in capabilities enable you to easily convert spoken input into text and generate spoken output from text, all integrated within the app.

These features make Twilio's ConversationRelay an excellent choice for developers who want to create advanced voice bot applications without the complexity of building each component from scratch. The diagram below shows a high level architecture and flow of ConversationRelay, you could find more details in this blog post.

Diagram of customer interaction with a virtual agent using WebSocket API and backend integration.

Key Features of the ConversationRelay Sample App

The ConversationRelay Sample App demonstrates how to utilize Twilio’s capabilities alongside Airtable and OpenAI. Here are its standout features:

Low-Latency Responses: Thanks to ConversationRelay’s streaming, the app delivers responses in near real-time, enhancing conversational flow

Demo of ConversationRelay APIs: The app provides an end-to-end demonstration of how to utilize ConversationRelay’s core APIs.

Customizable Prompts with Airtable: Modify conversation prompts directly in Airtable for fast, code-free updates to your bot's behavior.

Function Tools: Includes prebuilt integrations for common use cases like:

Weather Information: Fetch real-time weather data using OpenWeatherMap.
Order Placement: Simulate order confirmation and send SMS notifications.
Dynamic Language Switching: Change the conversation language during a call.

Set Up the ConversationRelay Sample App

To start building your voice bot, follow these steps to set up your environment.

Prerequisites

Accounts with Twilio, OpenAI, and Airtable.
Install Node.js and npm.
Download and configure ngrok for local hosting.
Fly.io to deploy your service online.
OpenWeatherMap to fetch real-time weather.

First, make sure you have Node.js and npm installed. Then, clone the repository and install the necessary dependencies with the command below:

git clone https://github.com/midshipman/owl-shoes
npm install

Configure Environment Variables

Copy the .env.example file to . env and configure your API keys, including Twilio’s and OpenAI’s credentials, to allow the app to access the necessary services.

Configure Airtable by importing the sample Airtable table or creating your own with the same fields. Ensure the table is named "builder" and that your API keys are correctly configured in the . env file.

Use ngrok to expose your local server to the internet. This is necessary for Twilio to send incoming voice data to your local app during development. Do not forget to copy the public URL (e.g., abc123.ngrok.io) into the SERVER variable in your .env file.

ngrok http 3000

Start the server in development mode using nodemon, which will automatically restart the server as you make changes to your code. Type the command npm run dev in your terminal.

Configure Twilio for incoming calls by connecting a phone number using the Twilio Console. You can also use the Twilio CLI:

twilio phone-numbers:update [your-twilio-number] --voice-url=https://your-server.ngrok.io/incoming

You can now place a call to the Twilio number and have a conversation with the voice bot.

Customize the Bot via Airtable

In this sample app we prompt it as the Owl Shoes bot to help you find the best shoes. You can conveniently adjust the bot’s responses by updating records in Airtable. With this no-code option you can not only tweak the prompt and make your own bot, but also modify the GPT and ConversationRelay parameters.

Edit Prompts: Update the fields in Airtable to modify what the bot says or how it reacts during calls. The last updated record will be loaded.
Use Airtable Forms: Create a form to allow non-technical team members to add or update prompts easily.
Change Parameters: Change language and voice settings used by the voice bot.

You can customize the bot's behavior and user experience using the following fields in the table:

Prompt: Defines the bot's role, tasks, and tone.
User Profile, Orders, Inventory: Personalize interactions by tailoring responses to user-specific details.
Model: Choose the GPT model for generating responses.
Voice: Select the voice used for text-to-speech output.
Language: Set the bot's language, applied to both speech-to-text and text-to-speech functions.
Transcription Provider: Choose the speech-to-text provider (currently supports Google and Deepgram).
SPIChangeSTT: Enable dynamic language changes during a conversation when requested.

Add Monitoring & Logging

Track your app’s logs and latency using its built-in monitoring and logging interface:

Access the logs at https://your-server-address/monitor to view conversation data and debug issues.
Logs include details and timestamps about incoming calls, ConversationRelay logs, GPT responses.

Deploy the app to production with Fly.io

Modify the app name in fly.toml to be a unique value (this must be globally unique). Use the following commands to deploy the app using the Fly.io CLI:

fly launch
fly deploy

Update the SERVER in . env with the fly.io server you get. Import your secrets from your .env file to your deployed app:

fly secrets import < .env

Understand ConversationRelay and Voice Bot Mechanics

You can configure ConversationRelay using TwiML as shown below or dynamically adjust settings during the session using SPI messages over the WebSocket.

An initial TwiML configuration might look like this. The app dynamically retrieves attributes from Airtable whenever an incoming call is received.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <ConversationRelay
      url="wss://mywebsocketserver.com/websocket" 
      language=“en-GB”
      transcriptionProvider=“deepgram”
	speechModel=“nova2-general”
      ttsProvider=“google”
	voice="en-GB-Journey-F”/>
  </Connect>
</Response>

Change the transcription and TTS language during the session by sending the SPI message below:

{
  "type": "language",
  "ttsLanguage": "en-US",
  "transcriptionLanguage": "en-US"
}

You can find more information at the ConversationRelay document.

Understand how language switching works

To change the language during a conversation, ask the bot a phrase such as, "Can you speak in French?". ChatGPT will recognize the intent and trigger the changeLanguage tool implemented in this app. This tool sends an SPI message (as shown above) to ConversationRelay to switch the language.

Ensure the SPIChangeSTT flag is enabled in Airtable for debugging purposes. Without this flag enabled, the voice bot can respond in French, but the transcription language remains unchanged. This means you’ll need to continue speaking in English for accurate recognition.

Note that ConversationRelay assigns a default voice for each language. For instance, if you start with en-US in a male voice and switch to French, the voice may default to a female voice in French. To maintain consistency, specify voices directly in the TwiML configuration as below. Currently, changing the voice via SPI messages is not supported.

The TwiML below ensures consistent use of a male voice.

<ConversationRelay url="wss://${process.env.SERVER}/sockets" voice="en-GB-Journey-D" language="en-GB">
  <Language code="fr-FR" ttsProvider="google" voice="fr-FR-Neural2-B" />
  <Language code="es-ES" ttsProvider="google" voice="es-ES-Neural2-B" />
</ConversationRelay>

Create Effective Prompts for Voice Bots

A well-designed voice bot goes beyond just delivering information—it creates a natural, engaging, and seamless experience for users. Unlike text-based interactions, voice communication relies heavily on tone, brevity, and adaptability to ensure clarity and connection. These style guidelines are essential to make the bot feel more human, build trust, and enhance user satisfaction, especially in dynamic, real-time conversations.

Here are the style guidelines used in this application, serving as a foundation for creating your own unique tone and style.

Keep It Voice-Friendly: Responses should be brief, clear, and conversational—avoid visual elements like lists or symbols.
Use a Warm Tone: Speak in a friendly, relatable manner, using light humor or empathy when appropriate.
Personalize Responses: Leverage user profiles and history for tailored interactions (e.g., referencing past purchases).
Be Flexible: Adapt to the user's pace, respond to interruptions, and rephrase for clarity when needed.
Show Empathy: Acknowledge frustrations and emphasize the user’s value to the brand.
Stay Role-Focused: Stick to your defined role and redirect conversations creatively if asked to do something beyond your scope.
Ensure Smooth Flow: Keep responses natural, role-appropriate, and relevant to maintain a human-like, seamless conversation.

In addition to the general style guidelines, specific rules can be set for GPT to tailor responses format perfectly for voice-based conversations, avoiding confusion caused by written conventions. Following response format prompt is used by this app, additional rules can be applied as needed, depending on the language and specific use case, to further refine the response format for voice interactions.

Response Format:

Be Conversational: Use natural, spoken language that’s concise and easy to follow.
Avoid Special Characters: Replace symbols with descriptive words (e.g., "plus" for "+").
Simplify Punctuation: Stick to periods and commas; avoid complex punctuation like semicolons.
Emphasize Verbally: Use repetition or descriptive language instead of formatting (e.g., bold or caps).
List Items Verbally: Use "first," "next", etc., instead of bullet points or numbers.
Spell Out Rare Terms: Expand abbreviations or acronyms unless commonly spoken (e.g., "NASA").
Address Web or Email Properly: Say "dot" for "." and "at" for "@" when referencing URLs or emails.
Handle Numbers Smartly: Spell out one to ten; use numerals for larger values.
Adapt to Errors: If automatic speech recognition (ASR) struggles, guess the intent and respond naturally. When clarification is needed, use colloquial phrases like "pardon" or "didn’t catch that," avoiding technical terms like "transcription error." Never repeat yourself.

Add More Functions to Extend Functionality

ChatGPT's function calling feature allows it to interact with external tools, APIs, and services dynamically. By invoking predefined functions, ChatGPT can retrieve real-time information, execute complex operations, or interface with external systems seamlessly. This makes it a powerful platform for creating interactive and context-aware applications.

In this app, we’ve streamlined function calling by wrapping it as functions, making it convenient and intuitive to add your own. Functions are modular and reusable, and extend the app's capabilities, allowing the Voice Bot to interact with external services or execute specific tasks.

To add more functions, define the new tool in the functions/function-manifest.js file. Then, create a new file under the functions folder with the exact name of the function. You can implement the logic by referencing how the getWeather function is structured and adapting it for your own tool.

Bonus: Add Personalization with Segment

In this app, user profiles and order history are stored in Airtable and remain static. For dynamic user data, you can use Segment to store and retrieve profiles and events, such as order summaries. Utilize the helper functions in segment-service.js: use addUser() to link a new phone number as an ID for a new user profile, and addEvent() to log a new order. Once the events are recorded, you can read them and incorporate them into GPT prompts to personalize the conversation.

What's next for ConversationRelay applications?

By leveraging Twilio’s ConversationRelay, developers can create voice bots that are responsive, versatile, and easy to customize. The integration with Airtable adds even more flexibility, enabling non-technical team members to update conversation flows and prompts with minimal effort.

Explore the ConversationRelay Sample App, experiment with different integrations, and deploy your bot for real-world use cases. The possibilities are endless—start building your voice-first applications today!

Hao Wang is a Principal Solutions Engineer at Twilio, dedicated to empowering customers to maximize the potential of Twilio’s suite of products. With a strong passion for emerging technologies and voice AI, Hao is always exploring innovative ways to drive impactful solutions.

Related Resources

Twilio Docs

From APIs to SDKs to sample apps

API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.

Resource Center

The latest ebooks, industry reports, and webinars

Learn from customer engagement experts to improve your own communication.

Ahoy

Twilio's developer community hub

Best practices, code samples, and inspiration to build communications and digital engagement experiences.

2025 Gartner® Magic Quadrant™ for CPaaS

2025 State of Customer Engagement Report