Build a Voice AI Application With Twilio ConversationRelay and BentoML

January 08, 2025
Written by
Sherlock Xu
Contributor
Opinions expressed by Twilio contributors are their own

Building a voice AI application traditionally requires combining multiple components, including speech-to-text (STT), text-to-speech (TTS), real-time streaming, and large language models (LLMs). Managing these elements while maintaining low latency and delivering high-quality interactions can be a significant challenge.

In this tutorial, I’ll show you how to simplify this process by using Twilio ConversationRelay and BentoML. By the end of this tutorial, you will have a voice AI application that offers:

  • Real-time voice conversations accessible via a phone call to a Twilio number
  • Interruption handling, enabling users to redirect conversations
  • Streaming responses with minimal latency
  • Customizable serving logic with an open-source LLM deployed via BentoML
  • Production-ready deployment in the cloud that can scale with traffic

Prerequisites

Make sure you have the following:

  • A Twilio account with an available phone number. Sign up for free here.
  • A BentoCloud account for model deployment. Sign up for free here.
  • Python 3.11 or later installed on your machine
  • Git for cloning the project repository

Build the application

Let’s break the development process into manageable steps. I’ll start with the architecture, move on to the setup, and then explain how to handle real-time conversations.

Understand the architecture

ConversationRelay simplifies voice AI development by managing critical features like STT, TTS, real-time streaming, and interruption handling. It eliminates much of the complexity while providing scalability.

With ConversationRelay handling the voice interaction layer, the remaining challenge is implementing and deploying the AI application. This is where BentoML comes in - an open-source framework that simplifies model serving, application packaging, and production deployment.

When building the voice AI application, BentoML helps handle two crucial aspects in particular:

  • Automatically handle the hosting and deployment of a WebSocket server, providing seamless real-time communication between Twilio and the LLM. This saves developers from the complexity of setting up and maintaining WebSocket infrastructure themselves.
  • Allow developers to easily switch between different LLMs for testing or deploying updates in production. This can be done without requiring changes to the existing infrastructure, enabling rapid experimentation.

Here’s how ConversationRelay and BentoML work together in this project:

Project Architecture
The architecture of the voice AI application
The architecture of the voice AI application

Workflow:

  • A user calls the Twilio phone number configured for the application.
  • Twilio triggers the /chat/start_call endpoint, which generates TwiML instructions.
  • The TwiML response includes <ConversationRelay>, which tells Twilio to establish a WebSocket connection with the BentoML Service at /chat/ws.
  • Twilio converts user voice input to text and streams it to BentoML.
  • The BentoML Service passes the transcribed text to Llama 3.1 70B.
  • The model generates a response, streamed back to Twilio in real time via the BentoML Service.
  • Twilio performs TTS conversion on the response and plays the audio back to the user.

Set up the project

Clone the repository. It contains everything you need for this project.

git clone https://github.com/bentoml/BentoTwilioConversationRelay.git
cd BentoTwilioConversationRelay

Install BentoML and log in to BentoCloud.

pip install bentoml
bentoml cloud login

If you want to run this project locally, install all the dependencies:

pip install -r requirements.txt

Configure the BentoML Service

The application’s core logic is in the service.py file. Let’s walk through its main components.

First, configure the constants that control the model's behavior. Feel free to adjust them as needed.

MAX_TOKENS = 8192
MAX_NEW_TOKENS = 2048
SYSTEM_PROMPT = """You are a helpful, respectful and honest assistant. You can hear and speak. You are chatting with a user over voice. Your voice and personality should be warm and engaging, with a lively and playful tone, full of charm and energy. The content of your responses should be conversational, nonjudgmental, and friendly.
Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."""
MODEL_ID = "hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4"

Here I am using Llama 3.1 70B. Note that BentoML makes it straightforward to switch between AI models, whether for testing different configurations or deploying updates in production. This does not require you to overhaul your deployment infrastructure.

Then, create a FastAPI application as an entry point for managing call requests and WebSocket connections. In this project, BentoML converts your model and custom code into two API endpoints:

  • /chat/start_call: An HTTP endpoint for initiating Twilio voice calls. It is triggered when someone calls the Twilio number. It gets the BentoCloud Deployment URL environment variable and generates a TwiML response that uses <ConversationRelay> to establish a WebSocket connection.
  • /chat/ws: A WebSocket endpoint for real-time bidirectional communication, supporting voice streaming. You will configure it later within the BentoML Service.
from fastapi import FastAPI, WebSocket
app = FastAPI()
@app.post("/start_call")
async def start_call():
    print("POST TwiML")
    service_url = os.environ.get("BENTOCLOUD_DEPLOYMENT_URL")
    assert(service_url)
    if service_url.startswith("http"):
        from urllib.parse import urlparse
        service_url = urlparse(service_url).netloc
    tmpl = """<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect>
    <ConversationRelay url="wss://{service_url}/chat/ws" welcomeGreeting="Hi! I'm Jane from Bento M L. Just chat with me!!"></ConversationRelay>
  </Connect>
</Response>
    """
    return HTMLResponse(content=tmpl.format(service_url=service_url), media_type="application/xml")

Next, define the core BentoML Service.

import bentoml
@bentoml.service(
    traffic={"timeout": 360},
    resources={
        "gpu": 1,                       # Number of GPUs
        "gpu_type": "nvidia-a100-80gb", # GPU type on BentoCloud
    },
)
@bentoml.mount_asgi_app(app, path="/chat")
class TwilioChatBot:

Use the @bentoml.service decorator to mark a Python class (TwilioChatBot) as a BentoML Service. You can configure parameters like timeout and GPU resources to use in the cloud. For Llama-3.1-70B-Instruct-AWQ-INT4, I recommend you select an NVIDIA A100 80GB GPU.

Additionally, use the @bentoml.mount_asgi_app decorator to mount the FastAPI application at the /chat path. It allows the /start_call endpoint to be served together with the BentoML Service.

Within the Service, define your business logic and add custom APIs.

from bentoml.models import HuggingFaceModel
class TwilioChatBot:
    # Load the model from the class scope
    # Return the downloaded model path as a string
    model_ref = HuggingFaceModel(MODEL_ID)
    def __init__(self):
        from transformers import AutoTokenizer
        from vllm import AsyncEngineArgs, AsyncLLMEngine
        # Use vLLM as the inference backend
        engine_args = AsyncEngineArgs(
            model=self.model_ref,
            max_model_len=MAX_TOKENS,
            enable_prefix_caching=True,
        )
        self.engine = AsyncLLMEngine.from_engine_args(engine_args)
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_ref)

First, load Llama 3.1 70B from Hugging Face by instantiating a HuggingFaceModel class from bentoml.models. This API provides an efficient mechanism for loading AI models to accelerate model deployment. The model is downloaded during image building rather than Service startup. It is cached and will be mounted directly into containers, thus reducing cold start time and improving scaling performance.

Next, define the WebSocket endpoint /ws. It handles real-time communication between Twilio and the LLM via BentoML.

class TwilioChatBot:
    ...
    @app.websocket("/ws")
    async def websocket_endpoint(self, websocket: WebSocket):
        await websocket.accept()
        # Queue for managing input data
        queue = asyncio.queues.Queue()
            ...
        # Get replies from LLM and updating the message history
        # This function could be canceled by user interruption
        async def llm_request(message):
            ...
        # Read input
        async def read_from_socket(websocket: WebSocket):
            async for data in websocket.iter_json():
                queue.put_nowait(data)
        # Fetch data from queue, launching llm tasks and
        # canceling them when interrupted
        async def get_data_and_process():
            ...
        await asyncio.gather(read_from_socket(websocket), get_data_and_process())

See the complete code for more details.

Deploy the application to BentoCloud

BentoCloud is an AI Inference platform that provides optimized infrastructure to build and scale AI applications. It allows you to keep AI deployments in your own VPC for better data control.

After building the Service with BentoML, deploy it to BentoCloud via bentoml deploy. Use the -n flag to specify the name.

bentoml deploy . -n chatbot-app-demo

Wait for it to be ready. Once deployed, navigate to the Deployment details page and copy the Endpoint URL.

BentoCloud interface showing deployment details for chatbot-app-demo in the Deployments tab.

Voice AI application deployed on BentoCloud

Configure Twilio

Go to your Twilio Console and select the phone number you want to use. Under Voice Configuration, select Webhook for A call comes in and paste your BentoCloud endpoint URL + /chat/start_call in the URL field:

Screenshot of Twilio Voice Configuration settings showing routing and webhook URL options.
Twilio voice configuration
Twilio voice configuration

Test the application

Now, make a call to your Twilio phone number to test your voice AI application. You’ll hear a greeting and can begin interacting with your AI in real time. The assistant supports dynamic interruptions, allowing you to redirect the conversation at any moment.

Check out the demo video:

Conclusion

In this tutorial, you built a production-ready voice AI application using ConversationRelay and BentoML. The combination handles all the complexities of voice processing while giving you full control over your LLM implementation and deployment.

Want to take it further? All the source code of this project is available here. You can easily customize it with your own prompts, RAG, or tool use workflow.

Sherlock Xu is a Content Strategist at BentoML. An open-source enthusiast and active technical blogger, Sherlock is keen on creating clear and engaging content about AI technologies. You can follow him on LinkedIn, X, and GitHub.