Build an AI Voice Assistant with Twilio Voice, OpenAI’s Realtime API, and Python

October 01, 2024
Written by
Paul Kamp
Twilion
Reviewed by

Build an AI Voice Assistant with Twilio Voice, OpenAI’s Realtime API, and Python

Today, our friends at OpenAI launched their awesome Realtime API. Exposing the multimodal capabilities of their GPT-4o model, the Realtime API enables direct Speech to Speech, or S2S, functionality.

S2S models promise to improve latency, partially by avoiding a speech-to-text (SST) or text-to-speech (TTS) step. That means we can build applications that offer fluid AI conversations that feel just like human interaction – and we’re thrilled to provide one in this launch integration in collaboration with OpenAI.

In this tutorial, I’ll show you how to build an AI voice assistant using Twilio Voice and the OpenAI Realtime API, powered by Python and the FastAPI web framework. We’ll set up a Twilio Media Stream server to receive audio from a phone call, process it using the OpenAI Realtime API, and then send the AI’s audio response back to Twilio and on to the caller. Once you build it, you’ll be able to talk to your assistant, ask it for facts and jokes, and whatever else you can imagine!

Let’s build it.

OpenAI is rolling out Realtime API Access incrementally. Please watch their site for updates.

This app is also available as a prebuilt application on Code Exchange. You can find it here.

Prerequisites

To follow along with this tutorial, you will first need:

  • Python 3.9+. (I used version 3.9.13 to build the tutorial)
  • A Twilio account. If you don’t have one, you can sign up for a free trial here.
  • A Twilio number with Voice capabilities. Here are instructions to purchase a phone number.
  • An OpenAI account and an OpenAI API Key. You can sign up here.
  • OpenAI Realtime API access. Check here for more information.
  • (Optional) ngrok or another tunneling solution to expose your local server to the internet for testing. Download ngrok here.

Ensure you have the above ready before moving forward – then, let’s go…

Set up the Realtime API speech-to-speech Python project

In these next steps, I’ll walk through setting up our project, installing the dependencies we’ll need, and writing the server code. I’ll go step by step, and try to explain the interesting parts.

(Alternatively, you can find our repository here.)

Step 1: Initialize the project

First, let's set up a new Python project and create a virtual environment so we don’t clutter up things on your development machine. On your command line, enter the following:

mkdir speech-assistant-openai-realtime-api-python
cd speech-assistant-openai-realtime-api-python
python3 -m venv venv
source venv/bin/activate

Step 2: Install dependencies

Next, we need to install the required dependencies for the project. Run this command – I’ll explain in a second:

We’ll need the websockets library to handle websockets with Twilio and OpenAI, python-dotenv to read our environment variables, and twilio to structure our instructions to Twilio.

fastapi is the Python web framework I built this tutorial with – other popular choices in the Python community are Flask, Django, and Pyramid.

We’ll use uvicorn as our server. It’s a minimal server that’s great for asynchronous applications – as I think you’ll agree after testing this!

Step 3: Create the project files

Now we’ll create a file named main.py for our main code and server logic, and a .env file to store our OpenAI API Key. (You can learn more about this method in our Python Environment Variables post).

Step 3.1: Create the main.py File

Run this command:

touch main.py

Step 3.2: Create the .env File

First, create the .env file:

touch .env

Then, using your text editor, open the file and add your OpenAI Realtime API key:

OPENAI_API_KEY=your_openai_api_key_here

(Of course, please swap your key in where I wrote your_openai_api_key_here!)

Step 4: Write the Server Code

You’ve got your scaffolding ready now.

We'll build up the server code in multiple steps. Each step will include the relevant code, then I’ll do my best to provide a brief explanation of the trickier parts of the code.

Step 4.1: Import dependencies and load environment variables

At the top of the main.py file, we import the required modules and then set up and load the environment variables from our .env file.

Paste the following code at the top of your main.py:

import os
import json
import base64
import asyncio
import websockets
from fastapi import FastAPI, WebSocket, Request
from fastapi.responses import HTMLResponse
from fastapi.websockets import WebSocketDisconnect
from twilio.twiml.voice_response import VoiceResponse, Connect, Say, Stream
from dotenv import load_dotenv
load_dotenv()
# Configuration
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') # requires OpenAI Realtime API Access
PORT = int(os.getenv('PORT', 5050))

Step 4.2: Define constants and initialize FastAPI

Next, we define constants for the system message, the AI response voice, and events to log. We also initialize the FastAPI app.

Here's what you should paste next in your file:

SYSTEM_MESSAGE = (
    "You are a helpful and bubbly AI assistant who loves to chat about "
    "anything the user is interested in and is prepared to offer them facts. "
    "You have a penchant for dad jokes, owl jokes, and rickrolling – subtly. "
    "Always stay positive, but work in a joke when appropriate."
)
VOICE = 'alloy'
LOG_EVENT_TYPES = [
    'response.content.done', 'rate_limits.updated', 'response.done',
    'input_audio_buffer.committed', 'input_audio_buffer.speech_stopped',
    'input_audio_buffer.speech_started', 'session.created'
]
app = FastAPI()
if not OPENAI_API_KEY:
    raise ValueError('Missing the OpenAI API key. Please set it in the .env file.')

Here, the SYSTEM_MESSAGE configures the behavior and personality of the AI. Feel free to mix it up using your own instructions!

The VOICE constant controls the AI’s voice for responses. At launch, you can choose alloy (like I have here), echo, or shimmer.

Finally, LOG_EVENT_TYPES determines which events from the OpenAI API we want to log. See OpenAI’s Realtime API documentation for more details.

We also initialize a FastAPI application instance and check for the presence of the OpenAI API key.

Step 4.3: Define Routes for Incoming Calls and the Root Endpoint

Next, we define two routes: the root route to check if the server is running (we won’t use it in the final demo, but you might find it useful in testing to see signs of life - this is at the path /) and another route to handle incoming calls and return TwiML instructions to Twilio.

Paste this into main.py:

@app.get("/", response_class=HTMLResponse)
async def index_page():
    return {"message": "Twilio Media Stream Server is running!"}
@app.api_route("/incoming-call", methods=["GET", "POST"])
async def handle_incoming_call(request: Request):
    """Handle incoming call and return TwiML response to connect to Media Stream."""
    response = VoiceResponse()
    # <Say> punctuation to improve text-to-speech flow
    response.say("Please wait while we connect your call to the A. I. voice assistant, powered by Twilio and the Open-A.I. Realtime API")
    response.pause(length=1)
    response.say("O.K. you can start talking!")
    host = request.url.hostname
    connect = Connect()
    connect.stream(url=f'wss://{host}/media-stream')
    response.append(connect)
    return HTMLResponse(content=str(response), media_type="application/xml")

The /incoming-call route handles incoming calls from Twilio, responding with TwiML instructions, a special dialect of XML that lets Twilio know how to handle our call. We’re using the Twilio Python Helper library here to make the code simpler.

This particular TwiML response instructs the caller to wait, then tells Twilio to connect to our /media-stream WebSocket endpoint. Feel free to play with how it works.

Step 4.4: Handle WebSocket connections for Twilio Media Streams and OpenAI

In the next bit of code, we will set up the WebSocket route for Media Streams and connect to both the Twilio and OpenAI WebSockets. This code is long, so I'll explain some interesting things we’re doing right after the block.

Paste this code below the route definitions:

@app.websocket("/media-stream")
async def handle_media_stream(websocket: WebSocket):
    """Handle WebSocket connections between Twilio and OpenAI."""
    print("Client connected")
    await websocket.accept()
    async with websockets.connect(
        'wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2024-10-01',
        extra_headers={
            "Authorization": f"Bearer {OPENAI_API_KEY}",
            "OpenAI-Beta": "realtime=v1"
        }
    ) as openai_ws:
        await send_session_update(openai_ws)
        stream_sid = None
        async def receive_from_twilio():
            """Receive audio data from Twilio and send it to the OpenAI Realtime API."""
            nonlocal stream_sid
            try:
                async for message in websocket.iter_text():
                    data = json.loads(message)
                    if data['event'] == 'media' and openai_ws.open:
                        audio_append = {
                            "type": "input_audio_buffer.append",
                            "audio": data['media']['payload']
                        }
                        await openai_ws.send(json.dumps(audio_append))
                    elif data['event'] == 'start':
                        stream_sid = data['start']['streamSid']
                        print(f"Incoming stream has started {stream_sid}")
            except WebSocketDisconnect:
                print("Client disconnected.")
                if openai_ws.open:
                    await openai_ws.close()
        async def send_to_twilio():
           """Receive events from the OpenAI Realtime API, send audio back to Twilio."""
            nonlocal stream_sid
            try:
                async for openai_message in openai_ws:
                    response = json.loads(openai_message)
                    if response['type'] in LOG_EVENT_TYPES:
                        print(f"Received event: {response['type']}", response)
                    if response['type'] == 'session.updated':
                        print("Session updated successfully:", response)
                    if response['type'] == 'response.audio.delta' and response.get('delta'):
                        # Audio from OpenAI
                        try:
                            audio_payload = base64.b64encode(base64.b64decode(response['delta'])).decode('utf-8')
                            audio_delta = {
                                "event": "media",
                                "streamSid": stream_sid,
                                "media": {
                                    "payload": audio_payload
                                }
                            }
                            await websocket.send_json(audio_delta)
                        except Exception as e:
                            print(f"Error processing audio data: {e}")
            except Exception as e:
                print(f"Error in send_to_twilio: {e}")
        await asyncio.gather(receive_from_twilio(), send_to_twilio())

The /media-stream websocket endpoint will handle the connection from Twilio (during the phone call). After that, we do some work to proxy audio between the two websockets.

Connect to the OpenAI Realtime API

We establish a WebSocket connection to the OpenAI Realtime API:

  • websockets.connect(...): this code connects to the OpenAI Realtime API using the provided endpoint and headers, which include the OpenAI API key (and beta flag - see their documentation for more).
  • send_session_update(openai_ws): This sends the initial session update configuration to OpenAI after establishing the connection. It’s where we pass some of the constants defined in the section above – but I’ll explain in the section below.
Proxy audio between Twilio and OpenAI

The receive_from_twilio coroutine listens for audio data from Twilio, processes it, and sends it to OpenAI. Its counterpart send_to_twilio listens for response.audio.delta events from OpenAI and sends them back to Twilio (logging other event types – the ones you control in the LOG_EVENT_TYPES constant – to the command line).

Step 4.5: Send Session Update to OpenAI

Finally, we define the function to send a session update to the OpenAI WebSocket. (This is what we called in the section above.)

Paste this at the end of your main.py:

async def send_session_update(openai_ws):
    """Send session update to OpenAI WebSocket."""
    session_update = {
        "type": "session.update",
        "session": {
            "turn_detection": {"type": "server_vad"},
            "input_audio_format": "g711_ulaw",
            "output_audio_format": "g711_ulaw",
            "voice": VOICE,
            "instructions": SYSTEM_MESSAGE,
            "modalities": ["text", "audio"],
            "temperature": 0.8,
        }
    }
    print('Sending session update:', json.dumps(session_update))
    await openai_ws.send(json.dumps(session_update))

This function sends the initial configuration for the OpenAI Realtime API session. I’m only showing you a few possible settings (see more, here). Here’s what’s happening:

  • Turn Detection: Enables server-side Voice Activity Detection (VAD), which controls how the AI knows when to respond.
  • Audio Formats: Specifies input and output audio formats. G.711 ulaw is supported by Twilio.
  • Voice and Instructions: Sets the AI's voice and behavioral instructions. You can change the SYSTEM_MESSAGE in the constant section.
  • Modalities: Enables text and audio response capabilities.
  • Temperature: Influences the balance between predictability and creativity in generated text. Lower temperatures prefer more deterministic outputs, while higher temperatures foster diversity, innovation, and more varied responses.

Step 4.6: Prepare the server

Finally, we add the server's entry point to start the FastAPI server and listen on the specified port. Paste this at the end of main.py:

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=PORT)

Step 5: Run the server

If you followed along properly, it’s time! Run the server with:

uvicorn main:app --host 0.0.0.0 --port 5050

If everything is set up correctly, you should see a message similar to mine:

INFO:     Started server process [6143]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:5050 (Press CTRL+C to quit)

We’re getting really close now! Just a few more steps and you can place a phone call.

Finish your setup

Step 6: Use ngrok to expose your local server

Twilio needs instructions on how to handle incoming calls. For that it needs the TwiML we discussed above, but first it needs a public URL to reach your server where you’ll provide that TwiML!

I’ll provide instructions for ngrok in this post. You can find other reverse proxy or tunneling options and some notes on further options, here.

Download and install ngrok if you haven’t yet, then run the following command. If you have changed the port from 5050, be sure to also update it here:

ngrok http 5050

Here’s how mine looked after running the command:

Step 7: Configure Twilio

We’re so close now my fingers are heating up. It’s time to work on the Twilio side.

Open the Twilio Console, then find your Voice-enabled number.

Under Voice & Fax on that screen, set the A CALL COMES IN webhook to your ngrok URL (in the Forwarding line, ( https://ad745c4093d9.ngrok.app in my screenshot) appending /incoming-call. For example, in my case, I enter https://ad745c4093d9.ngrok.app/incoming-call.

Okay, hit Save. We’re ready!

Test your setup!

Make sure your ngrok session is still running and your server is up. Now, make a call to your Twilio number using a cell phone or landline.

The server should handle the call, deliver the introductory messages we added, and then connect the OpenAI Realtime API with the Twilio Media Stream WebSocket. Start talking – you should hear the AI's response in real-time! Have a great chat.

Session lengths are limited to 15 minutes during the OpenAI Realtime API beta.

Common issues and troubleshooting

If your setup isn’t working (but your server is still running), check these points first:

  • Is ngrok running? Ensure that the URL properly appears in the Voice Configuration under A Call Comes In.
  • Are there Twilio errors? You can debug Twilio errors in a few ways - there’s more info in this article.
  • Is there something in your server logs? Ensure that your server is running without errors.
  • Is your code calling OpenAI correctly?

Conclusion

And there you have it – you just successfully built an interactive AI voice application using Twilio Voice and Media Streams and the OpenAI Realtime API in Python.You now have a low-latency, interactive voice assistant you can talk to anytime. You’re ready to add your business logic and guardrails, productize, and then scale this solution – and we can’t wait to see you do it.

Happy building!

Next step:

Paul Kamp is the Technical Editor-in-Chief of the Twilio Blog. He had a lot of fun talking to OpenAI’s Realtime API while building this tutorial – and even more fun letting his daughters talk. You can reach him at pkamp [at] twilio.com.