Add Token Streaming and Interruption Handling to a Twilio Voice OpenAI Integration

April 15, 2025
Written by
Reviewed by
Paul Kamp
Twilion

Add Token Streaming and Interruption Handling to a Twilio Voice OpenAI Integration

ConversationRelay from Twilio allows you to build real-time, human-friendly voice applications for conversations with any AI Large Language Model, or LLM. It opens a WebSocket so you can integrate any AI API you choose with Twilio Voice, allowing for a fluid, event-based interaction and fast two-way connection.

Previously, we looked at a basic Node.js solution, showing how ConversationRelay can help you create a conversation with a friendly AI powered by OpenAI.

We also talked about some weaknesses of that integration. Because OpenAI generates the text before you hear it spoken aloud, it doesn't have context for the in-progress conversation. If you verbally interrupt the conversation, the AI has already stored all the text that was going to be spoken aloud. Its memory does not account for the point at which you interrupted it, which could cause some confusion for an end user.

We can improve how our AI tracks the voice conversation by adding some additional code. We’ll also add Token Streaming to your application, to improve the latency of the responses by allowing the speech to start before the AI finishes generating a response. Let’s get started now, before we’re interrupted!

Prerequisites

To deploy this tutorial you will need:

  1. Node.js installed on your machine
  2. A Twilio phone number ( Sign up for Twilio here)
  3. Your IDE of choice (such as Visual Studio Code)
  4. The ngrok tunneling service (or other tunneling service)
  5. An OpenAI Account to generate an API Key
  6. A phone to place your outgoing call to Twilio
This tutorial assumes you have already set up the Quickstart project in the previous tutorial. If not, you can get that Quickstart code right away by cloning it from the git repository.

Add Token Streaming

But first, what exactly is Token Streaming?

Think of a "token" as the smallest unit produced by the AI. Typically, this means a word, but can also mean a punctuation mark, number, or special character. When you stream tokens, the server returns each token one by one as the AI model generates them, rather than waiting for the AI’s entire response before converting the text to speech.

Streaming allows us to eliminate a lot of latency from our voice AI app. We’re also going to use the returned tokens to track where a caller interrupted the AI, so we can better inform OpenAI how much a user “heard” from the response.

Let's add token streaming to your server.js file.

Our demo repository has branches showing every step of the process. If you get stuck, follow along with Step Two: Streaming Tokens and Step Three: Conversation Tracking. A future blog post will discuss Tool or Function Calling.

Once you have the previous tutorial prepared, open your existing server.js file and look for the function called aiResponse. That function currently looks like this:

async function aiResponse(messages) {
 let completion = await openai.chat.completions.create({
   model: "gpt-4o-mini",
   messages: messages,
 });
 return completion.choices[0].message.content;
}

You will be changing this function to support streaming. Rename the function to aiResponseStream. Revise the code of this function to this new block:

async function aiResponseStream(messages, ws) {
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: messages,
    stream: true,
  });
  console.log("Received response chunks:");
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || "";
    // Send each token
    console.log(content);
    ws.send(JSON.stringify({
      type: "text",
      token: content,
      last: false,
    }));
  }
  // Send the final "last" token when streaming completes
  ws.send(JSON.stringify({
    type: "text",
    token: "",
    last: true,
  }));
  console.log("Assistant response complete.");
}

When we set stream:true, OpenAI sends us one chunk at a time incrementally. For each token we get from OpenAI, we send it to ConversationRelay with last: false, signifying that more tokens are coming. When the response is done, we send one last text message with last: trueso ConversationRelay knows not to expect more tokens.

To make sure to call your new function correctly, look for this block of code in server.js:

const response = await aiResponse(conversation);
         conversation.push({ role: "assistant", content: response });
         ws.send(
           JSON.stringify({
             type: "text",
             token: response,
             last: true,
           })
         );
         console.log("Sent response:", response);

Replace that all with this call:

aiResponseStream(conversation, ws);

That's it: now the token streaming should be working correctly.

Add conversation tracking and interruption handling

If you test this application now, the streaming will be working. This should lower the latency of your application: try asking your app to “give me a long, compounded sentence with many commas” and note how much quicker you hear a voice. You might also notice that the output in your developer console looks different. Instead of getting a long pause and a large chunk of text, you'll see the tokens appearing one at a time. There's an example in the screenshot below.

Command line interface showing a text response about readiness to assist.

This is a good first step. But you still have not changed the way the AI reacts to interruptions. The next step is to track the conversation so that the AI has context for where the interruption occurred.

You will be making a few changes to the code to account for the entire conversation.

The first change is to the same function you altered in the previous step. This is the revised code. Note the lines that have been changed.

async function aiResponseStream(conversation, ws) {
  const stream = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: conversation,
    stream: true,
  });
  const assistantSegments = [];
  console.log("Received response chunks:");
  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content || "";
    // Send each token
    console.log(content);
    ws.send(JSON.stringify({
      type: "text",
      token: content,
      last: false,
    }));
    assistantSegments.push(content);
  }
  // Send the final "last" token when streaming completes
  ws.send(JSON.stringify({
    type: "text",
    token: "",
    last: true,
  }));
  console.log("Assistant response complete.");
  const sessionData = sessions.get(ws.callSid);
  sessionData.conversation.push({ role: "assistant", content: assistantSegments.join("") });
  console.log("Final accumulated response:", JSON.stringify(assistantSegments.join("")));
}

This code is accumulating your tokens into a conversation. Underneath the above code, erase the existing code for your application and instead add this. The changes from the previous version of the code have been highlighted here.

const fastify = Fastify();
fastify.register(fastifyWs);
fastify.get("/twiml", async (request, reply) => {
  reply.type("text/xml").send(
    `<?xml version="1.0" encoding="UTF-8"?>
    <Response>
      <Connect>
        <ConversationRelay url="${WS_URL}" welcomeGreeting="${WELCOME_GREETING}" />
      </Connect>
    </Response>`
  );
});
fastify.register(async function (fastify) {
  fastify.get("/ws", { websocket: true }, (ws, req) => {
    ws.on("message", async (data) => {
      const message = JSON.parse(data);
      switch (message.type) {
        case "setup":
          const callSid = message.callSid;
          console.log("Setup for call:", callSid);
          ws.callSid = callSid;
          sessions.set(callSid, {conversation: [{ role: "system", content: SYSTEM_PROMPT }]});
          break;
        case "prompt":
          console.log("Processing prompt:", message.voicePrompt);
          const sessionData = sessions.get(ws.callSid);
          sessionData.conversation.push({ role: "user", content: message.voicePrompt });
          aiResponseStream(sessionData.conversation, ws);
          break;
        case "interrupt":
          console.log("Handling interruption; last utterance: ", message.utteranceUntilInterrupt);
          handleInterrupt(ws.callSid, message.utteranceUntilInterrupt);
          break;
        default:
          console.warn("Unknown message type received:", message.type);
          break;
      }
    });
    ws.on("close", () => {
      console.log("WebSocket connection closed");
      sessions.delete(ws.callSid);
    });
  });
});

Finally, add the function that will handle the interruptions.

You might be wondering why you need to add additional code to handle interruptions when interruptions seemed to work fine, verbally, in the previous model. But remember, ConversationRelay 1) detects verbal interruptions and 2) pauses the TTS readout when it detects an interruption. It does not manage the history of the conversation in context, which is why some additional code is needed.

Use this code in your server.js, replacing the code at the end of the file:

function handleInterrupt(callSid, utteranceUntilInterrupt) {
  const sessionData = sessions.get(callSid);
  const conversation = sessionData.conversation;
  let updatedConversation = [...conversation];
  const interruptedIndex = updatedConversation.findIndex(
    (message) =>
      message.role === "assistant" &&
      message.content.includes(utteranceUntilInterrupt),
  );
  if (interruptedIndex !== -1) {
    const interruptedMessage = updatedConversation[interruptedIndex];
    const interruptPosition = interruptedMessage.content.indexOf(
      utteranceUntilInterrupt,
    );
    const truncatedContent = interruptedMessage.content.substring(
      0,
      interruptPosition + utteranceUntilInterrupt.length,
    );
    updatedConversation[interruptedIndex] = {
      ...interruptedMessage,
      content: truncatedContent,
    };
    updatedConversation = updatedConversation.filter(
      (message, index) =>
        !(index > interruptedIndex && message.role === "assistant"),
    );
  }
  sessionData.conversation = updatedConversation;
  sessions.set(callSid, sessionData);
}
try {
  fastify.listen({ port: PORT });
  console.log(`Server running at http://localhost:${PORT} and wss://${DOMAIN}/ws`);
} catch (err) {
  fastify.log.error(err);
  process.exit(1);
}

This is the code that will handle more elegant interruptions for your streaming application. The handleInterrupt function is only called if a voice interaction is interrupted. When it is called, it finds the position at which the conversation was interrupted, and updates your local model of the conversation (in sessionData.conversation). When you pass that conversation back to OpenAI the next time you call aiResponseStream(conversation, ws), the AI has context for when the interruption occurred.

If you got lost making any of these changes, the final version of the server.js file is here.

Testing your AI application

You're now ready to test your application. Use the same steps that you used to test before. Start by going into your terminal and opening up a connection using ngrok:

ngrok http 8080

ngrok will provide you with a unique URL for your server running locally. Get the URL for your file and add it to the .env file using this line:

NGROK_URL="1234abcd.ngrok.app"

Replace the beginning of this placeholder with the correct information from your ngrok url. Note that you do not include the scheme (the “https://” or “http://”) in the environment variable.

Go to the terminal, and, in the folder where your application is located, run your server using:

node server

Go into your Twilio console, and look for the phone number that you registered.

Set the configuration under A call comes in with the Webhook option as shown below.

In the URL space, add your ngrok URL (this time including the “https://”), and follow that up with /twiml for the correct endpoint.

Finally, set the HTTP option on the right to GET.

A screenshot showing Twilio console call service

Save your configurations in the console and dial your registered Twilio number to test. Your AI voice greeting will pick up the line, and now you can begin a conversation with your AI.

During testing, try interrupting the conversation with your AI. Ask a question that is likely to result in a longer answer. Try "recite the Gettysburg address," or "What happened in the year 1997?" Cut your AI off mid-sentence with an interruption. Then ask it where it left off in the conversation.

Towards more robust voice conversations with AI

Congratulations on building a more robust AI application! The next post will review Tool or Function Calling so your application can interface with additional APIs. This will provide more functionality to your users and make a truly robust AI voice application!

Some other posts you might be interested in:

Let's build something amazing!

 

Amanda Lange is a .NET Engineer of Technical Content. She is here to teach how to create great things using C# and .NET programming. She can be reached at amlange [ at] twilio.com.