Integrate OpenAI's ChatGPT with Twilio Programmable Voice and Functions

March 23, 2023
Written by
Reviewed by

T-800 from the movie Terminator 2: Judgement Day explaining that his CPU is a neural net processor is a learning computer and that he learns from humans as he has contact with them while John Connor touches his face in disbelief.

Using ChatGPT to power an interactive voice chatbot is not just a novelty, it can be a way to get useful business intelligence information while reserving dedicated, expensive, and single-threaded human agents for conversations that only humans can help with. These days people talk to, listen to, and collaborate with robots all the time, but you know what's cooler than interacting with one robot? Interacting with three!

In this post, we'll show you how to use Twilio's native speech recognition and Amazon Polly Neural text-to-speech capabilities with ChatGPT to create a voice-activated chatbot, all hosted entirely on Twilio's serverless Functions environment. You'll also use the Call Event API to parse what callers are asking about and view the responses from the bot which allows us to unlock the rich first-party data captured in these interactions and send the data to customer engagement platforms like Segment where you can use it to build customer profiles, understand customer preferences, and create the personalized experiences customers expect.

Want to give this a demo a whirl before diving in? Call 1-989-4OPENAI (467-3624) to test it out!

Robot #1: Decoding the human voice using Speech Recognition

Twilio's speech recognition using the TwiML <Gather> verb is a powerful tool that turns words spoken on a phone call into text. It offers excellent accuracy, low latency, and support for numerous languages and dialects. Historically, Twilio developers have used speech recognition as a way to navigate interactive voice response (IVRs) and other self-service automation workflows, but with the release of new experimental speech models, the only limit is ✨your imagination ✨.

Robot #2: Giving your robot a voice with Amazon Polly Neural Voices

With the TwiML <Say> verb, Twilio provides a text-to-speech (TTS) function that uses Amazon Polly voices which leverage deep learning to synthesize human-like speech. Polly's neural voices offer a more natural and lifelike sound, providing an engaging listening experience for users. With support for multiple languages, a wide range of voices, and SSML support, Twilio text-to-speech allows you to customize your chatbot's voice to match your brand's identity.

Robot #3: OpenAI's ChatGPT Conversational Companion

ChatGPT is an advanced language model developed by OpenAI, capable of generating human-like text based on given input. It can understand context, provide relevant responses, and even engage in creative tasks like writing stories or poems. By leveraging the OpenAI API, developers can integrate this AI directly into their applications, offering users a more interactive and engaging experience.

Secret Sauce: Twilio Functions

How will you get these three robots to talk to each other, and to your callers? By using Twilio Functions. Beyond merely giving you the ability to get a proof of concept up and running without needing to spin up a server of your own, Functions provide auto scaling capabilities, enhanced security, and reduced latency by running your code inside Twilio. Of course, if you've got your own server rattling around someplace you can make a couple small edits to the Javascript and it will run in your node.js environment, no sweat.

Now that you've got the ingredients, let's check out the recipe in two flavors: CLI and GUI.

Robots dancing and Herbie Hancock singing from the music video for the song Rockit.

Pre-req yourself before you wreck yourself

Before diving into the integration process, you'll need the following:

Next Day's Function

First let's get our backend in order. You'll start by creating a new Serverless project. Since you installed the awesome open-source Serverless Toolkit, you can do this in one line by passing the command into your terminal:

twilio serverless:init <project-name>

Replace <project-name> with a name of your liking; I'm naming mine three-robot-rhumba.

Terminal window showing the initialization of a Twilio Serverless project

Go ahead and cd into the directory for your project and let's update the .env file to provide your Twilio auth token as AUTH_TOKEN, and your OpenAI API Key as OPENAI_API_KEY. Your Twilio account SID should be auto populated. Ensure your .env file looks like the following (with the XXXXX placeholders replaced with its respective keys):

ACCOUNT_SID=XXXXX
AUTH_TOKEN=XXXXX
OPENAI_API_KEY=XXXXX

Since Twilio Serverless Functions are just Node.js apps, you can add dependencies using any package manager that writes to package.json; I'm using npm because I'm basic. Navigate back to your terminal and enter the following to install the OpenAI NPM package:

npm install openai@3.3.0

With your environment variables set and your dependencies added you can get down to business. You're going to create two Functions: a /transcribe Function that uses Twilio speech recognition to turn your spoken words into text that ChatGPT can understand using the TwiML <Gather> verb, and a /respond Function that takes the text generated by the speech recognition, sends it over to the OpenAI API, and passes the response to Twilio's Amazon Polly Neural-powered text-to-speech engine using the TwiML <Say> verb.

To create a new Function, open up the functions folder in your project directory and create a JavaScript file. Create the /transcribe and /respond Functions by creating a transcribe.js and respond.js file in the folder.

Lost in Transcription

Now let's open up transcribe.js and add the following code:

exports.handler = function(context, event, callback) {
    // Create a TwiML Voice Response object to build the response
    const twiml = new Twilio.twiml.VoiceResponse();

    // If no previous conversation is present, or if the conversation is empty, start the conversation
    if (!event.request.cookies.convo) {
        // Greet the user with a message using AWS Polly Neural voice
        twiml.say({
                voice: 'Polly.Joanna-Neural',
            },
            "Hey! I'm Joanna, a chatbot created using Twilio and ChatGPT. What would you like to talk about today?"
        );
    }

    // Listen to the user's speech and pass the input to the /respond Function
    twiml.gather({
        speechTimeout: 'auto', // Automatically determine the end of user speech
        speechModel: 'experimental_conversations', // Use the conversation-based speech recognition model
        input: 'speech', // Specify speech as the input type
        action: '/respond', // Send the collected input to /respond 
    });

    // Create a Twilio Response object
    const response = new Twilio.Response();

    // Set the response content type to XML (TwiML)
    response.appendHeader('Content-Type', 'application/xml');

    // Set the response body to the generated TwiML
    response.setBody(twiml.toString());

    // If no conversation cookie is present, set an empty conversation cookie
    if (!event.request.cookies.convo) {
        response.setCookie('convo', '', ['Path=/']); 
    }

    // Return the response to Twilio
    return callback(null, response);
};

If you're new to Functions let me walk you through what's happening here. The /transcribe Function creates a TwiML generating voice response based on Twilio's Node.js helper library, starts a conversation if none exists, listens for user input, and passes that input along with the conversation history to the /respond endpoint for further processing.

In line 6 the application checks to see if a cookie called convo exists. If it doesn't, or if it does but it's empty, you can take that to mean the conversation hasn't started yet, so you will kick off an initial greeting using the TwiML <Say> verb.

Next, the twiml.gather method is used to capture user input. The parameters for gather are:

  • speechTimeout: 'auto': Automatically determines when the user has stopped speaking, can be set to a positive integer, but auto is best for this use case
  • speechModel: "experimental_conversations": Uses a speech recognition model optimized for conversational use cases
  • input: 'speech': Sets the input type to speech and ignores any keypresses (DTMF)
  • action: '/respond': Pass the user's speech input along with the conversation history to the /respond endpoint

Now, you need to create a way to create the convo cookie so the /respond Function has a place to store the conversation history which will get passed between the OpenAI API and our Polly Neural Voices, and that means the app needs need to initialize a Twilio.Response(); object, and does so on line 25.

You can't pass both the Twilio.twiml.VoiceResponse(); and the Twilio.Response(); back to the handler, so you'll need to use the response you just created to append a header to our requests and set the TwiML you generated via <Gather> to the body on lines 28 and 31 respectively.

Once that's done you can set the cookie using response.setCookie(); on line 35 before you pass the response back to the handler for execution by our Serverless infrastructure on line 39. Go ahead and save this file and close it.

Call and Response

Next, let's open up respond.js and add the following code:

// Import the OpenAI module
const { OpenAI } = require("openai");

// Define the main function for handling requests
exports.handler = async function(context, event, callback) {
  // Set up the OpenAI API with the API key from your environment variables
   const openai = new OpenAI({
    apiKey: context.OPENAI_API_KEY, 
  });


  // Set up the Twilio VoiceResponse object to generate the TwiML
  const twiml = new Twilio.twiml.VoiceResponse();

  // Initiate the Twilio Response object to handle updating the cookie with the chat history
  const response = new Twilio.Response();

  // Parse the cookie value if it exists
  const cookieValue = event.request.cookies.convo;
  const cookieData = cookieValue ? JSON.parse(decodeURIComponent(cookieValue)) : null;

  // Get the user's voice input from the event
  let voiceInput = event.SpeechResult;

  // Create a conversation object to store the dialog and the user's input to the conversation history
  const conversation = cookieData?.conversation || [];
  conversation.push({role: 'user', content: voiceInput});

  // Get the AI's response based on the conversation history
  const aiResponse = await createChatCompletion(conversation);


  // Add the AI's response to the conversation history
  conversation.push(aiResponse);

  // Limit the conversation history to the last 20 messages; you can increase this if you want but keeping things short for this demonstration improves performance
  while (conversation.length > 20) {
      conversation.shift();
  }

  // Generate some <Say> TwiML using the cleaned up AI response
  twiml.say({
          voice: "Polly.Joanna-Neural",
      },
      aiResponse
  );

  // Redirect to the Function where the <Gather> is capturing the caller's speech
  twiml.redirect({
          method: "POST",
      },
      `/transcribe`
  );

  // Since we're using the response object to handle cookies we can't just pass the TwiML straight back to the callback, we need to set the appropriate header and return the TwiML in the body of the response
  response.appendHeader("Content-Type", "application/xml");
  response.setBody(twiml.toString());

  // Update the conversation history cookie with the response from the OpenAI API
  const newCookieValue = encodeURIComponent(JSON.stringify({
      conversation
  }));
  response.setCookie('convo', newCookieValue, ['Path=/']);

  // Return the response to the handler
  return callback(null, response);

  // Function to generate the AI response based on the conversation history
  async function generateAIResponse(conversation) {
      const messages = formatConversation(conversation);
      return await createChatCompletion(messages);
  }

  // Function to create a chat completion using the OpenAI API
  async function createChatCompletion(messages) {
      try {
        // Define system messages to model the AI
        const systemMessages = [{
                role: "system",
                content: 'You are a creative, funny, friendly and amusing AI assistant named Joanna. Please provide engaging but concise responses.'
            },
            {
                role: "user",
                content: 'We are having a casual conversation over the telephone so please provide engaging but concise responses.'
            },
        ];
        messages = systemMessages.concat(messages);

        const chatCompletion = await openai.chat.completions.create({
            messages: messages,
            model: 'gpt-4',
            temperature: 0.8, // Controls the randomness of the generated responses. Higher values (e.g., 1.0) make the output more random and creative, while lower values (e.g., 0.2) make it more focused and deterministic. You can adjust the temperature based on your desired level of creativity and exploration.
              max_tokens: 100, // You can adjust this number to control the length of the generated responses. Keep in mind that setting max_tokens too low might result in responses that are cut off and don't make sense.
              top_p: 0.9, // Set the top_p value to around 0.9 to keep the generated responses focused on the most probable tokens without completely eliminating creativity. Adjust the value based on the desired level of exploration.
              n: 1, // Specifies the number of completions you want the model to generate. Generating multiple completions will increase the time it takes to receive the responses.
        });

          return chatCompletion.choices[0].message.content;

      } catch (error) {
          console.error("Error during OpenAI API request:", error);
          throw error;
      }
  }
}

As with above, here's a guided tour as to what exactly is going down in this code. It starts by importing the required modules (line 2) and defining the main function to handle requests (line 5).

Lines 7-8 set up the OpenAI API with the API key, while line 11 creates the Twilio Voice Response object that will generate TwiML for turning the ChatGPT responses into speech for callers. Line 14 initiates the Twilio Response object to update the conversation history cookie and set the headers and body so the TwiML is passed in the response.

Lines 17-20 parse the cookie value if it exists, and line 23 retrieves the user's voice input from the SpeechResult event received from the /transcribe Function. Lines 26-27 create a conversation variable to store the dialog and add the user's input to the conversation history.

Line 30 generates the AI's response based on the conversation history, and line 33 cleans the AI response by removing any unnecessary role names (assistant, Joanna, user). Line adds the cleaned AI response to the conversation history.

Lines 39-41 limit the conversation history to the last 10 messages to improve performance and keep the cookie a reasonable size while also giving the chatbot enough context to deliver useful responses. You can increase (or decrease) this if you'd like, but remember the stored history is getting passed to the OpenAI API with every request, so the larger this gets, the more tokens your application is chewing through. Lines 44-48 generate the <Say> TwiML using the cleaned AI response, and lines 51-55 redirect the call to the /transcribe function where the <Gather> is capturing the caller's speech.

As with /transcribe we need to use response to deliver the TwiML and lines 58-59 set the appropriate header and return the TwiML in the body of the response. Lines 62-67 update the conversation history cookie with the response from the OpenAI API, and line 70 returns the response to the handler.

The generateAIResponse function (lines 73-76) formats the conversation and creates a chat completion using the OpenAI API. The createChatCompletion function (lines 81-88) sends a request to the OpenAI API to generate a response using the GPT-3.5-turbo model and specified parameters. If we get a 500 from the OpenAI API we don't want to just drop the conversation on the floor, so we will handle an error from the API with <Say> and a redirecting the conversation back to the /transcribe function on lines 90-107.

It's also possible that the request to the OpenAI just plain old fashioned times out, so a catch has been added between lines 109-131 to handle timeouts gracefully, and redirect to /transcribe to try again.

Finally, the formatConversation function (lines 136-158) formats the conversation history into a format that the OpenAI API can understand by alternating between assistant and user roles.

Now your code is updated, your dependencies set, and your environment variables are configured, you're ready to deploy. With Twilio Serverless, it couldn't be easier… it's just a single command!

twilio serverless:deploy

Once the deployment wraps up you can now use the Functions you created to capture spoken input from a caller, convert it to text which is sent to the ChatGPT API, and play the response back to the caller in the form of AI-generated speech. Three robots, working together, just for you!

Electronic music band Kraftwerk performing live

The Model

The example above uses the gpt-3.5-turbo model from OpenAI. This is a good (and cheap!) model for development purposes and proof-of-concepts, but you may find that other models are better for specific use cases. GPT-4 was just released in limited beta, and even here at Twilio we haven't had a chance to really kick the tires on it yet, but based on the developer livestream published with the announcement, it looks to be a significant upgrade over the 3.5 version that has been blowing people's minds for the past couple months.

One additional consideration, GPT-3 is superseded by 3.5 (and naturally 4) but the models in GPT-3 expose fine tuning capabilities that the more advanced models lack at the time of writing. For example, even though the training data is older, you can get quicker responses by using curie, and can employ things like sentiment analysis and response styles. Choose your own adventure.

Rikki Don't Lose That Number

Now that you have your Function deployed and ready to party you can test it by making a call, but first, you'll need to configure a phone number to use the Functions you just created, and the CLI provides a quick and easy way to do so. Type the following into your terminal to list the phone numbers on your account (we are assuming you followed the pre-req and acquired a number beforehand).

twilio phone-numbers:list

You'll get back a list of the phone number SIDs, phone numbers, and friendly names on your account. You can use either the SID or the full E.164-formatted phone number for your request:

twilio phone-numbers:update <PN SID> –voice-url=<The URL for the /transcribe Function>
Obi-wan Kenobi from the movie Star Wars saying that he's never used a command line interface and R2-D2 responding with sad beeps.

If CLIs don't pass the vibe check, you can do all of the things we describe above directly in Twilio Console. First, in the left hand nav on the Develop tab go to the Functions and Assets section and click on Services. Click Create Service.

Twilio Voice ChatGPT Demo Create Service

Give your Service a name, I'll call mine voice-chatgpt-demo and click Next.

Twilio Voice ChatGPT Demo Name Service

You'll now see the Console view for your Service, with Functions, Assets, Environment Variables, and Dependencies in the left hand nav, and a text editor and console for editing your code and monitoring the logs. First thing you'll want to do is get your environment variables configured, so click on Environment Variables in the bottom right corner.

Twilio Voice ChatGPT Demo Service Config

Your Twilio account SID and auth token are pre-populated as env variables, so all you need to is add your OpenAI API key; the sample code will refer to it as OPENAI_API_KEY so if you're looking for a zero edit copy/paste experience, make sure you name it the same. Click Add when you're done.

Twilio Voice ChatGPT Demo Add Environment Variables

Next, you will need to update our dependencies to include the OpenAI npm module so you can make requests to the OpenAPI API. Click Dependencies, and enter openai in the Module textbox and 3.3.0 in the Version textbox. Don't forget to click Add.

Now you can get cracking with creating the Functions. You're going to create two: /transcribe which will do all the heavy lifting from a speech recognition perspective, and /respond which will pass the transcribed text to the ChatGPT API and read the response to the caller using an Amazon Polly Neural text-to-speech voice.

Click the Add button and select Add Function from the dropdown to create a new Function and name it /transcribe.

Twilio Voice ChatGPT Demo Add Function

Replace the contents of the new function with the code snippet here and hit Save. Your factory fresh Function should look like this when complete:

Twilio Voice ChatGPT Demo Transcribe Function

Next create another Function and call this one /respond. Replace the contents of this new Function with the code snippet here and hit Save again. If you'd like a guided tour of what precisely is going on in these examples, check out the CLI section of this post where we walk through the code in greater detail.

Next click the Deploy button, and your saved Functions will be deployed and can now be used in your incoming phone number configuration. Click the three vertical dots next to /transcribe to Copy URL. We will need this in a second.

Twilio Voice ChatGPT Demo Phone Number Setup

On the Develop tab navigate to the Phone Numbers section, then Manage, and choose Active Numbers. Once you've found a number you want to use, scroll down to the Voice & Fax section and under A call comes in select Function. For Service, select the Function service you just created which we named voice-chatgpt-demo. Then choose ui for Environment and lastly choose /transcribe for the Function Path since this is where your phone call should route to first.

Now give your newly configured phone number a call to test everything out!

No disassemble

Johnny Five from the movie Short Circuit saying "No disassemble" and rolling away distraught

One particularly great thing about this integration is that both the inputs and the responses are available to you as a developer; the speech recognition text from the person in the form of the SpeechResult parameter that gets passed to the /respond Function, and the ChatGPT-derived responses in the form of the <Say> TwiML that gets executed on the calls. This means these conversations aren't business intelligence closed boxes, and even though these Functions are executing in the Twilio Serverless environment, you can still get your hands on the content of the conversation using the Call Events API. Here's how to grab the details using the Twilio CLI:

twilio api:core:calls:events:list --call-sid <The call SID you are interested in> -o json

Using this API you can retrieve the requests, responses, and associated parameters and pump them directly into your internal systems to do things like provide agents a heads-up about what the caller had been asking about before they were connected, or using the data to decorate your customer profiles in a customer data platform like Segment.

Robot Rock

Daft Punk playing bass guitar and drums from the music video for the song Get Lucky

Publications like McKinsey and Forbes are already opining on how generative AI technologies like ChatGPT can be used to solve business problems, so now that you have three robots working for you, what can you actually do with an integration like this? How about a frontline IT desktop support agent? Make ChatGPT search Google so your expensive IT department doesn't have to, and in the event your caller and ChatGPT can't figure it out, connect the call to your agents. Have long hold times for your healthcare specialists? Instead of playing callers smooth jazz, offer them the wisdom of ChatGPT for common non-critical ailments while they wait.

Wrap it up

So there you have it: with the help of Twilio's speech recognition, Twilio Functions, Amazon Polly Neural voices, and OpenAI's API, you've now created your very own interactive voice chatbot. Definitely keep an eye on this space and be on the lookout for advances in conversational AI and chatbot capabilities that you can leverage using Twilio.

Michael Carpenter (aka MC) is a telecom API lifer who has been making phones ring with software since 2001. As a Product Manager for Programmable Voice at Twilio, the Venn Diagram of his interests is the intersection of APIs, SIP, WebRTC, and mobile SDKs. He also knows a lot about Depeche Mode. Hit him up at mc (at) twilio.com or LinkedIn.

A huge debt of gratitude to Dhruv Patel's positively prescient post on How to Call an AI Friend Using GPT-3 with Twilio Voice and Functions. Dhruv also provided a technical review of the code in this post. Dhruv Patel is a Developer on Twilio’s Developer Voices team and you can find him working in a coffee shop with a glass of cold brew, or he can be reached at dhrpatel [at] twilio.com or LinkedIn.