Transcribe audio messages with Twilio for WhatsApp and OpenAI Speech to Text

May 01, 2023
Written by
Néstor Campos
Contributor
Opinions expressed by Twilio contributors are their own

Transcribe audio messages with Twilio for WhatsApp and OpenAI Speech to Text

Not so long ago, you could have a conversation using your phone by either sending an SMS or making a phone call. Both have their benefits and drawbacks. These days, most messaging applications also let you send voice messages, which have their own combination of benefits that SMS and phone calls have. With voice messages, you can have an asynchronous conversation like SMS but still hear the inflections and emotions like a phone call.

Depending on the messaging application and the region, voice messaging is quite popular, and you can take advantage of this in your application when building Twilio SMS and WhatsApp applications. In this tutorial, you'll learn how to receive audio messages from WhatsApp and transcribe the audio using OpenAI Speech to Text.

You'll be using WhatsApp in this tutorial, but the code also works when audio messages are sent over MMS.

Prerequisites

You will need the following for your development environment:

You can find the source code of this tutorial in this GitHub repository.

What is OpenAI Speech to Text?

OpenAI Speech to text is the API provided by OpenAI to transform audio to text in different languages, both for the transcription and translation (for now only into English) of information. It allows audio in various formats (such as MP3 and MP4) with a maximum size of 25 MB.

Create and set up the .NET Project

Open a shell and create a Web API project using the .NET CLI:

dotnet new web -o TwilioWhatsAppOpenAI
cd TwilioWhatsAppOpenAI

Install the Twilio SDK and the Twilio helper library for ASP.NET Core which will help you send and receive WhatsApp messages:

dotnet add package Twilio
dotnet add package Twilio.AspNet.Core

Receive audio messages

Update the Program.cs file with the following code:

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHttpClient();
builder.Services.AddControllers();

var app = builder.Build();

app.MapControllers();

app.Run();

Next, you will create the controller where you will process each incoming message. Create a file MessageController.cs and add the following code:

using Microsoft.AspNetCore.Mvc;
using Twilio.AspNet.Core;
using Twilio.TwiML;

namespace TwilioWhatsAppOpenAI;

[Route("[controller]")]
public class MessageController : TwilioController
{
    private readonly HttpClient httpClient;

    public MessageController(HttpClient httpClient)
    {
        this.httpClient = httpClient;
    }

    [HttpPost]
    public async Task<IActionResult> Index(CancellationToken ct)
    {
        var response = new MessagingResponse();
        var form = await Request.ReadFormAsync(ct);
        var numMedia = int.Parse(form["NumMedia"].ToString());

        if (numMedia == 0)
        {
            response.Message("Please sent an audio file.");
            return TwiML(response);
        }

        if (numMedia > 1)
        {
            response.Message("You can only sent one audio file at a time.");
            return TwiML(response);
        }

        var mediaUrl = form["MediaUrl0"].ToString();
        var contentType = form["MediaContentType0"].ToString();
        if (!contentType.StartsWith("audio/"))
        {
            response.Message("You can only sent audio files.");
            return TwiML(response);
        }

        await DownloadAudioFile(mediaUrl, contentType, ct);
        response.Message("Audio file was received");
        return TwiML(response);
    }

    private async Task DownloadAudioFile(string mediaUrl, string contentType, CancellationToken ct)
    {
        // If you enable Basic Auth on your Twilio SMS Media, then use Basic Auth on your HTTP request 
        // where username and password are Account SID and Auth Token, or API Key SID and API Key Secret.
        var fileResponse = await httpClient.GetAsync(mediaUrl, ct);
        await using var audioFileStream = await fileResponse.Content.ReadAsStreamAsync(ct);

        var format = contentType.Substring(6); // remove 'audio/' prefix
        var fileName = Path.ChangeExtension(Path.GetFileName(mediaUrl),format); 
        await using var localFileStream = System.IO.File.Open(fileName, FileMode.CreateNew);
        await audioFileStream.CopyToAsync(localFileStream, ct);
    }
}

The Index method accepts the HTTP request sent by Twilio when a message comes in. Twilio submits the webhook data as an HTTP form, so the action reads the form and extracts the relevant fields for retrieving the attached media, if any.

The action will only accept a single audio file, in any other case, an error message is sent in response using Messaging TwiML.

Twilio doesn't actually pass the media file via the webhook request, instead, the URL where Twilio stored the media file is passed in, and the action will send an HTTP request to download the file and store it to disk.

By default, Twilio will not require any authentication to download the message media. You can follow these steps to enable Basic Authentication on message media. If you do, you'll need to update the code to include Basic Authentication.

After storing the audio file on disk, the action will respond with a success message using TwiML.

Now, run your project and continue with the next steps while the project is running:

dotnet run

Set up the Twilio Sandbox for WhatsApp

To send WhatsApp messages through your Twilio account, you need to create a WhatsApp Sender, but for testing and developing locally, you can, and in this tutorial, you will use the Twilio Sandbox for WhatsApp.

In order to get to the WhatsApp sandbox, in the left-side menu of the Twilio console click on "Messaging" (if you don't see it, click on "Explore Products", which will display the list with the available products, and there you will see "Messaging"). After that, in the available options open the "Try it out" submenu, and finally, click "Send a WhatsApp message".

Side menu in the Twilio console, highlighting the Messaging > Try it out > "Send a WhatsApp message" menu item.

Next, you have to follow the instruction on the screen, in which you must send a pre-defined message to the indicated number through WhatsApp. This will enable that WhatsApp number to use to send messages to your own WhatsApp number. If you want to send messages to other numbers, the people who own those numbers will have to do this same step.

Twilio Sandbox for WhatsApp console for sending test messages, initializing the process with a test message

After that, you will receive a message in response confirming the Sandbox is configured.

Confirmation message on WhatsApp indicating that the number is available to be used in test mode.

Now you are able to send messages to the Sandbox number and receive messages from the Sandbox number.

Make your webhook public with ngrok for testing

Your API needs to be publicly accessible for Twilio to send the message webhook requests to your application. That's why you'll use ngrok to create a secure tunnel between your locally running API and ngrok's public forwarding URL.

Leave your .NET application running and open a separate shell. In the new shell, run ngrok with the following command, specifying the HTTP URL that your application is listening to:

ngrok http https://localhost:<port>

Alternatively, if your application also listens to HTTPS, you can create a ngrok tunnel that forwards requests to the HTTPS URL. For this, you'll first need to create a free ngrok account, and then configure the ngrok auth token.

Copy the Forwarding HTTPS address that ngrok created for you, as you will use it in the Twilio Sandbox for the WhatsApp console.

Result of creating an ngrok tunnel in console. The output shows an HTTP and HTTPS Forwarding URL.

In the Twilio portal, go to the Twilio WhatsApp page, in the “Sandbox settings” section, and change the “When a message comes in” endpoint with the generated URL by ngrok, including the /Message path.

The Sandbox settings tab, on the Twilio Sandbox for WhatsApp console. The Sandbox configuration form has two text boxes. A text box "When a message comes in" filled out with the ngrok forwarding URL with the /Message path, and a text box "Status callback URL" which is left empty.

Every time you stop and start a ngrok tunnel, ngrok will generate a new Forwarding URL for you. This means you'll need to update the Sandbox Configuration form with the new Forwarding URL whenever it changes.

Test the project

To test, in the conversations with the Sandbox number, send an audio message using WhatsApp by pressing and holding the microphone button and speaking your message.

Audio message sent to the Twilio Sandbox using WhatsApp.

In a few seconds, you will see the message confirming that the audio was received by the endpoint.

WhatsApp conversation where an audio message was sent, and the response says "Audio file was received".

Convert unsupported audio formats using FFmpeg

OpenAI's transcription API does not support all audio formats. This will be a problem in particular for WhatsApp which sends audio recordings as ogg-files which OpenAI does not support. To work around this, you'll use FFmpeg and the FFMpegCore library to convert the audio from unsupported formats to the supported wav-format.

First, make sure you have installed FFmpeg on your machine, and it is in the PATH environment variable. Then, make sure you leave ngrok running, and stop the running ASP.NET Core application by pressing ctrl + c. Then, add the FFMpegCore NuGet package:

dotnet add package FFMpegCore

Now, add the following using statements at the top of MessageController.cs:

using Microsoft.AspNetCore.Mvc;
using FFMpegCore;
using FFMpegCore.Pipes;
using Twilio.AspNet.Core;
using Twilio.TwiML;

Then, update the DownloadAudioFile method with the one below, and add the rest of the code after the DownloadAudioFile method:

private async Task DownloadAudioFile(string mediaUrl, string contentType, CancellationToken ct)
{
    var (audioStream, format) = await GetAudioStream(mediaUrl, contentType, ct);
    await using (audioStream)
    {
        var fileName = Path.ChangeExtension(Path.GetFileName(mediaUrl),format); 
        await using var localFileStream = System.IO.File.Open(fileName, FileMode.CreateNew);
        await audioStream.CopyToAsync(localFileStream, ct);
    }
}

private static readonly HashSet<string> SupportedContentTypes = new()
{
    "mp3", "mp4", "mpeg", "mpga", "m4a", "wav", "webm"
};

private async Task<(Stream audioStream, string format)> GetAudioStream(
    string mediaUrl, 
    string contentType, 
    CancellationToken ct
)
{
    // If you enable Basic Auth on your Twilio SMS Media, then use Basic Auth on your HTTP request 
    // where username and password are Account SID and Auth Token, or API Key SID and API Key Secret.
    var fileResponse = await httpClient.GetAsync(mediaUrl, ct);
    var audioFileStream = await fileResponse.Content.ReadAsStreamAsync(ct);

    var format = contentType.Substring(6);
    if (SupportedContentTypes.Contains(format))
    {
        return (audioFileStream, format);
    }

    await using (audioFileStream)
    {
        var wavAudioStream = new MemoryStream();
        await ConvertMediaUsingFfmpeg(
            input: audioFileStream, inputFormat: format,
            output: wavAudioStream, outputFormat: "wav"
        );
        wavAudioStream.Seek(0, SeekOrigin.Begin);
        return (wavAudioStream, "wav");
    }
}

private async Task ConvertMediaUsingFfmpeg(Stream input, string inputFormat, Stream output, string outputFormat)
{
    await FFMpegArguments
        .FromPipeInput(new StreamPipeSource(input), options => options
            .ForceFormat(inputFormat))
        .OutputToPipe(new StreamPipeSink(output), options => options
            .ForceFormat(outputFormat))
        .ProcessAsynchronously();
}

This code will download the file just like before, but if the format is not in the SupportedContentTypes map, the audio is converted to wav-format using FFmpeg, and then stored on disk.

Feel free to verify the new code by starting the application again and sending another audio file.

Transcribe audio with OpenAI

Create an OpenAI API key

You need to generate an API Key with an OpenAI account to use the Speech to Text service. To do this, log in with your account, in the options of your account (right side), click "View API keys".

OpenAI home page, with different examples available, documentation and account options.

On the displayed page, click on the "Create new secret key" button, which will display a modal with the secret key. You will not see this secret again, so make sure you copy it somewhere safe, as you'll need it in the next section. API keys are secret, so make sure to keep them private, don't share them with others, and don't check them into source control.

Page displayed with the secret key to use in the API, with the option to copy it to the clipboard.

Install an OpenAI library

To start using the OpenAI API, you must first add the secret key to the project, using user secrets. To do this, run the following command line statement in the root directory of the project:

dotnet user-secrets init
dotnet user-secrets set "OpenAIServiceOptions:ApiKey" "<OpenAI Secret Key>"

Replace <OpenAI Secret Key> with the secret key copied previously.

OpenAI doesn't have an official library for .NET, but there are several community libraries that make it easier to integrate with OpenAI's APIs. In this tutorial, you'll be using the Betalgo.OpenAI library.

Install the library by adding it as a NuGet package using the .NET CLI:

dotnet add package Betalgo.OpenAI

Then, add the OpenAI service to ASP.NET Core's dependency injection container, by editing the Program.cs file:

using OpenAI.GPT3.Extensions;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHttpClient();
builder.Services.AddControllers();
builder.Services.AddOpenAIService();

Now that you installed and configured the OpenAI library, you are going to pass the audio data from Twilio to the OpenAI's transcription API. The transcription API will return the text from the audio, which you'll respond with to the user via WhatsApp.

First, import the following namespaces for the OpenAI library that will be necessary:

using Microsoft.AspNetCore.Mvc;
using FFMpegCore;
using FFMpegCore.Pipes;
using Twilio.AspNet.Core;
using Twilio.TwiML;
using OpenAI.GPT3.Interfaces;
using OpenAI.GPT3.ObjectModels;
using OpenAI.GPT3.ObjectModels.RequestModels;
using OpenAI.GPT3.ObjectModels.ResponseModels;

Next, update the constructor for the MessageController to receive the OpenAI service:

public class MessageController : TwilioController
{
    private readonly HttpClient httpClient;
    private readonly IOpenAIService openAIService;

    public MessageController(HttpClient httpClient, IOpenAIService openAIService)
    {
        this.httpClient = httpClient;
        this.openAIService = openAIService;
    }

Previously, the application would download the audio file from Twilio's API and then store it to disk, however, now that you'll upload the audio data to OpenAI's API, you can directly pass the audio data through without storing it to disk first.

Delete the DownloadAudioFile method and add the TranscribeAudio method:

private async Task<string> TranscribeAudio(Stream audioStream, string format)
{
        AudioCreateTranscriptionRequest audioRequest = new()
        {
                    Model = Models.WhisperV1,
                    FileStream = audioStream,
                    FileName = $"sample.{format}"
        };

        AudioCreateTranscriptionResponse audioResponse = await openAIService.Audio.CreateTranscription(audioRequest);
        if(audioResponse.Successful) return audioResponse.Text;

        throw new Exception(string.Format(
                    "Error occurred transcribing audio using OpenAI Whisper API. Code {0}: {1}",
                           audioResponse.Error?.Code,
                    audioResponse.Error?.Message
        ));
}

The TranscribeAudio selects the AI model to use, and send the audio stream through the Betalgo.OpenAI library which will send it to OpenAI. If OpenAI succeeds in transcribing, the transcription is returned, otherwise, an exception is thrown with the error message from OpenAI's API.

Finally, update the Index action so that it calls the GetAudioStream method to retrieve the audio, and then passes the audio stream to the TranscribeAudio method, and finally responds with the transcription as a TwiML message:

public async Task<IActionResult> Index(CancellationToken ct)
{
        var response = new MessagingResponse();
        var form = await Request.ReadFormAsync(ct);
        var numMedia = int.Parse(form["NumMedia"].ToString());

        if (numMedia == 0)
        {
            response.Message("Please sent an audio file.");
            return TwiML(response);
        }

        if (numMedia > 1)
        {
            response.Message("You can only sent one audio file at a time.");
            return TwiML(response);
        }

        var mediaUrl = form["MediaUrl0"].ToString();
        var contentType = form["MediaContentType0"].ToString();
        if (!contentType.StartsWith("audio/"))
        {
            response.Message("You can only sent audio files.");
            return TwiML(response);
        }

        var (audioStream, format) = await GetAudioStream(mediaUrl, contentType, ct);
        await using (audioStream)
        {
            var transcription = await TranscribeAudio(audioStream, format);
            response.Message( $"Transcription for audio: {transcription}");
            return TwiML(response);
        }
}

Test the project

To test the updated application, run the project again:

dotnet run

Finally, send another voice message using WhatsApp, wait a few seconds, and you should receive the transcription of your audio message as a response:

Result in WhatsApp of sending audio and receiving the transcribed text through OpenAI.

And with that, you already have an audio-to-text translator using OpenAI through WhatsApp thanks to Twilio.

Future improvements

This is a great start, but you can improve the solution further:

Additional resources

Send and Receive Media Messages with the Twilio API for WhatsApp

OpenAI Speech-To-Text Quickstart - You can explore basic examples with OpenAI and supported languages.

OpenAI libraries - Libraries created by OpenAI and the community in different languages to use the different services available.

FFmpeg - A complete, cross-platform solution to record, convert and stream audio and video.

Convert audio from one format to another using FFmpeg and .NET - A tutorial walking you through how to install and use FFmpeg from .NET applications using the FFMpegCore library.

Source Code to this tutorial on GitHub - You can find the source code for this project at this GitHub repository. Use it to compare solutions if you run into any issues.

Néstor Campos is a software engineer, tech founder, and Microsoft Most Value Professional (MVP), working on different types of projects, especially with Web applications. He has had to receive files from emails automatically through SendGrid Inbound Parse because he did not have access to the original repository of the data in some projects.