Transcribe phone calls in real time with Twilio, Vosk, and ASP.NET Core

May 15, 2023
Written by
Mark Heath
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

Transcribe phone calls in real time with Twilio, Vosk, and ASP.NET Core

In this tutorial, you are going to learn how to implement live transcription of phone calls to text. The phone calls will be routed through a Twilio phone number, and you will use the Media Streams API to stream the incoming audio to a small WebSocket server built using ASP.NET Core. You’ll use the open source NAudio .NET library for audio format conversion, and the transcription will be performed by Vosk, a lightweight open-source speech recognition engine with support for over 20 natural languages. Vosk runs locally on your computer, and offers bindings for several programming languages, including C#.

Prerequisites

To work on this tutorial, you will need:

This tutorial will also make use of ngrok to expose your locally running server to the internet.

Add a Twilio phone number

Your first task is to add a phone number to your Twilio account. This is the number that will receive the phone calls to transcribe.

Log in to the Twilio Console, select “Phone Numbers”, and then click on the “Buy a number” button to buy a Twilio number. Note that if you have a free account, you will be using your trial credit for this purchase.

On the “Buy a Number” page, select your country and check “Voice” in the “Capabilities” field. If you’d like to request a number from your region, you can enter your area code prefix in the “Number” field.

Twilio buy a number form

Click the “Search” button to see what numbers are available, and then click “Buy” for the number you like from the results. After you confirm your purchase, write down your new phone number and click the “Close” button.

Create the ASP.NET Core project

Your next task is to create an ASP.NET Core project to act as your web server. Create a new folder, then use the dotnet new command from the .NET SDK to create a minimal ASP.NET Core project, and lastly, change directories into the new project folder.

dotnet new web -o TranscribeTwilioVoiceVosk
cd TranscribeTwilioVoiceVosk

The first two lines of the Program.cs file set up a default configuration for your web application.

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

You will need to make a few modifications to enable forwarded headers and WebSockets, so replace those two initial lines in Program.cs with the following code:

using Microsoft.AspNetCore.HttpOverrides;

var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<ForwardedHeadersOptions>(
    options => options.ForwardedHeaders = ForwardedHeaders.All
);

var app = builder.Build();
app.UseForwardedHeaders();
app.UseWebSockets();

app.MapGet("/", () => "Hello World!");

app.Run();

Create the Voice endpoint

It is possible to configure your Twilio phone number to send webhook requests to your server whenever someone calls it, using the Incoming Voice Call webhook.  Twilio expects you to respond to their webhook with instructions in TwiML format. Twilio will execute these TwiML instructions on your behalf. Using the Connect-verb and Stream-noun, you can request that the audio of the call be streamed to you over a WebSocket, and you can also stream audio back if you'd like. In this step, you will create a new endpoint (/voice) designed to receive the Incoming Voice Call webhook.

The dotnet new web command that you used to create the project will have already created a single endpoint, which responds to a HTTP GET request with the phrase “Hello World”.

app.MapGet("/", () => "Hello World!");

You can delete this if you want, although it might actually be useful to test the ngrok tunnel you’re going to use later, so I left it in my sample code.

Add another endpoint to your Program.cs file, just before the call to app.Run. This one will respond to a HTTP POST method, so use the MapPost method, and the path will be /voice.

app.MapPost("/voice", (HttpRequest request) =>
{
   // … will connect the audio to a web-socket in here
});

app.Run();

The /voice endpoint will make use of the Twilio SDK and the Twilio helper library for ASP.NET Core, so add references to the Twilio and Twilio.AspNet.Core NuGet packages with the following commands:

dotnet add package Twilio
dotnet add package Twilio.AspNet.Core

Now you can use the VoiceResponse class along with an extension to format the webhook response to generate a response that tells Twilio to stream the audio to a WebSocket endpoint, which will be at the /stream endpoint (you’ll create that soon).

First, add the following two using namespace statements at the top of Program.cs:

using Microsoft.AspNetCore.HttpOverrides;
using Twilio.AspNet.Core;
using Twilio.TwiML;

Then, to configure details of where the connected audio should be sent, you can use the Connect class. Here’s the full code for the /voice endpoint.

app.MapPost("/voice", (HttpRequest request) =>
{
    var response = new VoiceResponse();
    var connect = new Twilio.TwiML.Voice.Connect();
    connect.Stream(url: $"wss://{request.Host}/stream");
    response.Append(connect);
    return Results.Extensions.TwiML(response);
});

When you'll tunnel your web application to the internet using ngrok, the request.Host would normally be localhost, but because of the forwarded headers middleware you added, it'll be the hostname provided by ngrok, which is what you need in this case.

Create the Stream endpoint

Next you need to create the /stream endpoint. You do this by using the MapGet method, and then ensuring that the incoming request is a WebSocket request. If it is, you can then use AcceptWebSocketAsync to accept the WebSocket connection. After accepting the WebSocket, you'll use a method Echo which will handle all incoming data. Pass the websocket to Echo and also the context.RequestServices. The context.RequestServices is a scoped IServiceProvider which you'll use to retrieve services from the DI container.

Here’s the code for the stream endpoint:

app.MapGet("/stream", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
        await Echo(webSocket, context.RequestServices);
    }
    else
    {
        context.Response.StatusCode = StatusCodes.Status400BadRequest;
    }
});

Receive data from the WebSocket

In the previous step you passed the WebSocket on to a function called Echo. This is where you will ultimately receive the audio and pass it on to Vosk. But first you need to actually receive and interpret the data coming in from the WebSocket.

The first step is to receive incoming data on the WebSocket with the ReceiveAsync method. In the example below, a buffer size of 4KB is used. Then, so long as the connection is not being closed, you can parse the incoming buffer as JSON with the JsonDocument.Parse method.

The JSON should have a property called “event”, and this can have the value “connected”, “start”, “media”, or “stop”. For now, in the example shown below, a simple console message is logged for each of those events.

Having processed the events, you then call ReceiveAsync again to get the next message from the WebSocket.

async Task Echo(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)

{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();

    var buffer = new byte[1024 * 4];
    var receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);

    while (!receiveResult.CloseStatus.HasValue &&
           !appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        using var jsonDocument = JsonSerializer.Deserialize<JsonDocument>(buffer.AsSpan(0, receiveResult.Count));

        var eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();

        switch (eventMessage)
        {
            case "connected":
                Console.WriteLine("Event: connected");
                break;
            case "start":
                Console.WriteLine("Event: start");
                break;
            case "media":
                Console.WriteLine("Event: media"); // optional - there will be a lot of these messages!
                break;
            case "stop":
                Console.WriteLine("Event: stop");
                break;
        }

        receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    }
}

You should also clean up connections properly, so after the while loop in the Echo method, call CloseAsync if the web-socket has been disconnected, or the host application is stopping.

    if (receiveResult.CloseStatus.HasValue)
    {
        await webSocket.CloseAsync(
            receiveResult.CloseStatus.Value,
            receiveResult.CloseStatusDescription,
            CancellationToken.None);
    }
    else if (appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        await webSocket.CloseAsync(
            WebSocketCloseStatus.EndpointUnavailable,
            "Server shutting down",
            CancellationToken.None);
    }

In order for the code you’ve written so far to compile, you will need to ensure that you’ve added the following using statements at the top of your Program.cs file.

using Microsoft.AspNetCore.HttpOverrides;
using System.Net.WebSockets;
using System.Text.Json;
using Twilio.AspNet.Core;
using Twilio.TwiML;

Handle the events

There are four events that the Echo method you created earlier is listening for. These are “connected”, “start”, “media”, and “stop”. Fortunately, for this demo you don’t really need to do anything in the “connected”, “start” and “stop” methods - the demo app just logs these events to the console.

If you need to access a stream identifier for the incoming audio stream, then you can get at one with the “streamSid” property on the JSON as shown here.

case "start":
    Console.WriteLine("Event: start");
    var streamSid = jsonDocument.RootElement.GetProperty("streamSid").GetString();
    Console.WriteLine($"StreamId: {streamSid}");
    break;

The most important event for transcription is the “media” event. The JSON payload for this event includes the actual audio, encoded as a Base64 string.

This means that in the Echo function, you can update the handler for the “media” event with the following code to extract the audio from the JSON and decode it into a byte array.

case "media":
   var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
   byte[] data = Convert.FromBase64String(payload);
   break;

Convert audio format

At this point you have the incoming audio in a byte array, but unfortunately it is not in the correct format to pass directly to Vosk. That’s because the incoming audio from Twilio is encoded at an 8kHz sample rate using the “mu-law” algorithm, which stores each audio “sample” as a single byte. Vosk is expecting the audio at a 16kHz sample rate and with each sample encoded as a linear PCM 16 bit integers (a “short” in C# terminology).

This means that two conversions are required - first to decode the mu-law samples into 16 bit integers, and then to “resample” the audio from 8kHz to 16kHz.

You can perform both conversions using the NAudio open source library (which I happen to be the author of!). Start by adding  the NAudio.Core NuGet package with the following command:

dotnet add package NAudio.Core

NAudio offers a wide variety of options for dealing with codecs and resampling, but many of those make use of Windows APIs under the hood. To keep this code cross-platform compatible, you can use the WdlResamplingSampleProvider which is a fully managed implementation.

To create a suitable audio converter that performs both tasks, paste the following code into a new C# file called AudioConverter.cs.

using NAudio.Codecs;
using NAudio.Wave;
using NAudio.Wave.SampleProviders;

class AudioConverter
{
        private readonly BufferedWaveProvider bufferedWaveProvider;
        private readonly IWaveProvider outputProvider;
        private readonly byte[] outputBuffer;
        private readonly WaveBuffer outputWaveBuffer;

        public AudioConverter()
        {
                bufferedWaveProvider = new BufferedWaveProvider(new WaveFormat(8000, 1));
                var resampler = new WdlResamplingSampleProvider(bufferedWaveProvider.ToSampleProvider(), 16000);
                outputProvider = new WaveFloatTo16Provider(resampler.ToWaveProvider());
                outputBuffer = new byte[16000*2]; // one second of audio should be plenty
                outputWaveBuffer = new WaveBuffer(outputBuffer);
        }
        
        public (short[],int) ConvertBuffer(byte[] input)
        {
                var samples = input.Length;

                        // ulaw 8000 bitrate to Linear 8kHz bitrate
                       for (int i = 0; i < input.Length; i++)
                       {
                           outputWaveBuffer.ShortBuffer[i] = MuLawDecoder.MuLawToLinearSample(input[i]);
                       }

                bufferedWaveProvider.AddSamples(outputWaveBuffer.ByteBuffer, 0, samples*2);
                var convertedBytes = samples * 4; // to PCM and to 16kHz
                var outRead = outputProvider.Read(outputBuffer, 0, convertedBytes);
                return (outputWaveBuffer.ShortBuffer, outRead / 2);
        }
}

To explain briefly what’s happening, an audio “pipeline” has been created in the constructor to perform sample rate conversion (which happens using 32 bit floating point audio samples) and then get the audio back into the 16 bit samples that Vosk is expecting.

Then in the ConvertBuffer method, the incoming audio is decoded from mu-law using the MuLawDecoder, and passed into the start of the pipeline by adding it to the BufferedWaveProvider. Then the audio is read out of the end of the pipeline, with the WaveBuffer NAudio helper class simplifying the task of re-interpreting a byte array as an array of 16-bit integers.

In your Program.cs file, add the AudioConverter as a scoped service to the DI container, after where you added the ForwardedHeadersOptions to the DI container:

builder.Services.Configure<ForwardedHeadersOptions>(
    options => options.ForwardedHeaders = ForwardedHeaders.All
);
builder.Services.AddScoped<AudioConverter>();

Then retrieve an AudioConverter in the Echo method.

async Task Echo(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)
{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();
    var audioConverter = serviceProvider.GetRequiredService<AudioConverter>();

Now that you have an audio format converter, convert the audio in the "media" switch case:

case "media":
    var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
    byte[] data = Convert.FromBase64String(payload);
    var (converted, convertedLength) = audioConverter.ConvertBuffer(data);
    break;

Configure the Vosk recognizer

Next, you need to add support for using Vosk. First, add a reference to the Vosk NuGet package:

dotnet add package Vosk

You will also need to provide some models for speech recognition. There are several models available to download and for this demo I selected vosk-model-small-en-us-0.15 due to its small size. The demo project also configures a model for speaker identification, for which I chose vosk-model-spk-0.4.

After downloading the models, you need to unpack them into a folder in the root of your project directory. Here you can see the contents of my model folder (for the main model) and my model-spk folder (which is the speaker identification model). Make sure your folders are named the same.

The folder structure of the unzipped model and model-spk files

Next, add the following using import for Vosk:

using Microsoft.AspNetCore.HttpOverrides;
using System.Net.WebSockets;
using System.Text.Json;
using Twilio.AspNet.Core;
using Twilio.TwiML;
using Vosk;

Now you are ready to create a VoskRecognizer. You'll add the VoskRecognizer to the DI container as a scoped service, so you can later retrieve a new instance for every WebSocket.  Do this in the Program.cs file, just after adding the AudioConverter. The code snippet below creates a Vosk Model and SpkModel (pointing to the two folders you just downloaded), and passes the model into the VoskRecognizer. There is also a parameter that tells the VoskRecognizer to expect audio with a sample rate of 16kHz.

The VoskRecognizer is also shown here being configured with a speaker model, as well as setting the logging level and the maximum number of alternative words. If you would like to adjust any of these settings, you can learn more about what they do on the Vosk documentation site.

builder.Services.AddScoped<AudioConverter>();
builder.Services.AddScoped<VoskRecognizer>(_ =>
{
    // You can set to -1 to disable logging messages
    Vosk.Vosk.SetLogLevel(0);
    var model = new Model("model");
    var spkModel = new SpkModel("model-spk");
    var recognizer = new VoskRecognizer(model, 16000.0f);
    recognizer.SetSpkModel(spkModel);
    recognizer.SetMaxAlternatives(0);
    recognizer.SetWords(true);
    return recognizer;
});

Then retrieve an VoskRecognizer in the Echo method.

async Task Echo(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)
{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();
    var audioConverter = serviceProvider.GetRequiredService<AudioConverter>();
    var recognizer = serviceProvider.GetRequiredService<VoskRecognizer>();

Pass Audio into Vosk

Now that the audio is in the correct format, you can pass it to Vosk.

After converting the audio, make a call to AcceptWaveform on the VoskRecognizer you retrieved earlier. This will return true when Vosk has finished recognizing an entire phrase, and false if it only has a “partial” result so far. The following code will access the result or partial result, deserialize the JSON string to a JsonDocument, and access the part of the JSON containing the recognized text.

var (converted, convertedLength) = audioConverter.ConvertBuffer(data);
if (recognizer.AcceptWaveform(converted, convertedLength))
{
   var json = recognizer.Result();
   var jsonDoc = JsonSerializer.Deserialize<JsonDocument>(json);
   Console.WriteLine(jsonDoc.RootElement.GetProperty("text").GetString());
}
else
{
   var json = recognizer.PartialResult();
   var jsonDoc = JsonSerializer.Deserialize<JsonDocument>(recognizer.PartialResult());
   //Console.WriteLine(jsonDoc.RootElement.GetProperty("partial").GetString());
}
break;

Note that partial results only contain the string, and happen very frequently, which is why the Console.WriteLine for the partial result is commented out. This may still be useful to you if you want to live update a display showing the current progress recognizing the phrase.

The full result contains additional useful information such as the timing of each word and the confidence that it has in the transcription of each word. In the example above, the “text” property is written to the console as it contains the full text of the recognized phrase.

After writing all this code, your Program.cs file should look something like:

using Microsoft.AspNetCore.HttpOverrides;
using System.Net.WebSockets;
using System.Text.Json;
using Twilio.AspNet.Core;
using Twilio.TwiML;
using Vosk;

var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<ForwardedHeadersOptions>(
    options => options.ForwardedHeaders = ForwardedHeaders.All
);
builder.Services.AddScoped<AudioConverter>();
builder.Services.AddScoped<VoskRecognizer>(_ =>
{
    // You can set to -1 to disable logging messages
    Vosk.Vosk.SetLogLevel(0);
    var model = new Model("model");
    var spkModel = new SpkModel("model-spk");
    var recognizer = new VoskRecognizer(model, 16000.0f);
    recognizer.SetSpkModel(spkModel);
    recognizer.SetMaxAlternatives(0);
    recognizer.SetWords(true);
    return recognizer;
});

var app = builder.Build();
app.UseForwardedHeaders();
app.UseWebSockets();

app.MapGet("/", () => "Hello World!");

app.MapPost("/voice", (HttpRequest request) =>
{
    var response = new VoiceResponse();
    var connect = new Twilio.TwiML.Voice.Connect();
    connect.Stream(url: $"wss://{request.Host}/stream");
    response.Append(connect);
    return Results.Extensions.TwiML(response);
});

app.MapGet("/stream", async (HttpContext context, IHostApplicationLifetime appLifetime) =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
        await Echo(webSocket, context.RequestServices);
    }
    else
    {
        context.Response.StatusCode = StatusCodes.Status400BadRequest;
    }
});

async Task Echo(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)
{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();
    var audioConverter = serviceProvider.GetRequiredService<AudioConverter>();
    var recognizer = serviceProvider.GetRequiredService<VoskRecognizer>();

    var buffer = new byte[1024 * 4];
    var receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);

    while (!receiveResult.CloseStatus.HasValue &&
           !appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        using var jsonDocument = JsonSerializer.Deserialize<JsonDocument>(buffer.AsSpan(0, receiveResult.Count));
        var eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();

        switch (eventMessage)
        {
            case "connected":
                Console.WriteLine("Event: connected");
                break;
            case "start":
                Console.WriteLine("Event: start");
                var streamSid = jsonDocument.RootElement.GetProperty("streamSid").GetString();
                Console.WriteLine($"StreamId: {streamSid}");
                break;
            case "media":
                var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
                byte[] data = Convert.FromBase64String(payload);
                var (converted, convertedLength) = audioConverter.ConvertBuffer(data);
                if (recognizer.AcceptWaveform(converted, convertedLength))
                {
                    var json = recognizer.Result();
                    var jsonDoc = JsonSerializer.Deserialize<JsonDocument>(json);
                    Console.WriteLine(jsonDoc.RootElement.GetProperty("text").GetString());
                }
                else
                {
                    var json = recognizer.PartialResult();
                    var jsonDoc = JsonSerializer.Deserialize<JsonDocument>(recognizer.PartialResult());
                    //Console.WriteLine(jsonDoc.RootElement.GetProperty("partial").GetString());
                }
                break;
            case "stop":
                Console.WriteLine("Event: stop");
                break;
        }

        receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    }

    if (receiveResult.CloseStatus.HasValue)
    {
        await webSocket.CloseAsync(
            receiveResult.CloseStatus.Value,
            receiveResult.CloseStatusDescription,
            CancellationToken.None);
    }
    else if (appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        await webSocket.CloseAsync(
            WebSocketCloseStatus.EndpointUnavailable,
            "Server shutting down",
            CancellationToken.None);
    }
}

app.Run();

Test with ngrok

At this point, you can run your web server. However, it will only be running locally on your development machine, and so it is not possible for the Twilio webhook to reach it directly. Of course, you could deploy your web app to a host running in the cloud, but for testing purposes you can use a tunnel service like ngrok which exposes an endpoint on the public internet that will proxy through to your local application.

You can install ngrok by following the instructions on their site.

If ngrok asks you for an auth token, you can get one by signing up for a free account. You can then register your auth key with the ngrok config add-authtoken command.

Having installed ngrok, then you need to find out which port your local ASP.NET web app is listening on. This can be found either by looking in your launchsettings.json file, or simply by observing the port number displayed when you start the application.

Start your application using the .NET CLI and look for the HTTP port number:

dotnet run

In the example shown below, the server is listening on port 5045.

Output of dotnet run command showing which port the server is listening on

To establish an ngrok tunnel, run the following command, replacing 5045 with your port number.

ngrok http 5045

When ngrok starts up, it will generate a URL that you need to copy. In the example shown below, the URL ends with “.ngrok-free.app”.

Output of ngrok showing forwarding URL

Then log in to Twilio Console, navigate to the number you purchased, and in the “configure” tab, paste in the ngrok forwarding URL with the suffix /voice. The HTTP method should be set to “HTTP POST”. Then save your changes.

Twilio phone number "configure" tab, showing the webhook URL and HTTP method filled in correctly

Every time you run ngrok you will get a different forwarding URL, so you will need to update this every time you test your application.

Now you are ready to test the whole application end to end. Make a phone call to the Twilio number you purchased. If you are using a Twilio trial account, you will first hear a message and be required to press a key, but then you can start speaking and the audio will be forwarded via ngrok to your locally running web server.

Every time Vosk recognizes a phrase, it will print it out to the console. Even though I selected the smallest Vosk recognition model, the quality of recognition was quite impressive. Below is an example of the output from a test call I made

Output of the ASP.NET Core web app showing some recognized text

Conclusion

In this tutorial, you’ve learned how the Twilio media streams API allows you to send the audio from a phone call through to an ASP.NET Core web application. This allows you to do any processing you like with that audio, and in this example we chose to pass it on to the Vosk transcription engine.

The full code for the demo application is available at my transcribe-twilio-vosk GitHub repository, and if you’d like to take this project a bit further, here are a few ideas for you to try out.

  • Remove the need for ngrok, by hosting your web app in the cloud. For example, you might use Azure App Service. Make sure the hosting platform you choose supports WebSockets.
  • Use the NAudio WaveFileWriter to save the incoming audio to a WAV file. This could be done before or after converting from mu-law to PCM.
  • Explore the Twilio Media Streams voice response capability for your application to speak back to the caller. You could even try passing the transcribed text into ChatGPT and speaking the answer back to the caller!
  • Explore using other Vosk models for higher quality recognition or to support additional languages.

Finally, I’d like to thank Niels Swimberghe and Miguel Grinberg whose sample code was very useful in building this demo.

Mark Heath is a software architect working for NICE, creating digital evidence management systems in Azure. He is a Microsoft MVP and has created over 20 training courses for Pluralsight. He also enjoys combining his interest in audio and music with programming, and is the creator of the NAudio open source audio library. You can keep up with what Mark is doing via his blog at markheath.net