Transcribe phone calls in real-time using C# .NET with AssemblyAI and Twilio

September 19, 2024
Written by
Reviewed by

Transcribe phone calls in real-time using C# .NET with AssemblyAI and Twilio

In this tutorial, you’ll build an application that transcribes a phone call to text in real-time using C# .NET. When someone calls your Twilio phone number, you will use the Media Streams API to stream the voice call audio to your WebSocket ASP.NET Core server. Your server will pass the voice audio to AssemblyAI's Streaming Speech-to-Text service to get the text back live.

Prerequisites

You'll need these things to follow along:

You can experiment with most of AssemblyAI's APIs on a free tier, but the real-time transcription feature requires an upgraded account, so make sure to have upgraded your account before continuing. You can upgrade your account under billing by adding at least $10 of credit using a credit card.

Create a WebSocket server for Twilio media streams

You'll need to create an empty ASP.NET Core project.

Open up your terminal and run the following commands to create a blank ASP.NET Core project:

dotnet new web --output transcriber-media-streams
cd transcriber-media-streams

Then run the following command to add the Twilio helper library for ASP.NET Core:

dotnet add package Twilio.AspNet.Core

Next, update the Program.cs file with the following C# code:

using Microsoft.AspNetCore.HttpOverrides;
using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using Twilio.AspNet.Core;
using Twilio.TwiML;
var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<ForwardedHeadersOptions>(
    options => options.ForwardedHeaders = ForwardedHeaders.All
);
var app = builder.Build();
// Forwarded headers ensure the ngrok host is set on the HTTP request, instead of localhost
app.UseForwardedHeaders();
app.UseWebSockets();
app.MapGet("/", () => "Hello World!");
app.MapPost("/voice", (HttpRequest request) =>
{
    var response = new VoiceResponse();
    response.Say("Speak to see your audio transcribed in the console.");
    var connect = new Twilio.TwiML.Voice.Connect();
    connect.Stream(url: $"wss://{request.Host}/stream");
    response.Append(connect);
    return Results.Extensions.TwiML(response);
});
app.Run();

This code above responds to HTTP POST at /voice with the following TwiML:

<Response>
  <Say>
    Speak to see your audio transcribed in the console.
  </Say>
  <Connect>
    <Stream url='wss://<your-host>' />
  </Connect>
</Response>

The following TwiML will tell Twilio to say a message using speech to the caller using the <Say> verb, and then create a media stream that will connect to your WebSocket server using the <Connect> verb.

Next, add the following WebSocket server code before app.Run();:

app.MapGet("/stream", async context =>
{
    if (context.WebSockets.IsWebSocketRequest)
    {
        using var webSocket = await context.WebSockets.AcceptWebSocketAsync();
        await TranscribeStream(webSocket, context.RequestServices);
    }
    else
    {
        context.Response.StatusCode = StatusCodes.Status400BadRequest;
    }
});
async Task TranscribeStream(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)
{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();
    var buffer = new byte[1024 * 4];
    var receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    while (!receiveResult.CloseStatus.HasValue &&
           !appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        using var jsonDocument = JsonSerializer.Deserialize<JsonDocument>(buffer.AsSpan(0, receiveResult.Count));
        var eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();
        switch (eventMessage)
        {
            case "connected":
                app.Logger.LogInformation("Twilio media stream connected");
                break;
            case "start":
                app.Logger.LogInformation("Twilio media stream started");
                break;
            case "media":
                var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
                app.Logger.LogInformation("Media: {Media}", payload);
                break;
            case "stop":
                app.Logger.LogInformation("Twilio media stream stopped");
                break;
        }
        receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    }
    if (receiveResult.CloseStatus.HasValue)
    {
        await webSocket.CloseAsync(
            receiveResult.CloseStatus.Value,
            receiveResult.CloseStatusDescription,
            CancellationToken.None);
    }
    else if (appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        await webSocket.CloseAsync(
            WebSocketCloseStatus.EndpointUnavailable,
            "Server shutting down",
            CancellationToken.None);
    }
}

The code above starts a WebSocket server and handles the different media stream messages that Twilio will send.

That's all the code you’ll need to implement the Twilio part of this application. Now you can test the application.

Run the application by running the following command on your terminal:

dotnet run

For Twilio to be able to reach your server, you need to make your application publicly accessible. Open a different shell and run the following command to tunnel your locally running server to the internet using ngrok. Replace <YOUR_ASPNET_URL> with the http-localhost URL that your ASP.NET Core application prints to the console.

ngrok http <YOUR_ASPNET_URL>

Now copy the Forwarding URL that the ngrok command outputs. It should look something like this https://d226-71-163-163-158.ngrok-free.app.

Go to the Twilio Console, navigate to your active phone numbers, and click on your Twilio phone number.

Update the Voice Configuration:

  • A call comes in: Webhook
  • URL: Set it to your ngrok forwarding URL suffixed by /voice
  • HTTP: HTTP POST
Voice Configuration so that Twilio sends a Webhook when a call comes in, to your ngrok forwarding URL, using HTTP POST.

Scroll to the bottom of the page and click Save configuration.

As a result of this configuration, when someone calls your Twilio number, Twilio will send a webhook to your ngrok URL, which will pass the HTTP request to your ASP.NET Core application. Your application will respond with the TwiML instructions you wrote earlier.

Call your Twilio phone number, say a few words, and hang up.

Then, observe the output on your terminal where you ran the application.

info: TwilioVoice[0]
      Twilio media stream connected
info: TwilioVoice[0]
      Twilio media stream started
info: TwilioVoice[0]
      Media: fn5+fn5+fn5+fn19fX5+fn5/f39/f35+fn5+fn59fX1+fn9/f39//39+fn1+fn5+fn59fX59fn5+fn5+fn5///7+///+/v7+/f7+/////39+fn5/f35+fn5+fn5+fn5///7+/v7/f39+fn5+fn5/f39/f/7+/v7/f3//f35/f35/f3///39/f////v////7+////f359fX1+fn5+fn5+fw==
info: TwilioVoice[0]
      Media: f35+//9/fn59fX5+fn59fX1+fn59fn5+fX5+f35+fn5+fn19fn5+fn5+fn5+fn9///7///7+/v7///7+/v7+/39+fn5+fn59fX1+f39+fn5+f39+fn5+fn5+fn5+f3/+/v9/f39//39///7/f///f39+fn19fX18fX1+fn5///7+/v9/f39/f3///v7+/39+fn5+f39+fn5+fX19fX19fQ==
info: TwilioVoice[0]
      Media: fX5+fX19fHx8fHt8fX19fX19fX19fX1+fn5+fX1+fn19fn5/f3///v5/////f39/fn59fn5+fn59fn9/fn19fX19fn5+fn//f3///v7+/v7+/v7+/f39/f39/P39/f3+///+/v/+/v7+/v7/f///fn///v9/f39+fn///v7+/v7///9///7+/v7+/v7+/v7+/v/+/v7+/v//f39+fn5/fw==
info: TwilioVoice[0]
      Twilio media stream stopped

You'll see the logs of the different Media stream events and be bombarded with a lot of Media messages.

Great job! You finished one half of the puzzle. Now it's time to solve the other half.

Transcribe media stream using AssemblyAI real-time transcription

You're already receiving the audio from the Twilio voice call. Now, you must forward the audio to AssemblyAI's real-time transcription service to turn it into text.

Stop the running application on your terminal, but leave the ngrok tunnel running.

Add the AssemblyAI .NET SDK to your project:

dotnet add package AssemblyAI

Open the Program.cs file and add a using statement to import the AssemblyAI.Realtime namespace:

using Microsoft.AspNetCore.HttpOverrides;
using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using AssemblyAI.Realtime;
using Twilio.AspNet.Core;
using Twilio.TwiML;

The real-time transcriber will need your AssemblyAI API key to authenticate. You can find the AssemblyAI API key here. Run the following command to store the API key using the .NET secrets manager.

dotnet user-secrets init
dotnet user-secrets set AssemblyAI:ApiKey "<YOUR_API_KEY>"

Next, configure the RealtimeTranscriber to be created by the dependency injection container.The real-time transcriber should be configured to use a sample rate of 8000 and Mu-law encoding, as Twilio sends the audio in that format.

var builder = WebApplication.CreateBuilder(args);
builder.Services.Configure<ForwardedHeadersOptions>(
    options => options.ForwardedHeaders = ForwardedHeaders.All
);
builder.Services.AddTransient<RealtimeTranscriber>(provider =>
{
    var config = provider.GetRequiredService<IConfiguration>();
    var realtimeTranscriber = new RealtimeTranscriber
    {
        ApiKey = config["AssemblyAI:ApiKey"]!,
        SampleRate = 8000,
        Encoding = AudioEncoding.PcmMulaw
    };
    return realtimeTranscriber;
});
var app = builder.Build();

Now, update the WebSocket handler to pass the audio from Twilio to AssemblyAI and log the transcripts.

async Task TranscribeStream(
    WebSocket webSocket,
    IServiceProvider serviceProvider
)
{
    var appLifetime = serviceProvider.GetRequiredService<IHostApplicationLifetime>();
    var realtimeTranscriber = serviceProvider.GetRequiredService<RealtimeTranscriber>();
    var transcriptTexts = new SortedDictionary<int, string>();
    string BuildTranscript()
    {
        var stringBuilder = new StringBuilder();
        foreach (var word in transcriptTexts.Values)
        {
            stringBuilder.Append($"{word} ");
        }
        return stringBuilder.ToString();
    }
    realtimeTranscriber.SessionBegins.Subscribe(sessionBegins =>
        app.Logger.LogInformation(
            "RealtimeTranscriber session begins with ID {SessionId} until {ExpiresAt}",
            sessionBegins.SessionId,
            sessionBegins.ExpiresAt
        )
    );
    realtimeTranscriber.ErrorReceived.Subscribe(error =>
        app.Logger.LogError("RealtimeTranscriber error: {error}", error)
    );
    realtimeTranscriber.Closed.Subscribe(closeEvent =>
        app.Logger.LogWarning(
            "RealtimeTranscriber closed with status {Code}, reason: {Reason}",
            closeEvent.Code,
            closeEvent.Reason
        )
    );
    realtimeTranscriber.PartialTranscriptReceived.Subscribe(partialTranscript =>
    {
        if (string.IsNullOrEmpty(partialTranscript.Text)) return;
        transcriptTexts[partialTranscript.AudioStart] = partialTranscript.Text;
        var transcript = BuildTranscript();
        Console.Clear();
        Console.WriteLine(transcript);
    });
    realtimeTranscriber.FinalTranscriptReceived.Subscribe(finalTranscript =>
    {
        transcriptTexts[finalTranscript.AudioStart] = finalTranscript.Text;
        var transcript = BuildTranscript();
        Console.Clear();
        Console.WriteLine(transcript);
    });
    await realtimeTranscriber.ConnectAsync().ConfigureAwait(false);
    var buffer = new byte[1024 * 4];
    var receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    while (!receiveResult.CloseStatus.HasValue &&
           !appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        using var jsonDocument = JsonSerializer.Deserialize<JsonDocument>(buffer.AsSpan(0, receiveResult.Count));
        var eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();
        switch (eventMessage)
        {
            case "connected":
                app.Logger.LogInformation("Twilio media stream connected");
                break;
            case "start":
                app.Logger.LogInformation("Twilio media stream started");
                break;
            case "media":
                var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
                byte[] audio = Convert.FromBase64String(payload);
                await realtimeTranscriber.SendAudioAsync(audio).ConfigureAwait(false);
                break;
            case "stop":
                app.Logger.LogInformation("Twilio media stream stopped");
                break;
        }
        receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
    }
    if (receiveResult.CloseStatus.HasValue)
    {
        await webSocket.CloseAsync(
            receiveResult.CloseStatus.Value,
            receiveResult.CloseStatusDescription,
            CancellationToken.None);
    }
    else if (appLifetime.ApplicationStopping.IsCancellationRequested)
    {
        await webSocket.CloseAsync(
            WebSocketCloseStatus.EndpointUnavailable,
            "Server shutting down",
            CancellationToken.None);
    }
    await realtimeTranscriber.CloseAsync();
}

The code above requests a RealtimeTranscriber from the dependency injection container, subscribes to the lifecycle and transcript events and connects to the real-time transcription service. When Twilio sends audio data, the audio is passed to the real-time service using realtimeTranscriber.SendAudioAsync(audio);.

The real-time transcription service uses a two-phase transcription strategy, broken into partial and final transcripts. Partial transcripts are immediately returned when you send the audio. Final transcripts are returned at the end of an "utterance" (usually a pause in speech). The service will finalize the results and return a higher-accuracy transcript and add punctuation and casing to the transcription text. Check out the AssemblyAI documentation on Streaming Speech-to-Text to learn more about the transcripts and the data sent with them.

The partial and final transcript events both update the transcriptTexts dictionary using the start time of each word as the key and the word as the value. This way, the partial transcript words get updated with the final transcript words. The BuildTranscript local function converts the dictionary into a single string.

Test the application

That's all the code you need to write. Let's test it out. Start the ASP.NET Core application (leave ngrok running), and give your Twilio phone number a call. As you speak in the call, you'll see your words printed on the console.

If the real-time transcription service returns an error, make sure you have an upgraded account with sufficient funds. The AssemblyAI docs provide a list of error codes and their meanings.

Extending your AssemblyAI Application

In this tutorial, you learned to create a WebSocket ASP.NET Core application to handle Twilio media streams for receiving the audio of a Twilio voice call and transcribe that audio to text in real-time using AssemblyAI's Streaming Speech-to-Text.

You can build on this to create many types of voice applications. For example, you could pass the final transcript to a Large Language Model (LLM) to generate a response, then use a text-to-speech service to turn the response text into audio.

You can inform the LLM that there are specific actions that the caller can take. You can then ask the LLM to identify the action the caller prefers based on their final transcript and execute that action.

We can't wait to see what you build! Let us know!

Niels Swimberghe is a Belgian-American software engineer, a developer educator at AssemblyAI, and a Microsoft MVP. Contact Niels on Twitter @RealSwimburger and follow Niels’ blog on .NET, Azure, and web development at swimburger.net .