Build a Soundboard using GCP Speech-To-Text, Twilio Voice Media Streams, and ASP.NET Core

February 27, 2023
Written by
Volkan Paksoy
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

Build a Soundboard using GCP Speech-To-Text,  Twilio Voice Media Streams, and ASP.NET Core

Twilio Media Streams give programmers access to the raw audio of a phone call in real-time. This allows you to process the media and enhance your applications by running sentiment analysis, speech recognition, etc. In this tutorial, you will learn how to receive the raw audio via WebSockets, transcribe the call using Google Cloud's Speech-to-Text service and play audio files based on the user's commands.

Prerequisites

You'll need the following things in this tutorial:

Set up GCP Speech-to-Text

To use the Speech-to-Text API, you must enable it in the Google Cloud console. If you have never used GCP, you can log in to your Google account and go to the free trial start page and

click the Start free button.

You will be asked to enter your personal information through a 2-step process. Once you've completed the process, you should gain access to $300 free credits that will be valid for 90 days.

To start developing your application, you will need to create a project in Google Cloud. In my account, Google automatically created a new project called "My First Project". If you don't have this or would like to create a brand new one, go to Menu > IAM & Admin > Create a Project.

You should see the new project creation screen with a default name already chosen for you. You can change it to your liking:

New project page showing the default project name automatically populated and No organization selected as Location

Click the Create button to finish project creation. To switch between the projects, you can click on the project name next to Google Cloud logo and browse your projects.

Screen showing the project name next to the GCP logo. User clicked on the project name and opened a dialog with title "Select a project" and shows all the projects in the account and a New Project button on top

The dialog also has a New Project button which you can use to create new projects.

Once you've created and selected your project, go to the Cloud Speech-to-Text API product page and click Enable.

Cloud Speech-to-Text API product page showing Enable and Try This API buttons

You should see a notification advising you to create credentials. Click Create Credentials.

Notification telling the user to create credentials to use the API. It shows a Create Credentials button on the right.

In the Which API you are using section, select Cloud Speech-to-Text API if it's not already selected.

Which API you are using section showing Cloud Speech-to-Text API selected

In the What data will you be accessing section, select Application data.

Select No, I'm not using them as the answer to the Are you planning to use this API with Compute Engine… question and click Next.

On the Service account details page, give your service account a name such as transcribe-twilio-call.

The Service Account ID should be automatically populated based on the name you chose.

Click Create and Continue.

The rest of the settings are optional, so you can click Done and complete the process.

To use this service account, you will need credentials. On the left menu, click Credentials.

While still on the Cloud Speech-to-Text API page, switch to the Credentials tab.

Credentials tab shown in the middle of the  Cloud Speech-to-Text API

Scroll down to the Service Accounts and click your account.

Switch to the Keys section and click Add Key → Create new key.

Select JSON if not selected already, and click Create.

Create private key window showing JSON and P12 options. JSON is selected. It also show the Create button at the bottom.

This should start a download of your private key in a JSON file.

Copy this file to a safe location and set the GOOGLE_APPLICATION_CREDENTIALS environment variable by running the command appropriate to your system.

export GOOGLE_APPLICATION_CREDENTIALS={ PATH TO YOUR JSON FILE }

Twilio Media Streams

In a Twilio voice application, you can use the Stream verb to receive raw audio streams from a live phone call over WebSockets in near real-time.

WebSockets

A WebSocket is a protocol for bidirectional communication between a client (such as a web browser) and a server over a single, long-lived connection. WebSockets allow for real-time, two-way communication between the client and server and can be used for a variety of applications such as online gaming, chat applications, and data streaming. Unlike traditional HTTP connections, which are request-response based, WebSockets provide a full-duplex communication channel for continuous, real-time data exchange.

Stream WebSocket Messages

In the Twilio Stream WebSocket, each message is sent in a JSON string. There are different message types, and to identify the message type, first, you need to parse the JSON and check the value of the event field.

The possible values for the WebSocket messages coming from Twilio are:

  • connected: The first message sent once a WebSocket connection is established.
  • start: This message contains important metadata about the stream and is sent immediately after the connected message. It is only sent once at the start of the Stream.
  • media: This message type encapsulates the raw audio data.
  • stop: This message will be sent when the Stream is stopped or the call has ended.
  • mark: The mark event is sent only during bidirectional streaming using the <Connect> verb. It is used to track or label when media has completed.

The possible values for the WebSocket messages coming from Twilio are:

  • media: To send media back to Twilio, you must provide a similarly formatted media message. The payload must be encoded audio/x-mulaw with a sample rate of 8000 and base64 encoded. The audio can be of any size.
  • mark: Send a mark event message after sending a media event message to be notified when the audio that you have sent has been completed.
  • clear: Send the clear event message if you would like to interrupt the audio that has been sent various media event messages.

In the demo project, you will learn more about the other fields that are used in these messages.

WAVE File Format Analysis

The telephony standard for audio is 8-bit PCM mono uLaw (MULAW) with a sampling rate of 8Khz. The payload of the media message should not contain the audio file type header bytes. So it's essential to understand the WAV file header fields so that you can strip them off before sending the audio data to the user.

A standard WAV file header comprises the following fields:

PositionsSample ValueDescription
1 - 4“RIFF”Marks the file as a riff file. Characters are each 1 byte long.
5 - 8File size (integer)Size of the overall file - 8 bytes, in bytes (32-bit integer). Typically, you’d fill this in after creation.
9 -12“WAVE”File Type Header. For our purposes, it always equals “WAVE”.
13-16“fmt "Format chunk marker. Includes trailing null
17-2016Length of format data as listed above
21-221Type of format (1 is PCM) - 2 byte integer
23-242Number of Channels - 2 byte integer
25-2844100Sample Rate - 32 byte integer. Common values are 44100 (CD), 48000 (DAT). Sample Rate = Number of Samples per second, or Hertz.
29-32176400(Sample Rate * BitsPerSample * Channels) / 8.
33-344(BitsPerSample * Channels) / 8.1 - 8 bit mono2 - 8 bit stereo/16 bit mono4 - 16 bit stereo
35-3616Bits per sample
37-40“data”“data” chunk header. Marks the beginning of the data section.
41-44File size (data)Size of the data section.

(Source: https://docs.fileformat.com/audio/wav/)

A WAVE file is a collection of a number of different types of chunks. The fmt chunk is required, and it contains parameters describing the waveform.  

Now, open the bird.wav in a hex editor and review the file. Note that the file length is 54,084 bytes.

Hex editor showing contents of bird.wav

Here you can see the fields:

PositionsBytesValueExplanation
1 - 452 49 46 46"RIFF"As expected
5 - 800 00 D3 3C54,076 (Little-endian) (File length -8)As expected
9 - 1257 41 56 45"WAVE"As expected
13 - 1666 6D 74 20"fmt" with trailing spaceAs expected
17- 2012 00 00 00Length of fmt chunk data: 18As expected. Can be 16, 18 or 40
21 - 2200 07Type of format: Mulaw (7)As expected
23 - 2400 01Number of channels: 1As expected
25 - 2800 00 1F 40Sample rate: 8000As expected
29 - 3200 00 1F 40(Sample Rate * Bit per sample * Channels) / 8(8000 * 8 * 1) / 8 = 8000As expected
33 - 3400 01(BitsPerSample * Channels) / 8(8 * 1) / 8 = 1As expected
35 - 3600 088 Bits per sampleAs expected
37 - 3800 00Size of the extension: 0Expected: Start of data. Microsoft Windows Media Player will not play non-PCM data (e.g. µ-law data) if the fmt chunk does not have the extension size field (cbSize) or a fact chunk is not present.
39 - 4266 61 63 74"fact"Optional fact chunk
43 - 4600 00 00 044 = Size of the fact chunk data
47 - 5000 00 D3 0A54,026 = chunk data. Equal to file length.Fact chunk explanation
51 - 5464 61 74 61"data"As expected, except it starts at 51 because of the fact chunk
55 - 5800 00 D3 0ASize of the data: 54,026As expected

As you can see, the actual file header diverges slightly from the standard header description.

The takeaways from this analysis are:

  • The audio data starts after the first 58 bytes. You will skip those bytes in the demo and only send the audio data to the caller.
  • You may encounter different header lengths and subsequently may need to adjust the number of bytes to skip; otherwise, you may hear distorted audio on the phone.

Now that you understand the WAV format better, proceed to the next section to implement the project to play audio files to a phone call.

Sample Project: Animal Soundboard

The project requires to have some audio files to function properly. The easiest way to set up the starter project is by cloning the sample GitHub repository.

Open a terminal, change to the directory you want to download the project, and run the following command:

git clone https://github.com/Dev-Power/play-audio-to-a-phone-call-using-media-streams.git --branch starter-project

The project can be found in the src\PlayAudioUsingMediaStreams subfolder. Open the project in your IDE.

The starter project comes with 2 controllers: IncomingCallController and AnimalSoundboardController. IncomingCallController currently only plays back a simple message to test your setup. You will implement  AnimalSoundboardController as you go along.

It also comes with 4 WAV files that will be used in the project.

Open another terminal and run ngrok like this:

ngrok http http://localhost:5214

For Twilio to know where to send webhook requests, you need to update the webhook settings on your Twilio phone number.

Go to the Twilio Console. Select your account, and then click Phone Numbers → Manage → Active Numbers on the left pane. (If Phone Numbers isn't on the left pane, click on Explore Products and then on Phone Numbers.)

Click on the phone number you want to use for your project and scroll down to the Voice section.

Under the A Call Comes In label, set the dropdown to Webhook, the text field next to it to the ngrok Forwarding URL suffixed with the /IncomingCall path, the next dropdown to HTTP POST, and click Save. It should look like this:

Twilio Console showing the incoming call webhook set to ngrok forwarding URL followed by /IncomingCall. Save button is highlighted in the image and needs to be clicked to update the settings.

Note that you have to use HTTPS as the protocol when setting the webhook URL.

In the terminal, run the following command:

dotnet run

Call your Twilio number, and you should hear the message "If you can hear this, your setup works!" played back to you.

After you've confirmed you can receive calls in your application, update the code in the Index method of the IncomingCallController with the code below:

var response = new VoiceResponse();
response.Say("Say animal names to hear their sounds.");

var connect = new Connect();
connect.Stream(
    name: "Animal Soundboard", 
    url: Url.Action(
        action: "Get", 
        controller: "AnimalSoundboard",
        values: null,
        protocol: "wss"
    )
);
response.Append(connect);

Console.WriteLine(response.ToString());
return TwiML(response);

This update replaces the message and adds Stream verb to the output. It prints the response before sending it back, so you can see the TwiML you created, which looks like this:

<?xml version="1.0" encoding="utf-8"?>
<Response>
  <Say>Say animal names to hear their sounds.</Say>
  <Connect>
    <Stream name="Animal Soundboard" url="wss://{YOUR NGROK URL}/animalsoundboard"></Stream>
  </Connect>
</Response>

In the demo, you will receive raw user audio and play animal sounds back depending on the commands you receive, so you have to maintain a synchronous bi-directional connection. This is why you use the Connect verb instead of the Start verb, which is asynchronous and immediately continues with the next TwiML instruction. You can read more about TwiML stream verbs here.

Now, it's time to implement the web socket. The first version will just echo the user's voice back. Update the AnimalSoundboardController with the code below:

using System.Net.WebSockets;
using Microsoft.AspNetCore.Mvc;
using Twilio.AspNet.Core;

namespace PlayAudioUsingMediaStreams.WebApi.Controllers;

[ApiController]
[Route("[controller]")]
public class AnimalSoundboardController : Controller
{
    public async Task Get()
    {
        if (HttpContext.WebSockets.IsWebSocketRequest)
        {
            using var webSocket = await HttpContext.WebSockets.AcceptWebSocketAsync();
            await Soundboard(webSocket);
        }
        else
        {
            HttpContext.Response.StatusCode = StatusCodes.Status400BadRequest;
        }
    }
    
    private async Task Soundboard(WebSocket webSocket)
    {
        var buffer = new byte[1024 * 4];
        var receiveResult = await webSocket.ReceiveAsync(
            new ArraySegment<byte>(buffer), CancellationToken.None);

        while (!receiveResult.CloseStatus.HasValue)
        {
            await webSocket.SendAsync(
                new ArraySegment<byte>(buffer, 0, receiveResult.Count),
                receiveResult.MessageType,
                receiveResult.EndOfMessage,
                CancellationToken.None);

            receiveResult = await webSocket.ReceiveAsync(
                new ArraySegment<byte>(buffer), CancellationToken.None);
        }

        await webSocket.CloseAsync(
            receiveResult.CloseStatus.Value,
            receiveResult.CloseStatusDescription,
            CancellationToken.None);
    }
}

Before you run the application, you have to modify the Program.cs and add WebSocket support as shown below:

app.MapControllers();

app.UseWebSockets();

app.Run();

Re-run the application and call your Twilio phone number again. You should hear yourself on the phone as you speak.

This version of the code reads the streaming data from the web socket, and as long as the connection is open, it sends the same data back to the user.

This is how you can access the raw audio of a phone call. This primitive version does not look inside the messages. What you receive over the web socket is a JSON message.

In the next version, you will parse the messages as well. Before that, you'll need some supporting services.

To model the sounds, create a new file called Sound.cs and update its contents like this:

namespace PlayAudioUsingMediaStreams.WebApi;

public class Sound
{
    public string Name { get; set; }
    public List<string> Keywords { get; set; }
    public string AudioDataAsBase64 { get; set; }
}

Every sound has a name, a list of keywords, and the audio data. In this project, you will use animal names as keywords, but the application can be used for any group of sounds.

Create a new folder called Services and a new file under it called SoundService.cs. Update the code as below:

namespace PlayAudioUsingMediaStreams.WebApi.Services;

public class SoundService
{
    private const string AudioRoot = "../../audio";
    private const int WavHeaderBytesToSkip = 58;
    
    private List<Sound> _sounds = new()
    {
        new() { Name = "dog", Keywords = new List<string> { "dog", "canine", "pooch", "hound" } },
        new() { Name = "cat", Keywords = new List<string> { "cat", "kitty", "kitten" } },
        new() { Name = "bird", Keywords = new List<string> { "bird" } },
        new() { Name = "elephant", Keywords = new List<string> { "elephant" } },
    };

    public SoundService()
    {
        // Load all files into memory once to avoid constant disk access
        foreach (var sound in _sounds)
        {
            var audioFilePath = $"{AudioRoot}/{sound.Name}.wav";
            var rawAudioData = File.ReadAllBytes(audioFilePath);
            
            // Skip the header bytes while copying
            var tempAudioData = new byte[rawAudioData.Length - WavHeaderBytesToSkip];
            Array.Copy(rawAudioData, WavHeaderBytesToSkip, tempAudioData, 0, tempAudioData.Length);

            sound.AudioDataAsBase64 = Convert.ToBase64String(tempAudioData);
        }
    }

    public bool TryFindSoundByKeyword(string keyword, out Sound sound)
    {
        sound = _sounds.FirstOrDefault(s => s.Keywords.Contains(keyword));
        return sound != null;
    }
}

At construction, the service initializes all the sound objects. It loads the audio data into memory, so they can be played in rapid succession without having to access the files from disk repeatedly.

Also, it handles skipping the wav header bytes, as discussed in the previous section.

You also need to identify the keywords that the user is uttering. To achieve this, you'll need to use Google Speech-to-Text service. Install the SDK by running the following command:

dotnet add package Google.Cloud.Speech.V1

Under the Services folder, create a new file called SpeechRecognitionService.cs with the following contents:

using Google.Api.Gax.Grpc;
using Google.Cloud.Speech.V1;
using Google.Protobuf;

namespace PlayAudioUsingMediaStreams.WebApi.Services;

public class SpeechRecognitionService
{
    private StreamingRecognitionConfig _streamingConfig = new()
    {
        Config = new RecognitionConfig
        {
            Encoding = RecognitionConfig.Types.AudioEncoding.Mulaw,
            SampleRateHertz = 8000,
            LanguageCode = "en-US",
            EnableWordConfidence = true,
            UseEnhanced = true
        },
        InterimResults = true
    };

    private SpeechClient _speechClient;

    private SpeechClient.StreamingRecognizeStream _streamingRecognizeStream;

    public SpeechRecognitionService(SpeechClient speechClient)
    {
        _speechClient = speechClient;
    }
    
    public async Task<AsyncResponseStream<StreamingRecognizeResponse>> InitStream()
    {
        _streamingRecognizeStream = _speechClient.StreamingRecognize();
        await _streamingRecognizeStream.WriteAsync(new StreamingRecognizeRequest
        {
            StreamingConfig = _streamingConfig,
        });

        return _streamingRecognizeStream.GetResponseStream();
    }

    public async Task SendAudio(string payload)
    {
        await _streamingRecognizeStream.WriteAsync(new StreamingRecognizeRequest
        {
            AudioContent = ByteString.FromBase64(payload)
        });
    }
}

This service is responsible for initializing the Google speech client. When you first create the stream, you write the recognition configuration as shown in the InitStream method. This returns the response stream; from that point on, you only write the audio data to the stream via the SendAudio method.

To be able to use these services with dependency injection, add them to the IoC container:

builder.Services.AddSwaggerGen();

builder.Services.AddSpeechClient();
builder.Services.AddTransient<SoundService>();
builder.Services.AddTransient<SpeechRecognitionService>();

var app = builder.Build();

Make sure to add the using statement to the top of the file as well:

using PlayAudioUsingMediaStreams.WebApi.Services;

Finally, update the AnimalSoundboardController as shown below:

using System.Net.WebSockets;
using System.Text;
using System.Text.Json;
using Google.Api.Gax.Grpc;
using Google.Cloud.Speech.V1;
using Microsoft.AspNetCore.Mvc;
using PlayAudioUsingMediaStreams.WebApi.Services;

namespace PlayAudioUsingMediaStreams.WebApi.Controllers;

[ApiController]
[Route("[controller]")]
public class AnimalSoundboardController : Controller
{
    private readonly SoundService _soundService;
    private readonly SpeechRecognitionService _speechRecognitionService;
    private readonly IHostApplicationLifetime _applicationLifetime;

    public AnimalSoundboardController(
        SoundService soundService,
        SpeechRecognitionService speechRecognitionService,
        IHostApplicationLifetime applicationLifetime
    )
    {
        _soundService = soundService;
        _speechRecognitionService = speechRecognitionService;
        _applicationLifetime = applicationLifetime;
    }

    [HttpGet]
    public async Task Get()
    {
        if (HttpContext.WebSockets.IsWebSocketRequest)
        {
            using var webSocket = await HttpContext.WebSockets.AcceptWebSocketAsync();
            await Soundboard(webSocket);
        }
        else
        {
            HttpContext.Response.StatusCode = StatusCodes.Status400BadRequest;
        }
    }

    private async Task Soundboard(WebSocket webSocket)
    {
        string streamSid = null;
        var buffer = new byte[1024 * 4];
        var receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);

        await using var speechRecognitionStream = await _speechRecognitionService.InitStream();
        while (!receiveResult.CloseStatus.HasValue &&
               !_applicationLifetime.ApplicationStopping.IsCancellationRequested)
        {
            using var jsonDocument = JsonDocument.Parse(Encoding.UTF8.GetString(buffer, 0, receiveResult.Count));
            var eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();

            switch (eventMessage)
            {
                case "connected":
                    Console.WriteLine("Event: connected");
                    break;
                case "start":
                    Console.WriteLine("Event: start");
                    streamSid = jsonDocument.RootElement.GetProperty("streamSid").GetString();
                    Console.WriteLine($"StreamId: {streamSid}");

                    // Do not await task, leave this task running in the background for the duration of the websocket connection
                    var _ = ListenForSpeechRecognition(webSocket, streamSid, speechRecognitionStream)
                        .ConfigureAwait(false);
                    break;
                case "media":
                    var payload = jsonDocument.RootElement.GetProperty("media").GetProperty("payload").GetString();
                    await _speechRecognitionService.SendAudio(payload);
                    break;
                case "stop":
                    Console.WriteLine("Event: stop");
                    break;
            }

            receiveResult = await webSocket.ReceiveAsync(new ArraySegment<byte>(buffer), CancellationToken.None);
        }

        if (receiveResult.CloseStatus.HasValue)
        {
            await webSocket.CloseAsync(
                receiveResult.CloseStatus.Value,
                receiveResult.CloseStatusDescription,
                CancellationToken.None);
        }
        else if (_applicationLifetime.ApplicationStopping.IsCancellationRequested)
        {
            await webSocket.CloseAsync(
                WebSocketCloseStatus.EndpointUnavailable,
                "Server shutting down",
                CancellationToken.None);
        }
    }

    private async Task ListenForSpeechRecognition(
        WebSocket webSocket,
        string streamSid,
        AsyncResponseStream<StreamingRecognizeResponse> speechRecognitionStream
    )
    {
        while (await speechRecognitionStream.MoveNextAsync())
        {
            var word = speechRecognitionStream.Current?.Results.FirstOrDefault()
                ?.Alternatives.FirstOrDefault()
                ?.Words.FirstOrDefault();
            if (word == null) continue;

            Console.WriteLine($"Word: [{word.Word}]. Confidence: {word.Confidence:N2}");
            if (word.Confidence < 0.5)
            {
                Console.WriteLine($"Low confidence. Skipping the word [{word.Word}]");
                continue;
            }

            var utterance = word.Word.Trim().ToLower();
            if (!_soundService.TryFindSoundByKeyword(utterance, out var soundToPlay))
            {
                continue;
            }

            Console.WriteLine($"Animal detected: {soundToPlay.Name}");

            var mediaMessage = new
            {
                streamSid,
                @event = "media",
                media = new
                {
                    payload = soundToPlay.AudioDataAsBase64
                }
            };

            var rawJson = JsonSerializer.Serialize(mediaMessage);
            var responseBuffer = Encoding.UTF8.GetBytes(rawJson);

            await webSocket.SendAsync(
                new ArraySegment<byte>(responseBuffer, 0, responseBuffer.Length),
                WebSocketMessageType.Text,
                true,
                CancellationToken.None);
        }
    }
}

As mentioned before, now you're parsing the JSON message:

var jsonDocument = JsonDocument.Parse(Encoding.UTF8.GetString(buffer, 0, receiveResult.Count))

First action is to determine the message type. You achieve this by parsing the event property:

string eventMessage = jsonDocument.RootElement.GetProperty("event").GetString();

As you saw at the beginning of the article, there are different types of events. In the application, you're interested in 3 of them:

  • connected: This is where you initialize your Google speech client and the stream.
  • start: You receive the unique stream identifier in this message. This id must be stored to be able to send audio back to the caller.
  • media: Whenever you receive a media message, you parse the payload and send it to Google for speech recognition.

The final important update is to use both sound and speech recognition services to identify if a keyword was uttered by the user. If this happens, you prepare a new media message, convert it to JSON and send it to the user.

var mediaMessage = new
{
    streamSid, 
    @event = "media", 
    media = new
    {
        payload = soundToPlay.AudioDataAsBase64
    }
};

To test the final version, rerun your application and call your phone.

Speak some of the keywords, and you should hear the corresponding animals' sounds:

Terminal showing the output of the application. Keywords as they are uttered, the confidence of the speech recognition and the animal detected.

How to Add More Audio

If you enjoyed this little project and would like to add more sounds, here's how I created the stock sounds:

Go to BBC Sound Effects website.

Search for the animal you're looking for, click the download button, and select wav as the file format.

BBC Sound Effects search results showing lion was searched and the download icon clicked. It shows wav and mp3 as available file formats to download.

Once you've downloaded the file, go to G711 File Converter.

Locate your file by clicking Browse.

Select u-Law WAV as the output format and click Submit.

Click on the link of the converted file to download it.

Most audio files are too long to be able to play one after another quickly. I use Audacity to open the files and copy the part I'm interested in.

Audacity showing the audio loaded a portion of it is selected to be exported

Once you've selected the portion you want, click File → Export → Export Selected Audio to save it as a separate file.

Conclusion

In this tutorial, you learned how WebSockets work and how to use them in a voice application to establish a 2-way audio connection to the caller. You also learned more about media streams and the audio format standard for telephony. You used all this knowledge to implement a project to play audio files to the user based on their commands. This project shows you have the ability to access raw audio and partial transcriptions. Now you can use this to implement your own projects.

If you'd like to keep learning, I recommend taking a look at these articles:

Volkan Paksoy is a software developer with more than 15 years of experience, focusing mainly on C# and AWS. He’s a home lab and self-hosting fan who loves to spend his personal time developing hobby projects with Raspberry Pi, Arduino, LEGO and everything in-between. You can follow his personal blogs on software development at devpower.co.uk and cloudinternals.net.