Talking Texts with .NET Core, Cognitive Services and Azure Storage

August 10, 2018
Written by

ZuVnzXGSF90QI4xjlHcGzwRlZeH6MZhHCbow_q8L4XsR7AU62P6kKNbvpE683kZoczLJMSVGZbKakcvkuDPH1BYz-KjFjJ3rJcIW0B3eMVAxwqgvHEzPVE_Ch7KY584CUzzmslUo

Imagine you are driving along in your car and your phone beeps, letting you know that a text message has come in.  We all know it’s beyond dangerous to read a message whilst driving, it’s a pet hate of mine when I see people doing it, so why not get your text message phoned through to you? Hands-free, of course!

This post will show you how to create talking texts using Twilio, .NET Core, Cognitive Services and Azure Storage.

We will build an application that will convert incoming SMS into speech using the Speech Service, currently in preview, on Microsoft Cognitive Services.  We will then use Twilio to call your mobile and play the speech recording.

Let’s get started.

What you will need

I will be developing this solution in Visual Studio 2017 on Windows.  However, you can certainly use VS Code for cross-platform development.  Either is available from the Microsoft site.

If you would like to see a full integration of Twilio APIs in a .NET Core application then checkout this free 5-part video series I created. It's separate from this blog post tutorial but will give you a full run down of many APIs at once.

Overview

We are going to write a fair bit of code today so let’s have a look at an overview of what we need to do.

  • When a text message comes into Twilio, Twilio will make an HTTP POST request to our application endpoint.
  • Our application will then call a method in the Controller that will get the body of the incoming text, and pass it along to the TextToSpeechService
  • The TextToSpeechService will then use the body to make a call to Azure Speech Service which will return an MP4
  • We will then save this MP4 to Azure Blob Storage and return the path to the MP4 back to the Controller
  • We then use the Twilio NuGet package to create a TWIML response which instructs Twilio what to do for the call.  In this case, play our soundbite and hang-up.

Download the Outline Project

To get you started I have created an outline project which you will need to download or clone from GitHub.  The completed project can be found on the completed branch.

The outline project has three folders;

  • Controllers; this folder has the default ValuesController and the one we shall be editing, the SpeechController
  • Services; this folder has both the AuthenticationService and TextToSpeechService files as well as their respective interfaces.
  • Models; this folder has all the models needed to map your appsettings.json and also the models needed for the requests.

Let’s restore the NuGet packages to ensure we have them all downloaded, build and run the project to make sure all is in order.

The project comes with a default controller from the dotnet template called ValuesController, which I leave in for debugging purposes, so we should see ["value1","value2"] displayed in the browser.

Configuring App Settings

We will need to map the various keys from Azure, Twilio and Cognitive Services into the appsettings.json file.  These keys are sensitive so I suggest using User Secrets to help prevent you committing them to a public repository by accident.  Check out this blog post on User Secrets in a .NET Core Web App if you’re new to working with User Secrets.
Add the following configuration to your secrets.json file, inserting your own values in the relevant places.

"TwilioAccount": {
    "AccountSid": "ACCOUNT_SID",
    "AuthToken": "AUTH_TOKEN"
  },
  "CsAccount": {
    "SubscriptionKey": "COGNITIVE_SERVICES_API_KEY"
  },
  "StorageCreds": {
    "Account": "STORAGE_ACCOUNT_NAME",
    "Key": "STORAGE_ACCOUNT_API_KEY" 
  }

The above code should match the code in the  appsettings.json file, but remember the appsettings.json file gets checked in to source control, so leave the values blank or use a reminder such as “value set in user secrets”.

If you look in the Startup.cs file, you will see where I have mapped the app settings to our values.

...
 services.Configure<TwilioAccount>(Configuration.GetSection("TwilioAccount"));
 services.Configure<CsAccount>(Configuration.GetSection("CsAccount"));
 services.Configure<StorageCreds>(Configuration.GetSection("StorageCreds"));
...

We can then inject these settings into any class using IOptions and the Options Pattern.

Fetching an Auth token from Cognitive Services

To enable us to talk to the Speech Service we will need to be issued an auth token from the Azure Speech Service.
To do this, let’s update the AuthenticationService.cs in the Services folder.  Add the following code, where we create a new HTTP request with the Token API URI and our Azure Speech Service subscription key to receive our auth token:

...
public async Task<string> FetchTokenAsync()
{
  using (var client = new HttpClient())
  {
     client.DefaultRequestHeaders.Add("Ocp-Apim-Subscription-Key", _subscriptionKey);
     var uriBuilder = new UriBuilder(FetchTokenUri);

     var result = await client.PostAsync(uriBuilder.Uri.AbsoluteUri, null);
     return await result.Content.ReadAsStringAsync();
  }
}
...

Converting Text to Speech

Next, we will update the service that handles the conversion of text to speech via an HTTP request to Speech Services. Go to the TextToSpeechService.cs file and add the following code:

public async Task<HttpSpeechResponse> GetSpeech(string body, string from)
{
   var response = new HttpSpeechResponse();
   //below is the endpoint I was given when I added Speech Services, you can substitute it 
   //for the one for your region: 
   //https://eastasia.tts.speech.microsoft.com/cognitiveservices/v1 
   //https://northeurope.tts.speech.microsoft.com/cognitiveservices/v1
   var endpoint = "https://westus.tts.speech.microsoft.com/cognitiveservices/v1";
   var token = await _authenticationService.FetchTokenAsync();
   using (var client = new HttpClient())
   {
       client.DefaultRequestHeaders.Add("X-Microsoft-OutputFormat", "audio-16khz-128kbitrate-mono-mp3");
       client.DefaultRequestHeaders.Add("User-Agent", "autotexter");

       client.DefaultRequestHeaders.Add("Authorization", token);

       var uriBuilder = new UriBuilder(endpoint);

       var text = $@"
              <speak version='1.0' xmlns=""http://www.w3.org/2001/10/synthesis"" xml:lang='en-US'>
                <voice  name='Microsoft Server Speech Text to Speech Voice (en-GB, Susan, Apollo)'>
                   You had a text message from {from}
                    <break time = ""100ms"" /> The message was
                    <break time=""100ms""/> {body}
                </voice> 
              </speak>
       ";

       var content = new StringContent(text, Encoding.UTF8, "application/ssml+xml");

       var result = await client
                    .PostAsync(uriBuilder.Uri.AbsoluteUri, content)
                    .ConfigureAwait(false);

       response.Code = result.StatusCode;
       if (result.IsSuccessStatusCode)
       {
         // add code to save the soundbite here
       }
       return response;
   }
}

In the above, I have created a variable called text and assigned a Speech Synthesis Markup Language or SSML string. I then passed in the from number and the message body text.  You can have fun and play around with things like speed and pronunciation or even change the voice and accent of the speaker.

If you go to the Startup.cs file, you will see our two new services have already been added, ready for .NET Core’s built-in dependency injection to pick up.

 ...
services.AddScoped<IAuthenticationService, AuthenticationService>();
services.AddScoped<ITextToSpeechService, TextToSpeechService>();
services.AddMvc().SetCompatibilityVersion(CompatibilityVersion.Version_2_1);
       ...

I have chosen to configure my services as Scoped as I want the instance to be around for the lifetime of the request.  You can read more on the service registration options on the Microsoft documentation.

Saving your new Soundbite

Our code will return a bit stream from Speech Services and now we need to store it someplace. We will be using an Azure Storage Blob.

Create a private method in the TextToSpeechService.cs that will write the MP3 to the blob and then call that in our public method. This private method will return the path to the newly stored item, then we can pass that forward to Twilio.

public async Task<HttpSpeechResponse> GetSpeech(string body, string from)
{
...
        if (result.IsSuccessStatusCode)
        {
           var stream = result.Content.ReadAsStreamAsync();

           using (MemoryStream bytearray = new MemoryStream())
           {
              stream.Result.CopyTo(bytearray);

              response.Path = await StoreSoundbite(bytearray.ToArray())
                            .ConfigureAwait(false);
           }
         }
         return response;
}

private async Task<string> StoreSoundbite(byte[] soundBite)
{
         var blobPath = "PATH_TO_YOUR_BLOB_STORAGE";
         var name = Path.GetRandomFileName();
         var filename = Path.ChangeExtension(name, ".mp3");
         var urlString = blobPath + filename;

         var creds = new StorageCredentials(_storageCreds.Account, _storageCreds.Key);
         var blob = new CloudBlockBlob(new Uri(urlString), creds);
         blob.Properties.ContentType = "audio/mpeg";

         if (!(await blob.ExistsAsync().ConfigureAwait(false)))
         {
             await blob
                    .UploadFromByteArrayAsync(soundBite, 0, soundBite.Length)
                    .ConfigureAwait(false);
            }

         return urlString;
}
 

You will need to construct a  PATH_TO_YOUR_BLOB_STORAGE URI to match your Azure Storage URL and Container name.

You can find the Azure Storage URL from within the portal when you click on your storage resource.

Your container name is just the name of your storage container and can be found under containers within the storage resource.

It should look something like this:

//https://your-storage-name.azure.blob.core.windows.net/your-container-name/
//e.g.
//https://<STORAGE_NAME>.azure.blog.core.windows.net/<CONTAINER_NAME>/

Interfacing with Twilio

We need to update our SpeechController.cs to accept a POST from Twilio that will kick off our conversion of text to speech.

First, add the API endpoint that Twilio uses as the webhook for when an SMS comes in.
We will only map the incoming message SID from Twilio to the TwilioResponse, as that is all we need to pass on to the next stage.

In the response, we tell Twilio that it needs to initiate a new voice call and we pass it a URI, containing the incoming message SID, that will tell Twilio what we require in the voice call.

We return an empty content so Twilio knows that we don’t want to reply to the incoming text.

...
[HttpPost]
[Route("voice")]
public async Task<IActionResult> Voice([FromForm]TwilioResponse twilioResponse)
{
    var siteUrl = HttpContext.Request.Host.ToString();
    await CallResource.CreateAsync(
        to: new PhoneNumber("YOUR_TELEPHONE_NUMBER"),
        from: "TWILIO_NUMBER",
        url: new Uri($"{siteUrl}/api/speech/call/{twilioResponse.MessageSid}"),
        method: "GET");
    return Content("");
} 
...

Now we need to add the API endpoint that Twilio will call after the webhook above.  The call will expect the message SID from the route so, using annotations, we can set that up.
The code below makes a call to Twilio, using the message SID to fetch the incoming message body.  It then passes the message body to the TextToSpeechService which creates the soundbite and returns the URI  of the stored soundbite.

We then use the Twilio helper library to create a TwiML response telling Twilio to play our soundbite, passing in the returned URI, and then hang up.

...
[HttpGet]
[Route("call/{messageSid}")]
public async Task<TwiMLResult> Call([FromRoute]string messageSid)
{
    var message = await MessageResource.FetchAsync(pathSid: messageSid);
    var response = await _textToSpeechService
                .GetSpeech(message.Body, message.From.ToString());
    var twiml = new VoiceResponse();
    twiml.Play(new Uri(response.Path));
    return TwiML(twiml);
}    
...

Wait… That was a lot of code!

It certainly was, so let’s just recap what we have done.
When a text message comes into Twilio, Twilio will make an HTTP POST request to our Voice action in the SpeechController.
Our application will then create a new call and pass in a URI to the instructions for the call.  That URI will be the route to the Call action on the SpeechController and it will pick up the message Sid off the route.
With this message Sid we can fetch all the details of the text message and then pass them into the TextToSpeechService which in turn, returns the URI of the stored soundbite.
We then use the helper library to create a TWIML response which instructs Twilio on what to do for the call.  In this case, play our soundbite and hang-up.

Setting up ngrok

We can use ngrok to test our endpoint rather than deploy to a server, as it creates a public facing URL that maps to our project running locally.

Once installed, run the following line in your command line to start ngrok, replacing with the port your localhost is running on.

> ngrok http <PORT_NUMBER> -host-header="localhost:<PORT_NUMBER>"

You will then see an output similar to below.


Copy the public facing URL, and update the SITE_URL in the SpeechController with it.

Let’s run the Solution either by pressing Run in the IDE or by dotnet run in the CLI.

Now you can do a quick check using the default ValuesController we left in from the template.
Enter the ngrok URL https://.ngrok.io/api/values into your browser and you should see ["value1","value2"] displayed once again.

Setting up the Twilio webhook and trying it out

To enable Twilio to make the initial request that fires off our string of events, we need to set up the webhook.

Go to the Twilio console and find the number that you created for this project.  Next add the API endpoint https://.ngrok.io/api/speech/voice from ngrok and paste into the A MESSAGE COMES IN section.

You should now be able to send yourself a text message and then shortly after receive a phone call which plays the text-to-speech soundbite. Give yourself a high-five – that was a lot of code!

What Next?

There is so much you can do now!  Perhaps you can create a lookup of all your contacts using Azure table storage and cross-reference it with the incoming text and have Speech Services tell you the name of the sender. Or you could write a webjob that clears out your blob storage on a regular basis.  You could even extract the code that creates the Cognitive Services Auth token into an Azure Function and re-use it across multiple apps!

Let me know what you come up with and feel free to get in touch with any questions. I can’t wait to see what you build!