Live Transcribing Phone Calls using Twilio Media Streams and Google Speech-to-Text

September 12, 2019
Written by

With Twilio Media Streams, you can now extend the capabilities of your Twilio-powered voice application with real time access to the raw audio stream of phone calls. For example, we can build tools that transcribe the speech from a phone call live into a browser window, run sentiment analysis of the speech on a phone call or even use voice biometrics to identify individuals.

This blog post will guide you step-by-step through transcribing speech from a phone call into text, live in the browser using Twilio and Google Speech-to-Text with Node.js.

If you want to skip the step-by-step instructions, you can clone my Github Repository and follow the ReadMe to get setup or if you prefer to watch Video, check out a video walkthrough here.

Requirements

Before we can get started, you’ll need to make sure to have:

Setting up the Local Server

Twilio Media Streams use the WebSocket API to live stream the audio from the phone call to your application. Let’s get started by setting up a server that can handle WebSocket connections.

Open your terminal and create a new project folder and create an index.js file.

$ mkdir twilio-streams
$ cd twilio-streams
$ touch index.js

To handle HTTP requests we will use node’s built-in http module and Express. For WebSocket connections we will be using ws, a lightweight WebSocket client for node.

In the terminal run these commands to install ws and Express:

$ npm install ws express

Open your index.js file and add the following code to set up your server.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
  console.log("New Connection Initiated");
});

//Handle HTTP Request
app.get("/", (req, res) => res.send("Hello World"));

console.log("Listening at Port 8080");
server.listen(8080);

Save and run index.js with node index.js. Open your browser and navigate to http://localhost:8080. Your browser should show Hello World.

Hello World in the Browser

Now that we know HTTP requests are working, let’s test our WebSocket connection. Open your browser’s console and run this command:

var connection = new WebSocket('ws://localhost:8080')

If you go back to the terminal you should see a log saying New Connection Initiated.

Connect to WebSocket Server from browser

Setting up Phone Calls

Let’s set up our Twilio number to connect to our WebSocket server.

First we need to modify our server to handle the WebSocket messages that will be sent from Twilio when our phone call starts streaming. There are four main message events we want to listen for: connected`, `start`, `media` and `stop`.

  • Connected: When Twilio makes a successful WebSocket connection to a server
  • Start: When Twilio starts streaming Media Packets
  • Media: Encoded Media Packets (This is the Raw Audio)
  • Stop: When streaming ends the stop event is sent.

Modify your index.js file to log messages when each of these messages arrive at our server.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        break;
      case "media":
        console.log(`Receiving Audio...`)
        break;
      case "stop":
        console.log(`Call Has Ended`);
        break;
    }
  });

});

//Handle HTTP Request
app.get("/", (req, res) => res.send("Hello World");

console.log("Listening at Port 8080");
server.listen(8080);

Now we need to set up or Twilio number to start streaming audio to our server. We can control what happens when we call our Twilio number using TwiML. We’ll create a HTTP route that will return TwiML` instructing Twilio to stream audio from the call to our server.

Add the following POST route to your index.js file.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        break;
      case "media":
        console.log(`Receiving Audio...`)
        break;
      case "stop":
        console.log(`Call Has Ended`);
        break;
    }
  });

};

//Handle HTTP Request
app.get("/", (req, res) => res.send("Hello World");

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(`
    <Response>
      <Start>
        <Stream url="wss://${req.headers.host}/"/>
      </Start>
      <Say>I will stream the next 60 seconds of audio through your websocket</Say>
      <Pause length="60" />
    </Response>
  `);
});

console.log("Listening at Port 8080");
server.listen(8080);

For Twilio to connect to your local server we need to expose the port to the internet. The easiest way to do that is using the Twilio CLI. Open a new Terminal to continue.

First let’s buy a phone number. In your terminal run the following command. I have used the GB country code to buy a mobile number, but feel free to change this for a number local to you. Hold on to the number’s Friendly Name  once the response is returned.

$ twilio phone-numbers:buy:mobile --country-code GB

Finally lets update the phone number to point to our localhost url. We need to use ngrok to create a tunnel to our localhost port and expose it to the internet. In a new terminal window run the following command:

$ ngrok http 8080

You should get an output with a forwarding address like this. Copy the URL onto the clipboard. Make sure you record the https url.

Forwarding                    https://xxxxxxxx.ngrok.io -> http://localhost:8080

Running ngrok in terminal and copying https URL

Back in the terminal window where we bought our twilio number lets update our phone number to make a post http request to our server.

Run the following command:

$ twilio phone-numbers:update $TWILIO_NUMBER --voice-url  https://xxxxxxxx.ngrok.io

Head over to a new terminal window and run your index.js file. Now call your Twilio phone number and you should hear the following prompt, “I will stream the next 60 seconds of audio through your websocket”. The terminal should be logging Receiving Audio…

Receiving Audio in Terminal

NOTE: Make sure that you have at least 2 terminals running if your log doesn’t match the expected response. One running your server (index.js) and one running ngrok.

Transcribing Speech into Text

At this point we have audio from our call streaming to our server. Today, we’ll be using Google Cloud Platform’s Speech-to-Text API to transcribe the voice data from the phone call.

There is some setup that we need to do before we get started.

  1. Install and initialize the Cloud SDK
  2. Setup a new GCP Project
  • Create or select a project.
  • Enable the Google Speech-to-Text API for that project.
  • Create a service account.
  • Download a private key as JSON.
  1. Set the environment variable GOOGLE_APPLICATION_CREDENTIALS to the file path of the JSON file that contains your service account key. This variable only applies to your current shell session, so if you open a new session, set the variable again.

Run the following command to install the Google Cloud Speech-to-Text client libraries.

$ npm install --save @google-cloud/speech

Now let’s use it in our code.

First we’ll include the Speech Client from the Google Speech-to-Text library then we will configure a Transcription Request. In order to get live transcription results, make sure you set interimResults to true. I have also set the language code to en-GB, feel free to set yours to a different language region.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });

//Include Google Speech to Text
const speech = require("@google-cloud/speech");
const client = new speech.SpeechClient();

//Configure Transcription Request
const request = {
  config: {
    encoding: "MULAW",
    sampleRateHertz: 8000,
    languageCode: "en-GB"
  },
  interimResults: true
};

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        break;
      case "media":
        console.log(`Receiving Audio...`)
        break;
      case "stop":
        console.log(`Call Has Ended`);
        break;
    }
  });

});

//Handle HTTP Request
app.get("/", (req, res) => res.send("Hello World");

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(`
    <Response>
      <Start>
        <Stream url="wss://${req.headers.host}/"/>
      </Start>
      <Say>I will stream the next 60 seconds of audio through your websocket</Say>
      <Pause length="60" />
    </Response>
  `);
});

console.log("Listening at Port 8080");
server.listen(8080);

Now let’s create a new stream to send audio from our server to the Google API. We will call it the recognizeStream and we will write our audio packets from our phone call to this stream. When the call has ended we will call .destroy() to end the stream.

Edit your code to include these changes.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });

//Include Google Speech to Text
const speech = require("@google-cloud/speech");
const client = new speech.SpeechClient();

//Configure Transcription Request
const request = {
  config: {
    encoding: "MULAW",
    sampleRateHertz: 8000,
    languageCode: "en-GB"
  },
  interimResults: true
};

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

 let recognizeStream = null;

  ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);

        // Create Stream to the Google Speech to Text API
        recognizeStream = client
          .streamingRecognize(request)
          .on("error", console.error)
          .on("data", data => {
            console.log(data.results[0].alternatives[0].transcript);
          });
        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        break;
      case "media":
        // Write Media Packets to the recognize stream
        recognizeStream.write(msg.media.payload);
        break;
      case "stop":
        console.log(`Call Has Ended`);
        recognizeStream.destroy();
        break;
    }
  });
});

//Handle HTTP Request
app.get("/", (req, res) => res.send("Hello World");

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(`
    <Response>
      <Start>
        <Stream url="wss://${req.headers.host}/"/>
      </Start>
      <Say>I will stream the next 60 seconds of audio through your websocket</Say>
      <Pause length="60" />
    </Response>
  `);
});

console.log("Listening at Port 8080");
server.listen(8080);

Restart your server, call your Twilio phone number and start talking down the phone. You should see interim transcription results begin to appear in your terminal.

Live Transcription from phone call in Terminal

Sending Live Transcription to the Browser

One of the benefits of using WebSockets is that we can broadcast messages to other clients, including browsers.

Let’s modify our code to broadcast our interim transcription results to all connected clients. We’ll also modify the GET route. Rather than sending ‘Hello World’ let’s send a  HTML file. We will need the path package also, so don’t forget to require it.

Modify your index.js file like below.

const WebSocket = require("ws");
const express = require("express");
const app = express();
const server = require("http").createServer(app);
const wss = new WebSocket.Server({ server });
const path = require("path");

//Include Google Speech to Text
const speech = require("@google-cloud/speech");
const client = new speech.SpeechClient();

//Configure Transcription Request
const request = {
  config: {
    encoding: "MULAW",
    sampleRateHertz: 8000,
    languageCode: "en-GB"
  },
  interimResults: true
};

// Handle Web Socket Connection
wss.on("connection", function connection(ws) {
console.log("New Connection Initiated");

let recognizeStream = null;

   ws.on("message", function incoming(message) {
    const msg = JSON.parse(message);
    switch (msg.event) {
      case "connected":
        console.log(`A new call has connected.`);
  //Create Stream to the Google Speech to Text API
  recognizeStream = client
    .streamingRecognize(request)
    .on("error", console.error)
    .on("data", data => {
      console.log(data.results[0].alternatives[0].transcript);
      wss.clients.forEach( client => {
           if (client.readyState === WebSocket.OPEN) {
             client.send(
               JSON.stringify({
               event: "interim-transcription",
               text: data.results[0].alternatives[0].transcript
             })
           );
         }
       });

    });

        break;
      case "start":
        console.log(`Starting Media Stream ${msg.streamSid}`);
        break;
      case "media":
        // Write Media Packets to the recognize stream
        recognizeStream.write(msg.media.payload);
        break;
      case "stop":
        console.log(`Call Has Ended`);
        recognizeStream.destroy();
        break;
    }
  });

});

//Handle HTTP Request
app.get("/", (req, res) => res.sendFile(path.join(__dirname, "/index.html")));

app.post("/", (req, res) => {
  res.set("Content-Type", "text/xml");

  res.send(`
    <Response>
      <Start>
        <Stream url="wss://${req.headers.host}/"/>
      </Start>
      <Say>I will stream the next 60 seconds of audio through your websocket</Say>
      <Pause length="60" />
    </Response>
  `);
});

console.log("Listening at Port 8080");
server.listen(8080);

Let’s setup a web page to handle the interim transcriptions and display them in the browser.

Create a new file, index.html and include the following:

<!DOCTYPE html>
<html>
  <head>
    <title>Live Transcription with Twilio Media Streams</title>
  </head>
  <body>
    <h1>Live Transcription with Twilio Media Streams</h1>
    <h3>
      Call your Twilio Number, start talking and watch your words magically
      appear.
    </h3>
    <p id="transcription-container"></p>
    <script>
      document.addEventListener("DOMContentLoaded", event => {
        webSocket = new WebSocket("ws://localhost:8080");
        webSocket.onmessage = function(msg) {
          const data = JSON.parse(msg.data);
          if (data.event === "interim-transcription") {
            document.getElementById("transcription-container").innerHTML =
              data.text;
          }
        };
      });
    </script>
  </body>
</html>

Restart your server, load localhost:8080 in your browser then give your Twilio phone number a call and watch your words begin to appear in your browser.

Live Transcription from phone call in Browser

Wrapping up

Congratulations! You can now harness the power of Twilio media streams to extend your voice applications. Now that you have live transcription, try translating the text with Google’s Translate API to create live speech translation or run sentiment analysis on the audio stream to work out the emotions behind the speech.

If you have any questions, feedback or just want to show me what you build, feel free to reach out to me: