Transcribe Phone Calls in Real-Time using Node.js with AssemblyAI, and Twilio
Time to read: 5 minutes
Real-Time Phone Call Transcription with Node.js, AssemblyAI, and Twilio
In this tutorial, you will build an application that transcribes a phone call to text in real-time. When someone calls your Twilio phone number , you will use the Media Streams API to stream the voice call audio to your WebSocket server. Your server will pass the voice audio to AssemblyAI 's real-time transcription service to get the text back live.
Prerequisites
You'll need these things to follow along:
- A Twilio account
- A Twilio phone number
- Experience with the Twilio Voice webhook and TwiML
- The ngrok CLI (or alternative tunnel service)
- An upgraded AssemblyAI account
You can experiment with AssemblyAI's APIs on a free tier, but the real-time transcription feature requires an upgraded account, so make sure to have upgraded your account before continuing.
Create a WebSocket server for Twilio media streams
You'll need to create a Node.js project and add some modules to build your application.
First, open up your terminal and run the following commands to create a Node.js project:
Then run the following command to add the necessary NPM dependencies:
- express : This is a web framework for Node.js. Express makes it easier to route incoming HTTP requests and send back HTTP responses.
- ws : This is a WebSocket client and server library for Node.js. You'll use ws to create a WebSocket server to which Twilio media streams will connect.
Open the package.json file on your preferred IDE and add the following property:
This tells Node.js that you'll be using the ES module syntax for importing and exporting modules and not the CommonJS syntax.
Next, create a file named server.js with the following code:
This code above responds to HTTP GET requests with "Twilio media stream transcriber", and to HTTP POST requests with the following TwiML :
The following TwiML will tell Twilio to say a message using speech to the caller using the <Say> verb , and then create a media stream that will connect to your WebSocket server using the <Connect> Verb .
Next, add the following WebSocket server code before console.log('Listening on port 3000');:
The code above starts a WebSocket server and handles the different media stream messages that Twilio will send.
This is all the code you’ll need to implement the Twilio part of this application. Let's try it out.
Run the application by running the following command on your terminal:
For Twilio to be able to reach your server, you need to make your application publicly accessible. Open a different shell and run the following command to tunnel your locally running server to the internet using ngrok:
Now copy the Forwarding URL that the ngrok command outputs. It should look something like this https://d226-71-163-163-158.ngrok-free.app.
Go to the Twilio Console , navigate to your active phone numbers, and click on your Twilio phone number.
Update the Voice Configuration so that Twilio sends a Webhook when a call comes in, to your ngrok forwarding URL, using HTTP POST.
Scroll to the bottom of the page and click Save configuration .
Call your Twilio phone number, say a couple of words, and hang up.
Then, observe the output shown on your terminal where you ran the application.
You'll see the logs of the different Media stream events, and especially be bombarded with a lot of media messages.
Great job. You finished one half of the puzzle, let's solve the other half.
Transcribe media stream using AssemblyAI real-time transcription
You're already receiving the audio from the Twilio voice call. Now, you have to forward the audio to AssemblyAI's real-time transcription service to turn the audio into text.
You'll need a couple more NPM packages. Stop the running application on your terminal and add the packages using the following command:
- The assemblyai module is the JavaScript SDK for AssemblyAI . The SDK makes it easier to interact with AssemblyAI's APIs.
- dotenv loads secrets from the .env file into the process's environment variables.
Open the server.js file and update the imports at the top with the following highlighted lines:
When you import dotenv/config, the dotenv module will load secrets from the .env file and add them to the process's environment variables. Create that .env file in the root of your project with the following contents below, and replace <ASSEMBLYAI_API_KEY> with your AssemblyAI API key. You can find the AssemblyAI API key here .
In the incoming connection handler for the WebSocket server, update the code to pass the audio to the RealtimeService and print the transcripts to the console.
Let's take a deeper look at the significant parts of the code.
The code above creates a new RealtimeService, passing in the API key you configured in .env , and configuring the encoding and sampleRate to match that of Twilio media streams. Finally, you connect to the real-time transcription service using .connect() which returns a promise. The promise resolves when the service is ready and the transcription session has begun.
When Twilio sends a media message, the server checks if we're done connecting to the real-time service, then turns the audio data into a buffer, and finally sends it to the real-time service.
The code above prints both partial and final transcripts to the console, but clears the console before doing so such that newly-spoken words appear appended to the end of the current line until an utterance is finished. Partial transcripts can have empty text when audio data containing silence is sent. If the partial transcript text is empty, nothing is printed to the console, and the console isn't cleared.
Test the application
That's all the code you need to write. Let's test it out. Restart the Node.js application (leave ngrok running), and give your Twilio phone number a call. As you talk in the call, you'll see the words you're saying printed on the console.
Conclusion
You learned how to create a WebSocket server that handles Twilio media streams so you can receive the audio of a Twilio voice call. You then passed the audio of the media stream to AssemblyAI's real-time transcription service to turn speech into text and print the text to the console.
You can build on top of this to create many types of voice applications. For example, you could pass the final transcript to a Large Language Model (LLM) to generate a response, then use a text-to-speech service to turn the response text into audio.
Or you could tell an LLM that there are certain actions that the caller can take, and ask the LLM which action the caller wants based on their final transcript, then execute that action.
We can't wait to see what you build! Let us know!
Niels Swimberghe is a Belgian-American software engineer, a developer educator at AssemblyAI, and a Microsoft MVP. Contact Niels on Twitter @RealSwimburger and follow Niels’ blog on .NET, Azure, and web development at swimburger.net .
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.