Offline Transcription and TTS using Vosk and Bark

March 27, 2025
Written by
Matt Coser
Twilion
Reviewed by
Paul Kamp
Twilion
Christopher Konopka
Contributor
Opinions expressed by Twilio contributors are their own

Speech Recognition and Text-to-speech (TTS) are crucial parts of any phone tree, call center, or telephony app. All of the traditional methods to build these features typically have one thing in common – they rely on external API calls that require third party access and transmitting your recording files over the Internet. Don’t get me wrong – this process works great, and Twilio’s secure connections for products such as Voice and AI Assistants are discussed on our subprocessors page.

Out of curiosity though: is it possible to generate TTS and transcribe recordings locally, using AI? In this article, we will look at using Vosk and Bark for offline speech recognition and generation.

Prerequisites

Before starting, be sure to have the following ready:

Recording and Transcribing call audio with Twilio

A Call Recording is an audio file containing part or all the RTP media transmitted during a Call.

With Twilio, customers can record and transcribe phone calls by setting certain parameters when using <Record>, <Dial>, Conferences, <Transcription>, or Outbound API.

You can also generate dynamic messages and responses on a call with <Say> and <Play>.

Recordings can be modified during an in-progress call, so the recording time and call time may not be the same.

<Record>

Setting the transcribe attribute to True will send the recorded audio to our transcription provider, and the transcription will be returned.

A transcribeCallback URL is optional, but a good idea for tracking the status of the Transcription

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Record transcribe="true" transcribeCallback="/handle_transcribe"/>
</Response>

When the transcription has been processed (or if it fails for some reason) the transcribeCallback URL will be hit with a payload containing the text, status, and other related info.

<Dial>

Outgoing Calls made using <Dial> can be recorded a few different ways, and the Recording file can be transcribed after completion of the call, using an external service.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Dial record="record-from-answer" recordingStatusCallback="/recording" recordingStatusCallbackEvent=”completed” />
</Response>

Conferences

Conference legs made using <Conference> TwiML can be recorded by setting the record attribute.

Similarly, Conferences made using the Conference Participant API can be recorded in different ways using the record, conferenceRecord, recordTrack, and other attributes. It can get complex, so test various configurations that work for you.

In any case, transcriptions of Conference recordings would need to be made after the fact as well.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Dial>
    <Conference record="record-from-start"
                recordingStatusCallback="/recording">
      LoveTwilio
    </Conference>
  </Dial>
</Response>

Media Streams

From the docs: “Media Streams gives access to the raw audio stream of your Programmable Voice calls by sending the live audio stream of the call to a destination of your choosing using WebSockets.”

A common implementation of real time call transcriptions involves Media Streams in conjunction with an external speech recognition provider. This demo code seamlessly integrates Google Cloud Speech API to the websocket server receiving the forked RTP media on an ongoing call.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Start>
        <Stream name="Example Audio Stream" url="wss://example.com/audiostream" />
    </Start>
    <Say>The stream has started.</Say>
</Response>

Voice Intelligence Transcripts

Voice Intelligence leverages artificial intelligence and machine learning technologies to transcribe and analyze recorded conversations.

Diagram explaining Twilio's Voice Intelligence Process from data sources to integrations for various applications.

Check out the docs for a deeper dive on Transcripts.

<ConversationRelay>

Twilio customers can now pipe call audio over websockets to their own LLM provider for transcription, translation, and much more. The <ConversationRelay> noun, under the <Connect> verb, supports a wide array of languages and voices, and is a great AI-powered enhancement to your call flows.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Connect action="https://twiml-app.com/connect_action">
    <ConversationRelay url="wss://mywebsocketserver.com/websocket" welcomeGreeting="Ahoy! This is a connectivity test." />
  </Connect>
</Response>

Check out the <ConversationRelay> Onboarding guide (and some cool blog posts) for more information.

Marketplace

The Marketplace allows customization and flexibility by integrating different external providers and functionality for operations like transcribing calls.

For example, the IBM Watson Speech to Text add-on will transcribe <Dial> and other types of recordings automatically, and hit a status callback URL when done.

Checklist with options for Record Verb Recordings, Outgoing Call Recordings, Conference Recordings, and Dial Verb Recordings.

TTS with Twilio

<Say>

TTS, or Text To Speech is a technology that converts written text into spoken words. <Say> uses TTS to generate audio of given text on the fly, which can be played on a live call.

The Basic voices, man and woman, are free to use.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Say voice=”woman”>Hello!</Say>
</Response>

You can use Google and Amazon Polly voices with the <Say> verb for a small additional cost.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
  <Say voice="Polly.Mathieu" language="fr-FR">Bonjour! Je m'appelle Mathieu.</Say>
</Response>

Furthermore, the Standard Voices enable use of Speech Synthesis Markup Language (SSML) to modify speed and volume, or to interpret different strings in different ways.

<Response>
  <Say voice="Polly.Joanna">
    Prosody can be used to change the way words sound. The following words are
    <prosody volume="x-loud"> quite a bit louder than the rest of this passage.
    </prosody> Each morning when I wake up, <prosody rate="x-slow">I speak slowly and
    deliberately until I have my coffee.</prosody> I can also change the pitch of my voice
    using prosody. Do you like <prosody pitch="+5%"> speech with a pitch higher,</prosody>
    or <prosody pitch="-10%"> is a lower pitch preferable?</prosody>
  </Say>
</Response>

<Play>

Some organizations opt for recording human speech or including music/sound effects in their phone app to reinforce a seamless customer experience across all interactions. Once the recording is stored somewhere, the <Play> TwiML verb can be invoked to “play” these media files on a live call.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Play loop="10">https://api.twilio.com/cowbell.mp3</Play>
</Response>

Playing media files like this is less flexible, since the recordings need to be made ahead of time. However, you can prepare pre recorded numbers, letters, and common words to Play dynamically for account numbers, addresses, or certain phrases.

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say>Your number is </Say>
    <Play>myapp.com/four.wav</Play>
    <Play>myapp.com/three.wav</Play>
    <Play>myapp.com/eight.wav</Play>
    <Play>myapp.com/seven.wav</Play>
</Response>

Building Local Options

Speech Recognition with Vosk

Vosk is a speech recognition toolkit that incorporates machine learning models trained to convert spoken language into written text. During inference, Vosk processes audio input to extract features like phonemes and matches them against the learned patterns to predict the most likely words and phrases being spoken.

To transcribe Call Recordings locally, we should create a few pieces of logic to handle different stages of the process.

First, we need to save the Recording files locally. When a recording has been processed and is ready to download, the RecordingStatusCallback URL will be hit. We can use this opportunity to download the Call Recording locally, and delete it from Twilio’s stores once the local download has been validated. This keeps your data local, and helps avoid storage costs.

My plumb-bob node/express app has a built-in custom RecordingStatusCallback endpoint that takes care of this.

app.post('/recording-download-delete', (request, response) => {
    const rec_status = request.body.RecordingStatus
    const rec_sid = request.body.RecordingSid
    const rec_url = `${request.body.RecordingUrl}.wav`
    const rec_filename = `recordings/${request.body.RecordingSid}.wav`
    async function deleteRecording(rsid) {
        await client.recordings(rsid).remove();
      }
    if (!rec_url) {
        return response.status(400).json({ error: RecordingUrl is required' });
      }
    if(rec_status == 'completed') { 
        var https = require('https');
        var fs = require('fs');
        var file = fs.createWriteStream(rec_filename);
        var complete_url = `https://${accountSid}:${authToken}@api.twilio.com/2010-04-01/Accounts/${accountSid}/Recordings/${rec_sid}.wav`
        var req = https.get(complete_url, function(res) {
            res.pipe(file);
            deleteRecording(rec_sid);
            return response.status(200)
        });
    }
    return response.status(500).json({error: "an error has occurred..."})
});

This route is not fit for production use, but it illustrates the desired logic for demonstration purposes.

  1. This endpoint is hit with the Recording Status payload
  2. The recording is downloaded locally
  3. The Recording is deleted from Twilio’s cloud storage

Alternatively, recordings can be downloaded in batches by periodically looping through the Recordings Resource, and performing the download/delete functions one by one.

import os
from twilio.rest import Client
account_sid = os.environ["TWILIO_ACCOUNT_SID"]
auth_token = os.environ["TWILIO_AUTH_TOKEN"]
client = Client(account_sid, auth_token)
recordings = client.recordings.list()
for record in recordings:
    download(record.media_url)
    delete(record.recording_sid)

Either way, once you have recordings to transcribe locally, Vosk comes with a simple vosk-transcriber command line tool.

vosk-transcriber -i recording.wav -o recording_transcription.txt

The output file is simple, but gives us what we asked for. The text of the recording - simple as.

A terminal window with a text file named recording_transcription.txt displays the text the hello this is a connectivity test

This example is a bit more complex, outputting a timestamped JSON.

import wave
import json
from vosk import Model, KaldiRecognizer
# Load the Vosk model
model = Model('models/vosk-model-en-us-0.22')  # Ensure you replace this with the path to your Vosk model
# Open the audio file
with wave.open('recordings/RExxxxx.wav', 'rb') as wf:
    # Initialize the recognizer
    rec = KaldiRecognizer(model, wf.getframerate())
    rec.SetWords(True)  # Enable word-level timestamps
    results = []
    text_segments = []
    while True:
        # Read audio data
        data = wf.readframes(4000)
        if len(data) == 0:
            break
        # Process the data through the recognizer
        if rec.AcceptWaveform(data):
            res = json.loads(rec.Result())
            if 'text' in res:
                results.append(res)
    # Fetch the final result
    final_res = json.loads(rec.FinalResult())
    if 'text' in final_res:
        results.append(final_res)
# Process results
for res in results:
    if 'result' in res:
        words = res['result']
        start_time = words[0]['start']
        end_time = words[-1]['end']
        content = ' '.join([word['word'] for word in words])
        text_segments.append({
            'content': content,
            'start': start_time,
            'end': end_time,
            'words': words
        })
# Output the final JSON
with open('\transcription.json', 'w') as f:
    json.dump(text_segments, f, indent=2)
print("Transcription saved.")

It begins by loading a pre-trained Vosk model and opening an audio file for reading. The KaldiRecognizer is initialized and the Vosk model is fed the audio data in chunks. If the recognizer accepts the audio chunk, it parses the recognized result into JSON to extract the transcribed text and its word-level timing information. Once the entire audio is processed, the transcription results are structured with each segment's text content, start time, and end time.

Finally, the transcription with all segments is saved in a JSON file.

Screenshot of a JSON file showing a transcription of text with timestamps and confidence scores.

Videogrep, a tool originally designed for making video supercuts, uses Vosk under the hood and can be used to further analyze the transcript. For example, the ngrams command shows common words or phrases. N-gram is a sequence of words in a specific order.

% videogrep -i recording.wav --ngrams 1
hello 1
this 1
is 1
a 1
connectivity 1
test 1

N-gram analysis can enhance customer service and improve decision-making processes in telephony applications by providing insights into communication patterns and customer interactions.

  • Sentiment Analysis - By examining frequent n-grams (sequences of words) associated with positive or negative sentiments, companies can gauge customer satisfaction and sentiment.
  • Keyword Spotting - Identifying common phrases or keywords can help in detecting recurring themes or topics in calls, such as frequently asked questions or common routing issues.
  • Fraud Detection - Unusual or suspicious n-grams can be flagged to identify potential fraudulent activities or calls.

TTS with Bark

From github:

Bark is a transformer-based text-to-audio model created by Suno.”

A transformer-based model uses transformer architecture, a type of neural network architecture used for processing text and other sequential data.

Bark converts text input into audio output, generating spoken language or other audio forms from written text. A lot of services perform the same function, but Bark works locally, using your machine's resources instead of someone else’s computer in the sky.

This example from the readme ‘just works’.

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio
# download and load all models
preload_models()
# generate audio from text
text_prompt = """
     Ahoy! This is a connectivity test. 
"""
audio_array = generate_audio(text_prompt, history_prompt="v2/en_speaker_5")
# save audio to disk
write_wav("bark_generation.wav", SAMPLE_RATE, audio_array)

Upon the initial execution, the models will be downloaded. Subsequent runs will be faster as the models are already in place.

From there, try messing around with different voices and settings. Once the file is generated, you can upload it to Assets or your own storage to be used with <Play>!

Your results may vary in consistency and hilariousness, but fine tuning is worth a shot.

TTS with Voice Puppet

Voice Puppet (from the same dev as videogrep mentioned above) uses an audio file of speech to ‘clone’ the voice for use in generating text to speech.

This can come in handy if you want a specific voice but you have too many recordings for a human to reliably provide at scale. For instance, if your phone app is designed to send various announcements or alerts, you may need to constantly make individual recordings yourself. With Voice Puppet, you can use an example recording as a source, and dynamically generate the speech each time.

To get started, grab the code and install with pip.

Depending on your setup, you may need to pin the torch version to something < 2.6, for reasons.

This is okay for the purposes of this demonstration, but please do your own research before using it in a production setting.

In the pyproject.toml file, change line 9 from this:

"torch>=2.2.1",

To this:

"torch==2.2.1",

Then, make sure you have the build package installed.

pip install build

To build the package, in your project's root directory (the one containing pyproject.toml), run:

python -m build

This will generate a dist/ directory containing the wheel and source distribution.

Install with

pip install dist/your_package_name.whl

I tried this out by recording my own voice, and pitch shifting it down for fun.

% voice_puppet --clone input-voice.wav --text "Ahoy! This is a connectivity test." --output "cloned-output.wav"

The resulting file can again be uploaded to the Internet to be used with <Play>.

Conclusion

Twilio offers a wide variety of cutting edge solutions such as <Say>, Voice Intelligence Transcripts, ConversationRelay, and more for adding Speech Recognition and Text-to-speech (TTS) functionality to your application. That said, in experimenting with Vosk, Bark, and other open source technologies we've shown that it's possible to locally transcribe and generate audio effectively as well. The underlying technologies have been around for decades, but recent developments and accelerated research have made AI powered telephony features more accessible and easy to use than ever.

Please browse these additional resources for more information.

Matt Coser is a Senior Field Security Engineer at Twilio. His focus is on telecom security, and empowering Twilio’s customers to build safely. Contact him on linkedin to connect and discuss more.