Turn Voice Recordings into Shareable Videos with Python and FFmpeg
Time to read: 9 minutes
In this tutorial, we are going to learn how to build an application with Python and FFmpeg that will allow us to turn voice recordings into cool videos that can be easily shared on social media.
At the end of the tutorial we will have turned a voice recording into a video that looks similar to the following:
Tutorial requirements
To follow this tutorial you are going to need the following components:
- One or more voice recordings that you want to convert to videos. Programmable Voice recordings stored in your Twilio account work great for this tutorial.
- Python 3.6+ installed.
- FFmpeg version 4.3.1 or newer installed.
Creating the project structure
In this section, we will create our project directory, and inside this directory, we will create sub-directories where we will store the recordings, images, fonts, and videos that will be used in this tutorial. Lastly, we will create the Python file that will contain the code that will allow us to use FFmpeg to create and edit a video.
Open a terminal window and enter the following commands to create the project directory move into it:
Use the following commands to create four subdirectories:
The images
directory is where we will store the background images of our videos. Download this image, and store it in the images
directory with the name bg.png
. This image was originally downloaded from Freepik.com.
In the fonts
directory we will store font files used to write text in our videos. Download this font, and store it in the fonts
directory with the name LeagueGothic-CondensedRegular.otf
. This font was originally downloaded from fontsquirrel.com.
The videos
directory will contain videos and animations that will be added on top of the background image. Download this video of a spinning record with the Twilio logo in the center, and store it in the videos
directory with the name spinningRecord.mp4
. The source image used in this video was downloaded from flaticon.com.
The recordings
directory is where we will store the voice recordings that will be turned into videos. Add one or more voice recordings of your own to this directory.
Now that we have created all the directories needed, open your favorite code editor and create a file named main.py
in the top-level directory of the project. This file will contain the code responsible for turning our recordings into videos.
If you don’t want to follow every step of the tutorial, you can get the complete project source code here.
Turning an audio file into a video
In this section, we are going to add the code that will allow us to turn a recording into a video that shows the recording’s sound waves.
We are going to use FFmpeg
to generate a video from an audio file. So in order to call FFmpeg and related programs from Python, we are going to use python’s subprocess
module.
Running a command
Add the following code inside the main.py
file:
In the block of code above, we have imported the subprocess
module and created a run_command()
function. As the name suggests, this function is responsible for running a command that is passed in the argument. When the command completes, we print the output and also return it to the caller.
Obtaining the duration of a recording
Add the following code below the run_command()
function:
Here, we created a function named get_rec_duration()
. This function is responsible for retrieving the duration of a recording. The function receives a recording name (rec_name
) as an argument, which is prepended with the name of the recordings directory and stored in the rec_path
local variable.
The ffprobe
program, which is part of FFmpeg, is used to create a command string to get the duration of the recording. We call the run_command()
function with this command and store the value returned in rec_duration
.
Lastly, we print and then return the recording duration we obtained.
The recording duration is needed to specify that the duration of the video that will be generated from it is the same.
Converting audio to video
Add the following code below the get_rec_duration()
function:
The turn_audio_to_video()
function will turn recordings into videos showing the recordings sound waves. The function takes as an argument the recording name (rec_name
) and the recording duration (rec_duration
).
The FFmpeg command that generates the video from the audio uses the recording path (rec_path
), the path to a background image (bg_image_path
), and the output filename for the video (video_name
).
Let’s take a closer look at the FFmpeg command:
The -y
tells ffmpeg to overwrite the output file if it exists on disk.
The -i
option specifies the inputs. In this case, we have 2 input files, the recording file, rec_path
, and the image that we are using has a background, which is stored in bg_image_path
.
The -loop
option to generate a video by repeating (looping) the input file(s). Here we are looping over our image input in bg_image_path
. The default value is 0
(don’t loop), so we set it to 1
(loop) to repeat this image in all the video frames.
The -t
option specifies a duration in seconds, or using the "hh:mm:ss[.xxx]"
syntax. Here we are using the recording duration (rec_duration
) value to set the duration of our output video.
-filter_complex
: allows us to define a complex filtergraph, one with an arbitrary number of inputs and/or outputs. This is a complex option that takes a number of arguments, discussed below.
First, we use the showwaves
filter to convert the voice recording, referenced as [0:a]
, to video output. The s
parameter is used to specify the video size for the output, which we set to 1280x150. The mode
parameter defines how the audio waves are drawn. The available values are: point
, line
, p2p
, and cline
. The colors
parameter specifies the color of the waveform. The waveform drawing is assigned the label [fg]
.
We use the drawbox
filter to draw a colored box on top of our background image to help the waveform stand out. The x and y
parameters specify the top left corner coordinates of the box, while w and h
set its width and height. The color
parameter configures the color of the box to black
with an opacity of 80%. The t
parameter sets the thickness of the box border. By setting the value to fill
we create a solid box.
To complete the definition of this filter we use overlay
to put the waveform drawing on top of the black box. The overlay
filter is configured with format
, which sets the pixel format automatically, and x and y
, which specify the coordinates in which the overlay will be placed in the video frame. We use some math to specify that x
and y
should be placed in the center of our video.
The -map
option is used to choose which streams from the input(s) should be included or excluded in the output(s). We choose to add all streams of our recording to our output video.
The -c:v
option is used to encode a video stream with a certain codec. We are telling FFmpeg to use the libx264
encoder.
The -preset
option selects a collection of options that will provide a certain encoding speed to compression ratio. We are using the fast
option here, but feel free to change the preset to a slower (better quality) or faster (lower quality) one if you like.
The -crf
option stands for constant rate factor. Rate control decides how many bits will be used for each frame. This will determine the file size and also the quality of the output video. A value of 18 is recommended to obtain visually lossless quality.
The -c:a
option is used to encode an audio stream with a certain codec. We are encoding the audio with the AAC codec.
The -shortest
option tells FFmpeg to stop writing the output when the shortest of the input streams ends.
The ./videos/{video_name}
option at the end of the command specifies the path of our output file.
In case you are curious, here is what all the FFmpeg waveforms modes discussed above do and how they look.
Point
draws a point for each sample:
Line
draws a vertical line for each sample:
P2p
draws a point for each sample and a line between them:
Cline
draws a centered vertical line for each sample. This is the one we are using in this tutorial:
Add the following code below the turn_audio_to_video()
function:
In this newly introduced code, we have a function named main()
. In it we store the recording name in a variable named rec_name
. You should update this line to include the name of your own voice recording file.
After that, we call the get_rec_duration()
function to get the recording duration.
Then, we call the turn_audio_to_video
function with the recording name and duration, and store the value returned in a variable named video_with_sound_waves
.
Lastly, we call the main()
function to run the whole process. Remember to replace the value of the rec_name
variable with the name of the recording you want to process.
Go back to your terminal, and run the following command to generate the video:
Look for a file named video_with_sound_waves.mp4
in the videos
directory, open it and you should see something similar to the following:
Adding a video on top of the background
In this section, we are going to add a video of a spinning record on the bottom left corner of the generated video. The video that we are going to add is stored in the file named spinningRecord.mp4
in the videos
directory.
Go back to your code editor, open the main.py
file, and add the following code below the turn_audio_to_video()
function:
Here, we have created a function named add_spinning_record()
. This function will be responsible for adding the spinningRecord.mp4
video on top of the video showing sound waves. It takes as an argument the name of the video generated earlier (video_name
) and the recording duration (rec_duration
).
This function also runs FFmpeg. Here is the command in detail:
The command above has the following options:
The -y
, -t
, -c:v
, -preset
, and -crf
options are the same as in the FFmpeg command that generated the audio waves.
The -i
option was also used before, but in this case, we have 2 videos as input files, the video file generated in the previous step, and the spinning record video file.
The -stream_loop
option allows us to set the number of times an input stream should be looped. A value of 0 means to disable looping, while -1 means to loop infinitely. We set the spinning record video to loop infinitely. This would make FFmpeg encode the output video indefinitely, but since we also specified the duration of the output video, FFmpeg will stop encoding the video when it reaches this duration.
The -filter_complex
option: also has the same function as before, but here we have two videos as input files, the video created in the previous section [0:v]
and the spinning record video [1:v]
.
The filter first uses scale
to resize the spinning record video so that it has 200x200 dimensions and assigns it the [fg]
label. We then use the scale
filter again to set the video created in the previous section to a 1280x720 size, with the [bg]
label. And finally, we use the overlay
filter to put the spinning record video on top of the video created in the previous section, in the coordinates x=25
, and y=H-225
(H stands for the video height).
The -c:a
option was also introduced in the previous section, but In this case, we use the special value copy
to tell ffmpeg to copy the audio stream from the source video without re-encoding it.
The final part of the command, ./videos/{new_video_name}
sets the path of our output file.
Replace the code inside the main()
function with the following, which adds the call to the add_spinning_record()
function:
Run the following command in your terminal to generate a video:
Look for a file named video_with_spinning_record.mp4
in the videos
directory, open it and you should see something similar to the following:
Adding text to video
In this section, we are going to add a title on the top portion of the video. As part of this we are going to learn how to use FFmpeg to draw text, change the color, size, font, and position.
Go back to your code editor, open the main.py
file, and add the following code below the add_spinning_record
function:
In this function, we have created a function named add_text_to_video()
, which invokes a new FFmpeg command to draw the text. Let’s take a closer look at the FFmpeg command:
The -y
, and the -c:a
options are used exactly as before.
The -i
option, which defines the inputs, now has only one input file, the video file generated in the previous section.
The -vf
option allows us to create a simple filtergraph
and use it to filter the stream. Here we use the drawtext
filter to draw the text on top of the video, with a number of parameters: fontfile
is the font file to be used for drawing text, text
defines the text to draw (feel free to change it to your liking),fontcolor
sets the text color to black, fontsize
sets the text size, box
to enable a box around the text, boxcolor
to set the color of this box to white
with a 50% opacity, boxborderw
to set the width of the border box, and x and y
to set the position within the video where the text is to be printed. We used a little math to draw the text centered.
The ./videos/{new_video_name}
option at the end sets the output file, just like in the previous FFmpeg commands.
Replace the code inside the main()
function with the following version, which adds the title step:
Go back to your terminal, and run the following command to generate a video with a title:
Look for a file named video_with_text.mp4
in the videos
directory, open it and you should see something similar to the following:
Conclusion
In this tutorial, we learned how to use some of the advanced options in FFmpeg to turn a voice recording into a video that can be shared on social media. I hope this encourages you to learn more about FFmpeg.
The code for the entire application is available in the following repository https://github.com/CSFM93/twilio-turn-recording-to-video.
Carlos Mucuho is a Mozambican geologist turned developer who enjoys using programming to bring ideas into reality. https://github.com/CSFM93.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.