SRTP and You: A Deep Dive into Encrypted VoIP Communications
Time to read: 11 minutes
RTP, or Real-time Transport Protocol, is used by Twilio (and others) for transmitting audio information for SIP calls. SRTP is Secure RTP, or RTP that has been encrypted. By design, no one can listen to, intercept, or replay the encrypted RTP media except the parties that originally negotiated the SIP session.
In this post, we will discuss:
- How SRTP Works
- Why encrypted media is cool
- Overcoming potential obstacles and overhead
- How to set up SRTP with Twilio
- Implementation considerations
How does SRTP work?
If you understand HTTPS, then you will totally get SRTP. If not, let’s start by reviewing the basics.
SRTP employs TLS for encryption, which uses a ‘handshake’ that looks something like this:
The client and server exchange keys, which are unique to the current session, and use them to encrypt/decrypt the data that is being transferred between them.
SRTP uses Advanced Encryption Standard (AES) as the default cipher, with two primary modes:
- AES-CTM - VoIP standard
- AES-f8 - used in 3g data networks
For more details, you can check out Twilio’s glossary entry on TLS. But for now, all you need to know is that audio encrypted with TLS can only be deciphered while the call is in progress, and only by the client and server that negotiated the call to begin with.
The sound of success
While diagrams are nice, it always helps to listen to call examples to fully understand what is happening on the wire.
This is an outbound call made using Twilio SIP Domains. The sound we hear is the RTP stream of the call.
Fun! We can hear the destination IVR, music on hold, the caller’s voice, jitter artifacts, and everything that makes RTP .
This is a recording of the same exact call flow, but with Secure Media enabled.
Unintelligible static. Of course, it didn’t sound like hot burning garbage while the call was in progress, as both parties were able to decrypt the traffic in real time.
It is only when an attacker goes to replay a captured SRTP stream does it sound like melting ketchup packets.
Behind the MoH
Now that we can hear the power of encryption, we can look deeper at the technical differences between RTP and SRTP.
RTP packet
The RTP spec describes how to packetize digital audio for phone calls. When a SIP session is established, the client and server agree on the RTP packet time, amongst other things.
This ‘ptime’ is the duration in milliseconds that each digital audio packet represents.
In this example of a Twilio SIP Domain call, the ptime value of 20 can be found in the SDP of the 200 OK:
This means the digital audio data will be encoded, dissected, and transmitted in 20 millisecond chunks.
The timestamps for each RTP packet confirm this ptime. We can also clearly see Wireshark correctly identifying each packet as RTP in the protocol column.
Looking closer at a single 20ms unencrypted RTP packet, we can see the header fields and payload in clear text:
The hexdump of the payload of this particular RTP packet is all ff s, which translates to all 0s in binary. In the context of a SIP call, all ffs indicates the 20ms slice of audio was digitally silent.
Take a look at the waveform for this call. In the green section, audio information is present, and visualized by amplitude changes over time. The red section is flat, indicating the absence of audio information. Nothing to hear here.
Digital silence is distinct from perceived silence. For example, if I simply stop speaking on a call, audio information will still be present; i.e. background noise from the room, line noise from the hardware, etc. On most SIP devices, the mute function will produce this digital silence.
tl;dr - we know this RTP packet is unencrypted, since we can visually decipher the payload and listen to the audio content of the packet.
SRTP packet
SRTP is represented in Wireshark as UDP, since the packet type identification has been scrambled by the encryption process.
Since we know what the packet is, we can instruct Wireshark to to decode this UDP as RTP.
The header is in cleartext as expected, but the payload is encrypted.
Most humans can’t even read unencrypted hex dumps. So, without listening, how can we tell if the RTP in this packet is actually encrypted?
This example was generated in a controlled environment, and the same mute function was used as in the above unencrypted RTP example. Based on the timestamp, this packet is known to be from a section of audio where the SIP device was producing digital silence. However, since the ffs have been encrypted, we can’t visually interpret the payload.
Of course, the more obvious way to tell if SRTP is being used is to check SIP signaling. We can see the crypto attributes being offered for negotiation in the SDP of the SIP INVITE being sent from my device. These are the cipher suites my device supports.
The 200 OK sent by Twilio contains the cipher suite that was agreed upon, and will be used to encrypt the RTP.
RTP header vs RTP payload
Now, SRTP specifically refers to the encryption of the RTP payload only. The payload is the part of a RTP packet that contains the digital audio information. With SRTP, the header is authenticated, but not actually encrypted, which means sensitive information could still potentially be exposed.
The main components of the RTP header are:
- Payload Type - The encoding of the RTP packet (G.711 PCMU, opus, etc.)
- Sequence Number - integer that increments with each RTP data packet sent. Can be utilized to detect and smooth jitter and packet loss.
- Timestamp - begins as a random starting number, and increments based on the quotient of sampling rate over packetization time.
- Synchronization Source ID (SSRC) - A random number identifying (and masking) the network address of the source of the RTP stream.
Keeping the header unencrypted is critical for proper routing, so SRTP only covers authenticating the header’s association with the payload, which aids in replay protection.
In most cases the default header information is not considered sensitive, unlike the associated digital audio payload.
“Since the server in the cloud belongs to a third party service provider… the endpoints do not want to risk their personal or corporate information... To achieve that, the double layer of security that is needed cannot be established without any meaningful modification of SRTP.”
When using VBR Codecs, the rate information can be seen unencrypted in the RTP header. This can pose a risk, as a savvy eavesdropper could deduce the content of speech based on the rate information or speech level.
From RFC 6562: Guidelines for the Use of Variable Bit Rate Audio with Secure RTP:
“In the worst case, using the rate information to recognize a prerecorded message knowing the set of all possible messages would lead to near-perfect accuracy.”
Furthermore, RFC 6464 outlines a header extension which can contain audio level information, regardless of encoding.
“Such an attacker might be able to infer information about the conversation, possibly with phoneme-level resolution. In scenarios where this is a concern, additional mechanisms MUST be used to protect the confidentiality of the header extension.“
Telecom admins are advised to use padding in conjunction with VBR codecs, especially in use cases with structured conversations like with an IVR system.
RFC 6904 describes a future where these headers can be selectively encrypted based on their content or likelihood of containing sensitive info. This is not standard practice yet though.
Why we encrypt
Eavesdropping
The above recordings were obtained by ARP Poisoning myself, and using Wireshark to capture and reconstruct the SIP/RTP packets.
Capturing and analyzing SIP traffic is an essential troubleshooting skill for any network engineer or VoIP technician. However, bad actors also leverage packet capture tools in attempts to gain access to your data, or completely disable your infrastructure. Capturing network traffic for nefarious purposes is known as eavesdropping or packet sniffing.
The threat of eavesdropping is ever present, and difficult to avoid entirely, especially on wireless networks. It’s OK if you can’t install a faraday cage at your home or office - using SRTP essentially renders sniffed packets useless, which mitigates the risk of data exposure the threat of eavesdropping poses.
Replay attack
A replay attack occurs when a bad actor replays network packets they have been nefariously captured via eavesdropping.
In this video, David Bombal demonstrates how to specifically intercept and replay RTP packets.
If SRTP were enabled, an attacker could still eavesdrop, but would not be able to conduct a replay attack.
Replay attacks should not necessarily be confined to the context of replaying network packets. Attackers can replay information they capture from eavesdropping. For example, a fraudster may conduct social engineering attacks, and pose as an authorized individual using information gleaned from eavesdropping phone calls. It is important to consider enabling SRTP, and follow anti-fraud best practices for calls where any remotely sensitive information is discussed.
Payload integrity
As mentioned above, SRTP headers are authenticated, but not encrypted. In other words, while the SRTP header is sent in the clear, the receiver can validate the sender’s headers are actually associated with encrypted payloads they precede.
The sender runs the full SRTP packet contents through a hash function, along with the session key, which produces a digest called the auth tag. The sender then appends the auth tag to the end of the encrypted payload, and sends the fully constructed SRTP packet to the receiver.
The receiver chips off the auth tag sent to them, does the same HMAC-SHA 1 digest generation, and compares the two values. If they match, the plaintext header is associated with the encrypted payload.
However, an attacker can modify SRTP packets without the receiver knowing when the sender uses a weak and/or vulnerable message authentication method. It is best to start with the SRTP RFC listed defaults, and use HMAC SHA-1, with a session authentication key length of 160 bits, and a resulting authentication tag length of 80 bits.
Needless to say, using a zero-length authentication tag should absolutely be avoided.
SRTP and the PSTN
The Publicly Switched Telephone Network, by nature, does not support SRTP and in some cases is infamously unencrypted.
SRTP is specific to SIP communications, which run over the Internet.
So, if your SIP calls hit the Publicly Switched Telephone Network (PSTN), the media will undoubtedly be unencrypted at some point, even with SRTP configured.
The PSTN’s main protocol, SS7, does utilize digital signatures, and the comms between cell phones and towers are (mostly) secure.
Whenever possible, be sure to work with your telecom provider to understand their security policies, their response to eavesdropping threats, and the risk of data exposure over their network. At SIGNAL in 2017, B Byrne, the Head of Product for Authy, discussed how exposed vulnerabilities in the SS7 network fundamentally changed how the telecom industry approaches security.
On the other hand, SRTP will be encrypted as long as the call hops through SIP B2BUAs over the Internet. So a business doing all their calling through SIP infrastructure, even if not physically colocated, can still leverage SRTP for end-to-end encryption.
Too much overhead?
I know exactly what you are thinking. “The quality of a VoIP call is directly related to the conditions present while the call is in progress. Since encryption uses valuable computational resources, voice quality will surely suffer when SRTP is enabled!”
Some users choose to forgo SRTP due to resource constraints or implementation complexities, but this may be misguided.
Researchers at Towson University performed a study on the processing overhead of SRTP in various environments. The results “indicate that SRTP adds negligible overhead to VoIP processing and has no observable effect on VoIP quality.”
Twilio provides a Voice Insights dashboard and REST API so you can monitor voice quality in your voice application, and compare your own metrics when toggling Secure Media.
Using SRTP with Twilio
Twilio Programmable Voice and Elastic SIP Trunking both support SRTP. Configuration on our end couldn’t be easier.
Let’s run through how to set it up!
Elastic SIP Trunking
https://www.twilio.com/docs/sip-trunking#securetrunking
Log into Console, and click on the Trunk you wish to secure.
Under Features, click the toggle to enable Secure Trunking.
SIP Domains
https://www.twilio.com/docs/voice/api/secure-media
Log into Console, and click the SIP Domain you wish to secure.
Under Secure Media, click the toggle to enable.
Don’t forget to Save!
Like most things at Twilio, there is an API for that!
Elastic SIP Trunking
https://www.twilio.com/docs/sip-trunking/api/trunk-resource#update-a-trunk-resource
Outbound Calls
On outbound calls, simply append transport=tls to the end of the SIP URI.
<Dial><Sip> TwiML
Signaling over TLS is also available, but is typically configured and negotiated separately. At Twilio however, when you enable Secure Media you are also requiring SIP signaling to be sent over TLS. Packet capture files are not available in Twilio Console for encrypted calls like they are for unencrypted SIP calls.
That’s it on the Twilio side! Be sure to configure SRTP on your PBX according to the vendor’s instructions.
Implementation considerations
As seen above, enabling Secure Media on the Twilio side is very straightforward. However, you will need to do some configuration on your end as well.
Each SIP device is like a snowflake, with its own unique characteristics, interface, terminology, config schema, functionality, usability, supportability, limitations, etc. These complexities are compounded by the network conditions and overall environment the device is operating within.
In other words, there is no one singular way to enable SRTP that works for every user on every device.
pjsua
pjsua (pjsip user agent), is a command line SIP softphone that comes bundled with the pjsip install. It is great for testing, automation, embedded devices, and impressing your friends.
You won’t find a familiar telephone UI with pjsua. Instead, endpoints register via a config file, outbound calls can be scripted in almost any language, and SIP signaling can be generated via commands. For the most part, it ‘just works’, but to utilize SRTP, the source code must explicitly be built with tls mode enabled.
If OpenSSL libraries are not installed and/or in appropriate working order on the target machine, the build may succeed but SRTP will still not function.
Bria 5
Bria by Counterpath is a leading softphone application. SRTP is only available with the paid subscription models. However, with the proper access, enabling TLS and SRTP is only a few clicks away.
Call flows and SIP signaling can be configured with XML, which brings a TwiML-esque feel to SIP Trunking call control.
However, the official build of SIPp does not support SRTP. At all. Luckily, ankitonweb's forked version does.
Poly (previously Polycom)
Most Poly phones have the option to enable SRTP on every call to/from the device, or per Registration.
The most common method is to use the device’s UI. Again, every model is different, but the Polycom forums outline one way.
However, this setting can also be enabled by provisioning a configuration file.
FreePBX
Within FreePBX, SRTP must be enabled in both the General SIP Settings, and within the settings for the Extension(s) you wish to encrypt media on.
The exact steps are described on the FreePBX wiki.
Conclusion
SRTP is awesome, yet woefully underutilized. There is almost no reason to NOT use SRTP with your Twilio SIP Domain and/or Elastic SIP Trunking setup. The added security and peace of mind far outweighs any potential overhead. Furthermore, any imperfections in the spec do not render encrypted payloads useless.
Even if you are leaving SIP Land and dancing with the PSTN, SRTP is your best mitigation against bad actors sniffing, modifying, or replaying your sensitive communications.
Check out our docs, and let us know about your experiences in our Community Forums. We can’t wait to not see what you encrypt!
Matt Coser is a Senior Field Security Engineer at Twilio. His focus is on telecom security, and empowering Twilio’s customers to build safely. Hit him up on linkedin to connect and discuss more.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.