Adding Dominant Speaker Detection for Twilio Programmable Video with TypeScript

June 01, 2021
Written by
Jamie Corkhill
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

speakerdetection.png

In this article, you’ll learn to use TypeScript and Twilio Programmable Video to build a video chatting application with a dominant speaker display. You’ll use an existing base project making use of the Twilio Programmable Video JavaScript SDK (for front-end video) and the Twilio Node Helper Library (for back-end authentication) and retrofit it to support dominant speaker detection.

This article is a continuation of my last one, Add Muting and Unmuting Capability to your Twilio Programmable Video App with TypeScript, and it will build off the “adding-mute-unmute” branch of this GitHub Repository. To see the final code, visit the “adding-dominant-speaker-detection” branch.

Twilio Programmable Video is a suite of tools for building real-time video apps that scale as you grow, from free 1:1 chats with WebRTC to larger group rooms with many participants. You can sign up for a free Twilio account to get started using Programmable Video.

TypeScript is an extension of pure JavaScript - a “superset” if you will - and adds static typing to the language. It enforces type safety, makes code easier to reason about, and permits the implementation of classic patterns in a more “traditional” manner. As a language extension, all JavaScript is valid TypeScript, and TypeScript is compiled down to JavaScript.

Parcel is a blazing-fast web configuration bundler that supports hot-module replacement and which bundles and transforms your assets. You’ll use it in this article to work with TypeScript on the client without having to worry about transpilation or bundling and configuration.

Requirements

  • Node.js - Consider using a tool like nvm to manage Node.js versions.
  • A Twilio Account for Programmable Video. If you are new to Twilio, you can create a free account. If you sign up using this link, we’ll both get $10 in free Twilio credit when you upgrade your account.

Project Configuration

Download the project files and install dependencies

Begin by cloning the “adding-mute-unmute” branch of the accompanying GitHub Repository with the command below:

git clone -b adding-mute-unmute --single-branch https://github.com/JamieCorkhill/Twilio-Video-Series

Then, install dependencies in both the client and the server project:

cd Twilio-Video-Series
cd client && npm i
cd ../server && npm i

Configure Environment Variables

The server directory contains a small Express Application which is used to manage identity and authentication for users joining rooms (by generating tokens). Before the server will function correctly, you’ll need to specify three environment variables corresponding to your Twilio Account SID, your Twilio API Key, and your Twilio API Key Secret. You can see how the Twilio Server Library uses them within your Express Application to generate access tokens in the highlighted lines here:

/**
 * Generates a video token for a given identity and room name.
 */
export function generateToken(req: Request, res: Response) {
    const dto = req.body as IGenerateVideoTokenRequestDto;

    // Generate an access token for the given identity.
    const token = new AccessToken(
        config.twilio.ACCOUNT_SID,
        config.twilio.API_KEY,
        config.twilio.API_SECRET,
        { identity: dto.identity }
    );

    // Grant access to Twilio Video capabilities.
    const grant = new VideoGrant({ room: dto.roomName });
    token.addGrant(grant);

    return res.send({ token: token.toJwt() });
}

This function can be found in server/src/api/controller.ts. See my article Get Started with Twilio Programmable Video Authentication and Identity using TypeScript or the relevant section of the documentation to learn more about Access Tokens.

If you are not already there from the prior step, navigate into the server folder and create a new folder within it entitled env. Within that, create a file with the name dev.env. The commands below will perform these steps:

cd server
mkdir env
touch env/dev.env

Add the following variables to dev.env.

TWILIO_ACCOUNT_SID=[Your Key]
TWILIO_API_KEY=[Your Key]
TWILIO_API_SECRET=[Your Key]

You can find your Account SID on the Twilio Console and you can create your API Key and API Secret here. Add these keys in their respective locations, overwriting the [Your Key] placeholder in its entirety each time.

Note that on the API dashboard of the Console, your API key will be referred to as the API SID. Also, be sure to take note of your API Key Secret before navigating away from the page - you won’t be able to access it again.

With this, you’ve completed all necessary configuration for the project, and can move on to implementing dominant speaker detection.

Update the Client

You’ll alert participants of who the current dominant speaker is by displaying a highlighted outline/bounding box around their video stream. To accomplish this, add the following CSS Class, highlighted below, to index.html:

<!DOCTYPE html>
<html lang="en">
<head>
    <title>Twilio Video Development Demo</title>

    <style>
        .media-container {
            display: flex;
        }

        .media-container > * + * {
            margin-left: 1.5rem;
        }

        .dominant-speaker {
            outline: 3px solid green;
        }
    </style>
</head>
<body>
    <div class="media-container">
        <div id="local-media-container"></div> 
        <div id="remote-media-container"></div>
    </div>
    
    <div>
        <input id="room-name-input" type="text" placeholder="Room Name"/>
        <input id="identity-input" type="text" placeholder="Your Name"/>
        <button id="join-button">Join Room</button>
        <button id="leave-button">Leave Room</button>
        <button id="mute-unmute-audio-button">Mute Audio</button>
        <button id="mute-unmute-video-button">Mute Video</button>
    </div>
    
    <script src="./src/video.ts"></script>
</body>
</html>

I chose green and 3 pixels, but you can pick any styles you want.

Next, move to the video.ts file in the src folder, find the attachTrack() function (which is about halfway down the file near line 270), and replace it as follows:

/**
 * Attaches a remote track within a parent element for the particular participant.
 * 
 * @param track 
 * The remote track to attach.
 */
 function attachTrack(track: RemoteAudioTrack | RemoteVideoTrack, participantIdentity: string) {
    let participantParentElement = document.getElementById('participantIdentity');

    if (!participantParentElement) {
        participantParentElement = document.createElement('div');
        participantParentElement.id = participantIdentity;
        remoteMediaContainer.appendChild(participantParentElement);
    }

    participantParentElement.appendChild(track.attach());
}

Originally, this function would only attach the track to the DOM within your remoteMediaContainer. With the update, it now creates a wrapper <div/>, within which both tracks are stored, and sets the ID of the <div/> to be the ID of the participant. This will allow you to more easily grab a reference to the individual participants’ container for styling, etc.

You can now update the onTrackSubscribed() function (which is located on line 150) to pass through the identity of the participant in the attachTrack() call:

/**
 * Triggers when a remote track is subscribed to.
 * 
 * @param track 
 * The remote track
 */
function onTrackSubscribed(track: RemoteTrack, participant: RemoteParticipant) {
    attachTrackEnabledDisabledHandlers(track, participant);

    if (!trackExistsAndIsAttachable(track))
        return;

    attachTrack(track, participant.identity);
}

There was a typo in the previous article of this series which called the attachTrackEnabledDisabledHandlers() function attachTrackEnabledAndDisabledHandlers(). That has since been fixed, but if you have any issues, be sure to remove the “and” from the function definition if it has it. You can find the starting code for this article without this issue here.

You can do the same in the attachAttachableTracksForRemoteParticipant() function:

/**
 * Attaches all attachable published tracks from the remote participant.
 * 
 * @param publications 
 * The list of possible publications to attach.
 */
function attachAttachableTracksForRemoteParticipant(participant: RemoteParticipant) {
    participant.tracks.forEach(publication => {
        if (!publication.isSubscribed)
            return;

        if (!trackExistsAndIsAttachable(publication.track))
            return;

        attachTrack(publication.track, participant.identity);
    });
}

After that, create a new function underneath manageTracksForRemoteParticipant() but above attachAttachableTracksForRemoteParticipant() called onDominantSpeakerChanged() as shown:

/**
 * Manages displaying the dominant speaker.
 * 
 * @param participant 
 * The participant who is now the dominant speaker.
 */
function onDominantSpeakerChanged(participant: Participant) {
    document.querySelectorAll('.dominant-speaker')
        .forEach(item => item.classList.remove('dominant-speaker'));

    document.getElementById(participant.identity)?.classList.add('dominant-speaker');
}

This function will first remove any active .dominant-speaker classes from any participants to “reset” the state of the application, and will then apply the .dominant-speaker class to the active speaker. You’re able to query the DOM based on participant.identity thanks to the updates you made in the attachTrack() function earlier.

The second statement of the function uses a potentially peculiar looking syntax - ?.. This is a JavaScript feature known as Optional Chaining, and it permits you to read values of nested properties which may be undefined.

You can’t guarantee that document.getElementById() will return a valid reference to a DOM Element that has a classList property (there may in fact be no such element with that ID as far as the TypeScript Compiler is concerned). The Optional Chaining syntax here will short circuit the entire operation should the reference returned from getElementById() be nullish, which will save you from trying to access properties on undefined values.

While you’re here, ensure the Participant type has been imported from twilio-video at the top import statement if your editor/IDE doesn’t import it automatically.

With the event handler complete, you now need to wire it up. The Twilio Programmable Video JavaScript SDK emits a dominantSpeakerChanged event, described here, which you can listen for. As its payload, it provides a reference to the participant who has transitioned into becoming the dominant speaker.

Pass a dominantSpeaker: true flag to the connect function and wire up the event handler from the room object in the onJoinClick() function towards the top of the video.ts file as shown below:

/**
 * Triggers when the join button is clicked.
 */
async function onJoinClick() {
    const roomName = roomNameInput.value;
    const identity = identityInput.value;
    room = await connect(await tokenRepository.getToken(roomName, identity), {
        name: roomName,
        audio: true,
        video: { width: 640 },
        dominantSpeaker: true
    });

    // Attach the remote tracks of participants already in the room.
    room.participants.forEach(
        participant => manageTracksForRemoteParticipant(participant)
    );

    // Wire-up event handlers.
    room.on('participantConnected', onParticipantConnected);
    room.on('participantDisconnected', onParticipantDisconnected);
    room.on('dominantSpeakerChanged', onDominantSpeakerChanged);
    window.onbeforeunload = () => room.disconnect();
    
    toggleInputs();
}

For more insight into options that can be passed into the connect() function, view the ConnectOptions reference. For more insight into the Twilio Client Library in general, visit the SDK documentation.

With this, your project is complete!

Running the Application

Before you can demo the project, you’ll need to start the Express Application from inside the server folder with:

cd server
npm start

As mentioned earlier, this server is used for generating tokens for participants. In order to test the dominant speaker detection across different machines, you can utilize ngrok to temporarily tunnel your localhost service to the public Internet with a public URL.

Make a note of the port displayed after running npm start above - it’ll likely be 3000.

In another terminal window, run npx ngrok http -host-header=rewrite 3000. This command will temporarily install ngrok and tunnel HTTP connections between localhost:3000 and a public URL. The `-host-header=rewrite” flag is to solve some reported CORS issues, and you should see output like what’s shown below:

screenshot of ngrok response

Pick the last HTTPS link, and paste it inside client/token-repository.ts as follows:

import axios from "axios";

/**
 * Creates an instance of a token repository.
 */
function makeTokenRepository() {
    return {
        /**
         * Provides an access token for the given room name and identity.
         */
        async getToken(roomName: string, identity: string) {
            const response = await axios.post<{ token: string }>('https://ba53a0f35b16.ngrok.io/create-token', {
                roomName,
                identity
            });

            return response.data.token;
        }
    }
}

/**
 * An instance of a token repository.
 */
export const tokenRepository = makeTokenRepository();

In doing so, you’ll allow your client to access the Express Application from another machine. Be sure to append the /create-token route on the end of the URL so you can reach the correct endpoint (which is the function for generating tokens which you saw earlier).

Next, you can open another terminal window and build the client application by navigating into the client folder and running Parcel:

cd client
parcel index.html

Your client should start running on port 1234, or any other port of Parcel’s choosing. In order to make the client accessible from other devices too, open one more terminal window and tunnel through ngrok once more:

npx ngrok http 1234

Make note of the HTTPS URL, and try visiting it across other devices, accepting the relevant permissions if prompted, or share it with other participants. You should notice the dominant speaker detection automatically kick in for each participant who speaks or has an elevated level of ambient environment noise.

Conclusion

In this article, you learned how to perform dominant speaker detection via the Twilio Client Library with TypeScript for Twilio Programmable Video. To view this project’s source code, visit the “adding-dominant-speaker-detection” branch at its GitHub Repository.

Jamie is an 18-year-old software developer located in Texas. He has particular interests in enterprise architecture (DDD/CQRS/ES), writing elegant and testable code, and Physics and Mathematics. He is currently working on a startup in the business automation and tech education space, and when not behind a computer, he enjoys reading and learning.