How to Build a Multimodal WhatsApp Bot with Twilio, Gemini, and PHP

September 10, 2024
Written by
Elijah Asaolu
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

How to Build a Multimodal WhatsApp Bot with Twilio, Gemini API, and PHP

Unless you live under a rock, there's no way you've not heard about or interacted with recent technologies like ChatGPT and Google Gemini. Behind the scenes, these technologies are powered by Large Language Models (LLMs).And recent updates are making them even more powerful, as they are now multimodal, which means they can now natively process media inputs (images, audio, video). So, there's no better time to integrate this technology in building powerful messaging systems than now.

In this tutorial, we'll go over what Multimodal AI systems are and further explore how to build a Multimodal WhatsApp bot using PHP, Twilio, and Google Gemini's API. By the end of the tutorial, we'll have Gemini nicely integrated with our Whatsapp bot, giving it the ability to respond to text, image, and audio input. Our final output will look like the image below.

Screenshot of a WhatsApp conversation with the bot built using Twilio API and Gemini API.

Prerequisite

The following prerequisites are required to follow along with the tutorial:

What is Multimodal AI?

Modality in Artificial Intelligence (AI) and Machine Learning (ML) refers to the input/output channel that an AI/ML model can process, such as text, images, audio, etc. In the same sense, a Multimodal AI is one that can process information from multiple channels, like text, images, audio, and even video, all at once.

Before now, most AI models were unimodal, meaning they could only handle one data type. For example, an image recognition model could only analyze pictures, not accompanying text. Even models that seemed multimodal were mostly converting data from one format (like an image) to another (like text) before processing it. However, one downside of this approach is that it can lead to a loss of information.

Google's Gemini model is changing this narrative with its truly native multimodal models. These models can process various input combinations, including text, images, and audio, without relying on conversions. In the next sections, we'll integrate this model with Twilio's Programmable Messaging API to build a powerful WhatsApp bot that can process different types of input.

Build a Multimodal WhatsApp bot With Twilio and Gemini API

Let’s get started by creating a new folder for our project and changing into it, by running the commands below.

mkdir multimodal-whatsapp-bot
cd multimodal-whatsapp-bot

Next, open the project folder in your favorite text editor or IDE and create a new .env file. Then, copy the following into the file.

TWILIO_AUTH_TOKEN="PASTE_YOUR_TWILIO_AUTH_TOKEN"
TWILIO_WHATSAPP_NUMBER="PASTE_YOUR_TWILIO_WHATSAPP_NUMBER"
GCLOUD_PROJECT_ID="YOUR_PROJECT_ID"
GCLOUD_ACCESS_TOKEN="PASTE_YOUR_ACCESS_TOKEN"

We'll proceed with retrieving the necessary credentials for building our application and storing them in this file.

Retrieve your Twilio credentials and set up your WhatsApp Sandbox

Sign in to the Twilio Console.Immediately after you're logged in, your Account SID and Auth Token should be displayed in the Account Info panel, as shown in the screenshot below.

Twilio Console showing the user's account information

Copy these values and paste them into your .env file in place of the respective placeholders.

Next, let's set upthe WhatsApp Sandbox. Normally, before we can start integrating with WhatsApp, we'd need to request WhatsApp API access, which is usually a long process that takes days or weeks. However, with the Twilio WhatsApp Sandbox, we can start right away.

In the Twilio Console, navigate to Messaging > Try it out and select the Send a Whatsapp message. As a first-time user, you'll need to activate your Sandbox; afterward, you should see a new page with the connection instructions, as shown in the image below.

Twilio Console showing instructions for activating the WhatsApp Sandbox

Send a message to the displayed phone number using WhatsApp as instructed, or scan the QR code instead. Once you're done, you should get a response acknowledging your connection. Next, append the phone number in the From field — minus the "whatsapp:" prefix — to your .env file, replacing PASTE_YOUR_TWILIO_WHATSAPP_NUMBER.

With that done, since we'll be processing media programmatically, we need to disable some default media access authentication. In the Twilio Console, go to Messaging > Settings > General. Once on this page, you will be requested to complete a verification. After completing the verification, scroll down to HTTP Basic Authentication for media access and disable this value, as shown below, if it's not already disabled.

Screenshot of the Twilio Console highlighting where to disable HTTP Basic Authentication for media access

And that’s all we need to do in the Twilio Console for now.

Retrieve your Gemini API credentials

Gemini API is a Google Cloud application suite; for this reason, sign into your Google Cloud Console and create a new project. During project creation, you should see your new Project ID, displayed as shown below.

Screenshot of the Google Cloud Console interface for creating a new project.

Alternatively, if you have an existing project, you can select the project from your projects list to preview its Project ID. Copy your Project ID and in .env use it to replace YOUR_PROJECT_ID. Next, enable the Vertex AI API, making sure the selected project is the one you created earlier.

Finally, we'll need to generate a new Google Cloud access token. With the Google Cloud CLI, sign into your account by running the command below.

gcloud auth login

Once signed in, configure the CLI to use the project you created earlier by running the following command — after replacing YOUR_PROJECT_ID with your project ID.

gcloud config set project <<YOUR_PROJECT_ID>>

Finally, run the following command to generate a new Google Cloud access token.

gcloud auth print-access-token

After running this command, you should see a long access token printed on the console. Copy this value into .env, in place of PASTE_YOUR_ACCESS_TOKEN.

With all these updates, your .env file is now complete.

Your Google Cloud access token might expire after 60 minutes for security reasons. When this happens, you’ll have to generate a new one, as we did earlier, and update your  .env file. To change this, you’ll need to set up application default credentials.

With all the credentials retrieved, let’s dive right into building our bot.

Building the Bot

Here's how the bot will work — we'll create a function that'll allow us to send a request to Gemini's API, passing our message and a media file if there's any. Afterward, we'll create a webhook endpoint that will listen to any message sent to the bot on WhatsApp, send this message to our Gemini function, and send a response back to the user.

To achieve all this, we’ll need to install the following packages:

  • guzzlehttp/guzzle: To make HTTP requests in our PHP application
  • vlucas/phpdotenv: To read environment variables
  • twilio/sdk: To interact with Twilio messaging API 

Run the following command in your project's root directory to install these packages.

composer require guzzlehttp/guzzle vlucas/phpdotenv twilio/sdk

Next, create two new files: gemini.php and bot.php. Then, paste the following code inside the gemini.php file.

<?php

require 'vendor/autoload.php';

use GuzzleHttp\Client;
use GuzzleHttp\Exception\RequestException;
use Dotenv\Dotenv;

$dotenv = Dotenv::createUnsafeImmutable(__DIR__);
$dotenv->load();

function generateContentFromGemini($text, $fileUri = null, $mimeType = null)
{
    $location = 'us-central1';
    $modelId = 'gemini-1.0-pro-vision-001';
    $accessToken = getenv('GCLOUD_ACCESS_TOKEN');
    $projectId = getenv('GCLOUD_PROJECT_ID');
    $url = "https://{$location}-aiplatform.googleapis.com/v1/projects/{$projectId}/locations/{$location}/publishers/google/models/{$modelId}:generateContent";
    $client = new Client();
    $parts = [];

    if ($fileUri && $mimeType) {
        $imageData = file_get_contents($fileUri);
        $base64Image = base64_encode($imageData);
        $parts[] = [
            "inline_data" => [
                "mimeType" => $mimeType,
                "data" => $base64Image
            ]
        ];
    }

    if ($text) {
        $parts[] = ["text" => $text];
    }

    $body = [
        "contents" => [
            [
                "role" => "user",
                "parts" => $parts,
            ]
        ]
    ];

    try {
        $response = $client->post($url, [
            'headers' => [
                'Content-Type' => 'application/json',
                'Authorization' => 'Bearer ' . $accessToken
            ],
            'json' => $body
        ]);
        
        $data = json_decode($response->getBody(), true);

        return (isset($data['candidates'][0]['content']['parts'][0]['text'])) 
            ? $data['candidates'][0]['content']['parts'][0]['text']
            : "No text found in the response.";
        
    } catch (RequestException $e) {
        return $e->getMessage() . ($e->hasResponse() ? $e->getResponse()->getBody()->getContents() : '');
    }
}

The code above defines a function named generateContentFromGemini() that accepts a required $text parameter, and optional $fileUri and $mimeType parameters if we want to attach a media URL to our message. We configured our preferred server location, as well as the Gemini model we want to use; you can also replace this value with any of the supported models highlighted in the image below.

Table listing supported Gemini model numbers with their corresponding version details.

Furthermore, we:

  • Retrieve our credentials from the environment variables

  • Pass them to the Gemini API URL

  • Read the file from the provided URL (if any is provided)

  • Convert it to a base64-encoded string

  • And pass it to Gemini as inline_data

Along with the text, we send a request with all these parameters to our constructed Gemini API URL, parse the response, and then return the message, or an error if any occurred. You can also see a list of all parameters supported by Gemini API in the documentation.

To proceed, let’s create our webhook endpoint by pasting the following code inside the bot.php file.

<?php

use Dotenv\Dotenv;
use Twilio\Rest\Client;
use GuzzleHttp\Exception\RequestException;

require 'vendor/autoload.php';
require 'gemini.php';

$dotenv = Dotenv::createUnsafeImmutable(__DIR__);
$dotenv->load();

function listenToWhatsAppReplies($request)
{
    $from = $request['From'];
    $body = $request['Body'];
    $mediaUrl = isset($request['MediaUrl0']) ? $request['MediaUrl0'] : null;
    $mimeType = isset($request['MediaContentType0']) ? $request['MediaContentType0'] : null;

    try {
        if ($mediaUrl) {
            $message = generateContentFromGemini($body, $mediaUrl, $mimeType);
            sendWhatsAppMessage($message, $from);
        } else {
            $message = generateContentFromGemini($body);
            sendWhatsAppMessage($message, $from);
        }
    } catch (RequestException $e) {
        sendWhatsAppMessage($e->getMessage(), $from);
    }
}

function sendWhatsAppMessage($message, $recipient)
{
    $twilio_whatsapp_number = getenv('TWILIO_WHATSAPP_NUMBER');
    $account_sid = getenv("TWILIO_SID");
    $auth_token = getenv("TWILIO_AUTH_TOKEN");

    $client = new Client($account_sid, $auth_token);

    return $client->messages->create("$recipient", [
        'from' => "whatsapp:$twilio_whatsapp_number",
        'body' => $message
    ]);
}

if ($_SERVER['REQUEST_METHOD'] === 'POST') {
    $request = $_POST;
    listenToWhatsAppReplies($request);
    http_response_code(200);
    echo 'Message processed';
} else {
    http_response_code(405);
    echo 'Method not allowed';
}

Here, we created two functions, listenToWhatsAppReplies() and sendWhatsAppMessage()

The sendWhatsAppMessage() accepts a $message and a $recipient parameter. Next, it retrieves your Twilio credentials from the environment variable, and uses the Twilio SDK to send the $message passed to this function to the $recipient.

The listenToWhatsAppReplies() function receives a web request as its parameter, reads the data from this request, and sends it to the generateContentFromGemini message we created earlier.

When you receive a WhatsApp message, Twilio will send the message data to your webhook URL, which typically includes:

  • From: The sender of the message

  • Body: The body of the message

  • MediaUrl0: If a media is attached to the message, Twilio will send the media URL as MediaUrl0

  • MediaContentType0: The content type or mime-type of the attached media, e.g., image/jpeg.

Now, we're passing this data to our generateContentFromGemini() function, and then the response generated from Gemini is sent back to the user via the sendWhatsAppMessage() function. Furthermore, we configured our app to listen to POST requests and then invoke the listenToWhatsAppReplies() function.

At this stage, we’re done with developing the application, so start it by running the command below.

php -S localhost:5000 -t .

This command will start the app on port 5000, so our bot endpoint will be accessible at http://localhost:5000/bot.php. However, our webhook URL in the Twilio Console has to be a live URL, as Twilio cannot access our localhost content. For this reason, we'll make our app accessible on the public internet with ngrok, as outlined in the next section.

Configure Webhook

With ngrok installed and your app running on port 5000, execute the following command in a new terminal tab or session.

ngrok http 5000

Running this command will create a secure tunnel to your local server running on port 5000 and generate a public URL that allows external access to the app. You should see an image similar to the one below.

A terminal window displaying an active ngrok session status, including account, region, version, and connection details.

Copy the generated URL and append the bot endpoint to it, e.g., https://3c10-102-89-47-96.ngrok-free.app/bot.php. This will serve as our webhook URL. Next, head back to the tab where you have WhatsApp Sandbox open and open the Sandbox Settings tab, and paste your ngrok webhook URL in the "When a message comes in" field and set the accompanying Method field to POST, as shown below. Then, click Save.

Twilio WhatsApp Sandbox configuration settings with webhook and sandbox participant information.

And we're done!

Test that the application works

Send a message to the Twilio WhatsApp Sandbox number using WhatsApp, and you should get a response from Gemini. You can also try sending it a media file, as shown in the screenshot below.

Chat screenshot discussing a sliced pepperoni pizza held by hand.

Or, you could even ask programming questions, as in the following example.

Chat screen showing a conversation with Twilio Bot including topics on a pizza description and PHP code questions.

That’s not the end!

Add a custom instruction

Custom instructions are a recent update that makes LLMs like Gemini and ChatGPT even more powerful than before. These instructions are like a lead-in context that you give the model before any further interaction. This way, you can control its behavior and how it responds. 

For example, you can add an instruction to make Gemini respond in Spanish for all subsequent requests. You can give it your preferred name and make it refer to you as a preferred name. Even more practically, you could create custom instructions that include details about your organization and ultimately create a customer support chatbot; the applications are limitless.

To add custom instructions, we only need to update the content array to include new content with the role model and our instructions, as shown below.

[
    "role" => "model",
    "parts" => [
        "text" => "You are Automate Inc. Bot, an expert in all things automation. Your responses must be concise. Refer to the user as 'Champ' in every interaction."
    ],
]

In this example, we are giving our bot a custom name, showing it's an expert in automation, how it should respond, and how it should refer to the user. To make this take effect, open the previous gemini.php file, and update the $body array so that it now looks like below.

$body = [
    "contents" => [
        [
            "role" => "user",
            "parts" => $parts,
        ],
        [
            "role" => "model",
            "parts" => [
                "text" => "You are Automate Inc. Bot, an expert in all things automation. Your responses must be concise. Refer to the user as 'Champ' in every interaction."
            ],
        ]
    ],
];

Now, we can ask the bot questions about automation, and it refers to us as "Champ!"

Twilio bot explaining benefits of robotic process automation in a WhatsApp chat.

That's how to build a multimodal WhatsApp bot with Twilio, Gemini, and PHP

In this tutorial, we covered what it means for an AI/ML model to be multimodal. We explored how Gemini is a new multimodal AI that can natively process text, image, and audio input. We then integrated the Gemini API with Twilio to build a multimodal WhatsApp bot.

Finally, we covered how to add custom instructions to the bot so as to give it additional context and customize how it responds to users. All the code used in this tutorial can be found in this GitHub repository. Thanks for reading!

Elijah Asaolu is a technical writer and software engineer. He frequently enjoys writing technical articles to share his skills and experience with other developers.