How to Build a Multimodal WhatsApp Bot with Twilio, Gemini, and PHP
How to Build a Multimodal WhatsApp Bot with Twilio, Gemini API, and PHP
Unless you live under a rock, there's no way you've not heard about or interacted with recent technologies like ChatGPT and Google Gemini. Behind the scenes, these technologies are powered by Large Language Models (LLMs).And recent updates are making them even more powerful, as they are now multimodal, which means they can now natively process media inputs (images, audio, video). So, there's no better time to integrate this technology in building powerful messaging systems than now.
In this tutorial, we'll go over what Multimodal AI systems are and further explore how to build a Multimodal WhatsApp bot using PHP, Twilio, and Google Gemini's API. By the end of the tutorial, we'll have Gemini nicely integrated with our Whatsapp bot, giving it the ability to respond to text, image, and audio input. Our final output will look like the image below.
Prerequisite
The following prerequisites are required to follow along with the tutorial:
- Ngrok installed
- PHP 8.2 and Composer
- A Google Cloud account with a billing account that has an active payment method
- The Google Cloud CLI installed
- A Twilio account (either free or paid). If you are new to Twilio, click here to create a free account
- A Twilio phone number
What is Multimodal AI?
Modality in Artificial Intelligence (AI) and Machine Learning (ML) refers to the input/output channel that an AI/ML model can process, such as text, images, audio, etc. In the same sense, a Multimodal AI is one that can process information from multiple channels, like text, images, audio, and even video, all at once.
Before now, most AI models were unimodal, meaning they could only handle one data type. For example, an image recognition model could only analyze pictures, not accompanying text. Even models that seemed multimodal were mostly converting data from one format (like an image) to another (like text) before processing it. However, one downside of this approach is that it can lead to a loss of information.
Google's Gemini model is changing this narrative with its truly native multimodal models. These models can process various input combinations, including text, images, and audio, without relying on conversions. In the next sections, we'll integrate this model with Twilio's Programmable Messaging API to build a powerful WhatsApp bot that can process different types of input.
Build a Multimodal WhatsApp bot With Twilio and Gemini API
Let’s get started by creating a new folder for our project and changing into it, by running the commands below.
Next, open the project folder in your favorite text editor or IDE and create a new .env file. Then, copy the following into the file.
We'll proceed with retrieving the necessary credentials for building our application and storing them in this file.
Retrieve your Twilio credentials and set up your WhatsApp Sandbox
Sign in to the Twilio Console.Immediately after you're logged in, your Account SID and Auth Token should be displayed in the Account Info panel, as shown in the screenshot below.
Copy these values and paste them into your .env file in place of the respective placeholders.
Next, let's set upthe WhatsApp Sandbox. Normally, before we can start integrating with WhatsApp, we'd need to request WhatsApp API access, which is usually a long process that takes days or weeks. However, with the Twilio WhatsApp Sandbox, we can start right away.
In the Twilio Console, navigate to Messaging > Try it out and select the Send a Whatsapp message. As a first-time user, you'll need to activate your Sandbox; afterward, you should see a new page with the connection instructions, as shown in the image below.
Send a message to the displayed phone number using WhatsApp as instructed, or scan the QR code instead. Once you're done, you should get a response acknowledging your connection. Next, append the phone number in the From field — minus the "whatsapp:" prefix — to your .env file, replacing PASTE_YOUR_TWILIO_WHATSAPP_NUMBER
.
With that done, since we'll be processing media programmatically, we need to disable some default media access authentication. In the Twilio Console, go to Messaging > Settings > General. Once on this page, you will be requested to complete a verification. After completing the verification, scroll down to HTTP Basic Authentication for media access and disable this value, as shown below, if it's not already disabled.
And that’s all we need to do in the Twilio Console for now.
Retrieve your Gemini API credentials
Gemini API is a Google Cloud application suite; for this reason, sign into your Google Cloud Console and create a new project. During project creation, you should see your new Project ID, displayed as shown below.
Alternatively, if you have an existing project, you can select the project from your projects list to preview its Project ID. Copy your Project ID and in .env use it to replace YOUR_PROJECT_ID
. Next, enable the Vertex AI API, making sure the selected project is the one you created earlier.
Finally, we'll need to generate a new Google Cloud access token. With the Google Cloud CLI, sign into your account by running the command below.
Once signed in, configure the CLI to use the project you created earlier by running the following command — after replacing YOUR_PROJECT_ID
with your project ID.
Finally, run the following command to generate a new Google Cloud access token.
After running this command, you should see a long access token printed on the console. Copy this value into .env, in place of PASTE_YOUR_ACCESS_TOKEN
.
With all these updates, your .env file is now complete.
With all the credentials retrieved, let’s dive right into building our bot.
Building the Bot
Here's how the bot will work — we'll create a function that'll allow us to send a request to Gemini's API, passing our message and a media file if there's any. Afterward, we'll create a webhook endpoint that will listen to any message sent to the bot on WhatsApp, send this message to our Gemini function, and send a response back to the user.
To achieve all this, we’ll need to install the following packages:
- guzzlehttp/guzzle: To make HTTP requests in our PHP application
- vlucas/phpdotenv: To read environment variables
- twilio/sdk: To interact with Twilio messaging API
Run the following command in your project's root directory to install these packages.
Next, create two new files: gemini.php and bot.php. Then, paste the following code inside the gemini.php file.
The code above defines a function named generateContentFromGemini()
that accepts a required $text
parameter, and optional $fileUri
and $mimeType
parameters if we want to attach a media URL to our message. We configured our preferred server location, as well as the Gemini model we want to use; you can also replace this value with any of the supported models highlighted in the image below.
Furthermore, we:
Retrieve our credentials from the environment variables
Pass them to the Gemini API URL
Read the file from the provided URL (if any is provided)
Convert it to a base64-encoded string
And pass it to Gemini as
inline_data
Along with the text, we send a request with all these parameters to our constructed Gemini API URL, parse the response, and then return the message, or an error if any occurred. You can also see a list of all parameters supported by Gemini API in the documentation.
To proceed, let’s create our webhook endpoint by pasting the following code inside the bot.php file.
Here, we created two functions, listenToWhatsAppReplies()
and sendWhatsAppMessage()
.
The sendWhatsAppMessage()
accepts a $message
and a $recipient
parameter. Next, it retrieves your Twilio credentials from the environment variable, and uses the Twilio SDK to send the $message
passed to this function to the $recipient
.
The listenToWhatsAppReplies()
function receives a web request as its parameter, reads the data from this request, and sends it to the generateContentFromGemini
message we created earlier.
When you receive a WhatsApp message, Twilio will send the message data to your webhook URL, which typically includes:
From: The sender of the message
Body: The body of the message
MediaUrl0: If a media is attached to the message, Twilio will send the media URL as MediaUrl0
MediaContentType0: The content type or mime-type of the attached media, e.g., image/jpeg.
Now, we're passing this data to our generateContentFromGemini()
function, and then the response generated from Gemini is sent back to the user via the sendWhatsAppMessage()
function. Furthermore, we configured our app to listen to POST requests and then invoke the listenToWhatsAppReplies()
function.
At this stage, we’re done with developing the application, so start it by running the command below.
This command will start the app on port 5000, so our bot endpoint will be accessible at http://localhost:5000/bot.php. However, our webhook URL in the Twilio Console has to be a live URL, as Twilio cannot access our localhost content. For this reason, we'll make our app accessible on the public internet with ngrok, as outlined in the next section.
Configure Webhook
With ngrok installed and your app running on port 5000, execute the following command in a new terminal tab or session.
Running this command will create a secure tunnel to your local server running on port 5000 and generate a public URL that allows external access to the app. You should see an image similar to the one below.
Copy the generated URL and append the bot endpoint to it, e.g., https://3c10-102-89-47-96.ngrok-free.app/bot.php. This will serve as our webhook URL. Next, head back to the tab where you have WhatsApp Sandbox open and open the Sandbox Settings tab, and paste your ngrok webhook URL in the "When a message comes in" field and set the accompanying Method field to POST, as shown below. Then, click Save.
And we're done!
Test that the application works
Send a message to the Twilio WhatsApp Sandbox number using WhatsApp, and you should get a response from Gemini. You can also try sending it a media file, as shown in the screenshot below.
Or, you could even ask programming questions, as in the following example.
That’s not the end!
Add a custom instruction
Custom instructions are a recent update that makes LLMs like Gemini and ChatGPT even more powerful than before. These instructions are like a lead-in context that you give the model before any further interaction. This way, you can control its behavior and how it responds.
For example, you can add an instruction to make Gemini respond in Spanish for all subsequent requests. You can give it your preferred name and make it refer to you as a preferred name. Even more practically, you could create custom instructions that include details about your organization and ultimately create a customer support chatbot; the applications are limitless.
To add custom instructions, we only need to update the content array to include new content with the role model
and our instructions, as shown below.
In this example, we are giving our bot a custom name, showing it's an expert in automation, how it should respond, and how it should refer to the user. To make this take effect, open the previous gemini.php file, and update the $body
array so that it now looks like below.
Now, we can ask the bot questions about automation, and it refers to us as "Champ!"
That's how to build a multimodal WhatsApp bot with Twilio, Gemini, and PHP
In this tutorial, we covered what it means for an AI/ML model to be multimodal. We explored how Gemini is a new multimodal AI that can natively process text, image, and audio input. We then integrated the Gemini API with Twilio to build a multimodal WhatsApp bot.
Finally, we covered how to add custom instructions to the bot so as to give it additional context and customize how it responds to users. All the code used in this tutorial can be found in this GitHub repository. Thanks for reading!
Elijah Asaolu is a technical writer and software engineer. He frequently enjoys writing technical articles to share his skills and experience with other developers.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.