AI Agent for Sending Emails Through WhatsApp Voice Notes Using Twilio, Sendgrid, .NET

April 08, 2025
Written by
Jacob Snipes
Contributor
Opinions expressed by Twilio contributors are their own
Reviewed by

AI Agent for Sending Emails Through WhatsApp Voice Notes Using Twilio, Sendgrid, .NET

Imagine leaving a voice note on WhatsApp and having it instantly transcribed and delivered as a well-structured email—no typing, no hassle. Voice messages capture tone, intent, and urgency in a way that text alone often fails to convey. Now, what if you could transform them into professional emails effortlessly?

By integrating Twilio's API, the .NET framework, and AI-driven speech recognition and enhancement using AssemblyAI and OpenAI, this system converts WhatsApp voice notes into text, refines them, and sends them via email—bridging the gap between casual voice messaging and formal business communication. Whether for customer support, internal updates, or professional correspondence, this solution enhances productivity, accessibility, and engagement in a truly innovative way.

Let's dive into the technical bit.

Prerequisites

In order to be successful in this tutorial, you will need the following:

Install the following:

Sign up for required services to obtain the required credentials needed by your application for external services:

With all these checking out, let’s roll.

Create the .NET application

In this section, you'll set up your project structure using the .NET CLI. This structure forms the foundation of your application, separating concerns between the API and the core business logic.

Find the full source code on GitHub for reference.

Set Up the Project Structure

Open Visual Studio Code and head to the terminal. Run the following commands to create a main directory, initialize a solution, and set up two projects: one for your API and the other one for your core business logic.

# Create main directory
mkdir VoiceToEmail
cd VoiceToEmail
# Create solution and projects
dotnet new sln
dotnet new webapi -n VoiceToEmail.API
dotnet new classlib -n VoiceToEmail.Core
# Add projects to solution
dotnet sln add VoiceToEmail.API/VoiceToEmail.API.csproj
dotnet sln add VoiceToEmail.Core/VoiceToEmail.Core.csproj
# Add project reference from API to Core
cd VoiceToEmail.API
dotnet add reference ../VoiceToEmail.Core/VoiceToEmail.Core.csproj
# Install required packages
dotnet add package OpenAI
dotnet add package Twilio
dotnet add package SendGrid
dotnet add package AssemblyAI
dotnet add package Microsoft.EntityFrameworkCore.SQLite dotnet add package Swashbuckle.AspNetCore

This set of commands initializes your solution with two projects. The API project handles HTTP requests and controllers, while the Core project will contain models and interfaces used across the application. The required packages are installed to support OpenAI, SendGrid, AssemblyAI, and Entity Framework for data persistence.

With the project structure ready, move on to create your core models and interfaces.

Create Core Models and Interfaces

In this section, you define properties such as sender/recipient email, audio URL, transcribed text, enhanced content, timestamp, and status. This model will be used to track the voice message’s journey from transcription to email sending. In the VoiceToEmail.Core folder, create a new Models folder and add the file VoiceMessage.cs. Update it with the code below.

namespace VoiceToEmail.Core.Models;
public class VoiceMessage
{
    public Guid Id { get; set; }
    public string SenderEmail { get; set; }
    public string RecipientEmail { get; set; }
    public string AudioUrl { get; set; }
    public string TranscribedText { get; set; }
    public string EnhancedContent { get; set; }
    public DateTime CreatedAt { get; set; }
    public string Status { get; set; }
}

This model encapsulates all the necessary data for a voice message, including unique identifiers, timestamps, and various stages of processing, that is, transcription and content enhancement.

Next, create a new file in the same Models folder called WhatsAppMessage.cs to define the WhatsApp message model. Update it with the code below:

namespace VoiceToEmail.Core.Models;
public class WhatsAppMessage
{
    public string? MessageSid { get; set; }
    public string? From { get; set; }
    public string? To { get; set; }
    public string? Body { get; set; }
    public int NumMedia { get; set; }
    public Dictionary<string, string> MediaUrls { get; set; } = new();
}
public class ConversationState
{
    public string PhoneNumber { get; set; } = string.Empty;
    public string? PendingVoiceNoteUrl { get; set; }
    public bool WaitingForEmail { get; set; }
    public DateTime LastUpdated { get; set; } = DateTime.UtcNow;
    // Method to check if the state is stale
    public bool IsStale => DateTime.UtcNow.Subtract(LastUpdated).TotalHours > 24;
}

The WhatsAppMessage class represents an incoming WhatsApp message, capturing essential details like the sender's phone number, message body, number of media attachments, and media URLs. The ConversationState class helps manage the conversation flow and track pending actions for each user interaction.

Create Service Interfaces

Next, you will define the interfaces that outline the responsibilities of various services in your application. These include transcription, content enhancement, email sending, and handling WhatsApp messages. Still in the VoiceToEmail.Core directory, create an Interfaces folder to add service interfaces. Create a new file and name it ITranscriptionService.cs. Update it with the code below.

namespace VoiceToEmail.Core.Interfaces;
public interface ITranscriptionService
{
    Task<string> TranscribeAudioAsync(byte[] audioData);
}
// IContentService.cs
public interface IContentService
{
    Task<string> EnhanceContentAsync(string transcribedText);
}
// IEmailService.cs
public interface IEmailService
{
    Task SendEmailAsync(string to, string subject, string content);
}

These interfaces decouple the implementation details from the rest of your codebase, allowing you to easily swap out service implementations if needed.

Now, add another instance and name it IWhatsAppService.cs. Update it with the code below.

using VoiceToEmail.Core.Models;
namespace VoiceToEmail.Core.Interfaces;
public interface IWhatsAppService
{
    Task<string> HandleIncomingMessageAsync(WhatsAppMessage message);
}

The IWhatsAppService interface defines how incoming WhatsApp messages will be processed. This abstraction lets you focus on handling the specifics of message parsing and response generation.

Implement the Services

With your interfaces in place, it's time to implement the actual business logic. In this section, you'll create the service classes that interact with external APIs that are AssemblyAI for handling voice transcription, OpenAI for personalizing the transcribed voice and composing the email, and SendGrid for sending the emails.

WhatsAppService

In VoiceToEmail.API, create a Services folder and add service implementations. Start by creating the WhatsappService.cs file. This service will parse and process incoming WhatsApp messages. Update it with the code below.

using System.Net.Http.Headers;
using Twilio;
using VoiceToEmail.Core.Interfaces;
using VoiceToEmail.Core.Models;
namespace VoiceToEmail.API.Services;
public class WhatsAppService : IWhatsAppService
{
    private readonly IConfiguration _configuration;
    private readonly ITranscriptionService _transcriptionService;
    private readonly IContentService _contentService;
    private readonly IEmailService _emailService;
    private readonly HttpClient _httpClient;
    private readonly ILogger<WhatsAppService> _logger;
    private static readonly Dictionary<string, ConversationState> _conversationStates = new();
    private static readonly object _stateLock = new();
    public WhatsAppService(
        IConfiguration configuration,
        ITranscriptionService transcriptionService,
        IContentService contentService,
        IEmailService emailService,
        HttpClient httpClient,
        ILogger<WhatsAppService> logger)
    {
        _configuration = configuration;
        _transcriptionService = transcriptionService;
        _contentService = contentService;
        _emailService = emailService;
        _httpClient = httpClient;
        _logger = logger;
        // Initialize Twilio client
        var accountSid = configuration["Twilio:AccountSid"] ?? 
            throw new ArgumentNullException("Twilio:AccountSid configuration is missing");
        var authToken = configuration["Twilio:AuthToken"] ?? 
            throw new ArgumentNullException("Twilio:AuthToken configuration is missing");
        // Set up HTTP client authentication for Twilio media downloads
        var authString = Convert.ToBase64String(
            System.Text.Encoding.ASCII.GetBytes($"{accountSid}:{authToken}"));
        _httpClient.DefaultRequestHeaders.Authorization = 
            new AuthenticationHeaderValue("Basic", authString);
        TwilioClient.Init(accountSid, authToken);
        _logger.LogInformation("WhatsAppService initialized successfully");
    }
    public async Task<string> HandleIncomingMessageAsync(WhatsAppMessage message)
    {
        try
        {
            _logger.LogInformation("Processing incoming message from {From}", message.From);
            ConversationState state;
            lock (_stateLock)
            {
                if (!_conversationStates.TryGetValue(message.From!, out state!))
                {
                    state = new ConversationState { PhoneNumber = message.From! };
                    _conversationStates[message.From!] = state;
                    _logger.LogInformation("Created new conversation state for {From}", message.From);
                }
            }
            // If waiting for email address
            if (state.WaitingForEmail && !string.IsNullOrEmpty(message.Body))
            {
                return await HandleEmailProvided(message.Body, state);
            }
            // If it's a voice note
            if (message.NumMedia > 0 && message.MediaUrls.Any())
            {
                return await HandleVoiceNote(message.MediaUrls.First().Value, state);
            }
            // Default response
            return "Please send a voice note to convert it to email, or type an email address if requested.";
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error processing incoming message from {From}", message.From);
            throw;
        }
    }
    private async Task<string> HandleVoiceNote(string mediaUrl, ConversationState state)
    {
        try
        {
            _logger.LogInformation("Downloading voice note from {MediaUrl}", mediaUrl);
            // Download the voice note
            byte[] voiceNote;
            try
            {
                voiceNote = await _httpClient.GetByteArrayAsync(mediaUrl);
                _logger.LogInformation("Successfully downloaded voice note ({Bytes} bytes)", voiceNote.Length);
            }
            catch (HttpRequestException ex)
            {
                _logger.LogError(ex, "Failed to download media from Twilio. URL: {MediaUrl}, Status: {Status}", 
                    mediaUrl, ex.StatusCode);
                throw;
            }
            // Transcribe the voice note
            var transcription = await _transcriptionService.TranscribeAudioAsync(voiceNote);
            _logger.LogInformation("Successfully transcribed voice note");
            // Extract email address if present
            var emailAddress = ExtractEmailAddress(transcription);
            if (emailAddress != null)
            {
                // Generate and send email
                var enhancedContent = await _contentService.EnhanceContentAsync(transcription);
                await _emailService.SendEmailAsync(emailAddress, "Voice Note Transcription", enhancedContent);
                _logger.LogInformation("Email sent successfully to {EmailAddress}", emailAddress);
                return "Your voice note has been converted and sent as an email! ✉️";
            }
            else
            {
                // Store voice note URL and wait for email
                state.PendingVoiceNoteUrl = mediaUrl;
                state.WaitingForEmail = true;
                _logger.LogInformation("Waiting for email address from user");
                return "I couldn't find an email address in your message. Please reply with the email address where you'd like to send this message.";
            }
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error processing voice note");
            throw;
        }
    }
    private async Task<string> HandleEmailProvided(string emailText, ConversationState state)
    {
        try
        {
            var emailAddress = ExtractEmailAddress(emailText);
            if (emailAddress == null)
            {
                _logger.LogWarning("Invalid email address provided: {EmailText}", emailText);
                return "That doesn't look like a valid email address. Please try again.";
            }
            if (state.PendingVoiceNoteUrl == null)
            {
                _logger.LogWarning("No pending voice note found for {PhoneNumber}", state.PhoneNumber);
                return "Sorry, I couldn't find your voice note. Please send it again.";
            }
            _logger.LogInformation("Processing pending voice note for {EmailAddress}", emailAddress);
            // Download and process the pending voice note
            var voiceNote = await _httpClient.GetByteArrayAsync(state.PendingVoiceNoteUrl);
            var transcription = await _transcriptionService.TranscribeAudioAsync(voiceNote);
            var enhancedContent = await _contentService.EnhanceContentAsync(transcription);
            // Send the email
            await _emailService.SendEmailAsync(emailAddress, "New Message Delivered via Voice-to-Text", enhancedContent);
            // Reset state
            state.PendingVoiceNoteUrl = null;
            state.WaitingForEmail = false;
            _logger.LogInformation("Successfully processed voice note and sent email to {EmailAddress}", emailAddress);
            return "Your voice note has been converted and sent as an email! ✉️";
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error handling email provision");
            throw;
        }
    }
    private string? ExtractEmailAddress(string text)
    {
        // Simple regex for email extraction
        var match = System.Text.RegularExpressions.Regex.Match(text, @"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}");
        return match.Success ? match.Value : null;
    }
}

During initialization, the constructor retrieves Twilio credentials from the configuration and initializes the Twilio client. If the credentials are missing, an exception is thrown. It also sets up authentication headers for the HTTP client, allowing secure media downloads from Twilio.

The main method HandleIncomingMessageAsync() processes incoming WhatsApp messages by first checking if the user is expected to provide an email address. If so, the message is handled as an email input. If the message contains a voice note, it is downloaded, transcribed, and processed. If neither condition is met, the service responds by prompting the user to send a voice note or an email.

If a voice note is detected, the service downloads the media from Twilio, transcribes the voice note, and attempts to extract an email address from the transcription. If an email address is found, the content is enhanced using IContentService before being sent to the detected email. If no email is found, the system stores the voice note and waits for the user to manually provide an email address. The conversation state is updated to ensure that when the email is received, the system knows which voice note to associate it with.

When a user provides an email after sending a voice note, the service validates the email address using a regular expression. If the email is valid, the pending voice note is retrieved, processed again, transcribed, enhanced, and then sent to the given email address. If no pending voice note exists, the user is informed and asked to resend it.

The email extraction process uses a regular expression to find valid email patterns in messages. If an invalid email is provided, the service responds with an error message and prompts the user to re-enter a valid email. Finally, it will generate a response, a confirmation message that is sent back to Twilio WhatsApp.

TranscriptionService

This service handles audio transcription by uploading the audio data to AssemblyAI, creating a transcription request, and polling until the transcription is complete. Go ahead and create a TranscriptionService.cs file. Update it with the code below.

using System.Text.Json.Serialization;
using VoiceToEmail.Core.Interfaces;
namespace VoiceToEmail.API.Services;
public class TranscriptionService : ITranscriptionService
{
    private readonly ILogger<TranscriptionService> _logger;
    private readonly HttpClient _httpClient;
    private readonly string _apiKey;
    public TranscriptionService(
        IConfiguration configuration,
        HttpClient httpClient,
        ILogger<TranscriptionService> logger)
    {
        _logger = logger;
        _httpClient = httpClient;
        _apiKey = configuration["AssemblyAI:ApiKey"] ?? 
            throw new ArgumentNullException("AssemblyAI:ApiKey configuration is missing");
        _httpClient.DefaultRequestHeaders.Add("Authorization", _apiKey);
        _httpClient.BaseAddress = new Uri("https://api.assemblyai.com/v2/");
    }
    public async Task<string> TranscribeAudioAsync(byte[] audioData)
    {
        try
        {
            _logger.LogInformation("Starting audio transcription with AssemblyAI");
            // Upload the audio file
            using var audioContent = new ByteArrayContent(audioData);
            audioContent.Headers.Add("Content-Type", "application/octet-stream");
            var uploadResponse = await _httpClient.PostAsync("upload", audioContent);
            if (!uploadResponse.IsSuccessStatusCode)
            {
                var errorContent = await uploadResponse.Content.ReadAsStringAsync();
                _logger.LogError("Upload failed with status {Status}. Response: {Response}", 
                    uploadResponse.StatusCode, errorContent);
                throw new HttpRequestException($"Upload failed with status {uploadResponse.StatusCode}");
            }
            var uploadResult = await uploadResponse.Content.ReadFromJsonAsync<UploadResponse>();
            if (uploadResult?.upload_url == null)
            {
                _logger.LogError("Upload response missing upload_url. Response: {Response}", 
                    await uploadResponse.Content.ReadAsStringAsync());
                throw new InvalidOperationException("Failed to get upload URL from response");
            }
            _logger.LogInformation("Audio file uploaded successfully. Creating transcription request");
            // Create transcription request
            var transcriptionRequest = new TranscriptionRequest
            {
                audio_url = uploadResult.upload_url,
                language_detection = true
            };
            var transcriptionResponse = await _httpClient.PostAsJsonAsync("transcript", transcriptionRequest);
            if (!transcriptionResponse.IsSuccessStatusCode)
            {
                var errorContent = await transcriptionResponse.Content.ReadAsStringAsync();
                _logger.LogError("Transcription request failed with status {Status}. Response: {Response}", 
                    transcriptionResponse.StatusCode, errorContent);
                throw new HttpRequestException($"Transcription request failed with status {transcriptionResponse.StatusCode}");
            }
            var transcriptionResult = await transcriptionResponse.Content
                .ReadFromJsonAsync<TranscriptionResponse>();
            if (transcriptionResult?.id == null)
            {
                _logger.LogError("Transcription response missing ID. Response: {Response}", 
                    await transcriptionResponse.Content.ReadAsStringAsync());
                throw new InvalidOperationException("Failed to get transcript ID from response");
            }
            // Poll for completion
            int attempts = 0;
            const int maxAttempts = 60; // 1 minute timeout
            while (attempts < maxAttempts)
            {
                var pollingResponse = await _httpClient.GetAsync($"transcript/{transcriptionResult.id}");
                if (!pollingResponse.IsSuccessStatusCode)
                {
                    var errorContent = await pollingResponse.Content.ReadAsStringAsync();
                    _logger.LogError("Polling failed with status {Status}. Response: {Response}", 
                        pollingResponse.StatusCode, errorContent);
                    throw new HttpRequestException($"Polling failed with status {pollingResponse.StatusCode}");
                }
                var pollingResult = await pollingResponse.Content
                    .ReadFromJsonAsync<TranscriptionResponse>();
                if (pollingResult?.status == "completed")
                {
                    if (string.IsNullOrEmpty(pollingResult.text))
                    {
                        throw new InvalidOperationException("Received empty transcription text");
                    }
                    _logger.LogInformation("Transcription completed successfully");
                    return pollingResult.text;
                }
                if (pollingResult?.status == "error")
                {
                    var error = pollingResult.error ?? "Unknown error";
                    _logger.LogError("Transcription failed: {Error}", error);
                    throw new Exception($"Transcription failed: {error}");
                }
                _logger.LogInformation("Waiting for transcription to complete. Current status: {Status}", 
                    pollingResult?.status);
                attempts++;
                await Task.Delay(1000);
            }
            throw new TimeoutException("Transcription timed out after 60 seconds");
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error during transcription");
            throw;
        }
    }
    private class UploadResponse
    {
        [JsonPropertyName("upload_url")]
        public string? upload_url { get; set; }
    }
    private class TranscriptionRequest
    {
        [JsonPropertyName("audio_url")]
        public string? audio_url { get; set; }
        [JsonPropertyName("language_detection")]
        public bool language_detection { get; set; }
    }
    private class TranscriptionResponse
    {
        [JsonPropertyName("id")]
        public string? id { get; set; }
        [JsonPropertyName("status")]
        public string? status { get; set; }
        [JsonPropertyName("text")]
        public string? text { get; set; }
        [JsonPropertyName("error")]
        public string? error { get; set; }
    }
}

The TranscriptionService class is responsible for handling voice note transcription using AssemblyAI. It takes an audio file as a byte array, uploads it to AssemblyAI, requests a transcription, and continuously polls until the transcription is completed. Once transcribed, it returns the text output. If any step fails, the service logs the error and throws an exception. This service is essential for converting WhatsApp voice messages into email-friendly text before being sent.

The class implements the ITranscriptionServiceinterface, ensuring it follows a contract for transcription-related functionality. It uses dependency injection to receive an HttpClientfor making API requests, an ILoggerfor logging, and an API key for authentication. The API key is loaded from the configuration, and if it is missing, an exception is thrown to prevent unauthorized API requests.

The core functionality is implemented in TranscribeAudioAsync(byte[] audioData), which follows three key steps. First, it uploads the audio file to AssemblyAI, receiving a unique upload URL in response. If the upload fails, it logs the error and throws an exception. Second, it creates a transcription request using the upload URL and submits it to AssemblyAI. If the request is unsuccessful, it logs an error and stops execution. Third, the service enters a polling loop where it checks the transcription status every second for up to 60 seconds. If the transcription is completed, it returns the extracted text. If an error occurs, it logs the issue and throws an exception. If the timeout limit is reached, it throws a timeout exception.

To support this process, the service defines helper classes to structure JSON responses from AssemblyAI. The UploadResponse class stores the upload URL returned after an audio file is uploaded. The TranscriptionRequest class defines the JSON structure for transcription requests, including the audio_url and an option for language_detection. The TranscriptionResponse class stores the transcription ID, status (queued, processing, completed, or error), the final transcribed text, and any error messages.

This service ensures seamless voice-to-text conversion and integrates efficiently with the WhatsApp message handling workflow.

Next, create the ContentService.cs.

ContentService

The ContentService class is responsible for enhancing transcribed text by transforming it into a professional email format. It integrates with OpenAI's API to generate well-structured and formal emails from voice transcriptions. This service ensures that voice-to-text conversions are polished and properly formatted before being sent via email. Update it with the code below.

using OpenAI.Chat;
using VoiceToEmail.Core.Interfaces;
public class ContentService : IContentService
{
    private readonly ChatClient _client;
    public ContentService(IConfiguration configuration)
    {
        string apiKey = configuration["OpenAI:ApiKey"];
        // Initialize the ChatClient with your model (e.g. "03-mini")
        _client = new ChatClient(model: "o3-mini-2025-01-31", apiKey: apiKey);
    }
    public async Task<string> EnhanceContentAsync(string transcribedText)
    {
        var messages = new List<ChatMessage>
        {
            new SystemChatMessage(
                "Transform the following message into a professional email. " +
                "Maintain the core message but make it more formal and well-structured. " +
                "Add appropriate greeting and closing."
            ),
            new UserChatMessage(transcribedText)
        };
        var response = await _client.CompleteChatAsync(messages);
        return response.Value.Content.Last().Text.Trim();
    }
}

The class implements the IContentServiceinterface, establishing a structured approach to processing content. It is initialized in the constructor by retrieving an API key from the configuration settings and using it to create a new instance of ChatClient with a specified model ("o3-mini-2025-01-31"). This setup readies the service for interacting with OpenAI's chat-based API.

The primary work is performed in the EnhanceContentAsync(string transcribedText) method. Here, a list of chat messages is created with two key components:

  • A system message that instructs the AI to transform the given message into a professional email. It directs the AI to keep the core content intact while formalizing the structure and adding a proper greeting and closing.
  • A user message that carries the original transcribed text.

The ChatClient then processes these messages asynchronously by calling CompleteChatAsync(messages). Finally, the method retrieves the last message from the AI's response, trims any extra whitespace, and returns this polished version of the transcription as a professional email.

Proceed to create the EmailService.cs file.

EmailService

The EmailService class is responsible for sending emails using SendGrid, ensuring that transcriptions and enhanced content reach the intended recipients. Update it with the code below.

using SendGrid;
using SendGrid.Helpers.Mail;
using VoiceToEmail.Core.Interfaces;
public class EmailService : IEmailService
{
    private readonly SendGridClient _client;
    private readonly string _fromEmail;
    private readonly string _fromName;
    public EmailService(IConfiguration configuration)
    {
        var apiKey = configuration["SendGrid:ApiKey"] ?? 
            throw new ArgumentNullException("SendGrid:ApiKey configuration is missing");
        _client = new SendGridClient(apiKey);
        _fromEmail = configuration["SendGrid:FromEmail"] ?? 
            throw new ArgumentNullException("SendGrid:FromEmail configuration is missing");
        _fromName = configuration["SendGrid:FromName"] ?? 
            throw new ArgumentNullException("SendGrid:FromName configuration is missing");
    }
    public async Task SendEmailAsync(string to, string subject, string content)
    {
        var from = new EmailAddress(_fromEmail, _fromName);
        var toAddress = new EmailAddress(to);
        var msg = MailHelper.CreateSingleEmail(
            from,
            toAddress,
            subject,
            content,
            $"<div style='font-family: Arial, sans-serif;'>{content}</div>"
        );
        var response = await _client.SendEmailAsync(msg);
        if (!response.IsSuccessStatusCode)
        {
            throw new Exception($"Failed to send email: {response.StatusCode}");
        }
    }
}

It implements IEmailService, enforcing a contract for email-sending functionality. The service is initialized with a SendGridClient, which requires an API key, and it retrieves the sender's email address and name from the configuration settings. If any of these configurations are missing, the constructor throws an exception to prevent misconfigured email sending.

The primary method, SendEmailAsync(string to, string subject, string content), handles email dispatch. It first constructs the sender and recipient email addresses using SendGrid’s EmailAddress class. Then, it uses MailHelper.CreateSingleEmail() to create an email message, formatting the content with HTML styling for better readability. The email is then sent via _client.SendEmailAsync(msg), and if the response indicates failure, an exception is thrown to signal the issue.

This service ensures reliable email delivery, making it an essential component for sending professionally formatted transcriptions via email.

Move forward to the controllers.

Implement the Controller

The MessageController class handles HTTP requests for processing voice messages and converting them into text-based emails. Create a Controllers folder in VoiceToEmail.API and create Controllers.cs. Update it with the code below.

using Microsoft.AspNetCore.Mvc;
using VoiceToEmail.Core.Interfaces;
namespace VoiceToEmail.API.Controllers;
[ApiController]
[Route("api/[controller]")]
public class MessageController : ControllerBase
{
    private readonly ITranscriptionService _transcriptionService;
    private readonly IContentService _contentService;
    private readonly IEmailService _emailService;
    private readonly ILogger<MessageController> _logger;
    public MessageController(
        ITranscriptionService transcriptionService,
        IContentService contentService,
        IEmailService emailService,
        ILogger<MessageController> logger)
    {
        _transcriptionService = transcriptionService;
        _contentService = contentService;
        _emailService = emailService;
        _logger = logger;
    }
    [HttpPost]
    public async Task<IActionResult> SendMessage(IFormFile audioFile, string recipientEmail)
    {
        try
        {
            if (audioFile == null || audioFile.Length == 0)
                return BadRequest("Audio file is required");
            using var memoryStream = new MemoryStream();
            await audioFile.CopyToAsync(memoryStream);
            var audioData = memoryStream.ToArray();
            _logger.LogInformation("Starting transcription");
            var transcribedText = await _transcriptionService.TranscribeAudioAsync(audioData);
            _logger.LogInformation("Enhancing content");
            var enhancedContent = await _contentService.EnhanceContentAsync(transcribedText);
            _logger.LogInformation("Sending email");
            await _emailService.SendEmailAsync(
                recipientEmail,
                "Voice Message Transcription",
                enhancedContent
            );
            var response = new
            {
                TranscribedText = transcribedText,
                EnhancedContent = enhancedContent,
                RecipientEmail = recipientEmail,
                Status = "Completed"
            };
            return Ok(response);
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error processing voice message");
            return StatusCode(500, "An error occurred while processing your message");
        }
    }
}

This code uses dependency injection to access ITranscriptionService, IContentService, and IEmailService. The controller exposes a POST endpoint that accepts an audio file and a recipient's email address as input.

When a request is received, the controller first validates the uploaded audio file to ensure it exists and has content. It then reads the audio data into memory and calls TranscribeAudioAsync() to obtain a text-based transcription. The transcribed text is further enhanced using EnhanceContentAsync(), which formats it into a professional email. Finally, SendEmailAsync() is called to send the processed transcription to the specified recipient.

Throughout the process, the controller logs key steps, ensuring issues can be traced and debugged effectively. If any errors occur, they are logged, and the API responds with a 500 Internal Server Error. On success, the response includes the original transcription, enhanced content, recipient email, and a status message.

Proceed to WhatsAppController to bind it with the webhook.

WhatsAppController

The WhatsAppController class serves as the webhook endpoint for Twilio’s WhatsApp API, allowing WhatsApp messages to be processed in real-time. Create a WhatsAppController.cs file and update it with the code below.

using Microsoft.AspNetCore.Mvc;
using VoiceToEmail.Core.Models;
using VoiceToEmail.Core.Interfaces;
namespace VoiceToEmail.API.Controllers;
[ApiController]
[Route("api/[controller]")]
public class WhatsAppController : ControllerBase
{
    private readonly IWhatsAppService _whatsAppService;
    private readonly ILogger<WhatsAppController> _logger;
    public WhatsAppController(
        IWhatsAppService whatsAppService,
        ILogger<WhatsAppController> logger)
    {
        _whatsAppService = whatsAppService;
        _logger = logger;
    }
    // Test endpoint to verify routing
    [HttpGet]
    public IActionResult Test()
    {
        _logger.LogInformation("Test endpoint hit at: {time}", DateTime.UtcNow);
        return Ok("WhatsApp endpoint is working!");
    }
    // Main webhook endpoint for Twilio
    [HttpPost]
    public async Task<IActionResult> Webhook([FromForm] Dictionary<string, string> form)
    {
        try
        {
            _logger.LogInformation("Webhook received at: {time}", DateTime.UtcNow);
            // Log all incoming form data
            foreach (var item in form)
            {
                _logger.LogInformation("Form data - {Key}: {Value}", item.Key, item.Value);
            }
            // Create WhatsApp message from form data
            var message = new WhatsAppMessage
            {
                MessageSid = form.GetValueOrDefault("MessageSid"),
                From = form.GetValueOrDefault("From"),
                To = form.GetValueOrDefault("To"),
                Body = form.GetValueOrDefault("Body"),
                NumMedia = int.Parse(form.GetValueOrDefault("NumMedia", "0"))
            };
            // Process media if present
            for (int i = 0; i < message.NumMedia; i++)
            {
                var mediaUrl = form.GetValueOrDefault($"MediaUrl{i}");
                var mediaContentType = form.GetValueOrDefault($"MediaContentType{i}");
                if (!string.IsNullOrEmpty(mediaUrl))
                {
                    message.MediaUrls[mediaContentType] = mediaUrl;
                    _logger.LogInformation("Media found - URL: {MediaUrl}, Type: {MediaType}", 
                        mediaUrl, mediaContentType);
                }
            }
            // Process message and get response
            var response = await _whatsAppService.HandleIncomingMessageAsync(message);
            _logger.LogInformation("Response generated: {Response}", response);
            // Create and return TwiML response
            var twimlResponse = $@"<?xml version=""1.0"" encoding=""utf-8""?>
<Response>
    <Message>{response}</Message>
</Response>";
            return Content(twimlResponse, "application/xml");
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "Error processing webhook: {ErrorMessage}", ex.Message);
            // Return a basic TwiML response even in case of error
            var errorResponse = $@"<?xml version=""1.0"" encoding=""utf-8""?>
<Response>
    <Message>Sorry, there was an error processing your message. Please try again.</Message>
</Response>";
            return Content(errorResponse, "application/xml");
        }
    }
}

The controller is initialized with IWhatsAppService, which handles incoming messages, and an ILogger for logging.

A GET request to the controller verifies that the WhatsApp webhook is accessible, responding with a basic success message. The primary functionality, however, lies in the POST webhook endpoint, which Twilio calls when a message is received. This method extracts form data from the incoming request, logs the details, and constructs a WhatsAppMessage object containing metadata such as the message sender, recipient, text body, and any media attachments.

If media such as voice notes is present, the controller iterates through the media URLs and adds them to the message object. The fully constructed message is then passed to HandleIncomingMessageAsync(), which determines the appropriate response based on the message content. The response is formatted into TwiML (Twilio Markup Language) and returned to Twilio, ensuring the sender receives an appropriate reply.

If any errors occur during processing, the controller logs them and returns a fallback TwiML response, apologizing for the issue. This controller is essential for seamlessly handling WhatsApp voice messages, allowing them to be processed, transcribed, and sent as emails via Twilio’s messaging system.

Configure Application

The JSON configuration file defines essential API keys, logging settings, and service credentials required for the VoiceToEmail API to function properly. It provides structured environment settings for external services such as OpenAI, SendGrid, Twilio, and AssemblyAI. Update the appsettings.json as shown below. Remember to replace the placeholders with real credential values.

For detailed instructions on obtaining API keys and service credentials (OpenAI, SendGrid, Twilio, and AssemblyAI), please refer to the Prerequisites section.

Never expose API keys in public repositories or share them online. To keep your credentials secure:

  • Add appsettings.json to your .gitignore file to prevent accidental uploads.
  • Use environment variables instead of storing sensitive data in plain text for deployment purposes.
  • Consider using a secrets management tool for better security.
{
  "OpenAI": {
    "ApiKey": "OPENAI_API_KEY"
  },
  "SendGrid": {
    "ApiKey": "SENDGRID_API_KEY",
    "FromEmail": "SENDGRID_FROM_EMAIL",
    "FromName": "SENDGRID_FROM_NAME"
  },
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Microsoft.AspNetCore": "Warning"
    }
  },
  "Twilio": {
    "AccountSid": "TWILIO_ACCOUNT_SID",
    "AuthToken": "TWILIO_AUTH_TOKEN",
    "WhatsAppNumber": "TWILIO_WHATSAPP_NUMBER"
  },
  "AssemblyAI": {
    "ApiKey": "ASSEMBLYAI_API_KEY"
  },
  "AllowedHosts": "*"
}

The OpenAI section contains the ApiKey, which is used in ContentService to enhance transcribed text by converting it into a professionally formatted email. The SendGrid section includes an ApiKey for authentication, along with FromEmail and FromName, which define the sender’s email address and display name when sending emails. These settings allow EmailService to manage email delivery reliably.

For logging, the configuration specifies different logging levels. By default, general logs are set to "Information", ensuring key events are recorded, while ASP.NET Core framework logs are filtered to "Warning" to prevent excessive log noise. This improves monitoring and debugging by focusing on important events.

The Twilio section holds credentials for handling WhatsApp messages, including the AccountSid and AuthToken, which authenticate API requests. Additionally, WhatsAppNumber specifies the Twilio WhatsApp sender number used for messaging, allowing seamless integration with WhatsAppService. You can use the phone number of a phone that you have access to with WhatsApp installed. If you want instructions on setting up a registered sender for WhatsApp, review the documentation.

The AssemblyAI section contains an ApiKey required for speech-to-text transcription, enabling TranscriptionService to process voice messages and convert them into text-based content. This API plays a crucial role in ensuring accurate transcriptions before they are formatted and sent via email.

Finally, the AllowedHosts setting is configured as * to permit requests from any domain, which is useful during development. However, in production, this setting may need to be restricted for security reasons.

Next, update the Program.cs with the code below to bundle everything in our application together and bring it to life.

using Microsoft.OpenApi.Models;
using VoiceToEmail.API.Services;
using VoiceToEmail.Core.Interfaces;
var builder = WebApplication.CreateBuilder(args);
// Add services to the container.
builder.Services.AddControllers().AddXmlSerializerFormatters();
builder.Services.AddEndpointsApiExplorer();
// Add Swagger services
builder.Services.AddSwaggerGen(c =>
{
    c.SwaggerDoc("v1", new OpenApiInfo { Title = "VoiceToEmail API", Version = "v1" });
});
// Configure HttpClient for Twilio with a named client
builder.Services.AddHttpClient("TwilioClient", client =>
{
    client.Timeout = TimeSpan.FromMinutes(2); // Increased timeout for media downloads
});
// Register services
builder.Services.AddHttpClient();
builder.Services.AddScoped<ITranscriptionService, TranscriptionService>();
builder.Services.AddScoped<IContentService, ContentService>();
builder.Services.AddScoped<IEmailService, EmailService>();
builder.Services.AddScoped<IWhatsAppService>(sp =>
{
    var config = sp.GetRequiredService<IConfiguration>();
    var transcriptionService = sp.GetRequiredService<ITranscriptionService>();
    var contentService = sp.GetRequiredService<IContentService>();
    var emailService = sp.GetRequiredService<IEmailService>();
    var logger = sp.GetRequiredService<ILogger<WhatsAppService>>();
    var httpClientFactory = sp.GetRequiredService<IHttpClientFactory>();
    var httpClient = httpClientFactory.CreateClient("TwilioClient");
    return new WhatsAppService(
        config,
        transcriptionService,
        contentService,
        emailService,
        httpClient,
        logger
    );
});
// Add CORS for development
builder.Services.AddCors(options =>
{
    options.AddPolicy("AllowAll",
        builder =>
        {
            builder
                .AllowAnyOrigin()
                .AllowAnyMethod()
                .AllowAnyHeader();
        });
});
// Add logging
builder.Services.AddLogging(logging =>
{
    logging.AddConsole();
    logging.AddDebug();
});
var app = builder.Build();
// Configure the HTTP request pipeline.
if (app.Environment.IsDevelopment())
{
    app.UseSwagger();
    app.UseSwaggerUI(c =>
    {
        c.SwaggerEndpoint("/swagger/v1/swagger.json", "VoiceToEmail API V1");
    });
    app.UseDeveloperExceptionPage();
    app.UseCors("AllowAll");
}
app.UseRouting();
app.UseAuthorization();
app.UseEndpoints(endpoints =>
{
    endpoints.MapControllers();
});
// Log application startup
var logger = app.Services.GetRequiredService<ILogger<Program>>();
logger.LogInformation("Application started. Environment: {Environment}", 
    app.Environment.EnvironmentName);
app.Run();

The Program.cs file is responsible for configuring and starting the VoiceToEmail API application. It initializes dependency injection, configures services, sets up HTTP request handling, and enables logging and debugging features.

The application is built using ASP.NET Core, and it follows the modern Minimal API approach. It begins by creating a WebApplicationBuilder, which is used to configure services and middleware. The AddControllers() method is called to enable API controller functionality, including support for XML serialization. Additionally, AddEndpointsApiExplorer() is registered to facilitate API endpoint discovery for tools like Swagger.

To improve developer experience, the application integrates Swagger for API documentation. The AddSwaggerGen() method sets up Swagger UI, allowing developers to visually explore API endpoints. The application also configures CORS (Cross-Origin Resource Sharing) to permit requests from any origin during development, making it easier to test the API from different front-end clients. The application then builds and configures the request pipeline, enabling routing, authorization, and endpoint mapping for API controllers.

Finally, logging is enabled using ILogger, ensuring that important events such as application startup and environment configuration are recorded. The application then runs the server, making the API available for handling requests. This file acts as the entry point for the entire system, bringing together all the services and middleware required to process voice messages and convert them into emails.

Test the Application

To verify that the whatsapp-Voice-to-Email API is working correctly, follow these steps:

Navigate to the project directory and build the application. Run the application using the command below:

cd VoiceToEmail.API
dotnet build
dotnet run

This will start the server and display the API URL and listening port http://localhost:5168. Feel free to allocate your own port number.

Since Twilio needs a publicly accessible URL for its WhatsApp Webhook, use ngrok to expose your local server to the internet. Run the command below to start ngrok.

ngrok http 5168

Once ngrok is running, it will generate a public URL e.g., https://randomstring.ngrok.io as displayed in the screenshot below.

Ngrok

Head on to the Twilio Try WhatsApp page, be sure that the phone number you listed in the settings file is connected to the WhatsApp sandbox by sending the Join code you see to the provided number. Then click on the Sandbox Settings tab and configure the sandbox settings as follows.

  • In the When a message comes in field, paste the ngrok forwarding URL and add /api/whatsapp to the URL endpoint. As that is where the webhook requests will be processed.
  • Set the Method to POST

Click the Save button to confirm the configuration. Confirm you are on the right track as depicted in the screenshot below.

Webhook endpoint

Your application is now connected to the sandbox. Now, send a voice note to test the transcription, personalization, and email delivery process. If an email address is not picked in the voice note, you can either send another recorded audio containing the email address or just send the email address as text. Check the screenshot below for the application flow.

Whatsapp Voice to email

After sending a voice note, you should receive an email transcription from Twilio SendGrid. In this case, the received and personalized email as shown in the screenshot contains a transcription of the voice message. Remember that, up in EnhanceContentAsync, you instructed the AI to change the contents of your voicemail to include email headers and appropriate professional language, so the transcription of what you said aloud won't be an exact match for what is in the email. If you are happy with the results, verify the rest of the output. If not, you can always alter the provided AI prompt to adjust the email transcription.

Transcribed Personalized Email

Verifying the Output

  • Check the Email – Ensure the email was delivered successfully to the intended recipient.
  • Confirm the Transcription Accuracy – Compare the email's text with the original voice note. Minor errors may occur due to background noise or unclear speech.
  • Ensure Proper Formatting – The email should have a clear subject, structured content, and proper paragraph formatting.
  • Check the Sender Details – Verify that the email is sent from the configured SendGrid email address.

Points to check out

  • If the transcription is inaccurate, feel free to try other transcription tools.
  • If the email was not received, verify your SendGrid API key and email configuration.

At this point, the application has successfully processed and transcribed the voice note and personalized it into an email.

What's Next for Your WhatsApp Voice-to-Email System?

You've built a robust, AI Agentic-powered WhatsApp voice-to-email application that:

  • Receives voice notes via Twilio WhatsApp and processes them in real-time.
  • Transcribes audio using AssemblyAI for accurate speech-to-text conversion.
  • Enhances transcriptions with OpenAI to create well-structured, professional emails.
  • Delivers transcriptions via SendGrid, ensuring seamless email communication.

Applicatory use cases:

AI for the Visually Impaired – Develop voice-driven systems that transcribe, summarize, and deliver emails, helping visually impaired users navigate digital communication more efficiently.

CRM and Business Integration – Connect with customer management systems (CRM) to log voice interactions and automate business workflows.

Automated Response System – Implement AI-driven auto-replies for WhatsApp messages, providing instant feedback or call-to-action responses.

Multi-Language Support – Expand the transcription service to detect and process multiple languages, enabling global accessibility.

Bonus materials for insights and to further your skills:

Jacob automates applications using Twilio for communication, OpenAI for brain-like processing, and vector databases for intelligent search. He loves seeing real-world problems solved with AI and the agentic wave. Check out more of his work on GitHub .