Model Context Protocol: Real-World Performance Testing of Twilio Alpha's MCP Server
Time to read: 9 minutes
You’ve probably already heard some hype (and skepticism) around Model Context Protocol (MCP) for AI agents. MCP was originally introduced by Anthropic as a standard way for AI agents to discover and use tools more flexibly. I won’t rehash the basics here, but if you're new to MCP or want a primer on what it is and how it works, check out our introductory blog post.
What I will do is dive into the numbers behind that hype. Over the last few months, interest in MCP has exploded— Google Trends shows a sharp rise in searches for "Model Context Protocol," with a significant spike following OpenAI’s support announcement. Developer forums are buzzing, GitHub repos are multiplying, and more tools are adding MCP support weekly.
So I set out to answer the obvious next question: Does using MCP actually help agents perform better on real-world tasks, or is it just another overhyped spec? I ran a series of hands-on tests using Twilio’s own MCP implementation to find out.


How I tested MCP (Twilio Style)
To get some answers, I set up a simple task-based experiment. We used a popular open-source VS code extension called Cline (Using Cline v3.8.4) as our AI agent and hooked it up to Twilio Alpha’s new MCP server. We picked three common Twilio tasks for the agent to perform – things a Twilio developer might do:
- Purchase a phone number .
- Create a TaskRouter activity .
- Create a TaskRouter queue with a filter.
For each task, we ran the agent using two different implementations:
- No MCP (Control): The agent solved the task without any MCP help by using its usual built-in tools (like file search, terminal execution, web search, or its trained knowledge of Twilio’s docs).
- MCP Enabled: The agent had access to the Twilio MCP Server, which provided it with live tools for Twilio’s APIs (discovered via the MCP Server).
We repeated each task a minimum of ten times for each implementation to account for variability (AI agents are non-deterministic, after all). We used the same AI model for all runs (Anthropic Claude-3.7-Sonnet) to ensure any differences came from MCP, and not the model.
Then we logged the metrics:
- How long did each task take
- How many API calls were made
- How many human interactions it needed to complete the task
- Tokens used (input, output, cache reads, cache writes)
- Cost to complete the task
- Success rate
Results: Faster and more reliable – but a bit more expensive
After crunching the numbers, here’s what we found for MCP-enabled runs versus control:
- Tasks completed ~20.5% faster on average. Agents using MCP finished the tasks quicker than those without.
- ~19.2% fewer API calls were needed. MCP helped the agent be more efficient in the number of times it hit Twilio’s APIs.
- Slightly fewer interactions with the user (-3.2%). In most cases, the agent needed one fewer back-and-forth question to get the job done.
- On average, 6.3% fewer AI tokens were consumed, which is a modest reduction in the amount of “thinking” the model had to do.
- Success rate jumped to 100%. Every MCP-assisted run succeeded, compared to a ~92.3% success rate without MCP. (In other words, the non-MCP agent occasionally failed one of the tasks, while the MCP agent nailed them all.)


Sounds great so far, right? It actually worked better than we expected… for the most part.
Here’s the catch: using MCP wasn’t free. (As my high school econ teacher used to say: “There’s no such thing as a free lunch.”) MCP delivered noticeable speed and reliability gains—but it came with added cost. In fact, it introduced some overhead:
- ~28.5% more cache reads and ~53.7% more cache writes under the hood. The MCP server led the agent to pull in a lot more cached context data.
- Cost increased by about 27.5% on average. This is the “uh-oh” part – those extra tokens and operations translated to a higher Anthropic API bill per task.
We were optimistic about seeing major gains—and in many ways, we did. But once we looked into the cache numbers, things got a bit more nuanced. In hindsight, the added cost makes sense: more context means more tokens, and that means higher cost. While shaving off a few seconds per task is a solid win, it’s worth asking whether that trade-off is justified at scale. That led us to dig deeper into what’s really driving the increase in cached context.


Rethinking context: When more isn’t always better
Before we discuss the specifics of cached tokens, it's worth stepping back and asking a broader question: How much is each additional token of context actually worth?
As developers, we're often tempted to load our agents with as much information as possible—docs, specs, tool definitions—assuming that more context leads to better results. But our testing suggests that there's a point of diminishing returns. Adding context improves performance… up to a point. After that, you're paying for more tokens without seeing a proportional quality or success rate boost.
In the MCP and AI agents world, it’s becoming increasingly important to think critically about context efficiency. Not just “Did the task succeed?”, but “How much context did it take to get there, and was that worth the cost?”
Digging into the cache: Why did costs go up?
The culprit behind the higher cost is how MCP handles context caching. Essentially, MCP-enabled agents pull in much reference data (like API specs and tool definitions) and keep it handy as “cached” context. Our runs saw a big jump in these cached tokens. Even though Anthropic recently updated their pricing to not count cached tokens as heavily as before, our MCP agent still pushed a ton of data into the context cache.
Why so many cached tokens? We have a few hunches:
- Client behavior (Cline): Cline is most likely loading a large chunk if not all of the MCP server’s provided context at the start, even if the task only needs part of it.
- Server content: Our Twilio MCP server might be offering too many tools or too much info. If the agent gets a whole kitchen sink of Twilio API endpoints and data, that’s a lot to chew on even for a simple task.
We plan to experiment with different MCP client and server configurations and other tasks to pinpoint the cause. (Is it just a Twilio thing, or do all MCP servers cause this caching surge?)
The good news is that token costs continue to decrease, and future iterations might handle caching more efficiently. But for now, it’s clear: MCP makes the agent smarter and faster at the cost of munching more tokens. That said, there’s reason to believe this cost burden will decrease over time. Just as LLM costs have dropped dramatically since ChatGPT launched, there’s a good chance MCP-related costs will follow a similar trajectory.
Anthropic, for example, has already rolled out token-saving updates that reduce the impact of cached context. These include prompt caching optimizations, cache-aware rate limits, and changes that remove prompt cache reads from input token rate limits. As models and infrastructure continue to improve, the overhead from extra MCP context could shrink significantly.
So while MCP isn’t free today, the economics may look very different in the future.
One task example: Buying a phone number with MCP
Let’s zoom in on the first task – purchasing a phone number – to see MCP’s impact up close. This task was straightforward and both the MCP and non-MCP agent succeeded every time. Yet, their behavior wasn’t identical:
- With MCP, the phone purchase task finished ~21.6% faster (about 17 seconds saved).
- The agent made 25% fewer API calls thanks to MCP guiding it directly to the right Twilio API calls.
- It needed slightly fewer clarifications from the user.
- However, it ended up using 12.7% more base tokens overall than the non-MCP approach.
- Average cache reads increased by 7.06%, and cache writes surged by 49%—the biggest driver of the added token cost.
- As a result, the cost for that task was ~23.5% higher with MCP than without.
- Both approaches had a 100% success rate.


In this case, MCP didn’t need to fix any failures (the non-MCP agent already did fine), but it made the process more direct. The trade-off was extra token usage loading Twilio’s API context, which bumped up the cost. If you’re thinking “Is saving ~17 seconds worth spending 23% more?”, you’re asking the same question we did. The answer probably depends on how critical that time save is for your use case (and whether those tokens cost a lot in absolute terms).
On the other hand, one of our other tasks ( creating a TaskRouter activity) told a different story: the non-MCP agent failed a couple of times because it wasn’t sure how to use that API, while the MCP agent succeeded every time. In a scenario like that, MCP’s reliability boost might justify the cost.
Generalist vs. Specialist: When does MCP help most?
It’s important to note what kind of knowledge MCP is adding.
Twilio’s APIs are public and well-documented. Many large language models are already trained on that documentation. So a general-purpose Twilio MCP server might not blow you away with huge improvements, because the underlying LLM mostly already knows how to work with Twilio (at least at a basic level). That could explain why our gains in speed and API calls, while solid, were not mind-boggling.
Now, imagine a more niche or proprietary scenario – say an internal company API or a particular workflow that an AI wouldn’t know about off-hand. This is where MCP could shine. By spinning up an MCP server with just the tools and context for your specific use case, you’re giving the AI agent a tailored toolbox it never had before. It can discover and use functions it didn’t originally learn, without you hard-coding every step. In cases like that, MCP might save the day (and the AI agent from a lot of trial-and-error).
In short: general APIs that the AI agent has seen in its training (like Twilio’s) likely yield smaller MCP benefits, whereas niche APIs or multi-step workflows could see bigger wins . For example, a custom “Customer Service MCP” that combines Twilio’s messaging and TaskRouter APIs with your company’s ticket system might greatly streamline an AI agent’s ability to solve support tickets.
Tips to make the most of MCP
If you’re considering using MCP for your AI agent, here are a few tips we picked up from our testing:
- Don’t overload the context. Include only the API details and data your agent actually needs for the task at hand. Avoid dumping an entire OpenAPI spec or massive docs if the task is narrow. Extraneous info likely leads to token bloat with no benefit.
- Use tool filtering. Twilio’s MCP server lets you filter which APIs or endpoints to expose (e.g. via --services or --tags flags). Take advantage of that! If your agent only needs to send texts and buy phone numbers, it doesn’t need the whole Twilio API catalog in its head.
- Monitor token usage. Keep an eye on how many tokens your agent is using with MCP vs without. If you see huge spikes from caching or context, try trimming the context or simplifying what you pass via MCP.
- Experiment with models and settings. We used one model (Anthropic’s Claude 3.7) – your mileage may vary with GPT-4 or other models. Also, different MCP server configurations (or servers for different services) could behave differently. Test small before scaling up.
Key takeaways for AI builders
- Quick wins for prototyping: If you’re just getting an AI agent off the ground, MCP can seriously speed up integration with external APIs. It’s much easier to plug in an MCP server than to write a bunch of custom API wrappers and prompts from scratch. You’ll get something working faster, which is great for proving out ideas.
- Not a silver bullet at scale: As your project grows, you might hit limits with MCP. The extra context can become overhead, and you have less direct control over what the agent is doing. Many builders may start with MCP for convenience, but later transition to custom tool integrations once they know exactly what their agent needs.
- Consider the trade-offs: Our tests show a clear pattern. MCP makes things more efficient (time, API calls) and reliable, saving you time, but at the cost of more tokens and money. For some, the reliability (100% success) and speed will be worth it; for others, the cost increase might be a deal-breaker. It depends on your use case and budget.
- Best for new or niche capabilities: Use MCP to give your agent abilities it wouldn’t otherwise have. MCP can feed the AI agent precisely what it needs if an API is obscure or if it requires a complex workflow. However, if the AI agent already knows how to handle something well, MCP may not add as much value.
Go and experiment!
We’ve shared what we found, but this is just the start. We encourage you to try MCP in your own projects and see what happens. Spin up an MCP server (check out Twilio’s open-source one on GitHub), run your favorite agent through its paces, and measure the difference. Share your results with us and the community – the more data points, the better we can all understand where – or when MCP makes sense over custom tools.
We were pretty excited by how well MCP performed, even with the quirks. There’s still plenty to explore (different APIs, other models, larger sample sizes), and we’re not done learning. So give it a go. You might discover that MCP is precisely what your AI agent needed – or that you’re better off without it. Either way, we’d love to hear about it. Keep experimenting, keep measuring, and keep building!
Noah Mogil is our resident AI tinkerer and Embedded Solutions Engineer on Twilio’s Emerging Tech & Innovation team. He collaborates closely with customers and internal stakeholders to bring big ideas to life by utilizing and extending Twilio’s platform.
Related Posts
Related Resources
Twilio Docs
From APIs to SDKs to sample apps
API reference documentation, SDKs, helper libraries, quickstarts, and tutorials for your language and platform.
Resource Center
The latest ebooks, industry reports, and webinars
Learn from customer engagement experts to improve your own communication.
Ahoy
Twilio's developer community hub
Best practices, code samples, and inspiration to build communications and digital engagement experiences.