We Threw 4,000 Tools at Anthropic's New Tool Search. Here's What Happened.

We Threw 4,000 Tools at Anthropic's New Tool Search. Here's What Happened.

Eric Gustin's avatar
Eric Gustin
DECEMBER 3, 2025
2 MIN READ
THOUGHT LEADERSHIP
Rays decoration image
Ghost Icon

TL;DR: Anthropic's new Tool Search is a step in the right direction-but if you're running 4,000+ tools across multiple services, it might not be ready for prime time.


The promise

Anthropic's Tool Search promises to let Claude "access thousands of tools without consuming its context window." Music to our ears. At Arcade, we maintain thousands of agent-optimized tools across Gmail, Slack, GitHub, HubSpot, Salesforce, and dozens more platforms. If anyone was going to stress-test this feature, it was us.

So we did! Source code and full results →

The setup

We loaded 4,027 tools into Anthropic's beta and ran 25 straightforward tasks. The kind of requests your agent should nail 100% of the time on smaller tool sets:

  • "Send an email to my colleague about the project update."
  • "Post a message to the #general channel in Slack."
  • "Schedule a meeting for tomorrow at 2pm."

Nothing tricky. No ambiguous edge cases. Just everyday agentic workflows.

We tested both of Anthropic's built-in search modes:

# Regex-based search
search_tool = [{"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"}]

# BM25-based search
search_tool = [{"type": "tool_search_tool_bm25_20251119", "name": "tool_search_tool_bm25"}]

Then we checked: did the correct tool even appear in the top-K results?

The results

Search Mode

Avg Success Rate

Regex

56% (14/25)

BM25

64% (16/25)

To keep this as fair as possible, we just tested the success rate for retrieval - whether the right tool showed up in the search results. We didn't test whether Claude would select that tool or fill in the parameters correctly. 

Where it worked and where it struggled

Tool search handled some requests flawlessly:

  • ✅ GoogleCalendar_CreateEvent
  • ✅ GoogleDocs_CreateBlankDocument
  • ✅ Github_CreateIssue
  • ✅ Spotify_PlayTrackByName
  • ✅ Salesforce_CreateContact
  • ✅ MicrosoftTeams_SendMessageToChannel

However, it did struggle to retrieve some some of the most common tools:

  • ❌ Gmail_SendEmail - Couldn't find "send email" in a Gmail prompt
  • ❌ Slack_SendMessage - Missed "post a message to Slack"
  • ❌ Zendesk_CreateTicket - Ticket creation? Never heard of it
  • ❌ ClickUp_CreateTask - Task creation tools exist. Just not in the results.
  • ❌ Youtube_SearchVideos - Returned Youtube_SearchForVideos instead. Close, but no cigar.

When "send an email" can't find Gmail_SendEmail, there's still work to do.

What this means

This is certainly a move in the right direction. The architecture is sound: defer loading tools into the model’s context window to sidestep the long-standing context-bloat problem, and instead discover them just-in-time, keeping interactions with a model lightweight. And especially important to enterprises: the token savings are real. 

But ~60% retrieval accuracy isn't ready for prime time when you're building agents that need to reliably take real-world actions. Enterprises need to be able to reliably trust the results of their agents. And having nearly half the tool searches fail before you even get to selection and parameterization doesn’t instill that trust.

We believe that Anthropic has identified a real problem, and we’re happy to see progress made in this space. Arcade is committed to delivering the MCP runtime and agent-optimized tools that help enterprises deploy agents that can take actions reliably for any model and for any number of tools. While our customers have already been able to improve the reliability of their production agents through Arcade, stay tuned for some exciting updates that will continue to push the boundaries of what’s possible.   


Ready to build? Get started with Arcade →

SHARE THIS POST

RECENT ARTICLES

How Arcade Proactively Addressed The First Major Identity Vulnerability in Agentic AI

While building an AI demo has become trivially easy, production-grade deployments in enterprises have been stifled by performance issues, costs, and security vulnerabilities that their teams have been warning about. Today, we're addressing one of those vulnerabilities head-on. A new class of identity attack Security researchers at The Chinese University of Hong Kong recently identified new variants of COAT (Cross-app OAuth Account Takeover), an identity phishing attack targeting agentic AI a

TUTORIALS

New Year, New Agents to Make You More Productive

Most conversations about AI agents still start the same way: models, prompts, frameworks, followed by an incredible looking demo. Then someone asks, “Okay… when can it ship to production?” That’s where things get a little awkward. The naked truth in the fading demo afterglow is that agents are apps. Which means they need identity, permissions, real integrations, and a way to behave predictably when something goes sideways. Without these components, any agent can dazzle a boardroom, but it won

THOUGHT LEADERSHIP

5 Takeaways from the 2026 State of AI Agents Report

AI agents have moved quickly from experimentation to real-world deployment. Over the past year, organizations have gone from asking whether agents work to figuring out how to deploy enterprise AI agents reliably at scale. The 2026 State of AI Agents Report from the Claude team captures this shift clearly. Drawing on insights from teams building with modern LLM agents—including those powered by models from providers like Anthropic—the report offers a grounded view of how agentic systems are bein

Blog CTA Icon

Get early access to Arcade, and start building now.