We Threw 4,000 Tools at Anthropic's New Tool Search. Here's What Happened.

We Threw 4,000 Tools at Anthropic's New Tool Search. Here's What Happened.

Eric Gustin's avatar
Eric Gustin
DECEMBER 3, 2025
2 MIN READ
THOUGHT LEADERSHIP
Rays decoration image
Ghost Icon

TL;DR: Anthropic's new Tool Search is a step in the right direction-but if you're running 4,000+ tools across multiple services, it might not be ready for prime time.


The promise

Anthropic's Tool Search promises to let Claude "access thousands of tools without consuming its context window." Music to our ears. At Arcade, we maintain thousands of agent-optimized tools across Gmail, Slack, GitHub, HubSpot, Salesforce, and dozens more platforms. If anyone was going to stress-test this feature, it was us.

So we did! Source code and full results →

The setup

We loaded 4,027 tools into Anthropic's beta and ran 25 straightforward tasks. The kind of requests your agent should nail 100% of the time on smaller tool sets:

  • "Send an email to my colleague about the project update."
  • "Post a message to the #general channel in Slack."
  • "Schedule a meeting for tomorrow at 2pm."

Nothing tricky. No ambiguous edge cases. Just everyday agentic workflows.

We tested both of Anthropic's built-in search modes:

# Regex-based search
search_tool = [{"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"}]

# BM25-based search
search_tool = [{"type": "tool_search_tool_bm25_20251119", "name": "tool_search_tool_bm25"}]

Then we checked: did the correct tool even appear in the top-K results?

The results

Search Mode

Avg Success Rate

Regex

56% (14/25)

BM25

64% (16/25)

To keep this as fair as possible, we just tested the success rate for retrieval - whether the right tool showed up in the search results. We didn't test whether Claude would select that tool or fill in the parameters correctly. 

Where it worked and where it struggled

Tool search handled some requests flawlessly:

  • ✅ GoogleCalendar_CreateEvent
  • ✅ GoogleDocs_CreateBlankDocument
  • ✅ Github_CreateIssue
  • ✅ Spotify_PlayTrackByName
  • ✅ Salesforce_CreateContact
  • ✅ MicrosoftTeams_SendMessageToChannel

However, it did struggle to retrieve some some of the most common tools:

  • ❌ Gmail_SendEmail - Couldn't find "send email" in a Gmail prompt
  • ❌ Slack_SendMessage - Missed "post a message to Slack"
  • ❌ Zendesk_CreateTicket - Ticket creation? Never heard of it
  • ❌ ClickUp_CreateTask - Task creation tools exist. Just not in the results.
  • ❌ Youtube_SearchVideos - Returned Youtube_SearchForVideos instead. Close, but no cigar.

When "send an email" can't find Gmail_SendEmail, there's still work to do.

What this means

This is certainly a move in the right direction. The architecture is sound: defer loading tools into the model’s context window to sidestep the long-standing context-bloat problem, and instead discover them just-in-time, keeping interactions with a model lightweight. And especially important to enterprises: the token savings are real. 

But ~60% retrieval accuracy isn't ready for prime time when you're building agents that need to reliably take real-world actions. Enterprises need to be able to reliably trust the results of their agents. And having nearly half the tool searches fail before you even get to selection and parameterization doesn’t instill that trust.

We believe that Anthropic has identified a real problem, and we’re happy to see progress made in this space. Arcade is committed to delivering the MCP runtime and agent-optimized tools that help enterprises deploy agents that can take actions reliably for any model and for any number of tools. While our customers have already been able to improve the reliability of their production agents through Arcade, stay tuned for some exciting updates that will continue to push the boundaries of what’s possible.   


Ready to build? Get started with Arcade →

SHARE THIS POST

RECENT ARTICLES

Rays decoration image
THOUGHT LEADERSHIP

What does Anthropic's Tool Search for Claude mean for you?

I was recently in Amsterdam meeting with some of the largest enterprises, and they all raised the same challenge: how to give AI agents access to more tools without everything falling apart?  The issue is that as soon as they hit 20-30 tools, token costs became untenable and selection accuracy plummeted. The pain has been so acute that many teams have been attempting (unsuccessfully) to build their own workarounds with RAG pipelines, only to hit performance walls.  That's why I'm excited about

Rays decoration image
THOUGHT LEADERSHIP

38 Proxy Server AI Revenue Metrics: Market Growth, Data Collection ROI, and Infrastructure Performance

Comprehensive analysis of proxy server market valuations, AI-driven revenue acceleration, and performance benchmarks shaping the future of scoped, user-delegated access The convergence of proxy infrastructure and artificial intelligence represents one of the fastest-growing segments in enterprise technology, with the proxy server market valued at $1 billion in 2024. This growth reflects the need for secure, scoped access pathways as AI systems move from prototypes to real operations. Arcade.de

Rays decoration image
THOUGHT LEADERSHIP

26 Global AI Developer Community Statistics: Adoption Rates, Security Challenges, and Market Momentum

A data-driven analysis of the worldwide AI developer ecosystem, covering adoption patterns, security concerns, productivity gains, and enterprise deployment trends The global AI developer community has reached an inflection point, with 17.4 million developers using or building with AI/ML—a significant jump from 15.5 million in 2023. This massive shift toward AI-powered development creates both unprecedented opportunity and urgent challenges around security, multi-user authorization, and tool r

Blog CTA Icon

Get early access to Arcade, and start building now.