Anthropic Tool Search Test: 4,000 Tools, 60% Success

TL;DR: Anthropic's new Tool Search is a step in the right direction-but if you're running 4,000+ tools across multiple services, it might not be ready for prime time.

The promise

Anthropic's Tool Search promises to let Claude "access thousands of tools without consuming its context window." Music to our ears. At Arcade, we maintain thousands of agent-optimized tools across Gmail, Slack, GitHub, HubSpot, Salesforce, and dozens more platforms. If anyone was going to stress-test this feature, it was us.

So we did! Source code and full results →

The setup

We loaded 4,027 tools into Anthropic's beta and ran 25 straightforward tasks. The kind of requests your agent should nail 100% of the time on smaller tool sets:

"Send an email to my colleague about the project update."
"Post a message to the #general channel in Slack."
"Schedule a meeting for tomorrow at 2pm."

Nothing tricky. No ambiguous edge cases. Just everyday agentic workflows.

We tested both of Anthropic's built-in search modes:

# Regex-based search
search_tool = [{"type": "tool_search_tool_regex_20251119", "name": "tool_search_tool_regex"}]

# BM25-based search
search_tool = [{"type": "tool_search_tool_bm25_20251119", "name": "tool_search_tool_bm25"}]

Then we checked: did the correct tool even appear in the top-K results?

The results

Search Mode	Avg Success Rate
Regex	56% (14/25)
BM25	64% (16/25)

To keep this as fair as possible, we just tested the success rate for retrieval - whether the right tool showed up in the search results. We didn't test whether Claude would select that tool or fill in the parameters correctly.

Where it worked and where it struggled

Tool search handled some requests flawlessly:

✅ GoogleCalendar_CreateEvent
✅ GoogleDocs_CreateBlankDocument
✅ Github_CreateIssue
✅ Spotify_PlayTrackByName
✅ Salesforce_CreateContact
✅ MicrosoftTeams_SendMessageToChannel

However, it did struggle to retrieve some some of the most common tools:

❌ Gmail_SendEmail - Couldn't find "send email" in a Gmail prompt
❌ Slack_SendMessage - Missed "post a message to Slack"
❌ Zendesk_CreateTicket - Ticket creation? Never heard of it
❌ ClickUp_CreateTask - Task creation tools exist. Just not in the results.
❌ Youtube_SearchVideos - Returned Youtube_SearchForVideos instead. Close, but no cigar.

When "send an email" can't find Gmail_SendEmail, there's still work to do.

What this means

This is certainly a move in the right direction. The architecture is sound: defer loading tools into the model’s context window to sidestep the long-standing context-bloat problem, and instead discover them just-in-time, keeping interactions with a model lightweight. And especially important to enterprises: the token savings are real.

But ~60% retrieval accuracy isn't ready for prime time when you're building agents that need to reliably take real-world actions. Enterprises need to be able to reliably trust the results of their agents. And having nearly half the tool searches fail before you even get to selection and parameterization doesn’t instill that trust.

We believe that Anthropic has identified a real problem, and we’re happy to see progress made in this space. Arcade is committed to delivering the MCP runtime and agent-optimized tools that help enterprises deploy agents that can take actions reliably for any model and for any number of tools. While our customers have already been able to improve the reliability of their production agents through Arcade, stay tuned for some exciting updates that will continue to push the boundaries of what’s possible.

Ready to build? Get started with Arcade →

We Threw 4,000 Tools at Anthropic's New Tool Search. Here's What Happened.

The promise

The setup

The results

Where it worked and where it struggled

What this means

RECENT ARTICLES

How Arcade Proactively Addressed The First Major Identity Vulnerability in Agentic AI

New Year, New Agents to Make You More Productive

5 Takeaways from the 2026 State of AI Agents Report

Get early access to Arcade, and start building now.