Anthropic ๐—ท๐˜‚๐˜€๐˜ ๐—ฟ๐—ฒ๐—น๐—ฒ๐—ฎ๐˜€๐—ฒ๐—ฑ ๐—ฎ ๐—•๐—ถ๐—ฏ๐—น๐—ฒ ๐—ณ๐—ผ๐—ฟ ๐—”๐—œ ๐—”๐—ด๐—ฒ๐—ป๐˜ ๐——๐—ฒ๐˜ƒ๐—ฒ๐—น๐—ผ๐—ฝ๐—ฒ๐—ฟ๐˜€

It talks about how they developed multi-agent research systems -

1. ๐—•๐—ฒ๐˜†๐—ผ๐—ป๐—ฑ ๐—ฆ๐—ถ๐—ป๐—ด๐—น๐—ฒ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€

โžœ Single agents hit context and time walls on complex research tasks, where bigger models aren't the solutionโ€”orchestrated coordination is

โžœ When one Claude model stalls on broad queries like "list every S&P 500 IT board member," parallel subagents with separate context windows can explore different angles simultaneously

2. ๐—ง๐—ผ๐—ธ๐—ฒ๐—ป ๐—˜๐—ฐ๐—ผ๐—ป๐—ผ๐—บ๐—ถ๐—ฐ๐˜€

โžœ Multi-agent systems consume ~15ร— more tokens than standard chat, but token volume alone explained 80% of performance variance on BrowseComp evaluations

โžœ The trade-off is clear: spend more tokens to unlock capabilities that single agents fundamentally cannot achieve

3. ๐—ข๐—ฟ๐—ฐ๐—ต๐—ฒ๐˜€๐˜๐—ฟ๐—ฎ๐˜๐—ผ๐—ฟ ๐—ฃ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐—ป

โžœ A lead agent analyzes queries, develops strategy, and spawns specialized subagents with focused objectives

โžœ Each subagent operates independently with its own tools and context window, then returns distilled insights, preventing chaos of unstructured multi-agent interactions

4. ๐—œ๐—ป๐˜๐—ฒ๐—น๐—น๐—ถ๐—ด๐—ฒ๐—ป๐˜ ๐—–๐—ผ๐—บ๐—ฝ๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐—ผ๐—ป

โžœ Traditional RAG uses static chunk retrieval while multi-agent research uses adaptive, multi-step search

โžœ Subagents act as intelligent filtersโ€”exploring different aspects in parallel and compressing vast information into key insights for the lead agent

5. ๐—–๐—ผ๐—ผ๐—ฟ๐—ฑ๐—ถ๐—ป๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—–๐—ผ๐—บ๐—ฝ๐—น๐—ฒ๐˜…๐—ถ๐˜๐˜†

โžœ Coordination complexity grows rapidly, with early agents making errors like spawning 50 subagents for simple queries or scouring endlessly for nonexistent sources

โžœ Success requires embedding clear heuristics: delegate with explicit objectives and output formats, scale effort by complexity (1 agent for facts, 10+ for open research), and start broad then narrow based on findings

6. ๐—ฃ๐—ฟ๐—ผ๐—ฑ๐˜‚๐—ฐ๐˜๐—ถ๐—ผ๐—ป ๐—ฅ๐—ฒ๐—น๐—ถ๐—ฎ๐—ฏ๐—ถ๐—น๐—ถ๐˜๐˜†

โžœ Agents are stateful and non-deterministicโ€”small changes cascade into large behavioral shifts that can't be fixed with simple restarts

โžœ Solutions include checkpointing progress, full execution tracing, rainbow deployments for live updates, and robust retry logic

7. ๐—ข๐˜‚๐˜๐—ฐ๐—ผ๐—บ๐—ฒ ๐—˜๐˜ƒ๐—ฎ๐—น๐˜‚๐—ฎ๐˜๐—ถ๐—ผ๐—ป

โžœ Unlike traditional software with fixed paths, agents take different valid routes to the same goal, making standard evaluation methods inadequate

โžœ Effective evaluation uses LLM judges for scalable assessment plus human reviewers for edge cases, focusing on whether agents achieved the right outcomes through reasonable processes

8. ๐—”๐˜€๐˜†๐—ป๐—ฐ๐—ต๐—ฟ๐—ผ๐—ป๐—ผ๐˜‚๐˜€ ๐—™๐˜‚๐˜๐˜‚๐—ฟ๐—ฒ

โžœ Current systems run subagents synchronouslyโ€”the lead waits for all to complete, which simplifies coordination but limits parallelism

โžœ Asynchronous execution where agents spawn new agents mid-task promises major performance gains, but introduces challenges in state consistency and result merging

P.S. Check out my profile for more resources on AI Agents ๐Ÿ‘‹


This post was originally shared by on Linkedin.