Lessons learned from building agents

Nov 15, 2025 • Martin Legrand

These notes draw on work on AgenticSeek and my current role at CNRS. A practical priority: tackle time bottlenecks early. Waiting hours for an agent run to reach step 54 and crash wastes huge time — design to avoid failure, recover, and replay.

Implement a solid provider first. Add robust error handling, traceability and recovery. LLM providers fail for many reasons (quota, server, auth); make the provider resilient.
Cache and mock everything. Don’t wait for long runs to debug: provide mock responses and a local cache so you can reproduce and skip expensive steps.
Build for failure, not success. Complex systems will fail — assume it. Add monitoring, self-checks, watchdogs, and recovery flows (agent supervising agent).
Don’t overhype public MCPs. Many multi-component platforms on GitHub are immature; often a small custom control loop tailored to your needs is faster to implement and more reliable.
Minimize context. Use compression, split into multiple agents, and remove unnecessary tokens (e.g., verbose internal think traces). Bigger context often degrades practical performance.
Keep the goal visible. Remind the agent of its objective or run a supervisor agent that can intervene when it drifts.
Circuit breakers for recursion. Agents love subtle infinite loops — add hard limits and pattern detection (e.g., repeated reformulations of the same question) and break or escalate when detected.
Minimize domain-specific tools. The more tools you expose, the more confused the agent about order and usage. Prefer fewer, more capable tools that return richer, well-structured results.
Limit tool output and encourage step-by-step use. Tools should never return massive payloads (cap outputs, e.g., ≤ 8k tokens) and should be designed to nudge the model toward incremental, stepwise reasoning.
Use the best model first, then scale down. Prototype with the strongest model to validate behavior, then test if cheaper/local models can reproduce it for cost-effective deployment and heavy testing.
OpenAI models are crap for agents. From my personal experimence openAI models are terrible at following instructions properly. They often fail to respect a simple format and will often refuse leading to planner failure, tools use failure.... production failure.
Use cheap/local models for heavy testing. Reduce cost by running bulk tests on local or low-cost models to iterate quickly before moving to higher-cost inference.
Keep a human in the loop. AGI is not here — humans should review high-impact decisions and intervene when agents show brittle or risky behavior.
Provenance tracking. Record not just decisions but prompt versions, model used, temperature and other parameters so you can reproduce and audit agent behavior.

Short, pragmatic guidance: protect your time first, then your system. Build tooling (mocks, caches, and recovery) up-front — it pays back many hours during development and debugging.