Lessons learned from building agents
These notes draw on work on AgenticSeek and my current role at CNRS. A practical priority: tackle time bottlenecks early. Waiting hours for an agent run to reach step 54 and crash wastes huge time — design to avoid failure, recover, and replay.
- Implement a solid provider first. Add robust error handling, traceability and recovery. LLM providers fail for many reasons (quota, server, auth); make the provider resilient.
- Cache and mock everything. Don’t wait for long runs to debug: provide mock responses and a local cache so you can reproduce and skip expensive steps.
- Build for failure, not success. Complex systems will fail — assume it. Add monitoring, self-checks, watchdogs, and recovery flows (agent supervising agent).
- Don’t overhype public MCPs. Many multi-component platforms on GitHub are immature; often a small custom control loop tailored to your needs is faster to implement and more reliable.
- Minimize context. Use compression, split into multiple agents, and remove unnecessary tokens (e.g., verbose internal think traces). Bigger context often degrades practical performance.
- Keep the goal visible. Remind the agent of its objective or run a supervisor agent that can intervene when it drifts.
- Circuit breakers for recursion. Agents love subtle infinite loops — add hard limits and pattern detection (e.g., repeated reformulations of the same question) and break or escalate when detected.
- Minimize domain-specific tools. The more tools you expose, the more confused the agent about order and usage. Prefer fewer, more capable tools that return richer, well-structured results.
- Limit tool output and encourage step-by-step use. Tools should never return massive payloads (cap outputs, e.g., ≤ 8k tokens) and should be designed to nudge the model toward incremental, stepwise reasoning.
- Use the best model first, then scale down. Prototype with the strongest model to validate behavior, then test if cheaper/local models can reproduce it for cost-effective deployment and heavy testing.
- OpenAI models are crap for agents. From my personal experimence openAI models are terrible at following instructions properly. They often fail to respect a simple format and will often refuse leading to planner failure, tools use failure.... production failure.
- Use cheap/local models for heavy testing. Reduce cost by running bulk tests on local or low-cost models to iterate quickly before moving to higher-cost inference.
- Keep a human in the loop. AGI is not here — humans should review high-impact decisions and intervene when agents show brittle or risky behavior.
- Provenance tracking. Record not just decisions but prompt versions, model used, temperature and other parameters so you can reproduce and audit agent behavior.
Short, pragmatic guidance: protect your time first, then your system. Build tooling (mocks, caches, and recovery) up-front — it pays back many hours during development and debugging.