LLM-as-Judge is Evolving. Meet Agent-as-Judge.

LLM-as-Judge was a breakthrough - using LLMs to evaluate LLM outputs at scale. But as we move from simple chatbots to complex agents, the paradigm is breaking. Agents do multi-step reasoning, execute tools, and make decisions over time. Evaluating just the final output isn’t enough anymore. You cannot judge a journey by looking only at the destination. We need judges that can investigate the process, verify claims, and adapt their criteria. ...

February 10, 2026 · 6 min · Dmytro Kovalchuk

How Do You Actually Evaluate an AI Research Agent?

We’re building expert AI research agents at Grep.ai — think due diligence reports, business research, compliance checks. The kind of work where you need depth, accuracy, and real sources. Getting the agent to run was the easy part. Making sure it’s actually good? That’s where things get interesting. The problem with “it works” Our initial eval setup was basic: does the agent complete the task? Does it return a report? Does it cite sources? Check, check, check. ...

January 14, 2025 · 4 min · Dmytro Kovalchuk