Agent-as-Judge

LLM-as-Judge was a breakthrough - using LLMs to evaluate LLM outputs at scale. But as we move from simple chatbots to complex agents, the paradigm is breaking. Agents do multi-step reasoning, execute tools, and make decisions over time. Evaluating just the final output isn’t enough anymore. You cannot judge a journey by looking only at the destination. We need judges that can investigate the process, verify claims, and adapt their criteria. ...