How Do You Actually Evaluate an AI Research Agent?
We’re building expert AI research agents at Grep.ai — think due diligence reports, business research, compliance checks. The kind of work where you need depth, accuracy, and real sources. Getting the agent to run was the easy part. Making sure it’s actually good? That’s where things get interesting. The problem with “it works” Our initial eval setup was basic: does the agent complete the task? Does it return a report? Does it cite sources? Check, check, check. ...