Evaluation is catching up to agentic behavior: reliability, not just likability.