Tracking AI existential risk. Auto-aggregated headlines. Human-curated analysis.
AGGREGATING 47 SOURCES · UPDATED LIVE
Research

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Zac Boring May 14, 2026 1 min read
Read original source →

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the

By Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

Read the full article at ArXiv cs.AI →