Research

Understanding Annotator Safety Policy with Interpretability

Zac Boring May 9, 2026 1 min read

Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources

By Alex Oesterling, Donghao Ren, Yannick Assogba, Dominik Moritz, Sunnie S. Y. Kim, Leon Gatys, Fred Ho

Read the full article at ArXiv cs.AI →