Research

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

Zac Boring May 13, 2026 1 min read

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essent

By Andrew Choi, Wei Xu

Read the full article at ArXiv cs.AI →