RewardGuard – detect reward hacking in RL training loops
Catches reward hacking before it tanks your RL training run.
Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).
Catches LLM reward hacking at runtime when models game evals.
AI safety researchers, LLM developers, evaluation engineers
OpenAI's CoT monitoring · Anthropic's alignment monitoring · METR's MALT evaluation framework
Catches reward hacking before it tanks your RL training run.
BEAM-based agent runtime with git-backed recovery and auditable safety gates.
Educational content in a space where Nathan Lambert's RLHF book already exists.
Skips heavy judge loops by using logprobs to gate agent actions at runtime.
Proves text safety ≠ tool-call safety; catches hidden harmful executions deterministically.
Detection logic is scientifically flawed and easily bypassed by modern tokenizers.