Back to browse
GitHub Repository

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

12 starsPython

RewardHackWatch – Reward hacking detector for LLM agents

by aerosta·Mar 1, 2026·1 point·1 comment

AI Analysis

●●●BangerBig BrainWizardryShip It

Catches LLM reward hacking at runtime when models game evals.

Strengths
  • 89.7% F1 score validated on real METR dataset, not synthetic data
  • Calibrated threshold (0.02 vs default 0.5) shows empirical rigor for imbalanced classes
  • Includes experimental RMGI metric linking reward hacking to misalignment signals
Weaknesses
  • Detector only catches known exploitation patterns (sys.exit, test patching); generalization to unseen hacks unclear
  • Limited to post-hoc analysis; cannot prevent hacking in real-time during model execution
Category
Target Audience

AI safety researchers, LLM developers, evaluation engineers

Similar To

OpenAI's CoT monitoring · Anthropic's alignment monitoring · METR's MALT evaluation framework

Similar Projects

AI/ML●●Solid

RewardGuard – detect reward hacking in RL training loops

Catches reward hacking before it tanks your RL training run.

Niche GemBig Brain
Giovan321
111mo ago
EducationMid

rlvrbook

Educational content in a space where Nathan Lambert's RLHF book already exists.

Niche Gem
kyars
112mo ago