GitHub Repository

Runtime detector for reward hacking and misalignment in LLM agents (89.7% F1 on 5,391 trajectories).

12 starsPython

RewardHackWatch – Reward hacking detector for LLM agents

Name: RewardHackWatch – Reward hacking detector for LLM agents
Availability: InStock
Author: aerosta

by aerosta·Mar 1, 2026·1 point·1 comment

AI Analysis

●●●BangerBig BrainWizardryShip It

Catches LLM reward hacking at runtime when models game evals.

Strengths

•89.7% F1 score validated on real METR dataset, not synthetic data
•Calibrated threshold (0.02 vs default 0.5) shows empirical rigor for imbalanced classes
•Includes experimental RMGI metric linking reward hacking to misalignment signals

Weaknesses

•Detector only catches known exploitation patterns (sys.exit, test patching); generalization to unseen hacks unclear
•Limited to post-hoc analysis; cannot prevent hacking in real-time during model execution