Mitigations for Reward Hacking

Reward is central to reinforcement learning, but often hard to define, engineer, and easy hacked.

Why does reward hacking happen?

Proxy gap and Goodhart pressure

Reward hacking occurs when a reinforcement learning (RL) agent exploits flaws or ambiguities in a proxy reward function to achieve high rewards without genuinely learning or completing the intended task. The core reason reward hacking occurs is the unobservability of the true reward function, which forces us to replace it with a simplified, underdefined proxy.

Mathematically, Let $u(x, y)$ be the true (unobservable) utility of a response y to prompt x. RL replaces $u$ with a learned proxy $r_\phi(x, y)$. Policy optimization then solves something like:

$$\max_\pi \; \mathbb{E}_{y \sim \pi(\cdot|x)}\left[r_\phi(x, y)\right] - \beta \,\mathrm{KL}(\pi \,\|\, \pi_{\text{ref}})$$

As soon as optimization pressure becomes strong, Goodhart’s law kicks in: the policy actively searches for regions where $r_\phi$ is high but poorly aligned with u. KL regularization slows this down but does not eliminate it.

Distribution shift is intrinsic to RLHF

Reward models are trained on a static dataset of human comparisons. Policy optimization changes the output distribution. This creates a distribution shift problem: the reward model is asked to score samples far outside its training support.

This shift is could be adversarial. The policy is not passively drifting—it is actively hunting for higher reward or a blind stops. From the reward model’s perspective, this looks like systematic out-of-distribution exploitation.

This is why reward hacking often appears suddenly rather than gradually: once the policy finds a direction where reward extrapolates incorrectly, it will push hard.

Capability

In Robotics and LLMs, more capable agents exploit misspecification more effectively. In simulated environments, researchers have observed phase transitions: as the agent becomes sufficiently competent, true reward collapses while proxy reward continues to rise.

LLMs add an extra twist: the reward interface is often language itself (rubrics, instructions, tool specs). This makes semantic loopholes exploitable in ways that resemble adversarial example.

Mitigation Strategies

Reward shaping and bounding

A surprisingly effective idea is to bound the reward signal. If reward grows without limit, critics become unstable and policies are incentivized to chase extreme values that often correspond to exploitation.

Preference-as-Reward (PAR) formalizes this intuition by converting raw reward differences into bounded, centered signals—often via a sigmoid on reward differences relative to a reference response. The resulting reward behaves more like a probability than a score, which stabilizes training and empirically delays reward hacking.

Formally, let $r(x, y)$ be the raw output of the reward model for prompt $x$ and response $y$. Let $\mu(x)$ be a reference reward (baseline), typically derived from the average reward for the same prompt or the reward of a reference policy $\pi_{ref}$. The shaped reward $R_{PAR}$ is defined as:

$$R_{PAR}(x, y) = \sigma \left( \frac{r(x, y) - \mu(x)}{\tau} \right)$$

where $\sigma(\cdot)$ is the logistic function and $\tau$ is a temperature parameter controlling the steepness of the preference curve.

If you find this formulation very similar to GRPO, you're not mistaken—the only difference is that PAR applies the logistic function while GRPO divides by the standard deviation. Note that if you're already using GRPO-style baselining, PAR becomes largely redundant.

Ensembles and uncertainty

Reward hacking could resemble adversarial robustness problems, so one natural solution is ensemble of reward models.

Instead of a single reward model, train an ensemble. When ensemble members disagree, treat high reward as suspicious rather than desirable.

Uncertainty-Weighted Optimization (UWO) is a common conservative score aggregation:

$$R_{\text{UWO}}(x,y) = \mathbb{E}[r(x,y)] - \lambda \cdot \mathrm{Std}[r(x,y)]$$

Worse-case Optimization (WCO) is another common scoring aggregation:

$$R_{WCO}(x, y) = \min_{i} r_i(x, y)$$

Empirically, ensemble-based pessimism significantly reduces overoptimization during RL. Using UWO changes the PPO dynamics: the agent naturally avoids regions of the policy space where the ensemble disagrees, effectively steering the policy back toward the training distribution's support.

Research has also shown that simply setting a different random seed before fine-tuning a base LLM into a reward model is sufficient to produce diverse ensemble members—no pretraining required.

However, standard ensemble requires maintaining $M$ distinct models in memory and running $M$ forward passes during RL, which is prohibitively expensive for LLMs. WARM (Weight Averaged Reward Models) proposes a solution based on the geometry of the loss landscape, termed "Rewarded Soup." Instead of averaging the predictions of $M$ models (Ensemble Averaging), WARM averages the weights of $M$ fine-tuned reward models into a single model $\theta_{WARM}$.

$$\theta_{WARM} = \frac{1}{M} \sum_{i=1}^M \theta_i$$

This works because reward models fine-tuned from the same pretrained checkpoint with different seeds tend to lie in the same loss basin, meaning linear interpolation between them stays in a region of low loss. The resulting single model captures much of the ensemble's diversity at the cost of just one forward pass. In practice, WARM achieves reliability gains comparable to full ensembles while matching the inference cost of a single model. The authors also find that WARM produces reward models that are more robust to distribution shift and label corruption than any individual ensemble member, suggesting that weight averaging acts as an implicit regularizer. Notably, WARM composes well with other mitigation strategies—it can be used alongside KL regularization or UWO-style pessimism for further gains.

Better reward models

Invariance and shortcut mitigation

PRISM reframes reward modeling as a problem of learning invariances. The core idea is that if a superficial transformation—such as adding filler phrases, inflating length, or inserting flattery—does not change the true quality of a response, then the reward model should be invariant to that transformation.

Concretely, PRISM defines a set of shortcut transformation groups $\mathcal{G} = \{g_1, g_2, \dots, g_K\}$, where each $g_k$ represents a meaning-preserving but reward-gaming perturbation (e.g., paraphrasing to increase length, appending sycophantic hedging). The training objective augments the standard preference loss with an invariance penalty:

$$\mathcal{L}_{\text{PRISM}} = \mathcal{L}_{\text{pref}}(\theta) + \alpha \sum_{k=1}^{K} \mathbb{E}_{x,y}\left[\left(r_\theta(x, y) - r_\theta(x, g_k(y))\right)^2\right]$$

where $\alpha$ controls the strength of the invariance regularization. By explicitly penalizing reward differences between a response and its shortcut-transformed variant, PRISM reduces the reward model's reliance on spurious features.

This approach is elegant and theoretically grounded, but it relies on knowing—or discovering—the right shortcut groups. In practice, this means curating a library of transformations, which may not cover novel exploits the policy discovers during training.

Information-theoretic reward modeling

InfoRM takes a different tack: instead of manually specifying invariances, it uses an information bottleneck to compress reward representations. The idea is to learn a compressed latent z that retains only the information predictive of human preference labels while discarding everything else.

InfoRM optimizes a trade-off between compression and predictiveness:

$$\mathcal{L}_{\text{InfoRM}} = -I(z; \text{pref}) + \beta \, I(z; y)$$

where $I(z; \text{pref})$ is the mutual information between the latent representation and the preference label (maximized to stay predictive), and I(z;y) is the mutual information between the latent and the full response (minimized to discard reward-irrelevant features). The hyperparameter $\beta$ controls the compression–fidelity trade-off.

This compression also yields a detection mechanism: hacked or overoptimized samples tend to appear as outliers in the compressed latent space, exhibiting high reconstruction error or low likelihood under the learned prior.

Adversarial Data Collection

Adversarial Reward Auditing

ARA formalizes adversarial data collection (a.k.a. red teaming) into a structured pipeline involving two agents: the Hacker and the Auditor. The Hacker is initialized from $\pi{\text{SFT}}$ but trained with RL to maximize the proxy reward $r\theta$ specifically by finding exploits—responses that score high under the proxy but are genuinely low quality. The Auditor is a discriminator trained to distinguish between "genuine high reward" samples from a validation set and "hacked high reward" samples generated by the Hacker. Through this contrastive training, the Auditor learns the signatures of hacking: repetition patterns, incongruent tone, length padding, and other exploitation artifacts.

During actual training of the target policy, the reward signal is gated by the Auditor:

$$R{\text{ARA}}(x, y) = r\theta(x, y) \cdot A(x, y)$$

where $A(x, y) \in [0, 1]$ is the Auditor's confidence that the response is genuinely good rather than hacked. This suppresses reward for outputs exhibiting known exploitation patterns, even if the proxy scores them highly.