In Parts 1 and 2, we established what offline RL is, its potential benefits, and why it remains largely unused despite these benefits. The challenges are both practical (engineering complexity, reward annotation burden, brittle O2O transitions) and fundamental (quadratic error accumulation over the horizon).
In this final part, we examine recent research that directly addresses these obstacles.
Addressing the Curse of Horizon
The \(H^2\) error accumulation in TD learning is perhaps the most fundamental barrier to scaling offline RL. Two recent approaches attack this problem from different angles.
Horizon Reduction: SHARSA and n-step Methods
Park et al. (2025) conducted a systematic study of offline RL scaling with datasets up to 1B transitions—1000× larger than typical offline RL benchmarks. Their findings are striking: standard offline RL methods (IQL, CRL, SAC+BC) completely fail on complex, long-horizon tasks even with massive datasets. Performance saturates far below optimal, regardless of model size or hyperparameter tuning.
Value horizon reduction: Use n-step returns instead of 1-step TD. This reduces the number of recursive updates by a factor of n, directly attacking the \(H^2\) term.
Policy horizon reduction: Use hierarchical policies that decompose long-horizon goals into shorter subgoal-reaching problems. A high-level policy \(\pi^h(w|s,g)\) outputs subgoals; a low-level policy \(\pi^\ell(a|s,w)\) executes them. This happens to be the modern approach in robotics
Another idea, SHARSA (Scalable Horizon-Aware RSA), which combines:
- Flow-matching behavioral cloning for expressive policy representation
- n-step SARSA for value learning (reduces value horizon)
- Hierarchical policy structure (reduces policy horizon)
- Rejection sampling for high-level policy extraction (avoids ill-defined gradients in state space)
The results are great. On tasks where standard offline RL achieves 0% success even with 1B samples, SHARSA achieves 50-100% success. The key insight: horizon reduction isn't just a nice-to-have—it's necessary for scaling.
Transitive RL: Divide and Conquer Value Learning
Transitive RL (TRL) exploits a divide-and-conquer structure specific to goal-conditioned RL.
The key observation: in goal-conditioned RL, temporal distances satisfy a triangle inequality:
$$d^*(s, g) \leq d^*(s, w) + d^*(w, g)$$
This enables a transitive Bellman update:
$$V(s, g) \leftarrow \max_{w \in S} V(s, w) \cdot V(w, g)$$
Instead of propagating value one step at a time (TD) or summing over entire trajectories (MC), TRL combines two equal-sized trajectory segments. In theory, this reduces the number of biased recursions from \(O(H)\) to \(O(\log H)\)
The practical challenge is the \(\max_{w}\) operator—naively maximizing over all possible subgoals leads to catastrophic overestimation. TRL addresses this with:
- In-sample maximization: Only consider subgoals that appear in dataset trajectories
- Expectile regression: Soft approximation of the max operator
- Distance-based reweighting: Focus more on shorter trajectory chunks (solve smaller subproblems first)
On long-horizon tasks (3000+ steps), TRL outperforms both TD and MC baselines.
The limitation: TRL currently applies only to goal-conditioned, deterministic settings. Extending it to general reward-based RL and stochastic environments remains open.
Fixing Offline-to-Online Transitions
As discussed in Part 2, the offline-to-online (O2O) transition is often where offline RL falls apart. Several recent methods directly target this problem.
RLPD: Just Use Off-Policy RL (With Care)
Ball et al. (2023) ask a simple question: can we just use existing off-policy methods to leverage offline data during online learning, without specialized offline RL pretraining?
The answer is yes, but with important caveats. Their method, RLPD (RL with Prior Data), shows that naively adding offline data to SAC's replay buffer performs poorly. However, a minimal set of changes makes it work:
- Symmetric sampling: Sample 50% from offline data, 50% from online buffer
- Layer normalization: Stabilizes learning with mixed data distributions
- Large ensembles: Use 10 Q-functions instead of 2
The insight is that you don't need complex offline RL machinery—standard off-policy RL can work if you handle the distribution mismatch carefully.
Cal-QL: Calibrated Value Functions for Fine-Tuning
Nakamoto et al. (2023) identify a specific failure mode: conservative offline RL methods (like CQL) produce value estimates that are too pessimistic. During online fine-tuning, the agent wastes samples "unlearning" this excessive conservatism before it can improve.
Cal-QL (Calibrated Q-Learning) solves this with a simple modification: ensure the learned Q-values provide a lower bound on the true value of the learned policy, but an upper bound on the behavior policy's value. This "calibration" means:
- The agent correctly knows its current policy is better than the data
- It doesn't massively underestimate its own performance
Implementation-wise, it's a one-line change to CQL. Empirically, Cal-QL outperforms prior methods on 9/11 fine-tuning benchmarks.
WSRL: No Need to Retain Offline Data
Kumar et al. (2024) challenge a common assumption: that you need to keep training on offline data during online fine-tuning. This is undesirable because:
- Training on large offline datasets is slow
- Continued pessimism constrains performance improvement
WSRL (Warm-Start RL) shows that retaining offline data is unnecessary—but you need a proper transition strategy. The key insight: the "dip" in performance at fine-tuning onset comes from distribution mismatch between offline data and initial online rollouts.
WSRL's solution is remarkably simple: a warmup phase that collects a small number of rollouts from the pretrained policy before switching to pure online RL. This "recalibrates" the Q-function to the online distribution, after which you can completely discard offline data.
Results show WSRL learns faster and achieves higher asymptotic performance than methods that retain offline data.
PORL: Fine-Tuning With Just the Policy
Xiao et al. (2025) address a different limitation: existing O2O methods require pretrained Q-functions. But what if you only have a policy—say, from behavior cloning or imitation learning?
PORL (Policy-Only RL) fine-tunes using only the pretrained policy, initializing Q-functions from scratch during online learning. Counter-intuitively, this can work better than using pretrained Q-functions because:
- You avoid the pessimistic bias baked into offline Q-functions
- A randomly initialized Q-function doesn't actively discourage exploration of OOD actions
PORL opens a new path for fine-tuning BC policies directly with RL, without needing to first convert to an offline RL formulation.
So, is offline RL ready for prime time? Not yet. SHARSA and TRL still can't fully solve the hardest benchmarks even with billion-scale data. The O2O methods work well on standard benchmarks but haven't been validated at the scale and complexity of real autonomous systems.
But for the first time, we have principled explanations for why offline RL fails at scale, and methods that directly target those failure modes. The gap between offline RL's promise and its practice is narrowing, but it still there.