My name is Haytham ElFadeel. I am a founder, self-taught machine learning scientist, engineering executive, and former alpine climber and cyclist.

Adaptive Policy Optimization

GRPO and RLOO offer a memory-efficient alternative to PPO for training large language models by eliminating the need for a learned value function. However, we identify four sources of bias and instability in GRPO—length bias, difficulty normalization bias, finite-sampling bias, and unbounded advantage scaling—that harm training stability and generalization. We propose Adaptive Policy Optimization (APO), an unbiased and adaptive method that corrects these issues. Experiments on mathematical reasoning benchmarks show that APO improves training stability and token efficiency, and outperforming GRPO by up to 8%.

Progress-Aware Reward Model

RLVR improves LLM reasoning but suffers from severe credit assignment issues due to sparse, end-of-trajectory outcome rewards. Process reward models partially address this but rely on costly human supervision or noisy Monte Carlo estimates. We propose a hierarchical credit assignment framework that decomposes feedback across outcome, step, and token levels. Our method introduces a progress reward measuring changes in the probability of reaching a correct answer after each reasoning step, and an entropy-weighted advantage that modulates token-level updates based on model uncertainty.

On Offline Reinforcement Learning

Part 1 What is Offline RL, How does it compare to behavior cloning, and does it work.

Part 2 Why Offline RL is not widely used - Challenges of offline RL.

Part 3 Recent advancement that could make offline RL more feasible.

Mitigations for Reward Hacking

Reward is central to reinforcement learning, but often hard to define, engineer, and easy hacked.

Towards Tokenization-Free Language Models: Research Trends

Tokenization is useful because it significantly decreases the inference and training costs. However, tokenization has several disadvantages: vulnerability to adversarial attacks, worse performance for number and character, biases, and additional complexity.

On the Credit Assignment Problem

Unlike behavior cloning, which copies the average demonstration as-is, reinforcement learning aims to reinforce good actions and discourage bad ones. To do that effectively, we need to know which actions were good, which were bad, and which were irrelevant. This essentially is the credit assignment problem (CAP). This essay examines three dimensions of uncertainty—depth, breadth, and density—and presents a taxonomy of methods for addressing them.

Is scaling law is slowing?

Recent discussions suggest that Scaling Laws in AI might be slowing. Is that true, and does this mean innovation is hitting a ceiling?

Preference RL Is Hard: Whose Preference?

Reinforcement Learning from Human Feedback is the dominate paradigm for aligning large language models to human preferences and values. At its core lies a simple idea: learn what humans prefer by having them compare outputs, then train a model to maximize those preferences. But beneath this simple idea is a fundamental problem—whose preferences are we actually capturing?

Representational Efficiency in Deep Learning

Massive amounts of effort and money have gone into scaling (e.g. compute, data, parameters), and some architectural and efficiency tweaks. All of which are valuable. But they also raise an uncomfortable question:

If our deep neural networks are already “universal approximators,” Why does it still feel like we’re brute-forcing representation—sometimes learning good performance without learning good structure? why do LLMs need all the web text to learn how to do basic math? Why it takes millions of images to generate a five-fingered hand image?
This essay argues that the bottleneck is representational efficiency, more than anything else. The headline “universal approximation” is correct but incomplete.

Decoupled Transformer

How much Attention is really needed?

Knowledge Distillation - Can a student exceed their teacher performance

KD transfers information from teacher models to a student, and is often used in a same-capacity setting to improve a model by distilling dark knowledge. We formalize same-size logit distillation with temperature scaling, analyze why naïve same-size KD is frequently teacher-bounded, and propose Reliability-Weighted Knowledge Distillation (WKD), a per-example weighting scheme that downweights incorrect teachers while preserving logit scale. Experiments on SQuAD v2.0 show that WKD improves F1 and EM over standard KD.

ROaD-Electra Robustly Optimized and Distilled Electra

ROaD-Electra Robustly Optimized and Distilled Electra is state of the art encoder-only foundtional language model trained using multi-task improves the transformer performance, generalization, and robustness.

Part 1 Introducing ROaD-Electra.

Part 2 Improving ROaD-Electra with Multitask Pre-training and Knowledge distillation.

Gaussian Label Smoothing

Label Smoothing regularizes models by replacing one-hot targets with soft labels, reducing over-confidence and improving generalization. However, uniform smoothing assumes all incorrect classes are equally likely, which is poorly suited to sequence tasks with positional labels. Gaussian Label Smoothing (GLS) addresses this by replacing the uniform noise distribution with a Gaussian centered on the gold position, assigning higher probability to nearby tokens and better reflecting task structure.

How Academic Division of AI, Could Limits AGI Progress

A recurring failure mode in AI research is task isolation: we optimize narrowly scoped problems (either: language, vision, planning, or control) as if these faculties were separable in the real world. This produces systems that are statistically competent on benchmarks yet brittle under distribution shift, weak at commonsense inference, and poor at connecting perception to action.

Trending Tags