On Offline RL - Part 3

Recent advancement that could make offline RL more feasible

My 2025 So Far (part 1): Reasoning LLMs, MLA, and Summiting Everest

My sabbatical so far has been great. I considered starting a startup, did research on reasoning in LLMs and Multi-Head Latent Attention (MLA), built an autonomous SWE agent with a model I trained, and climbed Everest and Ama Dablam. Here are some quick updates.

On Offline RL - Part 2

If Offline RL is great, why is not widely used?

On Offline RL - Part 1

What is Offline RL, How it's different from BC, How does Offline RL work.

Mitigations for Reward Hacking

Reward is central to reinforcement learning, but often hard to define, engineer, and easy hacked.

Towards Tokenization-Free Language Models: Research Trends

Tokenization is useful because it significantly decreases the inference and training costs. However, tokenization has several disadvantages.
This article explores the problems of tokenization, and delves into emerging research on tokenization-free approaches.

On the Credit Assignment Problem

Unlike behavior cloning, which copies the average demonstration as-is, reinforcement learning aims to reinforce good actions and discourage bad ones. To do that effectively, we need to know which actions were good, which were bad, and which were irrelevant. This essentially is the credit assignment problem (CAP).

So the scaling law is slowing… So what…

Recent discussions suggest that Scaling Laws in AI might be slowing. But does this mean innovation is hitting a ceiling?
This article explores the scaling law, laws governing advancement in technology, and human nature…

Preference RL Is Hard: Whose Preference?

Reinforcement Learning from Human Feedback is the dominate paradigm for aligning large language models to human preferences and values. At its core lies a simple idea: learn what humans prefer by having them compare outputs, then train a model to maximize those preferences. But beneath this simple idea is a fundamental problem—whose preferences are we actually capturing?

Enhancing LLM with high-quality, diverse dataset, and factuality

Can we close the gap to GPT-4 with a smaller model?
A SoTA LLM at home powered by high-quality data, diverse dataset and objective, and fine-tuned for factuality.

Representational Efficiency in Deep Learning

Despite massive scaling, neural networks often rely on brute force, learning performance without efficiently capturing structure. The real bottleneck isn’t expressivity but representational efficiency—how compactly and learnably structure can be encoded. Universal approximation guarantees possibility, not practical efficiency in data, parameters, or training.

Decoupled Transformer

How much Attention is really needed?

Knowledge Distillation Part 1

Part 1 - Large Model to Large Mode : Can student model exceed the teachers performance

ROaD-Electra Robustly Optimized and Distilled Electra 2

Building the best Base sized Transformer model for MRC, NLI and NLU.

ROaD-Electra Robustly Optimized and Distilled Electra 1

Exploring Multi-Task pre-training and a new variance of Knowledge Distillation to build new state-of-the-art Transformer models for MRC, NLI and NLU.

New Models and Old Tricks

It seems tricks (e.g. Data Augmentation, Label smoothing, Mixout) are approaching their limits to improve SoTA models on SQUAD 2.0 and NQA.

Fast way to count zeros

Population count is a procedure of counting the number of ones in a bit string. It is used in many applications such as Hamming distance, cardinality count in Bitarray, Binary Neural Networks and m...

What is understanding - AI Prospective

In 2013 I faced this question, I took a week off from my startup and thought about this question, What is understanding? Here is what I came up with…

How academic division of AI, limits AI progress

One of the fundamental problems in the current AI approaches is that it typically throws out much of the structure of the world before they start.

Gaussian Label Smoothing

The generalization of neural networks can often be improved by using label smoothing which is soft targets that are a weighted average of the hard targets and the uniform distribution over labels. ...

Trending Tags