My 2025 So Far (part 1): Reasoning LLMs, MLA, and Summiting Everest

My sabbatical so far has been great. I considered starting a startup, did research on reasoning in LLMs and Multi-Head Latent Attention (MLA), built an autonomous SWE agent with a model I trained, and climbed Everest and Ama Dablam. Here are some quick updates.

Towards Tokenization-Free Language Models: Research Trends

Tokenization is useful because it significantly decreases the inference and training costs. However, tokenization has several disadvantages.
This article explores the problems of tokenization, and delves into emerging research on tokenization-free approaches.

So the scaling law is slowing… So what…

Recent discussions suggest that Scaling Laws in AI might be slowing. But does this mean innovation is hitting a ceiling?
This article explores the scaling law, laws governing advancement in technology, and human nature…

Enhancing LLM with high-quality, diverse dataset, and factuality

Can we close the gap to GPT-4 with a smaller model?
A SoTA LLM at home powered by high-quality data, diverse dataset and objective, and fine-tuned for factuality.

Decoupled Transformer

How much Attention is really needed?

Knowledge Distillation Part 1

Part 1 - Large Model to Large Mode : Can student model exceed the teachers performance

ROaD-Electra Robustly Optimized and Distilled Electra 2

Building the best Base sized Transformer model for MRC, NLI and NLU.

ROaD-Electra Robustly Optimized and Distilled Electra 1

Exploring Multi-Task pre-training and a new variance of Knowledge Distillation to build new state-of-the-art Transformer models for MRC, NLI and NLU.

New Models and Old Tricks

It seems tricks (e.g. Data Augmentation, Label smoothing, Mixout) are approaching their limits to improve SoTA models on SQUAD 2.0 and NQA.

Fast way to count zeros

Population count is a procedure of counting the number of ones in a bit string. It is used in many applications such as Hamming distance, cardinality count in Bitarray, Binary Neural Networks and m...

What is understanding - AI Prospective

In 2013 I faced this question, I took a week off from my startup and thought about this question, What is understanding? Here is what I came up with…

How academic division of AI, limits AI progress

One of the fundamental problems in the current AI approaches is that it typically throws out much of the structure of the world before they start.

Gaussian Label Smoothing

The generalization of neural networks can often be improved by using label smoothing which is soft targets that are a weighted average of the hard targets and the uniform distribution over labels. ...

Trending Tags