My 2025 So Far (part 1): Reasoning LLMs, MLA, and Summiting Everest

My sabbatical so far has been great. I considered starting a startup, did research on reasoning in LLMs and Multi-Head Latent Attention (MLA), built an autonomous SWE agent with a model I trained, and climbed Everest and Ama Dablam. Here are some quick updates.

Jul 72024-07-07T19:32:00-07:00

Towards Tokenization-Free Language Models: Research Trends

Tokenization is useful because it significantly decreases the inference and training costs. However, tokenization has several disadvantages.
This article explores the problems of tokenization, and delves into emerging research on tokenization-free approaches.

Dec 202024-12-20T19:32:00-07:00

So the scaling law is slowing… So what…

Recent discussions suggest that Scaling Laws in AI might be slowing. But does this mean innovation is hitting a ceiling?
This article explores the scaling law, laws governing advancement in technology, and human nature…

Dec 52024-12-05T19:32:00-07:00

Enhancing LLM with high-quality, diverse dataset, and factuality

Can we close the gap to GPT-4 with a smaller model?
A SoTA LLM at home powered by high-quality data, diverse dataset and objective, and fine-tuned for factuality.

Jan 72024-01-07T19:32:00-07:00

Decoupled Transformer

How much Attention is really needed?

Jun 12021-06-01T19:32:00-07:00

Knowledge Distillation Part 1

Part 1 - Large Model to Large Mode : Can student model exceed the teachers performance

Mar 32021-03-03T19:32:00-07:00

ROaD-Electra Robustly Optimized and Distilled Electra 2

Building the best Base sized Transformer model for MRC, NLI and NLU.

Dec 25, 20202020-12-25T19:32:00-07:00

ROaD-Electra Robustly Optimized and Distilled Electra 1

Exploring Multi-Task pre-training and a new variance of Knowledge Distillation to build new state-of-the-art Transformer models for MRC, NLI and NLU.

Oct 10, 20202020-10-10T19:32:00-07:00

New Models and Old Tricks

It seems tricks (e.g. Data Augmentation, Label smoothing, Mixout) are approaching their limits to improve SoTA models on SQUAD 2.0 and NQA.

May 10, 20202020-05-10T19:32:00-07:00

Fast way to count zeros

Population count is a procedure of counting the number of ones in a bit string. It is used in many applications such as Hamming distance, cardinality count in Bitarray, Binary Neural Networks and m...

Aug 10, 20192019-08-10T19:32:00-07:00

What is understanding - AI Prospective

In 2013 I faced this question, I took a week off from my startup and thought about this question, What is understanding? Here is what I came up with…

May 10, 20192019-05-10T19:32:00-07:00

How academic division of AI, limits AI progress

One of the fundamental problems in the current AI approaches is that it typically throws out much of the structure of the world before they start.

Feb 10, 20192019-02-10T19:32:00-07:00

Gaussian Label Smoothing

The generalization of neural networks can often be improved by using label smoothing which is soft targets that are a weighted average of the hard targets and the uniform distribution over labels. ...

Dec 10, 20182018-12-10T19:32:00-07:00

Trending Tags