My sabbatical so far has been great. I considered starting a startup, did research on reasoning in LLMs and Multi-Head Latent Attention (MLA), built an autonomous SWE agent with a model I trained, and climbed Everest and Ama Dablam. Here are some quick updates.
Reasoning in LLMs - H1 2025
Inducing reasoning—which is a form of Inference-time scaling—in LLM is all the rage now. If you’re new to this, this article provides an introduction.
I first investigated inference-time scaling in 2023—before it was mentioned explicitly in paper. The idea came as a natural extension of two previous ideas (1) Training Compute-Optimal Large Language Models, and (2) Symphony: Learning Realistic and Diverse Agents for Autonomous Driving Simulation, and my work in the AV space (i.e. scaling and improving AV behavior/ agent simulation).
Fast forward to late 2024, I became interested in the economics of reasoning in models like DeepSeek-R1. So I decided to investigate reasoning, I mainly sought to answer the following questions:
- How can we reduce the number of reasoning tokens? Can we make the LLM automatically adjust its reasoning budget—from 0 to a few thousand—based on the question complexity?
- For distillation based training is one form of reasoning traces better than the others and What is the best training strategy to induce reasoning—SFT vs. DPO vs. KTO?
(1) Reducing Overthinking in LLMs:
Models like DeepSeek-R1 and its derivatives (e.g., DeepSeek-R1-Distill-Qwen-32B) can generate several thousand reasoning tokens before answering— even for simple questions like ‘What is 5 + 2’.
I used DeepSeek-R1-Distill-Qwen-32B in all my experiments. Initially, I investigated using prompt, SFT, and DPO to reduce overthinking. While the DPO showed strong results, I was aiming for more. After investigating the DPO behavior, I came up with an alternative objective that takes into account the accuracy and generation length directly and formulate it into PPO/GRPO style optimization. This approach, which I call Length & Accuracy Policy Optimization (LAPO), performed significantly better at reducing overthinking while maintaining the model performance with a hyper-parameter for controlling the preference.
Let’s describe the process and the approaches:
Prompt: I Instructed the model to solve the problem as quickly as possible.
SFT, DPO, and LAPO: Sampled 16 answers for each prompt. Training data consisted of 15K questions (5K coding from APPS and TACO; 10K math from AIME, MATH, and Olympiad subsets of NuminaMATH).
- SFT: I selected the two shortest correct solutions per problem. Training done using standard SFT pipeline.
- DPO: I selected the two shortest correct responses as preferred, longest as rejected.
For evaluation, I used AIME2024, MATH500, and GPQA-Diamond.
LAPO:
Thinking back from first principles, We can formulate our objective as a constrained optimization problem. We want to reduce the solution length of the policy model relative to that of the reference model subject to the constraints that we want the model accuracy staying the same or improving. In simple terms this could look like this:
(Num_of_Token(ref_y) / Num_of_Token(y) ) - 1
Subject to: Accuracy(x, y) > Accuracy(x, ref_y)
Instead of dealing with the constraints in the optimization directly, we incorporate the Accuracy term into the loss function. Since Num_of_Token and Accuracy are not differentiable, we solve this by policy gradient approach. In this approach we sample from the pre-collected data exactly like DPO. The new loss function looks like this:
[(Num_of_Token(ref_y) / Num_of_Token(y) ) - 1] + λ(Accuracy(x, y) − Accuracy(x, ref_y))
To finish things up, we apply clipping to the final loss function like in the PPO algorithm.
Here are the results:
| Model / Dataset | AIME2024 | MATH500 | GPQA-Diamond | |||
| Length | Accuracy | Length | Accuracy | Length | Accuracy | |
| Baseline (DeepSeek-R1-Distill-Qwen-32B) | 9178 | 72.6 | 2013 | 90.8 | 5208 | 62.6 |
| Prompt | 7430 (-19%) |
72.4 | 1891 (-6%) |
90.7 | 4239 (-19%) |
61.4 |
| SFT | 8395 (-8.5%) |
72.5 | 1931 (-4%) |
90.7 | 4395 (-16%) |
61.5 |
| DPO | 4981 (-46%) |
72.5 | 1543 (-23%) |
90.9 | 2322 (-55%) |
62.3 |
| LAPO (our) | 4143 (-55%) |
72.6 | 1322 (-34%) |
91.2 | 1867 (-64) |
62.8 |
Note (update): After I concluded my research, Berkeley Sky Computing Lab proposed reducing overthinking by rewriting the responses using another LLM, improved sampling for the DPO/SimPO training, and adding a length normalization term to the SimPO optimization objective. However I believe the GRPO / PPO optimization objective is more powerful and it includes a tunable hyperparameter to control trade-off between length and accuracy.
(2) What reasoning traces for distillation
When I first started investigating overthinking, I started by distilling from a very large LM (DeepSeek-R1) into the Qwen 2.5 family of models (3B, 7B, 14B, and 32B). Initially, I rewrote the responses using another LLM (e.g., Qwen and GPT-4) to remove wrong steps, double-checking, and unnecessarily tokens. My hypothesis: high-quality, structured traces would help teach efficient reasoning.
It didn’t work as well as I expected. While models trained with SFT (similar to DeepSeek distillation approach but at a significantly smaller scale) did improve, the improvements were not substantial. Then a pattern started to emerge:
- Smaller LLMs (3B, and 7B) improved more using the compressed reasoning traces compared to the original traces.
- Bigger LLMs (32B) improved more using the original reasoning traces compared to compressed traces.
- Compressed reasoning did significantly reduce overthinking compared to the original traces.
This of course not a comprehensive study of the reasoning traces, I didn’t have the time to presume answering other important questions like:
- Impact of teacher model size on student performance?
- How does mixing traces from multiple models affect training based on question complexity?
- How does the source (e.g., Gemini vs. DeepSeek) influence results?
In part 2 I will present the results of:
- What is the best training strategy to induce reasoning—SFT vs. DPO vs. KTO?
- Can we make the LLM automatically adjust its reasoning budget—from 0 to a few thousand—based on the question complexity?
Multi-Head Latent Attention (MLA) - H1 2025
Multi-Head Latent Attention (MLA) is a variant of the standard Multi-Head Attention (MHA) mechanism, designed to improve scalability, especially for long or high-dimensional inputs. MLA addressing the memory bottleneck caused by the Key-Value (KV) cache, while Multi-Head Attention (MHA) processes attention using distinct Query (Q), Key (K), and Value (V) projections for each attention head, storing the full KV matrices, MLA compresses the KV information into a smaller latent vector. This compressed representation is then used to reconstruct the keys and values when needed, significantly reducing the memory footprint. Essentially, MLA trades increased computation during the attention calculation for reduced memory footprint and bandwidth requirements, enabling models to handle longer sequences and larger batch sizes with improved performance. Additionally, the latent vectors act as an information bottleneck which potentially improves generalization and expressiveness of the attention mechanism. For more info refer to DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model paper. The main question I was trying to answer is in my research:
Given MLA advantages, Can we retrofit trained models with MLA to reduce memory and improve performance?
I used Qwen 2.5 7B to ensure correctness and feasibility, then I applied the procedure to Qwen 2.5 32B. I replaced GQA layers with MLA layers, targeting a 50% reduction in KV cache. To replace the attention layers there are two challenges:
- Positional Embedding: In MLA every query–key head carries its own Rotary Positional Embedding (RoPE), I opted to concentrate the positional signal in K into a small subset of dimensions. The remaining dimensions contain little positional content; we drop their RoPE and merge them with V for low-rank decomposition. Once RoPE is isolated, the key up-projection can be absorbed into the query projection exactly as in DeepSeek paper, which enables seamless MLA conversion.
- Training: Since the new - MLA - layers are not trained, we have to train them - initially - to simulate/reproduce the same output as the old (GQA) layers while freezing the rest of the model. After a few experimentation I landed on 3 step approach:
- 25% of the training compute: We train each MLA layer independently to imitate the GQA layer using MSE, while freezing everything else (e.g. MLP). This phase is to bootstrap the attention weights. In this phase we have both attention layers executed in the forward pass to compute the MSE loss.
- 50% of the training compute: We remove the old GQA layers and fine-tune the MLA layers while freezing everything else (e.g. MLP).
- 25% of the training compute: We fine-tune the entire model using a standard diverse instruction-following dataset.
The dataset I used for the 3 steps is HuggingFace SmolTalk.
Here are the results using Qwen 2.5 32B:
- 48% memory reduction.
- 35% faster token generation.
- +1.5% on MATH
- +1.2% on MMLU-Pro
- +2.0% on HumanEval+
- +2.1% on IFEval
This confirms that MLA is supporier to GHA in both memory footprint, bandwidth, and performance.
Everest
One of my sabbatical objectives was to conclude my alpine climbing career, which began about five years ago.
After climbing Ama Dablam—a 22,349ft technical rock mountain in the Himalayas—in October 2024, I rested then restarted my training specifically for Everest. As with all of my expeditions I trained mostly at my home gym, plus 3,000ft hikes in Cupertino with weights, and hiked the 4,000ft Pyramid Peak near Lake Tahoe a several times when it was cold and snowy. Before leaving for Nepal I stayed at Fairplay, CO 11,600ft for about a week and climbed mt. Pennsylvania (13,013ft) to acclimatize.
Climbing Everest took ~42 days including the 40mi trek to Everest base camp. The climbing itself was as hard as I expected it to be and yes it’s harder than Denali, now I can settle this debate once and for all 😀. Probably the hardest day was going from basecamp to camp1 during the rotation because this day was long and extremely hot (yes hotter than Denali).
I summited Everest on May 19th @ 6:45 am. Like other 8,000s I climbed, I used Oxygen above 24,000ft at normal flow rate of 1 to 2 liter/min. It goes without saying, this wouldn’t have been possible without the Sherpa and the basecamp team.
I’m happy to finally hang up my ice axe and retire from mountaineering after five years of a fulfilling and successful career, with about 20 expeditions, climbs, and solo ascents behind me.