Highlights:

  • Introduces AdamHD (AdamHuberDecay), a drop-in replacement for AdamW with Huber regularization.
  • Boosts pre-training efficiency for large language models like GPT-2 and GPT-3.
  • Achieves 10–15% faster convergence and up to 30% memory savings after pruning.
  • Provides stronger robustness to gradient outliers and large-batch regimes.

TLDR:

Researchers Fu-Ming Guo and Yingfang Fan proposed AdamHD, a new optimization algorithm that replaces traditional weight decay with a Huber-based regularization scheme. It delivers faster convergence, improved sparsity, and enhanced stability when pre-training large generative models like GPT-3.

The optimization of large language models is one of the most computationally demanding steps in artificial intelligence research. Traditional adaptive optimizers like AdamW have been the backbone for training transformer architectures such as GPT-2 and GPT-3. However, they come with inherent drawbacks due to their reliance on a quadratic \( \ell_2 \) penalty, which can over-penalize certain parameters and destabilize updates under irregular gradient conditions. Addressing these challenges, researchers Fu-Ming Guo ([link](https://arxiv.org/search/cs?searchtype=author&query=Guo,+F)) and Yingfang Fan ([link](https://arxiv.org/search/cs?searchtype=author&query=Fan,+Y)) have introduced a novel optimizer called **AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training**, unveiled in November 2025.

AdamHD—short for AdamHuberDecay—redefines how parameter decay is handled by incorporating a smooth Huber regularizer instead of the traditional squared decay function. In this approach, parameters are decayed quadratically when their magnitude is below a threshold δ, but linearly once they exceed it. This hybrid strategy effectively limits the impact of extreme gradients while maintaining efficiency for well-behaved parameters. The key benefits include bounded regularization gradients, improved stability under noisy updates, and a stronger push toward sparsity for overgrown weight values.

The experimental results are striking. When tested on large-scale GPT-2 and GPT-3 pre-training tasks, AdamHD achieved convergence 10–15% faster in wall-clock time, reduced validation perplexity by as much as 4 points, and improved downstream task performance by up to 4.7%. Notably, its generated weight distributions exhibited significant sparsity, enabling 20–30% memory savings through magnitude pruning—all without additional hyperparameter tuning beyond defaults used for AdamW. These gains suggest that AdamHD could become a new standard for large-scale foundation model training, especially as researchers push the limits of transformer scalability.

From a technical perspective, the authors derived a closed-form Huber decay update rule that integrates seamlessly with existing Adam-family optimizers at just \( O(1) \) computational overhead. This means developers can adopt AdamHD as a direct replacement in current pipelines with minimal engineering modification. The paper further includes rigorous theoretical analysis bounding expected parameter norms under noisy gradient conditions, underscoring its mathematical elegance and robustness for massive distributed learning environments.

Source:

Source:

Original research paper: Fu-Ming Guo, Yingfang Fan. ‘AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training.’ arXiv:2511.14721 [cs.LG], 18 November 2025. https://doi.org/10.48550/arXiv.2511.14721

Leave a Reply

Your email address will not be published. Required fields are marked *