The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

New research establishes that momentum-based optimizers including Adam and Muon exhibit implicit bias toward specific geometric margin maximization when training smooth homogeneous neural networks. The study proves these algorithms converge to Karush-Kuhn-Tucker points of margin maximization problems, with experimental validation showing optimizer choice determines whether ℓ₂, ℓ∞, or hybrid norms are maximized. This work bridges implicit bias theory with momentum optimizer analysis, explaining why different algorithms select varying solutions from identical training data.

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Implicit Bias of Momentum Optimizers in Homogeneous AI Models Revealed

New research provides a unifying theoretical framework for understanding the implicit bias of momentum-based optimizers, such as MomentumGD and Adam, when training homogeneous machine learning models. The study, detailed in the preprint arXiv:2602.16340v2, extends prior work on steepest descent to prove that these popular algorithms inherently drive models toward solutions that maximize specific geometric margins, a critical factor influencing generalization. This finding establishes a direct link between an optimizer's algorithmic mechanics and the final, learned solution's properties.

Extending Steepest Descent Theory to Momentum Methods

The research first generalizes existing implicit bias results for standard steepest descent to its normalized variant, even when paired with a decaying learning rate schedule. The core breakthrough demonstrates that for smooth homogeneous models—a class that includes modern deep neural networks with ReLU activations—momentum-driven algorithms approximate the trajectories of steepest descent under specific conditions. This proves they share a similar bias toward converging to Karush-Kuhn-Tucker (KKT) points of an associated margin maximization problem.

Concretely, the analysis shows that Muon (using the spectral norm), MomentumGD (using the $\ell_2$ norm), and Signum (using the $\ell_\infty$ norm) all exhibit this implicit directional bias when a decaying learning rate is applied. The work further extends to Adam (without its stability constant), clarifying its tendency to maximize the $\ell_\infty$ margin, and to hybrid optimizers like Muon-Signum and Muon-Adam, which maximize a composite norm.

Experimental Validation and Practical Implications

The theoretical conclusions are strongly supported by experimental evidence. The researchers' tests confirm that the specific geometric margin maximized—whether $\ell_2$, $\ell_\infty$, or a hybrid—is fundamentally determined by the choice of optimizer, not just the model architecture. This provides a principled explanation for why different optimization algorithms can lead to models with varying generalization performance and robustness from the same training data.

This work synthesizes and significantly expands two major lines of inquiry: the study of implicit bias in homogeneous models and the analysis of momentum-based optimizers in linear settings. By bridging this gap, it offers a more complete picture of optimization dynamics in deep learning, moving beyond mere loss minimization to explain *which* solution an algorithm selects from a potentially infinite set.

Why This Matters for AI Development

  • Predictable Algorithmic Bias: The research provides a formal framework for predicting the type of solution (defined by its maximized margin) that momentum optimizers will converge to, aiding in algorithm selection for desired model properties.
  • Beyond Loss Curves: It underscores that understanding training requires analyzing an optimizer's implicit geometric bias, not just its speed in reducing loss, to explain final model behavior.
  • Unified Optimization Theory: The findings create a cohesive theoretical link between classical steepest descent analysis and modern adaptive momentum methods used in virtually all deep learning pipelines today.

常见问题