Implicit Bias of Momentum Optimizers in Homogeneous AI Models Revealed
New theoretical research provides a unifying framework for understanding the implicit bias of momentum-based optimizers, such as MomentumGD and Adam, when training homogeneous machine learning models. The study, detailed in the preprint arXiv:2602.16340v2, extends prior work on steepest descent to prove that these popular algorithms inherently steer optimization towards solutions that maximize specific geometric margins, a property previously established primarily for simpler optimizers. This finding offers a crucial bridge between optimization theory and modern deep learning practice, explaining why different optimizers can converge to distinct solutions even when starting from the same initialization.
From Steepest Descent to Momentum-Based Trajectories
The research first establishes a foundation by extending known results on the implicit bias of standard steepest descent in homogeneous models—where scaling the parameters scales the output by a consistent factor—to a more general setting. The authors prove that normalized steepest descent, even when paired with a learning rate schedule, maintains a bias towards Karush-Kuhn-Tucker (KKT) points of a corresponding margin maximization problem. This sets the stage for analyzing more complex, momentum-driven algorithms.
The core theoretical breakthrough demonstrates that for smooth homogeneous models, momentum steepest descent algorithms behave as approximate steepest descent trajectories under a decaying learning rate. This key insight allows the authors to formally characterize their implicit bias. Specifically, they prove that MomentumGD (using the ℓ₂ norm) is biased towards maximizing the ℓ₂ margin, while Signum (using the ℓ∞ norm) and a variant of Adam (without its stability constant) are biased towards maximizing the ℓ∞ margin.
Hybrid Norms and Experimental Validation
The analysis is further extended to hybrid optimizers that combine components from different algorithms. The study introduces Muon-Signum and Muon-Adam, showing these variants maximize a hybrid norm, blending characteristics of spectral and ℓ∞ normalization. This reveals a nuanced landscape where the choice of optimizer's underlying norm directly dictates the geometric property—the "margin"—of the final solution it will favor.
Extensive experiments were conducted to corroborate the theoretical findings. The results consistently show that the identity of the maximized margin is not an artifact of the model but is intrinsically tied to the optimizer's design. For instance, training the same homogeneous model with MomentumGD versus Signum leads to solutions with maximized ℓ₂ and ℓ∞ margins, respectively, validating the theory that the optimizer itself is a primary driver of this implicit bias.
Why This Matters for AI Development
- Predictable Optimization Outcomes: This work provides a formal explanation for why different optimizers yield different solutions, allowing practitioners to select an optimizer based on the desired solution property (e.g., a large ℓ₂ margin for generalization).
- Bridging Theory and Practice: It successfully extends rigorous theoretical analysis from classical steepest descent and linear models to the momentum-based optimizers (Adam, Signum) ubiquitous in modern deep learning.
- Informed Algorithm Design: Understanding the implicit bias towards specific margin maximization problems can guide the development of new, more effective optimization algorithms with tailored convergence properties.
- Enhanced Model Interpretation: The findings add a layer of interpretability to the training process, framing optimization not just as loss minimization but as a targeted search for solutions with particular geometric characteristics.
Overall, this research significantly advances the theoretical understanding of optimization in machine learning. By proving that momentum-based optimizers in homogeneous models inherit and specialize the implicit bias properties of steepest descent, it unifies earlier lines of work and provides a powerful lens through which to analyze and engineer training dynamics in complex AI systems.