Implicit Bias of Momentum Optimizers in Homogeneous AI Models Revealed
New research provides a unifying theoretical framework for understanding the implicit bias of momentum-based optimizers in homogeneous machine learning models. The study, detailed in the preprint arXiv:2602.16340v2, extends foundational work on gradient descent to prove that popular algorithms like MomentumGD, Signum, and Adam inherently steer models toward solutions that maximize specific geometric margins. This finding formally connects optimization dynamics to generalization performance, revealing that the choice of optimizer fundamentally dictates the type of solution found, even when starting from the same initialization.
Extending Steepest Descent Theory to Momentum Methods
The research first establishes a broader foundation by extending existing implicit bias results for standard steepest descent. The authors prove that these bias properties hold for normalized steepest descent, even when incorporating a decaying learning rate schedule. This crucial extension bridges the gap to more complex, momentum-based algorithms commonly used in modern deep learning.
The core breakthrough demonstrates that for smooth homogeneous models—a class that includes linear models and deep neural networks with homogeneous activation functions—momentum algorithms approximate steepest descent trajectories. Under a decaying learning rate, algorithms like Muon (using spectral norm), MomentumGD (using $\ell_2$ norm), and Signum (using $\ell_\infty$ norm) are shown to have an implicit bias toward Karush-Kuhn-Tucker (KKT) points of their corresponding margin maximization problem. This means each optimizer inherently seeks solutions that maximize a different norm-based margin on the training data.
Adam and Hybrid Optimizers Also Exhibit Norm-Specific Bias
The analysis is further extended to include the widely-used Adam optimizer (excluding its stability constant, epsilon). The researchers prove that Adam maximizes the $\ell_\infty$ margin, aligning its implicit bias with that of Signum. This formalizes an intuitive property of the algorithm's update rule, which normalizes gradients by a running estimate of their magnitude.
Furthermore, the study introduces and analyzes hybrid optimizers, Muon-Signum and Muon-Adam. These algorithms are shown to maximize a hybrid norm, blending characteristics of their component optimizers. This demonstrates that the implicit bias is not a fixed property but can be deliberately engineered by combining elements from different optimization strategies, opening new avenues for designing optimizers with tailored generalization properties.
Experimental Validation and Broader Implications
The theoretical claims are strongly corroborated by experimental results. The experiments clearly show that the identity of the margin maximized—whether $\ell_2$, $\ell_\infty$, spectral, or a hybrid—depends directly on the choice of optimizer, not just the model architecture. This work successfully unifies two previously distinct lines of research: the study of steepest descent in homogeneous models and the analysis of momentum-based optimizers in linear models.
From an expert perspective, this research provides a critical lens for algorithm selection in deep learning. It moves beyond viewing optimizers merely as tools for faster convergence, framing them as architects of a model's final solution and, by extension, its generalization behavior. Understanding this implicit bias is essential for interpreting model outcomes and designing training regimens aligned with specific performance goals.
Why This Matters for AI Development
- Predictable Generalization: The work provides a theoretical basis for predicting how different optimizers will influence a model's final solution and its ability to generalize to new data, linking optimization dynamics directly to model performance.
- Informed Optimizer Selection: Practitioners can now select optimizers not just for convergence speed, but with an understanding of the specific type of solution (e.g., $\ell_2$ vs. $\ell_\infty$ margin maximizer) they will implicitly promote.
- Path to Engineered Optimizers: By proving that hybrid optimizers like Muon-Signum maximize a hybrid norm, the research lays the groundwork for designing next-generation optimizers with bespoke implicit biases tailored for specific tasks or data characteristics.
- Unifying Theoretical Framework: This study creates a cohesive bridge between classical optimization theory and modern deep learning practice, offering a unified explanation for the behavior of a broad class of widely-used algorithms.