Implicit Bias of Momentum Optimizers in Homogeneous AI Models Revealed
New research provides a unifying theoretical framework for understanding the implicit bias of momentum-based optimizers like MomentumGD, Signum, and Adam when training homogeneous machine learning models. The study, detailed in the paper "arXiv:2602.16340v2," extends prior work on steepest descent to prove that these popular algorithms inherently steer models towards solutions that maximize specific geometric margins, a critical factor influencing generalization performance. This finding establishes a direct link between optimizer choice and the type of solution found, offering a principled explanation for their empirical success beyond mere convergence speed.
From Steepest Descent to Momentum Trajectories
The analysis begins by extending established results on the implicit bias of steepest descent in homogeneous models—where scaling inputs scales outputs by a constant factor—to a more general setting. The researchers first prove that normalized steepest descent, even with an optional learning rate schedule, shares the same bias towards Karush-Kuhn-Tucker (KKT) points of a corresponding margin maximization problem. This foundational step bridges the gap to momentum-based methods.
The core breakthrough demonstrates that for smooth homogeneous models, momentum algorithms behave as approximate steepest descent trajectories under a decaying learning rate. This equivalence holds for MomentumGD (implicitly maximizing the ℓ₂ margin), Signum (maximizing the ℓ∞ margin), and Muon (maximizing the spectral norm margin). The analysis is further extended to Adam (without its stability constant), showing it also maximizes the ℓ∞ margin, and to hybrid variants like Muon-Signum and Muon-Adam, which maximize a composite norm.
Experimental Validation and Practical Implications
The theoretical claims are strongly corroborated by experiments, which visually and quantitatively show that the identity of the margin maximized is fundamentally tied to the optimizer's choice. For instance, training the same homogeneous model with Signum versus MomentumGD leads to final solutions with distinctly different geometric properties, aligning perfectly with their predicted ℓ∞ and ℓ₂ biases, respectively. This work successfully unifies two previously separate lines of inquiry: the study of steepest descent in homogeneous models and the analysis of momentum-based optimizers in linear models.
Why This Matters for AI Development
This research provides crucial insights for both machine learning theorists and practitioners:
- Predictable Optimization Bias: The choice of optimizer (e.g., Adam vs. SGD with momentum) is not neutral; it actively selects the type of "simple" or "large-margin" solution the model converges to, directly impacting generalization.
- Unified Theoretical Framework: It offers a cohesive explanation for the behavior of diverse modern optimizers under one theoretical umbrella, moving beyond analyzing them solely through the lens of convergence rates.
- Informed Algorithm Selection: Understanding the implicit norm being maximized allows developers to select optimizers aligned with desired model properties, such as sparsity (linked to ℓ∞) or stability (linked to ℓ₂).
- Foundation for New Optimizers: The analysis methodology paves the way for designing next-generation optimizers with tailored implicit biases for specific tasks or model architectures.