Deep ResNets: Infinite-Width Dynamics & Training Regimes Explained

Deep Residual Networks Converge to Infinite-Width Dynamics, Revealing New Training Regimes

A groundbreaking mathematical analysis reveals that as residual networks (ResNets) grow infinitely deep, their gradient-based training dynamics converge to a unique limit, behaving as if they were infinitely wide regardless of their actual hidden width. This convergence is governed by a Neural Mean Ordinary Differential Equation (ODE), with the scaling of the residual blocks determining whether the network operates in a maximally expressive "feature learning" regime or a simpler "lazy training" regime. The research, presented in the paper arXiv:2509.10167v2, provides a novel stochastic approximation framework for understanding ResNet training, fundamentally linking depth, width, and initialization scale.

The Dual Regimes of Infinite-Depth ResNets

The core finding establishes that for a fixed embedding dimension D, the training dynamics of an infinite-depth ResNet converge to a unique limit as the depth L diverges. This limit is described by a Mean ODE, and the path to convergence depends critically on the scaling parameter α within the residual block scale Θ_D(α/(LM)).

When α = Θ_D(1)—meaning it remains constant relative to D—the network enters a regime of maximal local feature updates. Here, the Mean ODE is non-linearly parameterized, allowing for rich, adaptive learning. The error between the finite-depth ResNet's output and its infinite-depth limit after a fixed number of gradient steps is bounded by O_D(1/L + 1/√(LM)).

In contrast, if the scaling parameter α → ∞, the model enters a lazy training regime. In this regime, the limiting Mean ODE becomes linearly parameterized, resembling the behavior of models in the Neural Tangent Kernel (NTK) paradigm, where features do not evolve significantly during training. The study also derives a convergence rate for this lazy regime, completing the theoretical picture.

Precise Scaling for Two-Layer Perceptron Blocks

The researchers provide a precise analysis for a common architecture: ResNets with two-layer perceptron blocks. They identify the exact residual scaling required to achieve the expressive, feature-learning regime. Their analysis proves that a scale of O(√D / (LM)) is both necessary and sufficient for maximal local feature updates.

Under this optimal scaling, they establish a high-probability error bound between the finite ResNet and its limiting dynamics: O(1/L + √D / √(LM)). This result explicitly shows how the embedding dimension D influences the convergence rate, a crucial insight for designing very deep networks.

A Novel Stochastic Framework and Empirical Validation

The convergence proofs are built on a novel mathematical perspective. The authors show that due to the randomness of standard initializations, the forward and backward passes through a ResNet act as a stochastic approximation of certain mean ODEs. Furthermore, they demonstrate that a propagation of chaos—the asymptotic independence of hidden units—preserves this stochastic behavior throughout the entire training dynamics, ensuring convergence to the deterministic limit.

The theoretical rates are not merely asymptotic; the authors verify empirically that all derived bounds, including the dependencies on L, M, and D, are tight, confirming the practical relevance of their findings for understanding and engineering deep neural networks.

Why This Matters for AI and Machine Learning

Unifies Depth and Width: This work formally bridges two major architectural axes, showing infinite depth can induce infinite-width behavior, simplifying the analysis of ultra-deep networks.
Guides Architecture Design: The identified scaling rule O(√D / (LM)) provides a precise blueprint for initializing residual blocks to ensure networks operate in a rich feature-learning regime, avoiding the limitations of lazy training.
Advances Theoretical Foundations: By framing ResNet training as a stochastic approximation to a Mean ODE, the research offers a powerful new analytical tool that moves beyond the NTK paradigm to explain feature learning in deep, finite-width models.
Confirms Empirical Practice: The findings mathematically substantiate widely used initialization and scaling heuristics in deep learning, grounding practical engineering in rigorous theory.

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

Deep Residual Networks Converge to Infinite-Width Dynamics, Revealing New Training Regimes

The Dual Regimes of Infinite-Depth ResNets

Precise Scaling for Two-Layer Perceptron Blocks

A Novel Stochastic Framework and Empirical Validation

Why This Matters for AI and Machine Learning

常见问题

Deep Residual Networks Converge to Infinite-Width Dynamics, Revealing New Training Regimes

The Dual Regimes of Infinite-Depth ResNets

Precise Scaling for Two-Layer Perceptron Blocks

A Novel Stochastic Framework and Empirical Validation

Why This Matters for AI and Machine Learning

常见问题

相关推荐

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

The Lattice Geometry of Neural Network Quantization -- A Short Equivalence Proof of GPTQ and Babai's Algorithm

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Federated ADMM from Bayesian Duality

AdaBet: Gradient-free Layer Selection for Efficient Training of Deep Neural Networks

Learning of Population Dynamics: Inverse Optimization Meets JKO Scheme