The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

A mathematical analysis reveals that as residual networks (ResNets) grow infinitely deep, their training dynamics converge to a unique limit described by a Neural Mean Ordinary Differential Equation. The convergence depends on a scaling parameter α, where α = Θ_D(1) leads to maximal feature learning with error bound O_D(1/L + 1/√(LM)), while α → ∞ results in lazy training similar to NTK behavior. For two-layer perceptron blocks, optimal scaling of O(√D/(LM)) is both necessary and sufficient for expressive training.

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

Deep Residual Networks Converge to Infinite-Width Dynamics, Revealing New Training Regimes

A groundbreaking mathematical analysis reveals that as residual networks (ResNets) grow infinitely deep, their gradient-based training dynamics converge to a unique limit, behaving as if they were infinitely wide regardless of their actual hidden width. This convergence is governed by a Neural Mean Ordinary Differential Equation (ODE), with the scaling of the residual blocks determining whether the network operates in a maximally expressive "feature learning" regime or a simpler "lazy training" regime. The research, presented in the paper arXiv:2509.10167v2, provides a novel stochastic approximation framework for understanding ResNet training, fundamentally linking depth, width, and initialization scale.

The Dual Regimes of Infinite-Depth ResNets

The core finding establishes that for a fixed embedding dimension D, the training dynamics of an infinite-depth ResNet converge to a unique limit as the depth L diverges. This limit is described by a Mean ODE, and the path to convergence depends critically on the scaling parameter α within the residual block scale ΘD(α/(LM)).

When α = ΘD(1)—meaning it remains constant relative to D—the network enters a regime of maximal local feature updates. Here, the Mean ODE is non-linearly parameterized, allowing for rich, adaptive learning. The error between the finite-depth ResNet's output and its infinite-depth limit after a fixed number of gradient steps is bounded by OD(1/L + 1/√(LM)).

In contrast, if the scaling parameter α → ∞, the model enters a lazy training regime. In this regime, the limiting Mean ODE becomes linearly parameterized, resembling the behavior of models in the Neural Tangent Kernel (NTK) paradigm, where features do not evolve significantly during training. The study also derives a convergence rate for this lazy regime, completing the theoretical picture.

Precise Scaling for Two-Layer Perceptron Blocks

The researchers provide a precise analysis for a common architecture: ResNets with two-layer perceptron blocks. They identify the exact residual scaling required to achieve the expressive, feature-learning regime. Their analysis proves that a scale of O(√D / (LM)) is both necessary and sufficient for maximal local feature updates.

Under this optimal scaling, they establish a high-probability error bound between the finite ResNet and its limiting dynamics: O(1/L + √D / √(LM)). This result explicitly shows how the embedding dimension D influences the convergence rate, a crucial insight for designing very deep networks.

A Novel Stochastic Framework and Empirical Validation

The convergence proofs are built on a novel mathematical perspective. The authors show that due to the randomness of standard initializations, the forward and backward passes through a ResNet act as a stochastic approximation of certain mean ODEs. Furthermore, they demonstrate that a propagation of chaos—the asymptotic independence of hidden units—preserves this stochastic behavior throughout the entire training dynamics, ensuring convergence to the deterministic limit.

The theoretical rates are not merely asymptotic; the authors verify empirically that all derived bounds, including the dependencies on L, M, and D, are tight, confirming the practical relevance of their findings for understanding and engineering deep neural networks.

Why This Matters for AI and Machine Learning

  • Unifies Depth and Width: This work formally bridges two major architectural axes, showing infinite depth can induce infinite-width behavior, simplifying the analysis of ultra-deep networks.
  • Guides Architecture Design: The identified scaling rule O(√D / (LM)) provides a precise blueprint for initializing residual blocks to ensure networks operate in a rich feature-learning regime, avoiding the limitations of lazy training.
  • Advances Theoretical Foundations: By framing ResNet training as a stochastic approximation to a Mean ODE, the research offers a powerful new analytical tool that moves beyond the NTK paradigm to explain feature learning in deep, finite-width models.
  • Confirms Empirical Practice: The findings mathematically substantiate widely used initialization and scaling heuristics in deep learning, grounding practical engineering in rigorous theory.

常见问题