FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

FAST is a novel DNN-free distribution-matching coreset selection framework that formulates the task as a graph-constrained optimization problem. It uses a frequency-domain metric to capture the full data distribution, achieving an average 9.12% accuracy gain over existing methods while reducing power consumption by 96.57% and delivering a 2.2x speedup. The method overcomes the limitations of traditional metrics like MSE and MMD by employing an Attenuated Phase-Decoupled Characteristic Function Distance and a Progressive Discrepancy-Aware Sampling strategy.

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

FAST: A New AI Framework Dramatically Cuts Energy Use in Model Training

Researchers have introduced a groundbreaking new method for compressing massive AI training datasets, promising to slash the immense computational and energy costs of developing deep neural networks. The framework, named FAST, is the first DNN-free distribution-matching coreset selection technique, formulated as a graph-constrained optimization problem. It leverages a novel frequency-domain metric to capture full data distribution, achieving an average 9.12% accuracy gain over existing methods while reducing power consumption by a staggering 96.57% and delivering a 2.2x speedup.

Coreset selection is a critical technique for sustainable AI, aiming to distill large datasets into small, representative subsets for efficient training. Current approaches fall into two flawed categories: model-dependent DNN-based methods that introduce architectural bias, and heuristic DNN-free methods that lack theoretical guarantees. A fundamental challenge has been the inability to explicitly ensure distributional equivalence between the coreset and the original data, as continuous distribution matching is considered incompatible with discrete sampling.

The Limitations of Current Metrics and the FAST Solution

Prevailing metrics like Mean Squared Error (MSE), Kullback-Leibler (KL) divergence, and Maximum Mean Discrepancy (MMD) fail to accurately capture higher-order statistical moments, leading to suboptimal and unrepresentative coresets. The FAST framework directly addresses this core limitation by reformulating the selection task through the lens of spectral graph theory.

At its heart, FAST employs the Characteristic Function Distance (CFD) to measure distributional similarity in the frequency domain, which theoretically captures all moments of the data distribution. However, the researchers identified a critical flaw in a naive CFD implementation: a "vanishing phase gradient" issue in medium and high-frequency regions that hampers optimization.

Innovative Advances: Attenuated CFD and Progressive Sampling

To solve the vanishing gradient problem, the team developed an Attenuated Phase-Decoupled CFD. This enhanced metric stabilizes training by properly attenuating and decoupling phase information, enabling reliable distribution matching across all frequency bands.

Furthermore, to ensure robust and efficient convergence, they designed a Progressive Discrepancy-Aware Sampling (PDAS) strategy. This intelligent scheduler progressively selects frequencies from low to high, ensuring the global data structure is preserved before refining local details. This approach allows for accurate matching using fewer frequencies, preventing overfitting and significantly boosting computational efficiency.

Unprecedented Performance and Efficiency Gains

In extensive benchmarking, FAST demonstrated superior performance across the board. The average accuracy improvement of 9.12% over state-of-the-art methods underscores its effectiveness in selecting high-quality, representative data. The efficiency metrics are even more compelling, with a 96.57% reduction in power consumption and a 2.2x average speedup compared to baseline coreset methods. These figures highlight FAST's dual strength in enhancing model performance while drastically reducing the environmental and operational costs of AI training.

Why This Matters: Key Takeaways

  • Breaks the Coreset Paradigm: FAST is the first DNN-free method to successfully enforce explicit distributional equivalence via frequency-domain matching, moving beyond flawed heuristics and model-biased approaches.
  • Solves a Fundamental Technical Hurdle: It overcomes the "vanishing phase gradient" in characteristic function analysis with its novel Attenuated Phase-Decoupled CFD, enabling stable optimization.
  • Delivers Practical, Scalable Efficiency: The Progressive Discrepancy-Aware Sampling strategy ensures fast, reliable convergence, making high-quality coreset selection feasible for massive datasets.
  • Enables Greener AI: The dramatic reductions in energy use (96.57%) and training time (2.2x speedup) address the growing sustainability crisis in large-scale machine learning.

常见问题