FAST: A New Paradigm for Energy-Efficient AI Training via Distribution-Matching Coresets
A groundbreaking new framework named FAST promises to dramatically reduce the massive computational and energy costs of training deep neural networks (DNNs) by fundamentally rethinking how to select the most representative data. By formulating coreset selection as a graph-constrained optimization problem and employing a novel frequency-domain metric, FAST achieves superior data compression with full theoretical guarantees, setting a new state-of-the-art for efficiency and performance.
The research, detailed in the paper "FAST: DNN-Free Distribution-Matching Coreset Selection via Spectral Graph Theory and Characteristic Function Distance" (arXiv:2511.19476v3), directly addresses critical flaws in existing methods. Current approaches are either DNN-basedDNN-free, relying on heuristics without ensuring the selected subset truly matches the original data's statistical distribution. This lack of explicit distributional equivalence has been a major bottleneck, as traditional metrics like MSE, KL divergence, and MMD fail to capture complex, higher-order statistical discrepancies.
Overcoming the Distribution-Matching Challenge
The core innovation of FAST is its dual-pronged theoretical and methodological advance. First, it grounds the discrete sampling problem in spectral graph theory, allowing for rigorous optimization. Second, it adopts the Characteristic Function Distance (CFD), which operates in the frequency domain to capture a complete picture of a distribution, including all moments. However, the team discovered a critical flaw in a naive CFD implementation: a "vanishing phase gradient" issue that renders medium and high-frequency information unusable.
To solve this, the researchers developed an Attenuated Phase-Decoupled CFD. This enhanced metric effectively disentangles and preserves crucial phase information across all frequency bands, enabling accurate, full-spectrum distribution matching for the first time in a coreset context. This theoretical breakthrough is what makes truly representative, DNN-free coreset selection feasible.
Progressive Sampling for Optimal Convergence
Selecting the optimal subset requires intelligent optimization. The FAST framework incorporates a clever Progressive Discrepancy-Aware Sampling strategy. This technique schedules frequency selection from low to high, analogous to a painter first sketching a broad outline before adding fine details.
This progression ensures the coreset first captures the global structural patterns of the data (low-frequency information) before refining local details (high-frequency information). This method not only enables accurate matching with fewer sampled frequencies but also prevents overfitting, leading to more robust and generalizable coresets.
Unprecedented Gains in Accuracy and Efficiency
Extensive experimental validation confirms FAST's superiority. The framework significantly outperformed all existing state-of-the-art coreset selection methods across multiple benchmarks, achieving an average accuracy gain of 9.12%. Beyond accuracy, its efficiency gains are staggering.
When compared to other baseline coreset methods, FAST demonstrated a 96.57% reduction in power consumption and achieved a 2.2x average speedup in the training pipeline. These figures underscore its dual value: it not only produces better models but does so with a fraction of the computational cost and carbon footprint.
Why This Matters: Key Takeaways
- Eliminates Architectural Bias: As a DNN-free framework, FAST selects coresets independent of any specific model architecture, providing unbiased, portable data subsets for any downstream task.
- Provides Theoretical Guarantees: It moves beyond heuristic methods by offering a rigorous optimization framework grounded in spectral graph theory, ensuring distributional equivalence.
- Captures Full Data Distribution: The novel Attenuated Phase-Decoupled CFD metric captures higher-order statistical moments missed by traditional metrics like KL divergence or MMD, leading to more representative coresets.
- Drives Sustainable AI: The dramatic reduction in energy use (over 96%) addresses the growing environmental concerns associated with large-scale AI training, making model development more sustainable.
- Accelerates Research & Development: By compressing datasets without sacrificing performance, FAST can drastically reduce experiment iteration times, accelerating innovation in machine learning.