FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

FAST is a novel AI framework that revolutionizes coreset selection for deep neural network training by formulating it as a graph-constrained optimization problem using spectral graph theory. The method employs a frequency-domain Characteristic Function Distance metric and achieves an average 9.12% accuracy gain while reducing power consumption by 96.57% and speeding up the process by 2.2x compared to existing techniques. FAST addresses limitations of traditional DNN-based and heuristic approaches by providing rigorous mathematical foundations and capturing complex higher-order statistical differences.

FAST: Topology-Aware Frequency-Domain Distribution Matching for Coreset Selection

FAST: A New AI Framework Dramatically Cuts Energy Use in Model Training

Researchers have introduced a novel AI framework, FAST, that promises to revolutionize how large datasets are prepared for training deep neural networks. By formulating coreset selection as a graph-constrained optimization problem and employing a novel frequency-domain metric, FAST achieves superior data compression with strong theoretical guarantees. The method significantly outperforms existing techniques, delivering an average 9.12% accuracy gain while slashing power consumption by 96.57% and speeding up the process by an average of 2.2x.

The core challenge in coreset selection—creating a small, representative subset of a large dataset—has been balancing efficiency with distributional fidelity. Traditional DNN-based methods are often biased by the model architecture they are tied to, while DNN-free heuristic approaches lack rigorous mathematical foundations. Furthermore, common metrics like Mean Squared Error (MSE) or Maximum Mean Discrepancy (MMD) fail to capture complex, higher-order statistical differences between the original data and the selected coreset.

Bridging the Theory-Practice Gap with Spectral Graph Theory

FAST addresses these limitations head-on by being the first DNN-free framework to explicitly enforce distributional equivalence. It grounds the discrete sampling problem in spectral graph theory, treating the dataset as a graph where data points are nodes. The selection task is then formulated as a constrained optimization problem on this graph, providing a robust theoretical backbone absent in prior heuristic methods.

To measure distributional match, the team moved beyond traditional metrics to the Characteristic Function Distance (CFD). This metric operates in the frequency domain, comparing the Fourier transforms of distributions to capture complete statistical information, including complex higher-order moments that MSE and KL divergence miss.

Overcoming Frequency-Domain Challenges

However, the researchers identified a critical flaw in a naive application of CFD: a "vanishing phase gradient" issue in medium and high-frequency bands, which impedes effective optimization. Their solution was the development of an Attenuated Phase-Decoupled CFD. This enhanced metric separately attenuates and handles the phase information of the characteristic function, stabilizing the learning process across all frequencies.

For efficient and accurate optimization, the team also designed a Progressive Discrepancy-Aware Sampling (PDAS) strategy. PDAS intelligently schedules the selection of frequencies, starting with low frequencies to capture the global data structure before progressively incorporating higher frequencies to refine local details. This approach prevents overfitting and enables precise distribution matching using fewer frequency components, enhancing convergence.

Unprecedented Performance and Efficiency Gains

In extensive benchmarking against state-of-the-art coreset selection methods, FAST demonstrated commanding leads. The average accuracy improvement of 9.12% underscores its ability to select more informative and representative data subsets. The efficiency metrics are even more striking, highlighting the framework's potential for sustainable AI development.

The 96.57% reduction in power consumption translates to massive energy savings, a critical concern as the computational footprint of AI grows. Coupled with the 2.2x average speedup, FAST presents a compelling solution for deploying efficient machine learning on resource-constrained devices and in large-scale industrial training environments.

Why This Matters: Key Takeaways

  • Breakthrough in Efficiency: FAST sets a new standard for energy-efficient AI training, cutting power use by over 96% and speeding up data preparation by more than double.
  • Theoretically Sound Compression: It provides the first DNN-free, distribution-matching framework with rigorous grounding in spectral graph theory, solving a core theoretical gap in the field.
  • Superior Model Performance: By better capturing full data distributions, coresets selected by FAST lead to trained models that are, on average, over 9% more accurate than those using previous methods.
  • Practical Algorithmic Innovation: The introduction of Attenuated Phase-Decoupled CFD and Progressive Discrepancy-Aware Sampling solves key optimization challenges, making high-fidelity coreset selection computationally feasible.

常见问题