Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

The research paper 'Local Shapley: Efficient Data Valuation via Model-Induced Locality' introduces a paradigm-shifting method that overcomes the prohibitive computational cost of Shapley value-based data valuation. By leveraging the inherent locality of modern machine learning models, the work reframes valuation to focus on model-defined support sets rather than exponential coalitions, achieving substantial reductions in required model retrainings while preserving high accuracy. The authors prove that Shapley computation can be projected onto these support sets without loss of fidelity and establish an information-theoretic lower bound governed by distinct influential subsets rather than dataset size.

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

The research paper "Local Shapley: Efficient Data Valuation via Model-Induced Locality" introduces a paradigm-shifting method to overcome the prohibitive computational cost of Shapley value-based data valuation. By leveraging the inherent locality of modern machine learning models, this work reframes the valuation problem, offering a path to practical, scalable data attribution for large datasets—a critical need as AI development becomes increasingly data-centric.

Key Takeaways

  • The paper formalizes the concept of model-induced locality, where only a small, model-defined subset of training data (the "support set") influences a given prediction.
  • It proves that Shapley value computation can be projected onto these support sets without loss of fidelity when locality is exact, fundamentally changing the problem structure.
  • The authors introduce two algorithms: LSMR, an optimal algorithm that trains each influential subset exactly once, and LSMR-A, a scalable Monte Carlo estimator for larger support sets.
  • Experiments across model families (KNN, trees, GNNs) demonstrate the methods achieve substantial reductions in required model retrainings and runtime while preserving high valuation accuracy.
  • The work establishes an information-theoretic lower bound on retraining operations, governed by the number of distinct influential subsets, not the exponential coalition space.

Redefining Data Valuation Through Model Locality

The core innovation of "Local Shapley" is its formal recognition and exploitation of a structural property common to many modern predictors. For a given test instance, the model's computational pathway—such as the k-nearest neighbors in KNN, the activating leaf in a decision tree, or the receptive field in a Graph Neural Network (GNN)—defines a small, relevant subset of the training data. The paper proves that when this locality is exact, the Shapley value for a data point is zero if it lies outside this "support set" for the given prediction. This allows the valuation computation to be projected solely onto the overlapping family of these influential subsets, moving away from the traditional, intractable paradigm of enumerating all possible coalitions of the entire dataset.

This reframing transforms Shapley evaluation from a combinatorial nightmare into a structured data processing problem. The theoretical contribution establishes that the intrinsic complexity is no longer tied to the dataset size (N) but to the number of distinct influential subsets (M) present across different test queries. The authors provide an information-theoretic lower bound, showing that any correct algorithm must perform at least M model retrainings, setting a clear target for optimal efficiency.

Industry Context & Analysis

Data valuation, or quantifying the contribution of individual data points to a model's performance, is a foundational challenge with implications for data marketplaces, fairness auditing, and dataset curation. The Shapley value from cooperative game theory is the gold-standard, principled method for this task. However, its adoption has been severely limited by computational intractability; exact calculation is #P-hard, requiring the evaluation of 2^N coalitions for N data points. This has confined practical use to tiny datasets or necessitated approximations like Monte Carlo Shapley or the Data Shapley approximation (Ghorbani & Zou, 2019), which still require thousands of model retrainings and scale poorly.

The "Local Shapley" approach represents a significant departure from these global approximation methods. Unlike OpenAI's approach in web-scale pretraining, where data influence is often inferred post-hoc from gradients or heuristics, or Truera's model-agnostic explainability tools, this method is deeply integrated with the model's internal mechanics. It is conceptually closer to influence function-based methods but operates on discrete subsets defined by the model's architecture rather than continuous Hessian-based approximations.

The practical impact is substantial. For a tree-based model like XGBoost (a library with over 27k GitHub stars), evaluating a test instance only requires analyzing the data points that fall into the same leaf. In a KNN model, valuation is confined to the k neighbors. This aligns with the industry's shift towards more efficient, specialized training paradigms. The runtime of LSMR and LSMR-A is governed by the number of distinct influential subsets (M), which in practice is often orders of magnitude smaller than the dataset size (N). For example, in a dataset with 10,000 points, traditional methods might need to evaluate millions of coalitions, while Local Shapley might only need to process a few hundred unique support sets.

What This Means Going Forward

This research opens the door to practical, exact Shapley-based data valuation for a wide class of models where locality is a defining feature. Industries that rely on tree-based models (finance, healthcare) and graph neural networks (social networks, recommendation systems) stand to benefit immediately, enabling them to audit training data for bias, identify mislabeled examples, or establish data provenance for regulatory compliance. Data marketplaces could use this to price individual data points based on their provable contribution to specific model predictions, moving beyond bulk pricing.

The immediate next step is the extension of this "locality" principle to broader model classes. The paper focuses on models with discrete, well-defined support sets. A critical area for future work is adapting these concepts to large deep neural networks like GPT-4 or Stable Diffusion, where the notion of a "support set" is less clear-cut but could be approximated via attention mechanisms or gradient-based saliency. Furthermore, as the industry pushes for more interpretable and auditable AI, techniques like Local Shapley provide a mathematically rigorous tool for dataset debugging and curation, which is essential for improving benchmark performance on tasks like MMLU or HumanEval.

Watch for this work to catalyze a new subfield of efficient data attribution. The core insight—that model structure dictates valuation complexity—will likely inspire similar "algorithm-aware" accelerations for other foundational but costly AI operations. As models and datasets grow, the ability to efficiently attribute value and influence will become not just a technical curiosity, but a commercial and ethical imperative.

常见问题