Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Local Shapley is a novel computational framework for data valuation that exploits model-induced locality to drastically reduce complexity. It introduces LSMR and LSMR-A algorithms that require training only influential data subsets, transforming the problem from exponential 2^N coalition enumeration to structured processing. The method provides exact results under locality conditions and unbiased approximations with exponential concentration guarantees.

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Researchers have introduced a novel, computationally efficient framework for data valuation, a critical task in machine learning that determines the value of individual data points in a training set. By exploiting the inherent locality of modern predictive models, this work fundamentally reframes the problem, offering a pathway to make Shapley value-based data valuation practical for large-scale, real-world applications.

Key Takeaways

  • The new Local Shapley framework leverages the fact that, for a given prediction, only a small subset of training data is influential, drastically reducing the computational complexity of data valuation.
  • It introduces two algorithms: LSMR, an optimal algorithm that trains each influential subset only once, and LSMR-A, a scalable Monte Carlo estimator for larger support sets.
  • The method is proven to be exact under conditions of model-induced locality and unbiased with exponential concentration guarantees for its approximate version.
  • Experiments across multiple model families (KNN, trees, GNNs) demonstrate substantial reductions in required model retraining and significant speedups while maintaining high fidelity to the true Shapley values.
  • The work establishes an information-theoretic lower bound on retraining operations, governed by the number of distinct influential subsets, not the exponential coalition space.

Redefining Data Valuation with Model-Induced Locality

The core innovation of this research is the formalization of model-induced locality. The authors observe that for a given test instance, a model's prediction is typically influenced by only a small, structured subset of the training data. For a k-Nearest Neighbors model, this is the set of k closest points; for a decision tree, it's the data points that fall into the same leaf node; and for a Graph Neural Network, it's the nodes within a specific receptive field.

The researchers define these as support sets and prove a critical theorem: Shapley value computation can be projected exactly onto the union of these support sets when locality is perfect. This transforms the problem from an intractable enumeration over 2^N coalitions (where N is the dataset size) into a structured processing task over a family of overlapping, but far fewer, influential subsets. The theoretical contribution establishes that the intrinsic complexity of Local Shapley is governed by the number of distinct influential subsets, providing a new information-theoretic lower bound.

Industry Context & Analysis

This work tackles the central bottleneck preventing the widespread adoption of Shapley value-based data valuation: its #P-hard computational complexity. Existing approximations, like Data Shapley or the TMC-Shapley algorithm, still require a massive number of model retrainings on random subsets of data, which is prohibitively expensive for large models. For context, evaluating a dataset of just 1,000 points using a naive Monte Carlo approach could require tens of thousands of retrainings. In contrast, the Local Shapley approach directly attacks this cost by identifying and reusing computations only on the subsets that mathematically matter.

The technical implication is a paradigm shift from "global" to "local" data valuation. Unlike methods that treat the model as a black box and sample coalitions randomly, this approach is white-box, leveraging knowledge of the model's internal computational pathway (e.g., tree structure, graph connectivity). This mirrors a broader trend in explainable AI (XAI) where model-specific explanations (like TreeSHAP for tree ensembles) are far more efficient than model-agnostic ones (like KernelSHAP).

The performance claims are significant. If a model's support sets are small—imagine a random forest where each test point is influenced by data in only a few leaves—the LSMR algorithm could reduce retraining operations from exponential to linear in the number of distinct supports. For scenarios with larger supports, the LSMR-A estimator's runtime is determined by the number of distinct sampled subsets, not the total number of Monte Carlo draws. This is a major efficiency gain, as many random draws will sample redundant subsets. The guarantee of remaining unbiased with exponential concentration is crucial for providing reliable valuations in production settings.

What This Means Going Forward

The immediate beneficiaries of this research are organizations dealing with data marketplaces, data curation, and debugging machine learning pipelines. For platforms like Google's Dataset Search or emerging data-as-a-service models, practical data valuation is essential for pricing, provenance, and quality assessment. This method could enable valuation for datasets used to train large, complex models like GNNs for drug discovery or recommendation systems, where traditional methods are computationally impossible.

We can expect to see this methodology integrated into popular explainability libraries. SHAP (SHapley Additive exPlanations), a ubiquitous library with over 11.5k stars on GitHub, already implements model-specific optimizations. The Local Shapley framework provides a formal foundation to extend these optimizations in a principled way, potentially leading to a new class of "LocalSHAP" explainers for tree-based models, GNNs, and other locally interpretable architectures.

Looking ahead, the key trend to watch is the convergence of data-centric AI and efficient explainability. As the industry shifts focus from model architecture to data quality, tools for understanding data's contribution become paramount. The next frontier will be applying these local valuation principles to massive foundation models. While the exact locality assumption may not hold for a dense transformer model on a broad task, the core idea—identifying and valuing the most influential subsets of the pretraining corpus—could inform crucial research into dataset pruning, copyright attribution, and bias mitigation at scale.

常见问题