Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Researchers from the University of Texas at Austin developed Local Shapley, a framework that dramatically accelerates data valuation for machine learning models by exploiting model-induced locality. The method reduces computational complexity from exponential to linear by focusing only on influential training data subsets that affect specific predictions. This breakthrough makes rigorous data valuation practical for complex models like graph neural networks and ensembles, enabling applications in debugging, fairness auditing, and data marketplace pricing.

Local Shapley: Model-Induced Locality and Optimal Reuse in Data Valuation

Researchers from the University of Texas at Austin have introduced a groundbreaking method, Local Shapley, that dramatically accelerates the computationally prohibitive task of data valuation for machine learning models. This work fundamentally reframes the problem by exploiting the inherent locality of model predictions, shifting the computational burden from enumerating all possible training data subsets to processing only the small, influential subsets that actually affect a given prediction. The advance promises to make rigorous data valuation—critical for debugging, fairness auditing, and data marketplace pricing—practically feasible for complex modern models like graph neural networks (GNNs) and ensembles.

Key Takeaways

  • The core innovation is Local Shapley, a framework that leverages the fact that only a small, model-defined subset of training data (the "support set") influences any single prediction, drastically reducing the computational space for Shapley value calculation.
  • The researchers prove an information-theoretic lower bound on retraining operations and introduce LSMR (Local Shapley via Model Reuse), an optimal algorithm that trains each unique influential subset exactly once.
  • For larger support sets, they develop LSMR-A, a Monte Carlo estimator that remains unbiased and achieves exponential concentration, with runtime scaling with distinct sampled subsets rather than total samples.
  • Experiments across model families (KNN, Random Forests, GNNs) demonstrate the method achieves substantial reductions in required model retrainings and runtime speedups while preserving high valuation fidelity compared to exact or global approximation methods.
  • This work transforms Shapley-based data valuation from a purely theoretical or small-scale tool into a practical component of the modern ML pipeline for complex, non-differentiable models.

Redefining Data Valuation Through Model-Induced Locality

The fundamental challenge of data valuation via Shapley values is its #P-hard computational complexity. Calculating the exact Shapley value for a single training data point requires evaluating the model's performance with every possible subset of the training set, necessitating an exponential number of model retrainings. Existing acceleration techniques, such as Monte Carlo approximation or gradient-based methods for differentiable models, remain "global"—they still operate over the entire, exponentially large coalition space. The key insight of this research is that for a given test instance, a modern predictor's output is typically determined by a very small, structurally defined portion of the training data.

The authors formalize this as model-induced locality. For a given test point, the model's computational pathway defines a support set—examples that are truly influential. In a k-Nearest Neighbors (KNN) model, this is the set of k nearest neighbors; in a decision tree or Random Forest, it's the training points that fall into the same leaf node(s); in a Graph Neural Network (GNN), it's the nodes within the model's receptive field. The pivotal theoretical contribution is proving that Shapley value computation can be projected exactly onto these support sets when locality is exact, and approximated robustly when it is not. This reframes the problem from one of exhaustive coalition enumeration to one of structured data processing over families of overlapping, influential subsets.

Industry Context & Analysis

This research arrives at a critical juncture for MLOps and responsible AI. Data valuation is increasingly recognized as essential—not just for academic curiosity but for practical, high-stakes applications. Companies like Google and Microsoft use internal data valuation to prioritize quality data collection and identify mislabeled examples. Startups envisioning data marketplaces need it to price training data fairly. Regulatory frameworks demand explainability for automated decisions, which hinges on understanding data influence. However, the field has been bottlenecked by computation. Prior state-of-the-art methods, like Data Shapley or its Monte Carlo approximations, remain impractically slow for large datasets and complex models, often limiting real-world application to small-scale linear models or tiny datasets.

The Local Shapley approach represents a paradigm shift. Unlike gradient-based influence function methods (e.g., TracIn), which are specific to differentiable models and can be unstable, Local Shapley provides a principled, model-agnostic framework grounded in cooperative game theory. It also contrasts sharply with global Shapley approximations. For instance, while a method like the permutation-based approximation in the SHAP library (~43.8k GitHub stars) reduces samples, it still implicitly considers the entire dataset. Local Shapley's subset-centric view changes the scaling factor. Its complexity is governed not by dataset size N, but by the number of distinct influential subsets, which is often orders of magnitude smaller.

The technical implications are profound for non-differentiable model classes, which are dominant in many production systems. Tree-based models (e.g., XGBoost, Random Forests) power countless recommendation and fraud detection systems. The authors' demonstration on Random Forests shows LSMR can reduce required retrainings from an exponential function of dataset size to a linear function of the number of distinct leaf memberships—a monumental reduction. For GNNs, a rapidly growing field with applications in drug discovery and social network analysis where explainability is paramount, this is one of the first feasible methods for rigorous data valuation, moving beyond simpler node or edge attribution techniques.

What This Means Going Forward

The immediate beneficiaries of this work are ML engineers and researchers working with complex, non-differentiable models who require auditable and explainable AI systems. Industries reliant on Random Forests for high-stakes decisions (finance, healthcare) and those adopting GNNs (biotech, cybersecurity) now have a viable path to implement rigorous data valuation. This enables practical workflows for detecting training set biases, identifying corrupted labels, and curating high-quality core datasets—tasks that were previously intractable for these model families.

Looking ahead, this research opens several new avenues. First, it will spur the development of integrated data valuation tools within major ML frameworks. We can expect future versions of libraries like SHAP or Captum to incorporate locality principles for tree and graph models. Second, it provides a formal framework for "data attribution" in data marketplaces. A platform could use LSMR to dynamically price the contribution of individual data points to a model's performance on specific validation criteria, creating a more nuanced and fair data economy. Finally, the concept of model-induced locality is likely to influence other areas of explainable AI (XAI). Similar principles could be applied to accelerate other coalitional game-based explanations or to improve the efficiency of interpreting large, compound models.

The critical watchpoint will be the community's adoption and extension of the core support set concept for emerging model architectures, such as transformers and large language models (LLMs). Defining the "support set" for a single prediction from a 100-billion-parameter LLM is an open and monumental challenge, but the Local Shapley framework provides a new mathematical lens through which to attack it. If successful, it could eventually help answer pressing questions about what specific data in a massive pretraining corpus is responsible for a model's specific capability or failure.

常见问题