The research paper "Local Shapley: Efficient Data Valuation via Model-Induced Locality" introduces a paradigm-shifting method to make the computationally prohibitive Shapley value practical for large-scale machine learning. By exploiting the inherent locality of modern predictors, it reframes the valuation problem, offering a path to efficiently identify which training data points are most valuable for a specific prediction, a critical capability for debugging, data curation, and fairness auditing.
Key Takeaways
- The paper formalizes the concept of model-induced locality, where only a small, structured subset of training data (the "support set") influences a given prediction in models like KNN, decision trees, and GNNs.
- It proves that Shapley value computation can be accurately projected onto these support sets, fundamentally changing the problem from exponential coalition enumeration to structured processing of overlapping subsets.
- The authors introduce two algorithms: LSMR, an optimal deterministic method that trains each unique influential subset exactly once, and LSMR-A, a scalable Monte Carlo estimator for larger supports.
- Experiments show the methods achieve substantial reductions in required model retraining and significant speedups while maintaining high fidelity to exact Shapley values.
- The work establishes an information-theoretic lower bound on retraining operations, governed by the number of distinct influential subsets, providing a theoretical foundation for the efficiency gains.
Redefining Data Valuation with Model-Induced Locality
The core innovation of Local Shapley is its formalization of a practical observation: for a given test input, a modern ML model's prediction is typically determined by a small, structured portion of the training dataset. The authors define this as the support set, which varies by model architecture. For a k-Nearest Neighbors (KNN) model, it is the k nearest training points; for a decision tree or random forest, it is the set of training points that fall into the same leaf node; and for a Graph Neural Network (GNN), it is the receptive field—the subgraph of neighboring nodes.
The pivotal theoretical contribution is the proof that Shapley value computation for a data point can be restricted to its support set without loss of accuracy when locality is exact. This transforms the valuation problem's structure. Instead of evaluating a data point's contribution across all 2^N possible coalitions of N training points, the computation is projected onto the family of subsets induced by the overlapping support sets of all training points. This reframing is what enables the dramatic computational savings.
Industry Context & Analysis
This research directly tackles the primary bottleneck preventing the widespread adoption of Shapley-based data valuation in industry: computational intractability. The standard Shapley value has a complexity of O(2^N), making it unusable for datasets beyond a few hundred points. Existing approximations, like Data Shapley or TMC-Shapley, use Monte Carlo methods to sample coalitions but remain "global" in nature, still requiring a vast number of model retrainings across the entire dataset. For context, evaluating a single point's value on the CIFAR-10 dataset (50,000 training images) with these methods can require thousands of retrainings of a deep neural network, a prohibitively expensive process.
The Local Shapley approach is architecturally aware in a way previous methods are not. It recognizes that the valuation question "How valuable is this training point?" is intrinsically linked to the model's internal mechanics. Unlike a model-agnostic method like Data Shapley, which treats the model as a black-box oracle, Local Shapley leverages the white-box structure of the predictor's computational pathway. This is analogous to the shift in explainable AI (XAI) from post-hoc methods like LIME to integrated, gradient-based methods like Integrated Gradients or SHAP for deep networks, which use model internals for greater efficiency and fidelity.
The performance claims are grounded in a rigorous information-theoretic lower bound. The authors prove that the intrinsic complexity is governed by the number of distinct influential subsets, not the dataset size. This has profound implications. In a KNN model with k=5, the number of distinct subsets of 5 points is vastly smaller than the number of 5-point combinations from a large dataset, as many points share identical neighbor sets. This theoretical insight justifies the observed exponential speedups. For industries relying on interpretable models like gradient-boosted trees (e.g., finance with XGBoost or healthcare with clinical risk models), this method could make routine data valuation feasible, whereas it was previously a research curiosity.
What This Means Going Forward
The immediate beneficiaries of this work are organizations that rely on structured, non-deep learning models for high-stakes decisions and already prioritize model interpretability. Sectors like finance (credit scoring), insurance (risk assessment), and regulated healthcare can now practically implement Shapley-based data valuation to audit their training sets for biases, errors, or poisoned data points. This moves data valuation from a theoretical fairness tool to an operational debugging and compliance instrument.
A key trend this accelerates is the shift from model-centric to data-centric AI. As championed by Andrew Ng, this philosophy argues that high-quality data is more crucial than marginal model architecture improvements. Local Shapley provides a computationally feasible metric to quantify data quality at the individual example level, enabling active learning and data curation pipelines to be guided by principled valuation, not just heuristic rules.
Looking ahead, the major challenge and opportunity will be extending this "locality" principle to large foundation models and deep neural networks. The paper's experiments include Multi-Layer Perceptrons (MLPs), where locality is less explicit. Future research will need to define meaningful, efficient support sets for transformers and diffusion models—perhaps through attention head analysis or feature activation clustering. If successful, it could unlock valuation for the massive, web-scale datasets used to train models like GPT-4 or Stable Diffusion, answering critical questions about data provenance, copyright, and the marginal value of one more scraped webpage. The race to efficiently value training data is just beginning, and Local Shapley has set a compelling new direction.