A Law of Data Reconstruction for Random Features (And Beyond)

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

random featuresdata reconstructionmemorizationdeep learning theoryprivacyhigh-dimensional statistics

Large-scale deep learning models are known to memorize parts of the training set. In machine learning theory, memorization is often framed as interpolation or label fitting, and classical results show that this can be achieved when the number of parameters $p$ in the model is larger than the number of training samples $n$ . In this work, we consider memorization from the perspective of data reconstruction, demonstrating that this can be achieved when $p$ is larger than $dn$ , where $d$ is the dimensionality of the data. More specifically, we show that, in the random features model, when $p \gg dn$ , the subspace spanned by the training samples in feature space gives sufficient information to identify the individual samples in input space. Our analysis suggests an optimization method to reconstruct the dataset from the model parameters, and we demonstrate that this method performs well on various architectures (random features, two-layer fully-connected and deep residual networks). Our results reveal a law of data reconstruction, according to which the entire training dataset can be recovered as $p$ exceeds the threshold $dn$ .

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper establishes a 'law of data reconstruction' for random features models, showing that training data can be recovered when the number of parameters p exceeds dn (data dimensionality times sample count). It resides in the 'Gradient-Based Theoretical Analysis' leaf under 'Theoretical Foundations and Identifiability', alongside one sibling paper (Gradient Reconstruction Provably). This leaf is notably sparse, containing only two papers within a broader taxonomy of 50 works, suggesting the paper targets a relatively underexplored theoretical niche focused on formal identifiability conditions rather than empirical attack development.

The taxonomy reveals that most reconstruction research concentrates on practical attack methods (Direct Parameter-Based Reconstruction, Gradient Inversion Attacks) and defense mechanisms (Differential Privacy Bounds), with fewer works establishing theoretical foundations. Neighboring leaves include 'Implicit Bias and Margin Maximization' (four papers exploiting gradient descent properties) and 'Bayesian and Probabilistic Frameworks' (one paper on inverse problems). The paper's focus on random features and parameter-dimensionality thresholds distinguishes it from these adjacent directions, which emphasize optimization dynamics or statistical inference rather than capacity-based reconstruction limits.

Among 30 candidates examined, the theoretical contributions (law of data reconstruction, Theorems 1 and 2) show no clear refutation across 10 candidates each, suggesting limited prior work on this specific threshold phenomenon. However, the optimization method for reconstruction faces overlap: 2 of 10 candidates appear refutable, indicating that algorithmic approaches to parameter-based data recovery have received more attention. This pattern aligns with the taxonomy structure, where attack algorithms dominate the literature while theoretical characterizations of reconstruction regimes remain less developed.

Based on this limited search scope (top-30 semantic matches), the theoretical framing around the dn threshold appears relatively novel, though the optimization method connects to a more established algorithmic literature. The analysis does not cover exhaustive citation networks or domain-specific venues, so additional related work may exist outside the examined candidates. The sparse population of the theoretical analysis leaf suggests the paper addresses a gap in formal reconstruction theory.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reconstructing training data from model parameters. This field examines whether and how an adversary can recover sensitive training examples by inspecting learned model weights or gradients. The taxonomy organizes research into several main branches: Theoretical Foundations and Identifiability explores fundamental limits and conditions under which reconstruction is possible, often through gradient-based analysis (e.g., Gradient Reconstruction Provably[8], Law of Data Reconstruction[0]); Attack Methods and Algorithms develops practical techniques for extracting training data across diverse settings, from federated learning to graph neural networks (e.g., Informed Adversaries Reconstruction[1], Stealing Training Graphs[3]); Defense Mechanisms and Privacy Guarantees investigates countermeasures such as differential privacy and bounds on reconstruction risk (e.g., Bounding DP-SGD Reconstruction[5], Bounding Private Learning[13]); and Related Privacy and Security Contexts covers adjacent concerns like model inversion, unlearning vulnerabilities, and membership inference (e.g., Unlearning Reconstruction Attacks[6], Model Inversion Landscape[24]). Recent work highlights tensions between theoretical guarantees and practical attack efficacy. Many studies focus on gradient-based reconstruction in federated or distributed settings, where even a single gradient update can leak substantial information about individual training samples. Others examine reconstruction from final model weights, exploring how overparameterization or specific architectures enable data recovery. Law of Data Reconstruction[0] sits within the Gradient-Based Theoretical Analysis cluster, providing foundational insights into when and why reconstruction succeeds, closely aligned with Gradient Reconstruction Provably[8]. In contrast, works like ReCIT[4] and Loki[7] emphasize algorithmic innovations for practical attacks, while Bounding DP-SGD Reconstruction[5] and Deconstructing Data Reconstruction[14] investigate the interplay between privacy mechanisms and reconstruction risk. Open questions remain around scaling these attacks to large models, understanding the role of model architecture, and designing defenses that balance utility with rigorous privacy.

Claimed Contributions

Law of data reconstruction for random features

10 retrieved papers

The authors establish a theoretical threshold showing that entire training datasets can be reconstructed from random features models when the number of parameters p exceeds dn, where d is data dimensionality and n is the number of training samples. This reveals a fundamental law governing when data reconstruction becomes feasible.

10 retrieved papers

Theoretical characterization via Theorems 1 and 2

10 retrieved papers

The authors prove that when the feature representations of training samples lie in the span of reconstructed sample features under sufficient over-parameterization, the reconstructed samples must be close to original training data. Theorem 2 extends this to show reconstructed samples are distinct, ruling out duplicates.

10 retrieved papers

Optimization method for data reconstruction

Can Refute

10 retrieved papers

The authors develop a practical reconstruction algorithm based on minimizing the projection of trained parameters onto the orthogonal complement of reconstructed feature space. They demonstrate this method successfully recovers training data across random features, two-layer networks, and deep residual networks when p exceeds dn.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Reconstructing Training Data from Model Gradient, Provably PDF

Wang Zihan, Zihan Wang, Lee, Jason D., Jason Lee, Lei Qi, Jason D. Lee, Qi Lei (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Law of data reconstruction for random features

[51] Robust training under label noise by over-parameterization PDF

Cannot Refute

[52] Model repair: Robust recovery of over-parameterized statistical models PDF

Cannot Refute

[53] Recovery of training data from overparameterized autoencoders: An inverse problem perspective PDF

Cannot Refute

[54] Invertible kernel PCA with random fourier features PDF

Cannot Refute

[55] Overparameterization improves stylegan inversion PDF

Cannot Refute

[56] Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably PDF

Cannot Refute

[57] Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space PDF

Cannot Refute

[58] More is Less: Inducing Sparsity via Overparameterization PDF

Cannot Refute

[59] Wind field reconstruction with adaptive random Fourier features. PDF

Cannot Refute

[60] Overparameterized Neural Networks Implement Associative Memory PDF

Cannot Refute

Contribution

Theoretical characterization via Theorems 1 and 2

[61] Hybrid two-stage reconstruction of multiscale subsurface flow with physics-informed residual connected neural operator PDF

Cannot Refute

[62] Generalized transfer subspace learning through low-rank constraint PDF

Cannot Refute

[63] Secure split learning against property inference, data reconstruction, and feature space hijacking attacks PDF

Cannot Refute

[64] Graph regularized feature selection with data reconstruction PDF

Cannot Refute

[65] Adaptive surface reconstruction with multiscale convolutional kernels PDF

Cannot Refute

[66] Bayesian and related methods in image reconstruction from incomplete data PDF

Cannot Refute

[67] Data reconstruction coverage based on graph signal processing for wireless sensor networks PDF

Cannot Refute

[68] SAR target recognition with limited training data based on angular rotation generative network PDF

Cannot Refute

[69] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection PDF

Cannot Refute

[70] Kernel PCA and de-noising in feature spaces PDF

Cannot Refute

Contribution

Optimization method for data reconstruction

[1] Reconstructing training data with informed adversaries PDF

Can Refute

[2] Reconstructing training data from trained neural networks PDF

Can Refute

[28] Approximating Language Model Training Data from Weights PDF

Cannot Refute

[71] Dataset size recovery from lora weights PDF

Cannot Refute

[72] Phase space reconstruction from accelerator beam measurements using neural networks and differentiable simulations PDF

Cannot Refute

[73] Model architecture level privacy leakage in neural networks PDF

Cannot Refute

[74] Genetic algorithm optimized BP neural network for fast reconstruction of three-dimensional radiation field PDF

Cannot Refute

[75] A novel physical information neural network for real-time monitoring and sparse reconstruction of thermal environments with turbulent natural convection in nacelles PDF

Cannot Refute

[76] An integrated and homogenized global surface solar radiation dataset and its reconstruction based on a convolutional neural network approach PDF

Cannot Refute

[77] Generative image reconstruction from gradients PDF

Cannot Refute

A Law of Data Reconstruction for Random Features (And Beyond)

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Reconstructing Training Data from Model Gradient, Provably PDF

Contribution Analysis

Law of data reconstruction for random features

[51] Robust training under label noise by over-parameterization PDF

[52] Model repair: Robust recovery of over-parameterized statistical models PDF

[53] Recovery of training data from overparameterized autoencoders: An inverse problem perspective PDF

[54] Invertible kernel PCA with random fourier features PDF

[55] Overparameterization improves stylegan inversion PDF

[56] Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably PDF

[57] Regularized Random Fourier Features and Finite Element Reconstruction for Operator Learning in Sobolev Space PDF

[58] More is Less: Inducing Sparsity via Overparameterization PDF

[59] Wind field reconstruction with adaptive random Fourier features. PDF

[60] Overparameterized Neural Networks Implement Associative Memory PDF

Theoretical characterization via Theorems 1 and 2

[61] Hybrid two-stage reconstruction of multiscale subsurface flow with physics-informed residual connected neural operator PDF

[62] Generalized transfer subspace learning through low-rank constraint PDF

[63] Secure split learning against property inference, data reconstruction, and feature space hijacking attacks PDF

[64] Graph regularized feature selection with data reconstruction PDF

[65] Adaptive surface reconstruction with multiscale convolutional kernels PDF

[66] Bayesian and related methods in image reconstruction from incomplete data PDF

[67] Data reconstruction coverage based on graph signal processing for wireless sensor networks PDF

[68] SAR target recognition with limited training data based on angular rotation generative network PDF

[69] Frustratingly Easy Feature Reconstruction for Out-of-Distribution Detection PDF

[70] Kernel PCA and de-noising in feature spaces PDF

Optimization method for data reconstruction

[1] Reconstructing training data with informed adversaries PDF

[2] Reconstructing training data from trained neural networks PDF

[28] Approximating Language Model Training Data from Weights PDF

[71] Dataset size recovery from lora weights PDF

[72] Phase space reconstruction from accelerator beam measurements using neural networks and differentiable simulations PDF

[73] Model architecture level privacy leakage in neural networks PDF

[74] Genetic algorithm optimized BP neural network for fast reconstruction of three-dimensional radiation field PDF

[75] A novel physical information neural network for real-time monitoring and sparse reconstruction of thermal environments with turbulent natural convection in nacelles PDF

[76] An integrated and homogenized global surface solar radiation dataset and its reconstruction based on a convolutional neural network approach PDF

[77] Generative image reconstruction from gradients PDF

Table of Contents