Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms

ICLR 2026 Conference SubmissionAnonymous Authors
high dimensional statisticsalgorithmic statisticscomputational learning theorycoarse observationsmean estimationlinear regressionfriction
Abstract:

Coarse data arise when learners observe only partial information about samples; namely, a set containing the sample rather than its exact value. This occurs naturally through measurement rounding, sensor limitations, and lag in economic systems. We study Gaussian mean estimation from coarse data, where each true sample xx is drawn from a dd-dimensional Gaussian distribution with identity covariance, but is revealed only through the set of a partition containing xx. When the coarse samples, roughly speaking, have ``low'' information, the mean cannot be uniquely recovered from observed samples (i.e., the problem is not identifiable). Recent work by Fotakis et al. (2021) established that sample-efficient mean estimation is possible when the unknown mean is identifiable and the partition consists of only convex sets. Moreover, they showed that without convexity, mean estimation becomes NP-hard. However, two fundamental questions remained open:

  1. When is the mean identifiable under convex partitions?
  2. Is computationally efficient estimation possible under identifiability and convex partitions?

This work resolves both questions. We provide a geometric characterization of when a convex partition is identifiable, showing it depends on whether the convex sets form ``slabs'' in a direction. Second, we give the first polynomial-time algorithm for finding ε\varepsilon-accurate estimates of the Gaussian mean given coarse samples from an unknown convex partition, matching the optimal O~(d/ε2)\widetilde{O}(d/\varepsilon^2) sample complexity. Our results have direct applications to robust machine learning, particularly robustness to observation rounding. As a concrete example, we derive a sample- and computationally- efficient algorithm for linear regression with market friction, a canonical problem in using ML in economics, where exact prices are unobserved and one only sees a range containing the price (Rosett, 1959).

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper addresses Gaussian mean estimation when observations are revealed only through coarse partitions rather than exact values. It sits in the 'Coarse Partition-Based Mean Estimation' leaf, which contains only two papers total (including the original work). This represents a relatively sparse research direction within the broader taxonomy of 36 papers across quantization, distributed estimation, and functional data methods. The sibling paper focuses on particle filtering applications, suggesting limited prior work directly tackling the theoretical foundations of convex partition-based estimation.

The taxonomy reveals that neighboring branches emphasize different observation models: 'Quantization-Based Mean Estimation' (13 papers) studies discrete-level mappings with known quantizer designs, while 'Distributed Mean Estimation' (4 papers) addresses communication-constrained protocols across multiple agents. The original work diverges by considering arbitrary convex partitions without assuming structured quantization or distributed settings. The scope note for the leaf explicitly excludes point quantization and distributed protocols, positioning this work at the intersection of geometric constraints (convexity) and identifiability theory rather than communication or bit-budget optimization.

Among 22 candidates examined, the contribution on polynomial-time algorithms shows overlap with 2 refutable candidates from 10 examined, suggesting some prior computational work exists in related settings. The geometric identifiability characterization examined 2 candidates with no refutations, indicating this theoretical angle may be less explored. The linear regression extension examined 10 candidates without refutations, pointing to potential novelty in applying coarse-data frameworks to regression. The limited search scope (top-K semantic matches) means these statistics reflect nearby literature rather than exhaustive coverage of all estimation theory.

Given the sparse leaf structure and the limited 22-candidate search, the work appears to occupy a relatively underexplored niche bridging partition geometry and statistical identifiability. The computational contribution faces some prior overlap, while the identifiability characterization and regression application show fewer direct precedents among examined papers. A broader search beyond semantic neighbors might reveal additional connections to convex geometry or interval-censored data literature not captured here.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
22
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Gaussian mean estimation from coarse data. This field addresses the challenge of estimating the mean of a Gaussian distribution when observations are degraded by quantization, partitioning, or communication constraints. The taxonomy reveals several main branches that reflect different facets of this problem. Quantization-Based Mean Estimation focuses on scenarios where continuous observations are mapped to discrete levels, often exploring optimal quantizer design and the resulting estimation error, as seen in works like Quantized Gaussian Dither[9] and Optimal Scalar Quantization[35]. Distributed Mean Estimation Under Communication Constraints examines settings where multiple agents must collaborate under bandwidth limitations, with studies such as Quantized Message Passing[3] and Geometric Communication Bounds[34] addressing trade-offs between communication cost and accuracy. Coarse Partition-Based Mean Estimation considers data that arrive in broad bins or intervals rather than precise quantization levels. Meanwhile, Continuous and Functional Mean Estimation extends the problem to infinite-dimensional or functional data, Robust and High-Dimensional Mean Estimation tackles outliers and scaling challenges, and Application-Specific Mean Estimation applies these ideas to domains like viral load tracking or service metrics. Several active lines of work highlight contrasting emphases and open questions. One thread investigates the interplay between quantization granularity and estimation performance, with papers like MMSE Quantized Nonasymptotic[33] and One-Bit Mean Estimation[7] pushing the limits of minimal-bit regimes. Another explores adaptive and data-dependent quantizers, as in Adaptive Quantizers[36] and Quantization Adaptation Estimation[32], which adjust binning strategies based on observed statistics. The original paper, Coarse Mean Estimation[0], sits within the Coarse Partition-Based branch and shares thematic ground with Bernoulli Particle Filter[20], both addressing estimation from relatively crude partitions of the observation space. Compared to the fine-grained quantization focus of One-Bit Mean Estimation[7] or the communication-centric perspective of Quantized Message Passing[3], Coarse Mean Estimation[0] emphasizes scenarios where data are inherently grouped into coarse bins, raising questions about optimal partition design and the fundamental limits of inference under such constraints.

Claimed Contributions

Geometric characterization of identifiability for convex partitions

The authors characterize exactly when Gaussian mean estimation from coarse data is identifiable under convex partitions. They prove that a convex partition is non-identifiable if and only if almost all sets in the partition are parallel slabs in some direction, combining tools from optimal transport and variance reduction inequalities.

2 retrieved papers
First polynomial-time algorithm for coarse Gaussian mean estimation

The authors provide the first computationally efficient algorithm for estimating the Gaussian mean from coarse samples under identifiable convex partitions. The algorithm runs in polynomial time and achieves the same sample complexity as prior sample-efficient but computationally inefficient methods.

10 retrieved papers
Can Refute
Sample- and computationally-efficient algorithm for linear regression with market friction

As a concrete application, the authors develop an efficient algorithm for linear regression with market friction, where exact prices are unobserved and only ranges containing prices are available. This addresses a canonical economics problem dating back to Rosett (1959).

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Geometric characterization of identifiability for convex partitions

The authors characterize exactly when Gaussian mean estimation from coarse data is identifiable under convex partitions. They prove that a convex partition is non-identifiable if and only if almost all sets in the partition are parallel slabs in some direction, combining tools from optimal transport and variance reduction inequalities.

Contribution

First polynomial-time algorithm for coarse Gaussian mean estimation

The authors provide the first computationally efficient algorithm for estimating the Gaussian mean from coarse samples under identifiable convex partitions. The algorithm runs in polynomial time and achieves the same sample complexity as prior sample-efficient but computationally inefficient methods.

Contribution

Sample- and computationally-efficient algorithm for linear regression with market friction

As a concrete application, the authors develop an efficient algorithm for linear regression with market friction, where exact prices are unobserved and only ranges containing prices are available. This addresses a canonical economics problem dating back to Rosett (1959).