Low-Pass Filtering Improves Behavioral Alignment of Vision Models

ICLR 2026 Conference SubmissionAnonymous Authors
behavioral alignment; cognitive science; CLIP; computer vision; shape bias; error consistency
Abstract:

Despite their impressive performance on computer vision benchmarks, Deep Neural Networks (DNNs) still fall short of adequately modeling human visual behavior, as measured by error consistency and shape bias. Recent work hypothesized that behavioral alignment can be drastically improved through generative - rather than discriminative - classifiers, with far-reaching implications for models of human vision.

Here, we instead show that the increased alignment of generative models can be largely explained by a seemingly innocuous resizing operation in the generative model which effectively acts as a low-pass filter. In a series of controlled experiments, we show that removing high-frequency spatial information from discriminative models like CLIP drastically increases their behavioral alignment. Simply blurring images at test-time - rather than training on blurred images - achieves a new state-of-the-art score on the model-vs-human benchmark, halving the current alignment gap between DNNs and human observers. Furthermore, low-pass filters are likely optimal, which we demonstrate by directly optimizing filters for alignment. To contextualize the performance of optimal filters, we compute the frontier of all possible pareto-optimal solutions to the benchmark, which was formerly unknown.

We explain our findings by observing that the frequency spectrum of optimal Gaussian filters roughly matches the spectrum of band-pass filters implemented by the human visual system. We show that the contrast sensitivity function, describing the inverse of the contrast threshold required for humans to detect a sinusoidal grating as a function of spatiotemporal frequency, is approximated well by Gaussian filters of a specific width.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes that low-pass filtering at test time drastically improves behavioral alignment between deep neural networks and human visual perception, offering an alternative explanation for generative models' superior alignment. It resides in the 'Test-Time Filtering for Human Alignment' leaf, which contains only two papers total. This is a notably sparse research direction within the broader taxonomy of 18 papers across 16 leaf nodes, suggesting the specific focus on test-time frequency filtering for human alignment remains relatively unexplored compared to other frequency-based vision approaches.

The taxonomy reveals several neighboring directions that contextualize this work. The sibling leaf 'Training-Time Robustness to Blur' explores incorporating blur during training rather than at inference, while 'Self-Supervised Alignment with Augmentation' uses variable filtering as a learning signal. Adjacent branches address frequency filtering for entirely different objectives: 'Frequency-Based Generation and Synthesis' targets image quality in generative models, and 'Domain Adaptation and Transfer Learning' applies frequency methods to cross-domain robustness. The paper's focus on test-time intervention for behavioral metrics distinguishes it from these training-centric or task-specific approaches.

Among nine candidates examined, all three contributions show evidence of prior work overlap. The core claim about test-time filtering improving alignment examined one candidate with one refutable match. The alternative explanation for Imagen's alignment examined two candidates, finding one refutable. The Pareto-optimal frontier computation examined six candidates with one refutable match and five unclear cases. The limited search scope—nine papers total rather than an exhaustive survey—means these statistics reflect overlap within a narrow semantic neighborhood, not comprehensive field coverage. The frontier computation appears most novel given fewer clear refutations among examined candidates.

Based on the top-nine semantic matches examined, the work appears to occupy a sparsely populated research direction with some precedent in neighboring areas. The analysis captures immediate semantic neighbors but cannot assess whether more distant literature addresses similar ideas through different terminology or framing. The taxonomy structure suggests test-time filtering for alignment is less crowded than training-time or generation-focused frequency methods, though the small candidate pool limits confidence in this assessment.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
9
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: improving behavioral alignment of vision models through low-pass filtering. The field explores how frequency-domain manipulations—particularly suppressing high-frequency components—can make vision models behave more consistently with human perception and biological vision systems. The taxonomy organizes work into several main branches: Direct Behavioral Alignment via Frequency Filtering focuses on test-time or training-time filtering strategies to improve human-like responses; Frequency-Based Generation and Synthesis examines how frequency decomposition aids image creation and enhancement tasks; Domain Adaptation and Transfer Learning investigates frequency filtering as a tool for cross-domain robustness; Multimodal Representation and Contrastive Learning studies frequency manipulations in joint embedding spaces; and Specialized Vision Applications applies these ideas to tracking, HDR imaging, and other targeted problems. Representative works such as Robustness to Blur[2] and Removing High Frequency Information[14] illustrate how simple low-pass operations can yield surprising gains in model reliability, while methods like Adaptive Low-Pass Guidance[1] and Multi-Frequency Contrastive Decoding[4] show more sophisticated frequency-aware architectures. A particularly active line of work centers on test-time filtering for human alignment, where models apply frequency adjustments at inference without retraining. Low-Pass Behavioral Alignment[0] sits squarely in this cluster, proposing that low-pass filtering at test time can close the gap between model predictions and human judgments. This approach contrasts with training-time methods like Prompt Gradient Alignment[3], which bakes frequency-aware objectives into the learning process, and with architectural innovations such as Frequency-Augmented Mixture Experts[5], which embed frequency decomposition directly into model design. A key open question is whether simple post-hoc filtering suffices or whether deeper integration—through training objectives or architectural changes—yields more robust alignment. Low-Pass Behavioral Alignment[0] emphasizes the former, demonstrating that lightweight test-time interventions can be surprisingly effective, while neighboring work like Removing High Frequency Information[14] explores similar filtering strategies in related contexts.

Claimed Contributions

Low-pass filtering at test time drastically increases behavioral alignment

The authors demonstrate that applying low-pass filters to images during evaluation—rather than during training—substantially improves the behavioral alignment of vision models with human observers. This simple test-time transformation increases error consistency and shape bias across multiple model architectures.

1 retrieved paper
Can Refute
Alternative explanation for Imagen's behavioral alignment

The authors propose that Imagen's high behavioral alignment stems from its resizing operation (which acts as a low-pass filter) rather than its generative training objective. This challenges the hypothesis that generative models are necessary for human-like vision.

2 retrieved papers
Can Refute
Computation of pareto-optimal frontier for model-vs-human benchmark

The authors compute the frontier of pareto-optimal solutions for the model-vs-human benchmark, establishing the theoretical ceiling performance and revealing the fundamental trade-off between out-of-distribution accuracy and error consistency with humans.

6 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Low-pass filtering at test time drastically increases behavioral alignment

The authors demonstrate that applying low-pass filters to images during evaluation—rather than during training—substantially improves the behavioral alignment of vision models with human observers. This simple test-time transformation increases error consistency and shape bias across multiple model architectures.

Contribution

Alternative explanation for Imagen's behavioral alignment

The authors propose that Imagen's high behavioral alignment stems from its resizing operation (which acts as a low-pass filter) rather than its generative training objective. This challenges the hypothesis that generative models are necessary for human-like vision.

Contribution

Computation of pareto-optimal frontier for model-vs-human benchmark

The authors compute the frontier of pareto-optimal solutions for the model-vs-human benchmark, establishing the theoretical ceiling performance and revealing the fundamental trade-off between out-of-distribution accuracy and error consistency with humans.