PixNerd: Pixel Neural Field Diffusion

ICLR 2026 Conference SubmissionAnonymous Authors
pixel diffusion model
Abstract:

The current success of diffusion transformers are built on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To avoid these problems, researchers return to pixel space modeling but at the cost of complicated cascade pipelines and increased token complexity. Motivated by the simple yet effective diffusion transformer architectures on the latent space, we propose to model pixel space diffusion using a large-patch diffusion transformer and employ neural fields to decode these large patches, leading to a single-stage streamlined end-to-end solution, which we coin as pixel neural field diffusion transformer (PixNerd). Thanks to the efficient neural field representation in PixNerd, we achieve 1.93 FID on ImageNet 256x256 and nearly 8x lower latency without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieves a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
26
3
Claimed Contributions
13
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: pixel space diffusion modeling with neural fields. This emerging area combines diffusion generative models with neural field representations to synthesize and manipulate continuous visual data. The taxonomy reveals several complementary directions. Direct Pixel-Space Neural Field Generation methods, such as PixNerd[0] and Image Neural Field[18], learn to generate neural field parameters end-to-end while diffusing in pixel or coordinate space, enabling flexible resolution and continuous outputs. Latent-Space Neural Field Diffusion approaches like NeuralField-LDM[11] and HyperDiffusion[24] instead encode neural fields into compact latent codes before applying diffusion, trading some direct pixel control for efficiency and scalability. Neural Radiance Field Editing and Synthesis (e.g., DreamEditor[3], ViCA-NeRF[2]) focuses on manipulating 3D scene representations, while Texture Synthesis on 3D Surfaces with Diffusion (TexFusion[4], Single Mesh Diffusion[7]) targets surface appearance. Diffusion Models on Continuous Function Spaces (Diffusion Probabilistic Fields[5], Diff-INR[15]) explore the theoretical grounding of diffusing over function-valued distributions, and Domain-Specific Neural Field Diffusion Applications extend these ideas to medical imaging (Accelerated MRI[20]) and scientific data (Spatiotemporal Turbulence[21]). A central tension runs between end-to-end pixel-space methods and latent-space strategies: the former preserve fine-grained control and interpretability, while the latter achieve faster sampling and better scalability for high-dimensional scenes. PixNerd[0] exemplifies the direct pixel-space philosophy, generating neural field weights through a diffusion process that operates close to the final rendering, much like Image Neural Field[18]. This contrasts with latent approaches such as NeuralField-LDM[11], which compress neural fields into lower-dimensional codes before diffusion, sacrificing some pixel-level transparency for computational gains. Meanwhile, works like Diffusion Probabilistic Fields[5] provide a rigorous function-space perspective that underpins both paradigms. PixNerd[0] sits squarely in the Direct Pixel-Space branch, emphasizing end-to-end pixel diffusion with neural field decoding, and shares conceptual ground with Image Neural Field[18] in prioritizing continuous, resolution-agnostic generation without an intermediate latent bottleneck.

Claimed Contributions

PixNerd: Pixel Neural Field Diffusion Transformer

The authors introduce PixNerd, a novel architecture that combines large-patch diffusion transformers with neural field decoding for pixel-space image generation. This approach replaces the traditional linear projection with a patch-wise implicit neural field head, enabling efficient single-stage end-to-end training without requiring VAEs or cascade pipelines.

1 retrieved paper
Patch-wise adaptive neural field head for large-patch decoding

The authors design a patch-wise adaptive neural field head whose weights are predicted by the diffusion transformer's last hidden features. For each pixel within a patch, local coordinates and noisy pixel values are encoded and fed into a neural field MLP to predict diffusion velocity, addressing the challenge of learning fine details with large-patch configurations.

2 retrieved papers
Competitive performance on class-to-image and text-to-image benchmarks

The authors demonstrate that PixNerd achieves competitive results on both class-conditional and text-to-image generation tasks. On ImageNet 256×256, PixNerd-XL/16 obtains 1.93 FID with computational demands similar to latent diffusion models, while PixNerd-XXL/16 achieves strong scores on GenEval and DPG benchmarks for text-to-image generation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

PixNerd: Pixel Neural Field Diffusion Transformer

The authors introduce PixNerd, a novel architecture that combines large-patch diffusion transformers with neural field decoding for pixel-space image generation. This approach replaces the traditional linear projection with a patch-wise implicit neural field head, enabling efficient single-stage end-to-end training without requiring VAEs or cascade pipelines.

Contribution

Patch-wise adaptive neural field head for large-patch decoding

The authors design a patch-wise adaptive neural field head whose weights are predicted by the diffusion transformer's last hidden features. For each pixel within a patch, local coordinates and noisy pixel values are encoded and fed into a neural field MLP to predict diffusion velocity, addressing the challenge of learning fine details with large-patch configurations.

Contribution

Competitive performance on class-to-image and text-to-image benchmarks

The authors demonstrate that PixNerd achieves competitive results on both class-conditional and text-to-image generation tasks. On ImageNet 256×256, PixNerd-XL/16 obtains 1.93 FID with computational demands similar to latent diffusion models, while PixNerd-XXL/16 achieves strong scores on GenEval and DPG benchmarks for text-to-image generation.