PixNerd: Pixel Neural Field Diffusion
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce PixNerd, a novel architecture that combines large-patch diffusion transformers with neural field decoding for pixel-space image generation. This approach replaces the traditional linear projection with a patch-wise implicit neural field head, enabling efficient single-stage end-to-end training without requiring VAEs or cascade pipelines.
The authors design a patch-wise adaptive neural field head whose weights are predicted by the diffusion transformer's last hidden features. For each pixel within a patch, local coordinates and noisy pixel values are encoded and fed into a neural field MLP to predict diffusion velocity, addressing the challenge of learning fine details with large-patch configurations.
The authors demonstrate that PixNerd achieves competitive results on both class-conditional and text-to-image generation tasks. On ImageNet 256×256, PixNerd-XL/16 obtains 1.93 FID with computational demands similar to latent diffusion models, while PixNerd-XXL/16 achieves strong scores on GenEval and DPG benchmarks for text-to-image generation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[18] Image Neural Field Diffusion Models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
PixNerd: Pixel Neural Field Diffusion Transformer
The authors introduce PixNerd, a novel architecture that combines large-patch diffusion transformers with neural field decoding for pixel-space image generation. This approach replaces the traditional linear projection with a patch-wise implicit neural field head, enabling efficient single-stage end-to-end training without requiring VAEs or cascade pipelines.
[38] Latent Diffusion Transformer with Local Neural Field as PDE Surrogate Model PDF
Patch-wise adaptive neural field head for large-patch decoding
The authors design a patch-wise adaptive neural field head whose weights are predicted by the diffusion transformer's last hidden features. For each pixel within a patch, local coordinates and noisy pixel values are encoded and fed into a neural field MLP to predict diffusion velocity, addressing the challenge of learning fine details with large-patch configurations.
Competitive performance on class-to-image and text-to-image benchmarks
The authors demonstrate that PixNerd achieves competitive results on both class-conditional and text-to-image generation tasks. On ImageNet 256×256, PixNerd-XL/16 obtains 1.93 FID with computational demands similar to latent diffusion models, while PixNerd-XXL/16 achieves strong scores on GenEval and DPG benchmarks for text-to-image generation.