Conjuring Semantic Similarity

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Meaning RepresentationSemantic SimilarityDiffusion Model

The semantic similarity between sample expressions measures the distance between their latent `meaning'.These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes measuring semantic similarity between textual expressions by comparing distributions of images generated by text-conditioned diffusion models, using Jeffreys divergence between reverse-time SDEs. It resides in the 'Generative Model-Based Distribution Comparison' leaf, which contains only two papers including this one. This represents a sparse research direction within the broader taxonomy of 46 papers across multiple branches. The sibling paper in this leaf also explores distribution-based similarity but may differ in technical approach or scope, suggesting this is an emerging rather than crowded area.

The taxonomy reveals that most related work falls into neighboring branches: cross-modal embedding methods that learn joint text-image representations, and text-to-image generation quality assessment approaches that evaluate alignment without distribution comparison. The 'Retrieval-Based Visual Denotation Similarity' leaf offers an alternative distribution-based approach using retrieved rather than generated images. The paper's position bridges generative modeling and semantic similarity measurement, diverging from embedding-based methods that dominate the 'Cross-Modal Embedding and Alignment' branch with its multiple active sub-areas including image-text matching and multimodal distributional semantics.

Among 21 candidates examined across three contributions, none were identified as clearly refuting the work. The core contribution of visually-grounded semantic similarity examined 10 candidates with no refutations found. The Jeffreys divergence computation method examined only 1 candidate, while the evaluation framework contribution examined 10 candidates, again with no overlapping prior work detected. This limited search scope—top-K semantic matches plus citation expansion—suggests the contributions appear novel within the examined literature, though the small candidate pool and sparse taxonomy leaf indicate this may reflect an under-explored research direction rather than exhaustive validation.

The analysis covers a focused but limited literature sample, constrained by the semantic search methodology and the field's apparent sparseness in this specific direction. The absence of refuting work among 21 candidates, combined with the two-paper taxonomy leaf, suggests either genuine novelty or insufficient prior exploration. The approach's distinctiveness—using generative distributions rather than embeddings or retrieval—may explain both its isolation in the taxonomy and the difficulty finding closely comparable prior work within the search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: measuring semantic similarity between textual expressions through generated image distributions. The field encompasses diverse approaches to understanding and quantifying semantic relationships by bridging language and vision. The taxonomy reveals several major branches: some focus directly on image distribution-based similarity measurement, comparing how different text prompts produce varying visual outputs; others emphasize cross-modal embedding and alignment, learning joint representations where semantically similar texts and images occupy nearby positions in a shared space. Additional branches address text-to-image generation quality, semantic representation learning from multimodal contexts, and application-driven systems that leverage these techniques for tasks like rumor detection or accessibility testing. Works such as Visual Semantic Reasoning[3] and Multimodal Distributional Semantics[16] illustrate how grounding language in visual data can enrich semantic understanding, while methods like Semantic Similarity Distance[10] and Enhanced Semantic Similarity[5] explore computational frameworks for quantifying these relationships. Particularly active lines of work contrast direct generative approaches with embedding-based methods. Generative model-based distribution comparison, where Conjuring Semantic Similarity[0] resides, leverages the distributional properties of generated images to infer semantic closeness between prompts—an approach that naturally captures nuanced visual interpretations. This contrasts with embedding techniques like Semantic Similarity Embedding[21] or cross-modal alignment frameworks such as Scene Graph Alignment[14], which map texts and images into unified vector spaces for similarity computation. Conjuring Semantic Similarity[0] sits within the generative distribution comparison cluster, closely related to Semantic Similarity Distance[10], which also examines distributional divergence. However, while Semantic Similarity Distance[10] may emphasize metric properties, Conjuring Semantic Similarity[0] appears to focus on harnessing the generative process itself as a semantic probe. Open questions remain about how these distribution-based methods scale, handle ambiguity, and compare to embedding approaches in capturing fine-grained semantic distinctions.

Claimed Contributions

Visually-grounded semantic similarity for text-conditioned diffusion models

10 retrieved papers

The authors introduce a method to measure semantic similarity between textual expressions by comparing the distributions of images they generate in text-conditioned diffusion models, rather than comparing text representations directly. This provides a visually-grounded notion of meaning.

10 retrieved papers

Computable distance via Jeffreys divergence between reverse-time SDEs

1 retrieved paper

The authors derive a tractable algorithm for computing semantic similarity by using the Jeffreys divergence between the SDEs governing diffusion processes conditioned on different text prompts, which can be estimated through Monte-Carlo sampling.

1 retrieved paper

First method for evaluating semantic alignment of diffusion models with humans

10 retrieved papers

The authors present the first approach to quantitatively assess how well the semantic space learned by text-conditioned diffusion models aligns with human-annotated semantic similarity, enabling new evaluation paradigms for these models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation PDF

Zhaorui Tan, Xi Yang, Zihan Ye, Wang Qiufeng, Qiufeng Wang, Yuyao Yan, Anh Nguyen, Kaizhu Huang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visually-grounded semantic similarity for text-conditioned diffusion models

[48] Stable VITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On PDF

Cannot Refute

[51] Unsupervised Semantic Correspondence Using Stable Diffusion PDF

Cannot Refute

[53] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

Cannot Refute

[57] Exploring phrase-level grounding with text-to-image diffusion model PDF

Cannot Refute

[58] Uncovering the disentanglement capability in text-to-image diffusion models PDF

Cannot Refute

[59] Unleashing text-to-image diffusion models for visual perception PDF

Cannot Refute

[60] Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers PDF

Cannot Refute

[61] Uncovering the text embedding in text-to-image diffusion models PDF

Cannot Refute

[62] Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers PDF

Cannot Refute

[63] Attention disentanglement for semantic diffusion modeling in text-to-image generation PDF

Cannot Refute

Contribution

Computable distance via Jeffreys divergence between reverse-time SDEs

[64] One-step Diffusion Models with -Divergence Distribution Matching PDF

Cannot Refute

Contribution

First method for evaluating semantic alignment of diffusion models with humans

[47] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding PDF

Cannot Refute

[48] Stable VITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On PDF

Cannot Refute

[49] Diffusion Model Alignment Using Direct Preference Optimization PDF

Cannot Refute

[50] Aligning Text-to-Image Diffusion Models with Reward Backpropagation PDF

Cannot Refute

[51] Unsupervised Semantic Correspondence Using Stable Diffusion PDF

Cannot Refute

[52] Comat: Aligning text-to-image diffusion model with image-to-text concept matching PDF

Cannot Refute

[53] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

Cannot Refute

[54] Aligning Diffusion Models by Optimizing Human Utility PDF

Cannot Refute

[55] Generative Semantic Communication: Diffusion Models Beyond Bit Recovery PDF

Cannot Refute

[56] A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence PDF

Cannot Refute

Conjuring Semantic Similarity

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Semantic Similarity Distance: Towards better text-image consistency metric in text-to-image generation PDF

Contribution Analysis

Visually-grounded semantic similarity for text-conditioned diffusion models

[48] Stable VITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On PDF

[51] Unsupervised Semantic Correspondence Using Stable Diffusion PDF

[53] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

[57] Exploring phrase-level grounding with text-to-image diffusion model PDF

[58] Uncovering the disentanglement capability in text-to-image diffusion models PDF

[59] Unleashing text-to-image diffusion models for visual perception PDF

[60] Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers PDF

[61] Uncovering the text embedding in text-to-image diffusion models PDF

[62] Seg4Diff: Unveiling Open-Vocabulary Semantic Segmentation in Text-to-Image Diffusion Transformers PDF

[63] Attention disentanglement for semantic diffusion modeling in text-to-image generation PDF

Computable distance via Jeffreys divergence between reverse-time SDEs

[64] One-step Diffusion Models with -Divergence Distribution Matching PDF

First method for evaluating semantic alignment of diffusion models with humans

[47] Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding PDF

[48] Stable VITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On PDF

[49] Diffusion Model Alignment Using Direct Preference Optimization PDF

[50] Aligning Text-to-Image Diffusion Models with Reward Backpropagation PDF

[51] Unsupervised Semantic Correspondence Using Stable Diffusion PDF

[52] Comat: Aligning text-to-image diffusion model with image-to-text concept matching PDF

[53] Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models PDF

[54] Aligning Diffusion Models by Optimizing Human Utility PDF

[55] Generative Semantic Communication: Diffusion Models Beyond Bit Recovery PDF

[56] A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence PDF

Table of Contents