Conjuring Semantic Similarity

ICLR 2026 Conference SubmissionAnonymous Authors
Meaning RepresentationSemantic SimilarityDiffusion Model
Abstract:

The semantic similarity between sample expressions measures the distance between their latent `meaning'.These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes measuring semantic similarity between textual expressions by comparing distributions of images generated by text-conditioned diffusion models, using Jeffreys divergence between reverse-time SDEs. It resides in the 'Generative Model-Based Distribution Comparison' leaf, which contains only two papers including this one. This represents a sparse research direction within the broader taxonomy of 46 papers across multiple branches. The sibling paper in this leaf also explores distribution-based similarity but may differ in technical approach or scope, suggesting this is an emerging rather than crowded area.

The taxonomy reveals that most related work falls into neighboring branches: cross-modal embedding methods that learn joint text-image representations, and text-to-image generation quality assessment approaches that evaluate alignment without distribution comparison. The 'Retrieval-Based Visual Denotation Similarity' leaf offers an alternative distribution-based approach using retrieved rather than generated images. The paper's position bridges generative modeling and semantic similarity measurement, diverging from embedding-based methods that dominate the 'Cross-Modal Embedding and Alignment' branch with its multiple active sub-areas including image-text matching and multimodal distributional semantics.

Among 21 candidates examined across three contributions, none were identified as clearly refuting the work. The core contribution of visually-grounded semantic similarity examined 10 candidates with no refutations found. The Jeffreys divergence computation method examined only 1 candidate, while the evaluation framework contribution examined 10 candidates, again with no overlapping prior work detected. This limited search scope—top-K semantic matches plus citation expansion—suggests the contributions appear novel within the examined literature, though the small candidate pool and sparse taxonomy leaf indicate this may reflect an under-explored research direction rather than exhaustive validation.

The analysis covers a focused but limited literature sample, constrained by the semantic search methodology and the field's apparent sparseness in this specific direction. The absence of refuting work among 21 candidates, combined with the two-paper taxonomy leaf, suggests either genuine novelty or insufficient prior exploration. The approach's distinctiveness—using generative distributions rather than embeddings or retrieval—may explain both its isolation in the taxonomy and the difficulty finding closely comparable prior work within the search scope.

Taxonomy

Core-task Taxonomy Papers
46
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: measuring semantic similarity between textual expressions through generated image distributions. The field encompasses diverse approaches to understanding and quantifying semantic relationships by bridging language and vision. The taxonomy reveals several major branches: some focus directly on image distribution-based similarity measurement, comparing how different text prompts produce varying visual outputs; others emphasize cross-modal embedding and alignment, learning joint representations where semantically similar texts and images occupy nearby positions in a shared space. Additional branches address text-to-image generation quality, semantic representation learning from multimodal contexts, and application-driven systems that leverage these techniques for tasks like rumor detection or accessibility testing. Works such as Visual Semantic Reasoning[3] and Multimodal Distributional Semantics[16] illustrate how grounding language in visual data can enrich semantic understanding, while methods like Semantic Similarity Distance[10] and Enhanced Semantic Similarity[5] explore computational frameworks for quantifying these relationships. Particularly active lines of work contrast direct generative approaches with embedding-based methods. Generative model-based distribution comparison, where Conjuring Semantic Similarity[0] resides, leverages the distributional properties of generated images to infer semantic closeness between prompts—an approach that naturally captures nuanced visual interpretations. This contrasts with embedding techniques like Semantic Similarity Embedding[21] or cross-modal alignment frameworks such as Scene Graph Alignment[14], which map texts and images into unified vector spaces for similarity computation. Conjuring Semantic Similarity[0] sits within the generative distribution comparison cluster, closely related to Semantic Similarity Distance[10], which also examines distributional divergence. However, while Semantic Similarity Distance[10] may emphasize metric properties, Conjuring Semantic Similarity[0] appears to focus on harnessing the generative process itself as a semantic probe. Open questions remain about how these distribution-based methods scale, handle ambiguity, and compare to embedding approaches in capturing fine-grained semantic distinctions.

Claimed Contributions

Visually-grounded semantic similarity for text-conditioned diffusion models

The authors introduce a method to measure semantic similarity between textual expressions by comparing the distributions of images they generate in text-conditioned diffusion models, rather than comparing text representations directly. This provides a visually-grounded notion of meaning.

10 retrieved papers
Computable distance via Jeffreys divergence between reverse-time SDEs

The authors derive a tractable algorithm for computing semantic similarity by using the Jeffreys divergence between the SDEs governing diffusion processes conditioned on different text prompts, which can be estimated through Monte-Carlo sampling.

1 retrieved paper
First method for evaluating semantic alignment of diffusion models with humans

The authors present the first approach to quantitatively assess how well the semantic space learned by text-conditioned diffusion models aligns with human-annotated semantic similarity, enabling new evaluation paradigms for these models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visually-grounded semantic similarity for text-conditioned diffusion models

The authors introduce a method to measure semantic similarity between textual expressions by comparing the distributions of images they generate in text-conditioned diffusion models, rather than comparing text representations directly. This provides a visually-grounded notion of meaning.

Contribution

Computable distance via Jeffreys divergence between reverse-time SDEs

The authors derive a tractable algorithm for computing semantic similarity by using the Jeffreys divergence between the SDEs governing diffusion processes conditioned on different text prompts, which can be estimated through Monte-Carlo sampling.

Contribution

First method for evaluating semantic alignment of diffusion models with humans

The authors present the first approach to quantitatively assess how well the semantic space learned by text-conditioned diffusion models aligns with human-annotated semantic similarity, enabling new evaluation paradigms for these models.