Flow-Based Alignment of Uni-Modal Vision and Text Encoders for Few-Shot Image Classification

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

few-shot classificationvision-language modelsCLIP adaptationalignment of uni-modal encodersflow matching

Few-shot classification with vision–language models remains challenging, particularly when relying on multi-modal encoders such as CLIP that are restricted to paired image–text data. We introduce FSF, a framework that leverages arbitrary uni-modal encoders—including vision or text models that were pretrained on broad or domain-specific corpora—and aligns them for cross-modal classification. FSF first applies a closed-form orthogonal Procrustes map to align image and text embeddings while preserving their geometry, and then trains a lightweight flow-matching prior that regularizes adaptation in the few-shot regime. At inference, images are classified by cosine similarity in the aligned feature space between query embeddings and mapped class prototypes. Experiments on standard benchmarks, ImageNet variants, and VinDr-CXR, a large-scale chest X-ray benchmark, show that FSF is able to leverage stronger or specialized encoders, achieving competitive or superior accuracy compared to recent adaptation methods.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FSF, a framework that aligns independently trained uni-modal vision and text encoders for few-shot classification using orthogonal Procrustes mapping and a flow-matching prior. It resides in the 'Geometric Feature Alignment' leaf of the taxonomy, which contains only three papers total (including this one). This leaf sits within the broader 'Feature Space Alignment and Projection' branch, indicating a relatively sparse research direction compared to more crowded areas like 'Prompt-Based Adaptation Methods' (with multiple multi-paper subcategories) or 'Architectural Adaptation Methods'.

The taxonomy reveals that FSF's closest neighbors are other geometric alignment approaches (Selective Subspace Projection, Sketch Person Re-ID) and flow-based methods in a sibling leaf (Flow-Based and Generative Alignment, with two papers). The broader 'Feature Space Alignment and Projection' branch contrasts with prompt-based methods that optimize learnable tokens and architectural methods that insert adapter modules. FSF's emphasis on closed-form geometric transformations plus generative priors positions it at the intersection of deterministic alignment (Geometric Feature Alignment) and probabilistic modeling (Flow-Based Alignment), bridging two sparse subcategories within a moderately populated parent branch.

Among 29 candidates examined, the analysis found three refutable pairs, all concentrated in the 'Orthogonal Procrustes alignment' contribution (10 candidates examined, 3 refutable). The FSF framework itself (9 candidates, 0 refutable) and the flow-matching prior (10 candidates, 0 refutable) appear more novel within this limited search scope. The Procrustes component faces clearer prior work overlap, suggesting that the geometric alignment technique is less distinctive than the overall framework design or the integration of flow-based regularization for few-shot adaptation.

Given the sparse taxonomy leaf (three papers) and the limited search scale (29 candidates), FSF appears to occupy a relatively under-explored niche combining geometric and generative alignment strategies. However, the Procrustes mapping component shows measurable overlap with existing geometric methods, indicating that novelty concentrates more in the framework's integration of flow-matching priors and its application to arbitrary uni-modal encoders rather than in the alignment technique itself.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Aligning independently trained vision and text encoders for few-shot image classification. The field has evolved around the challenge of adapting large-scale vision-language models—such as CLIP—to downstream tasks with minimal labeled data. The taxonomy reveals several major branches: Prompt-Based Adaptation Methods explore learnable textual or visual prompts to steer pretrained encoders without full retraining (e.g., Learning to Prompt[3], Cross-Coupled Prompt[5]); Architectural Adaptation Methods introduce lightweight modules like adapters (CLIP-Adapter[4], TCFF-Adapter[27]) to bridge modality gaps; Feature Space Alignment and Projection focuses on geometric transformations and subspace techniques (Selective Subspace Projection[24], Flow-Based Alignment[0]) to harmonize vision and text embeddings; Contrastive and Metric Learning refines similarity measures and prototypes (Proto-CLIP[18], SimCLIP[22]); Training-Free and Cache-Based Methods leverage precomputed features for zero-shot or few-shot scenarios (Black Box Adaptation[8]); Multimodal Information Fusion combines cross-modal cues (Multimodal Retrieval Fusion[26], ProFusion[28]); Domain-Specific and Specialized Applications address niche settings like medical imaging (Hierarchical Contrastive Medical[41]) or hyperspectral data (Cross-Domain Hyperspectral[29]); Robustness and Generalization Enhancement targets distribution shifts and domain adaptation (Domain Aligned CLIP[49]); and Foundational and Theoretical Frameworks provide conceptual underpinnings (Representation Learning Few-Shot[36]). A particularly active line of work centers on geometric feature alignment, where methods explicitly model the structure of embedding spaces to reduce modality discrepancies. Flow-Based Alignment[0] sits within this cluster, emphasizing continuous transformations to warp vision features toward text representations, contrasting with discrete projection approaches like Selective Subspace Projection[24] that identify low-dimensional subspaces for alignment. Nearby works such as Sketch Person Re-ID[23] and SGVA-CLIP[1] also manipulate feature geometry but differ in their application domains and the degree of supervision required. Another vibrant theme involves prompt-based adaptation, where learnable tokens (Learning to Prompt[3], Causal Interventional Prompt[25]) offer parameter-efficient tuning, trading off simplicity for expressiveness compared to architectural interventions. Open questions persist around the trade-off between alignment complexity and generalization: while geometric methods can capture fine-grained structure, they risk overfitting in extreme few-shot regimes, whereas prompt-based techniques may struggle with severe domain shifts. Flow-Based Alignment[0] contributes to this landscape by proposing a flexible, continuous alignment strategy that balances geometric fidelity with computational efficiency, positioning itself as a middle ground between rigid projections and fully adaptive architectures.

Claimed Contributions

FSF framework for aligning uni-modal vision and text encoders

9 retrieved papers

The authors propose FSF, a modular framework that enables flexible alignment of independently pretrained vision and text encoders for few-shot image classification. Unlike existing methods that rely on jointly trained multi-modal encoders like CLIP, FSF can work with arbitrary uni-modal encoders from different domains or pretraining regimes.

9 retrieved papers

Closed-form Orthogonal Procrustes alignment for cross-modal embeddings

Can Refute

10 retrieved papers

The method uses a training-free Orthogonal Procrustes solution to align text and image feature spaces through a semi-orthogonal linear map. This closed-form alignment preserves within-modality geometric structure while enabling cross-modal comparison without requiring gradient-based optimization.

10 retrieved papers

Can Refute

Lightweight flow-matching prior for few-shot adaptation

10 retrieved papers

The authors introduce a parameter-efficient flow-matching module that learns continuous-time velocity fields between image and text embeddings in the aligned space. This flow-based prior provides expressive non-linear modeling capacity while remaining efficient enough for few-shot learning scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[23] A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification PDF

Yunpeng Gong, Yongjie Hou, Jiangming Shi, Kim Long Diep, Min Jiang (2025)

[24] Selective Vision-Language Subspace Projection for Few-shot CLIP PDF

Xingyu Zhu, Beier Zhu, Yi Tan, Shuo Wang, Yanbin Hao, Hanwang Zhang, H. Zhang (2024) • ACM Multimedia

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FSF framework for aligning uni-modal vision and text encoders

[9] Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models PDF

Cannot Refute

[71] Multimodal Representation Alignment for Cross-modal Information Retrieval PDF

Cannot Refute

[72] Assessing and Learning Alignment of Unimodal Vision and Language Models PDF

Cannot Refute

[73] Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models PDF

Cannot Refute

[74] VLCDoC: Vision-language contrastive pre-training model for cross-modal document classification PDF

Cannot Refute

[76] Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection PDF

Cannot Refute

[77] Fusion: Fully integration of vision-language representations for deep cross-modal understanding PDF

Cannot Refute

[78] Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models PDF

Cannot Refute

[79] Expanding large pre-trained unimodal models with multimodal information injection for image-text multimodal classification PDF

Cannot Refute

Contribution

Closed-form Orthogonal Procrustes alignment for cross-modal embeddings

[61] Normalization of language embeddings for cross-lingual alignment PDF

Can Refute

[65] Is Cross-Modal Information Retrieval Possible Without Training? PDF

Can Refute

[68] When Embedding Models Meet: Procrustes Bounds and Applications PDF

Can Refute

[62] Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion priors PDF

Cannot Refute

[63] Dual similarity enhanced hybrid orthogonal fusion for multimodal named entity recognition PDF

Cannot Refute

[64] Supervised discrete online hashing for large-scale cross-modal retrieval PDF

Cannot Refute

[66] Latent structure-oriented asymmetric hashing for cross-modal retrieval PDF

Cannot Refute

[67] Discrete semantic embedding hashing for scalable cross-modal retrieval PDF

Cannot Refute

[69] Cross-Modal Index Alignment: Bridging Vision and Language in Neural Retrieval Architectures PDF

Cannot Refute

[70] BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles PDF

Cannot Refute

Contribution

Lightweight flow-matching prior for few-shot adaptation

[51] In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data PDF

Cannot Refute

[52] Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models PDF

Cannot Refute

[53] Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition PDF

Cannot Refute

[54] Yingsound: Video-guided sound effects generation with multi-modal chain-of-thought controls PDF

Cannot Refute

[55] Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation Under Complex Task-Motion Dependencies PDF

Cannot Refute

[56] Exploring Cross-Modal Flows for Few-Shot Learning PDF

Cannot Refute

[57] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations PDF

Cannot Refute

[58] Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition PDF

Cannot Refute

[59] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better PDF

Cannot Refute

[60] DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models PDF

Cannot Refute

Flow-Based Alignment of Uni-Modal Vision and Text Encoders for Few-Shot Image Classification

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[23] A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification PDF

[24] Selective Vision-Language Subspace Projection for Few-shot CLIP PDF

Contribution Analysis

FSF framework for aligning uni-modal vision and text encoders

[9] Multimodality helps unimodality: Cross-modal few-shot learning with multimodal models PDF

[71] Multimodal Representation Alignment for Cross-modal Information Retrieval PDF

[72] Assessing and Learning Alignment of Unimodal Vision and Language Models PDF

[73] Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models PDF

[74] VLCDoC: Vision-language contrastive pre-training model for cross-modal document classification PDF

[76] Cross-modal incongruity aligning and collaborating for multi-modal sarcasm detection PDF

[77] Fusion: Fully integration of vision-language representations for deep cross-modal understanding PDF

[78] Empowering Unsupervised Domain Adaptation with Large-scale Pre-trained Vision-Language Models PDF

[79] Expanding large pre-trained unimodal models with multimodal information injection for image-text multimodal classification PDF

Closed-form Orthogonal Procrustes alignment for cross-modal embeddings

[61] Normalization of language embeddings for cross-lingual alignment PDF

[65] Is Cross-Modal Information Retrieval Possible Without Training? PDF

[68] When Embedding Models Meet: Procrustes Bounds and Applications PDF

[62] Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion priors PDF

[63] Dual similarity enhanced hybrid orthogonal fusion for multimodal named entity recognition PDF

[64] Supervised discrete online hashing for large-scale cross-modal retrieval PDF

[66] Latent structure-oriented asymmetric hashing for cross-modal retrieval PDF

[67] Discrete semantic embedding hashing for scalable cross-modal retrieval PDF

[69] Cross-Modal Index Alignment: Bridging Vision and Language in Neural Retrieval Architectures PDF

[70] BLIP-FusePPO: A Vision-Language Deep Reinforcement Learning Framework for Lane Keeping in Autonomous Vehicles PDF

Lightweight flow-matching prior for few-shot adaptation

[51] In-N-On: Scaling Egocentric Manipulation with in-the-wild and on-task Data PDF

[52] Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models PDF

[53] Ost: Refining text knowledge with optimal spatio-temporal descriptor for general video recognition PDF

[54] Yingsound: Video-guided sound effects generation with multi-modal chain-of-thought controls PDF

[55] Motion Manifold Flow Primitives for Task-Conditioned Trajectory Generation Under Complex Task-Motion Dependencies PDF

[56] Exploring Cross-Modal Flows for Few-Shot Learning PDF

[57] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations PDF

[58] Learning by Neighbor-Aware Semantics, Deciding by Open-form Flows: Towards Robust Zero-Shot Skeleton Action Recognition PDF

[59] Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better PDF

[60] DiG-Flow: Discrepancy-Guided Flow Matching for Robust VLA Models PDF

Table of Contents