Flow-Based Alignment of Uni-Modal Vision and Text Encoders for Few-Shot Image Classification

ICLR 2026 Conference SubmissionAnonymous Authors
few-shot classificationvision-language modelsCLIP adaptationalignment of uni-modal encodersflow matching
Abstract:

Few-shot classification with vision–language models remains challenging, particularly when relying on multi-modal encoders such as CLIP that are restricted to paired image–text data. We introduce FSF, a framework that leverages arbitrary uni-modal encoders—including vision or text models that were pretrained on broad or domain-specific corpora—and aligns them for cross-modal classification. FSF first applies a closed-form orthogonal Procrustes map to align image and text embeddings while preserving their geometry, and then trains a lightweight flow-matching prior that regularizes adaptation in the few-shot regime. At inference, images are classified by cosine similarity in the aligned feature space between query embeddings and mapped class prototypes. Experiments on standard benchmarks, ImageNet variants, and VinDr-CXR, a large-scale chest X-ray benchmark, show that FSF is able to leverage stronger or specialized encoders, achieving competitive or superior accuracy compared to recent adaptation methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces FSF, a framework that aligns independently trained uni-modal vision and text encoders for few-shot classification using orthogonal Procrustes mapping and a flow-matching prior. It resides in the 'Geometric Feature Alignment' leaf of the taxonomy, which contains only three papers total (including this one). This leaf sits within the broader 'Feature Space Alignment and Projection' branch, indicating a relatively sparse research direction compared to more crowded areas like 'Prompt-Based Adaptation Methods' (with multiple multi-paper subcategories) or 'Architectural Adaptation Methods'.

The taxonomy reveals that FSF's closest neighbors are other geometric alignment approaches (Selective Subspace Projection, Sketch Person Re-ID) and flow-based methods in a sibling leaf (Flow-Based and Generative Alignment, with two papers). The broader 'Feature Space Alignment and Projection' branch contrasts with prompt-based methods that optimize learnable tokens and architectural methods that insert adapter modules. FSF's emphasis on closed-form geometric transformations plus generative priors positions it at the intersection of deterministic alignment (Geometric Feature Alignment) and probabilistic modeling (Flow-Based Alignment), bridging two sparse subcategories within a moderately populated parent branch.

Among 29 candidates examined, the analysis found three refutable pairs, all concentrated in the 'Orthogonal Procrustes alignment' contribution (10 candidates examined, 3 refutable). The FSF framework itself (9 candidates, 0 refutable) and the flow-matching prior (10 candidates, 0 refutable) appear more novel within this limited search scope. The Procrustes component faces clearer prior work overlap, suggesting that the geometric alignment technique is less distinctive than the overall framework design or the integration of flow-based regularization for few-shot adaptation.

Given the sparse taxonomy leaf (three papers) and the limited search scale (29 candidates), FSF appears to occupy a relatively under-explored niche combining geometric and generative alignment strategies. However, the Procrustes mapping component shows measurable overlap with existing geometric methods, indicating that novelty concentrates more in the framework's integration of flow-matching priors and its application to arbitrary uni-modal encoders rather than in the alignment technique itself.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Aligning independently trained vision and text encoders for few-shot image classification. The field has evolved around the challenge of adapting large-scale vision-language models—such as CLIP—to downstream tasks with minimal labeled data. The taxonomy reveals several major branches: Prompt-Based Adaptation Methods explore learnable textual or visual prompts to steer pretrained encoders without full retraining (e.g., Learning to Prompt[3], Cross-Coupled Prompt[5]); Architectural Adaptation Methods introduce lightweight modules like adapters (CLIP-Adapter[4], TCFF-Adapter[27]) to bridge modality gaps; Feature Space Alignment and Projection focuses on geometric transformations and subspace techniques (Selective Subspace Projection[24], Flow-Based Alignment[0]) to harmonize vision and text embeddings; Contrastive and Metric Learning refines similarity measures and prototypes (Proto-CLIP[18], SimCLIP[22]); Training-Free and Cache-Based Methods leverage precomputed features for zero-shot or few-shot scenarios (Black Box Adaptation[8]); Multimodal Information Fusion combines cross-modal cues (Multimodal Retrieval Fusion[26], ProFusion[28]); Domain-Specific and Specialized Applications address niche settings like medical imaging (Hierarchical Contrastive Medical[41]) or hyperspectral data (Cross-Domain Hyperspectral[29]); Robustness and Generalization Enhancement targets distribution shifts and domain adaptation (Domain Aligned CLIP[49]); and Foundational and Theoretical Frameworks provide conceptual underpinnings (Representation Learning Few-Shot[36]). A particularly active line of work centers on geometric feature alignment, where methods explicitly model the structure of embedding spaces to reduce modality discrepancies. Flow-Based Alignment[0] sits within this cluster, emphasizing continuous transformations to warp vision features toward text representations, contrasting with discrete projection approaches like Selective Subspace Projection[24] that identify low-dimensional subspaces for alignment. Nearby works such as Sketch Person Re-ID[23] and SGVA-CLIP[1] also manipulate feature geometry but differ in their application domains and the degree of supervision required. Another vibrant theme involves prompt-based adaptation, where learnable tokens (Learning to Prompt[3], Causal Interventional Prompt[25]) offer parameter-efficient tuning, trading off simplicity for expressiveness compared to architectural interventions. Open questions persist around the trade-off between alignment complexity and generalization: while geometric methods can capture fine-grained structure, they risk overfitting in extreme few-shot regimes, whereas prompt-based techniques may struggle with severe domain shifts. Flow-Based Alignment[0] contributes to this landscape by proposing a flexible, continuous alignment strategy that balances geometric fidelity with computational efficiency, positioning itself as a middle ground between rigid projections and fully adaptive architectures.

Claimed Contributions

FSF framework for aligning uni-modal vision and text encoders

The authors propose FSF, a modular framework that enables flexible alignment of independently pretrained vision and text encoders for few-shot image classification. Unlike existing methods that rely on jointly trained multi-modal encoders like CLIP, FSF can work with arbitrary uni-modal encoders from different domains or pretraining regimes.

9 retrieved papers
Closed-form Orthogonal Procrustes alignment for cross-modal embeddings

The method uses a training-free Orthogonal Procrustes solution to align text and image feature spaces through a semi-orthogonal linear map. This closed-form alignment preserves within-modality geometric structure while enabling cross-modal comparison without requiring gradient-based optimization.

10 retrieved papers
Can Refute
Lightweight flow-matching prior for few-shot adaptation

The authors introduce a parameter-efficient flow-matching module that learns continuous-time velocity fields between image and text embeddings in the aligned space. This flow-based prior provides expressive non-linear modeling capacity while remaining efficient enough for few-shot learning scenarios.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FSF framework for aligning uni-modal vision and text encoders

The authors propose FSF, a modular framework that enables flexible alignment of independently pretrained vision and text encoders for few-shot image classification. Unlike existing methods that rely on jointly trained multi-modal encoders like CLIP, FSF can work with arbitrary uni-modal encoders from different domains or pretraining regimes.

Contribution

Closed-form Orthogonal Procrustes alignment for cross-modal embeddings

The method uses a training-free Orthogonal Procrustes solution to align text and image feature spaces through a semi-orthogonal linear map. This closed-form alignment preserves within-modality geometric structure while enabling cross-modal comparison without requiring gradient-based optimization.

Contribution

Lightweight flow-matching prior for few-shot adaptation

The authors introduce a parameter-efficient flow-matching module that learns continuous-time velocity fields between image and text embeddings in the aligned space. This flow-based prior provides expressive non-linear modeling capacity while remaining efficient enough for few-shot learning scenarios.