"vision-language models" Papers
125 papers found • Page 1 of 3
$\mathbb{X}$-Sample Contrastive Loss: Improving Contrastive Learning with Sample Similarity Graphs
Vlad Sobal, Mark Ibrahim, Randall Balestriero et al.
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
Hua Ye, Hang Ding, Siyuan Chen et al.
Aligning Visual Contrastive learning models via Preference Optimization
Amirabbas Afzali, Borna khodabandeh, Ali Rasekh et al.
AmorLIP: Efficient Language-Image Pretraining via Amortization
Haotian Sun, Yitong Li, Yuchen Zhuang et al.
Attention! Your Vision Language Model Could Be Maliciously Manipulated
Xiaosen Wang, Shaokang Wang, Zhijin Ge et al.
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets
Zexi Jia, Chuanwei Huang, Yeshuang Zhu et al.
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning
Hairui Ren, Fan Tang, He Zhao et al.
CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning
Ji Qi, Ming Ding, Weihan Wang et al.
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Di Zhang, Jingdi Lei, Junxian Li et al.
Cross-Modal and Uncertainty-Aware Agglomeration for Open-Vocabulary 3D Scene Understanding
Jinlong Li, Cristiano Saltori, Fabio Poiesi et al.
CrypticBio: A Large Multimodal Dataset for Visually Confusing Species
Georgiana Manolache, Gerard Schouten, Joaquin Vanschoren
DAMO: Decoding by Accumulating Activations Momentum for Mitigating Hallucinations in Vision-Language Models
Kaishen Wang, Hengrui Gu, Meijun Gao et al.
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Tal Barami, Nimrod Berman, Ilan Naiman et al.
DualCnst: Enhancing Zero-Shot Out-of-Distribution Detection via Text-Image Consistency in Vision-Language Models
Fayi Le, Wenwu He, Chentao Cao et al.
EA3D: Online Open-World 3D Object Extraction from Streaming Videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia et al.
Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models
Xudong Li, Zihao Huang, Yan Zhang et al.
GaussianProperty: Integrating Physical Properties to 3D Gaussians with LMMs
Xinli Xu, Wenhang Ge, Dicong Qiu et al.
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
Tsung-Han (Patrick) Wu, Heekyung Lee, Jiaxin Ge et al.
GenIR: Generative Visual Feedback for Mental Image Retrieval
Diji Yang, Minghao Liu, Chung-Hsiang Lo et al.
GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
Pengyue Jia, Seongheon Park, Song Gao et al.
Glance2Gaze: Efficient Vision-Language Models from Glance Fusion to Gaze Compression
Juan Chen, Honglin liu, Yingying Ao et al.
Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions
He Zhu, Quyu Kong, Kechun Xu et al.
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Hao Fang, Changle Zhou, Jiawei Kong et al.
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei, Yijun Yang, Junliang Xing et al.
GTR-Loc: Geospatial Text Regularization Assisted Outdoor LiDAR Localization
Shangshu Yu, Wen Li, Xiaotian Sun et al.
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling
Xin Dong, Shichao Dong, Jin Wang et al.
Locality-Aware Zero-Shot Human-Object Interaction Detection
Sanghyun Kim, Deunsol Jung, Minsu Cho
LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
Zhenpeng Huang, Jiaqi Li, zihan jia et al.
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Xi Chen, Mingkang Zhu, Shaoteng Liu et al.
MIP against Agent: Malicious Image Patches Hijacking Multimodal OS Agents
Lukas Aichberger, Alasdair Paren, Guohao Li et al.
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
Wenbo Hu, Jia-Chen Gu, Zi-Yi Dou et al.
One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
Zheyu Zhang, Ziqi Pang, Shixing Chen et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
Paint by Inpaint: Learning to Add Image Objects by Removing Them First
Navve Wasserman, Noam Rotstein, Roy Ganz et al.
RA-TTA: Retrieval-Augmented Test-Time Adaptation for Vision-Language Models
Youngjun Lee, Doyoung Kim, Junhyeok Kang et al.
Realistic Test-Time Adaptation of Vision-Language Models
Maxime Zanella, Clément Fuchs, Christophe De Vleeschouwer et al.
RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills
Chunru Lin, Haotian Yuan, Yian Wang et al.
RSCC: A Large-Scale Remote Sensing Change Caption Dataset for Disaster Events
Zhenyuan Chen, Chenxi Wang, Ningyu Zhang et al.
SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation
Zhenjie Mao, Yang Yuhuan, Chaofan Ma et al.
SCAN: Bootstrapping Contrastive Pre-training for Data Efficiency
Yangyang Guo, Mohan Kankanhalli
Semantic Temporal Abstraction via Vision-Language Model Guidance for Efficient Reinforcement Learning
Tian-Shuo Liu, Xu-Hui Liu, Ruifeng Chen et al.
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong Liu, Song-Li Wu, Sule Bai et al.
TaiwanVQA: Benchmarking and Enhancing Cultural Understanding in Vision-Language Models
Hsin Yi Hsieh, Shang-Wei Liu, Chang-Chih Meng et al.
Teaching Human Behavior Improves Content Understanding Abilities Of VLMs
SOMESH SINGH, Harini S I, Yaman Singla et al.
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames
Anurag Arnab, Ahmet Iscen, Mathilde Caron et al.
Text to Sketch Generation with Multi-Styles
Tengjie Li, Shikui Tu, Lei Xu
Towards Cross-modal Backward-compatible Representation Learning for Vision-Language Models
Young Kyun Jang, Ser-Nam Lim
Tri-MARF: A Tri-Modal Multi-Agent Responsive Framework for Comprehensive 3D Object Annotation
jusheng zhang, Yijia Fan, Zimo Wen et al.
UPRE: Zero-Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement
Xiao Zhang, Fei Wei, Yong Wang et al.
Vision-centric Token Compression in Large Language Model
Ling Xing, Alex Jinpeng Wang, Rui Yan et al.