"multimodal video understanding" Papers
4 papers found
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
Benedetta Liberatori, Alessandro Conti, Lorenzo Vaquero et al.
NeurIPS 2025posterarXiv:2509.19245
1
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
Xuehai He, Weixi Feng, Kaizhi Zheng et al.
ICLR 2025posterarXiv:2406.08407
34
citations
PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
Xiao Yu, Yan Fang, Yao Zhao et al.
NeurIPS 2025oralarXiv:2505.23155
1
citations
Exploiting Auxiliary Caption for Video Grounding
Hongxiang Li, Meng Cao, Xuxin Cheng et al.
AAAI 2024paperarXiv:2301.05997
14
citations