Yongshuo Zong

4

Papers

3

Total Citations

Papers (4)

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models

Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations

Safety Fine-Tuning at (Almost) No Cost: A Baseline for Vision Large Language Models