Jianwei Yang
35
Papers
2,038
Total Citations
Papers (35)
Hierarchical Question-Image Co-Attention for Visual Question Answering
NeurIPS 2016arXiv
1,702
citations
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
ECCV 2024arXiv
114
citations
Matryoshka Multimodal Models
ICLR 2025arXiv
58
citations
Visual In-Context Prompting
CVPR 2024
52
citations
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
CVPR 2024
48
citations
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
ICML 2025
44
citations
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
CVPR 2025
15
citations
Pix2Gif: Motion-Guided Diffusion for GIF Generation
ECCV 2024arXiv
5
citations
Unified Contrastive Learning in Image-Text-Label Space
CVPR 2022arXiv
0
citations
Learning Customized Visual Models With Retrieval-Augmented Knowledge
CVPR 2023arXiv
0
citations
GLIGEN: Open-Set Grounded Text-to-Image Generation
CVPR 2023arXiv
0
citations
Generalized Decoding for Pixel, Image, and Language
CVPR 2023arXiv
0
citations
Embodied Amodal Recognition: Learning to Move to Perceive Objects
ICCV 2019
0
citations
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
ICCV 2021arXiv
0
citations
TACo: Token-Aware Cascade Contrastive Learning for Video-Text Alignment
ICCV 2021arXiv
0
citations
Dynamic DETR: End-to-End Object Detection With Dynamic Attention
ICCV 2021
0
citations
Learning To Generate Scene Graph From Natural Language Supervision
ICCV 2021arXiv
0
citations
A Simple Framework for Open-Vocabulary Segmentation and Detection
ICCV 2023arXiv
0
citations
Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model
NeurIPS 2017arXiv
0
citations
Magma: A Foundation Model for Multimodal AI Agents
CVPR 2025
0
citations
Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation
CVPR 2025
0
citations
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
0
citations
Joint Unsupervised Learning of Deep Representations and Image Clusters
CVPR 2016
0
citations
Neural Baby Talk
CVPR 2018arXiv
0
citations
VinVL: Revisiting Visual Representations in Vision-Language Models
CVPR 2021arXiv
0
citations
Grounded Language-Image Pre-Training
CVPR 2022arXiv
0
citations
RegionCLIP: Region-Based Language-Image Pretraining
CVPR 2022arXiv
0
citations
Cross-channel Communication Networks
NeurIPS 2019
0
citations
Focal Attention for Long-Range Interactions in Vision Transformers
NeurIPS 2021
0
citations
Focal Modulation Networks
NeurIPS 2022
0
citations
ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models
NeurIPS 2022
0
citations
K-LITE: Learning Transferable Visual Models with External Knowledge
NeurIPS 2022
0
citations
Segment Everything Everywhere All at Once
NeurIPS 2023
0
citations
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
NeurIPS 2023
0
citations
Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection
NeurIPS 2023
0
citations