Jianwei Yang

35
Papers
2,038
Total Citations

Papers (35)

Hierarchical Question-Image Co-Attention for Visual Question Answering

NeurIPS 2016arXiv
1,702
citations

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

ECCV 2024arXiv
114
citations

Matryoshka Multimodal Models

ICLR 2025arXiv
58
citations

Visual In-Context Prompting

CVPR 2024
52
citations

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

CVPR 2024
48
citations

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

ICML 2025
44
citations

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

CVPR 2025
15
citations

Pix2Gif: Motion-Guided Diffusion for GIF Generation

ECCV 2024arXiv
5
citations

Unified Contrastive Learning in Image-Text-Label Space

CVPR 2022arXiv
0
citations

Learning Customized Visual Models With Retrieval-Augmented Knowledge

CVPR 2023arXiv
0
citations

GLIGEN: Open-Set Grounded Text-to-Image Generation

CVPR 2023arXiv
0
citations

Generalized Decoding for Pixel, Image, and Language

CVPR 2023arXiv
0
citations

Embodied Amodal Recognition: Learning to Move to Perceive Objects

ICCV 2019
0
citations

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

ICCV 2021arXiv
0
citations

TACo: Token-Aware Cascade Contrastive Learning for Video-Text Alignment

ICCV 2021arXiv
0
citations

Dynamic DETR: End-to-End Object Detection With Dynamic Attention

ICCV 2021
0
citations

Learning To Generate Scene Graph From Natural Language Supervision

ICCV 2021arXiv
0
citations

A Simple Framework for Open-Vocabulary Segmentation and Detection

ICCV 2023arXiv
0
citations

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model

NeurIPS 2017arXiv
0
citations

Magma: A Foundation Model for Multimodal AI Agents

CVPR 2025
0
citations

Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation

CVPR 2025
0
citations

SITE: towards Spatial Intelligence Thorough Evaluation

ICCV 2025
0
citations

Joint Unsupervised Learning of Deep Representations and Image Clusters

CVPR 2016
0
citations

Neural Baby Talk

CVPR 2018arXiv
0
citations

VinVL: Revisiting Visual Representations in Vision-Language Models

CVPR 2021arXiv
0
citations

Grounded Language-Image Pre-Training

CVPR 2022arXiv
0
citations

RegionCLIP: Region-Based Language-Image Pretraining

CVPR 2022arXiv
0
citations

Cross-channel Communication Networks

NeurIPS 2019
0
citations

Focal Attention for Long-Range Interactions in Vision Transformers

NeurIPS 2021
0
citations

Focal Modulation Networks

NeurIPS 2022
0
citations

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

NeurIPS 2022
0
citations

K-LITE: Learning Transferable Visual Models with External Knowledge

NeurIPS 2022
0
citations

Segment Everything Everywhere All at Once

NeurIPS 2023
0
citations

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

NeurIPS 2023
0
citations

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

NeurIPS 2023
0
citations