Jianfeng Gao

54

Papers

1,831

Total Citations

Papers (54)

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Is Self-Repair a Silver Bullet for Code Generation?

Visual In-Context Prompting

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

DataGen: Unified Synthetic Dataset Generation via Large Language Models

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Vector-ICL: In-context Learning with Continuous Vector Representations

Generative Adapter: Contextualizing Language Models in Parameters with A Single Forward Pass

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation

Tactical Rewind: Self-Correction via Backtracking in Vision-And-Language Navigation

Object-Driven Text-To-Image Synthesis via Adversarial Training

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-Training

VinVL: Revisiting Visual Representations in Vision-Language Models

Grounded Language-Image Pre-Training

RegionCLIP: Region-Based Language-Image Pretraining

WebQA: Multihop and Multimodal QA

Unified Contrastive Learning in Image-Text-Label Space

Learning Customized Visual Models With Retrieval-Augmented Knowledge

GLIGEN: Open-Set Grounded Text-to-Image Generation

Generalized Decoding for Pixel, Image, and Language

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

TACo: Token-Aware Cascade Contrastive Learning for Video-Text Alignment

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

End-to-end Learning of LDA by Mirror-Descent Back Propagation over a Deep Architecture

From Captions to Visual Concepts and Back

SITE: towards Spatial Intelligence Thorough Evaluation

Position: TrustLLM: Trustworthiness in Large Language Models

Magma: A Foundation Model for Multimodal AI Agents

Stacked Attention Networks for Image Question Answering

StyleNet: Generating Attractive Visual Captions With Styles

Semantic Compositional Networks for Visual Captioning

Language-Based Image Editing With Recurrent Attentive Models

StoryGAN: A Sequential Conditional GAN for Story Visualization

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models

M-Walk: Learning to Walk over Graphs using Monte Carlo Tree Search

Unified Language Model Pre-training for Natural Language Understanding and Generation

Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

Focal Attention for Long-Range Interactions in Vision Transformers

Focal Modulation Networks

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Fault-Aware Neural Code Rankers

K-LITE: Learning Transferable Visual Models with External Knowledge

Few-shot Task-agnostic Neural Architecture Search for Distilling Large Language Models

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

GLIPv2: Unifying Localization and Vision-Language Understanding

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Bridging Discrete and Backpropagation: Straight-Through and Beyond

Segment Everything Everywhere All at Once

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models

Guiding Large Language Models via Directional Stimulus Prompting

Augmenting Language Models with Long-Term Memory