Yong Jae Lee

45

Papers

180

Total Citations

Papers (45)

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts

X-Fusion: Introducing New Modality to Frozen Large Language Models

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

Removing Distributional Discrepancies in Captions Improves Image-Text Alignment

Edit One for All: Interactive Batch Image Editing

Yo’Chameleon: Personalized Vision and Language Generation

Track and Segment: An Iterative Unsupervised Approach for Video Object Proposals

Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection

Identifying First-Person Camera Wearers in Third-Person Videos

Weakly-Supervised Visual Grounding of Phrases With Linguistic Structures

Interspecies Knowledge Transfer for Facial Keypoint Detection

Cross-Domain Self-Supervised Multi-Task Feature Learning Using Synthetic Imagery

HPLFlowNet: Hierarchical Permutohedral Lattice FlowNet for Scene Flow Estimation on Large-Scale Point Clouds

FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery

You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection

MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation

Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Instance-Aware, Context-Focused, and Memory-Efficient Weakly Supervised Object Detection

Progressive Temporal Feature Alignment Network for Video Inpainting

Few-Shot Image Generation via Cross-Domain Correspondence

The Two Dimensions of Worst-Case Training and Their Integrated Effect for Out-of-Domain Generalization

GIRAFFE HD: A High-Resolution 3D-Aware Generative Model

Learning Customized Visual Models With Retrieval-Augmented Knowledge

GLIGEN: Open-Set Grounded Text-to-Image Generation

Generalized Decoding for Pixel, Image, and Language

Towards Universal Fake Image Detectors That Generalize Across Generative Models

Discovering the Spatial Extent of Relative Attributes

Hide-And-Seek: Forcing a Network to Be Meticulous for Weakly-Supervised Object and Action Localization

Identity From Here, Pose From There: Self-Supervised Disentanglement and Generation of Objects Using Unlabeled Videos

YOLACT: Real-Time Instance Segmentation

Collaging Class-Specific GANs for Semantic Image Synthesis

A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance

Masked Discrimination for Self-Supervised Learning on Point Clouds

Contrastive Learning for Diverse Disentangled Foreground Generation

FlowWeb: Joint Image Set Alignment by Weaving Consistent, Pixel-Wise Correspondences

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Customizing Domain Adapters for Domain Generalization

Improved Baselines with Visual Instruction Tuning

Elastic-InfoGAN: Unsupervised Disentangled Representation Learning in Class-Imbalanced Data

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models

Visual Instruction Inversion: Image Editing via Image Prompting

What Knowledge Gets Distilled in Knowledge Distillation?

Segment Everything Everywhere All at Once

Visual Instruction Tuning