Li Fei-Fei

68

Papers

595

Total Citations

Papers (68)

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Learning Semantic Relationships for Better Action Retrieval in Images

ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image

Re-thinking Temporal Search for Long-Form Video Understanding

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation

Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

Image Retrieval Using Scene Graphs

Fine-Grained Recognition Without Part Annotations

Social LSTM: Human Trajectory Prediction in Crowded Spaces

Recurrent Attention Models for Depth-Based Person Identification

End-To-End Learning of Action Detection From Frame Glimpses in Videos

Detecting Events and Key Actors in Multi-Person Videos

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Visual7W: Grounded Question Answering in Images

A Hierarchical Approach for Generating Descriptive Image Paragraphs

Knowledge Acquisition for Visual Question Answering via Iterative Querying

Jointly Learning Energy Expenditures and Activities Using Egocentric Multimodal Signals

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Unsupervised Learning of Long-Term Motion Dynamics for Videos

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Learning to Learn From Noisy Web Videos

Scene Graph Generation by Iterative Message Passing

Image Generation From Scene Graphs

Social GAN: Socially Acceptable Trajectories With Generative Adversarial Networks

Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

Referring Relationships

Iterative Visual Reasoning Beyond Convolutions

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets

Thoracic Disease Identification and Localization With Limited Supervision

Auto-DeepLab: Hierarchical Neural Architecture Search for Semantic Image Segmentation

Scene Memory Transformer for Embodied Agents in Long-Horizon Tasks

Information Maximizing Visual Question Generation

DenseFusion: 6D Object Pose Estimation by Iterative Dense Fusion

D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

Peeking Into the Future: Predicting Future Person Activities and Locations in Videos

Composing Text and Image for Image Retrieval - an Empirical Odyssey

Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction

Metadata Normalization

Scalable Differential Privacy With Sparse Network Finetuning

ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer

Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning

Revisiting the "Video" in Video-Language Understanding

The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

RGB-W: When Vision Meets Wireless

Learning Temporal Embeddings for Complex Video Analysis

Love Thy Neighbors: Image Annotation by Exploiting Image Metadata

Visual Semantic Planning Using Deep Successor Representations

Dense-Captioning Events in Videos

Fine-Grained Recognition in the Wild: A Multi-Task Domain Adaptation Approach

Inferring and Executing Programs for Visual Reasoning

Characterizing and Improving Stability in Neural Style Transfer

Scene Graph Prediction With Limited Labels

Situational Fusion of Visual Representation for Visual Navigation

Rendering Humans from Object-Occluded Monocular Videos

Procedure Planning in Instructional Videos

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens

Improving Image Classification With Location Context

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

WorldScore: Unified Evaluation Benchmark for World Generation

Chain of Code: Reasoning with a Language Model-Augmented Code Emulator

Best of Both Worlds: Human-Machine Collaboration for Object Annotation

Deep Visual-Semantic Alignments for Generating Image Descriptions

MentorNet: Learning Data-Driven Curriculum for Very Deep Neural Networks on Corrupted Labels

Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go?