Juan Carlos Niebles

47

Papers

3,191

Total Citations

2

Affiliations

Affiliations

SalesforceStanford University

Papers (47)

ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

End-to-End Joint Semantic Segmentation of Actors and Actions in Video

Re-thinking Temporal Search for Long-Form Video Understanding

UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

Taming generative video models for zero-shot optical flow extraction

ViUniT: Visual Unit Tests for More Robust Visual Programming

Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets

D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation

Peeking Into the Future: Predicting Future Person Activities and Locations in Videos

Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

Spatio-Temporal Graph for Video Captioning With Knowledge Distillation

Few-Shot Video Classification via Temporal Alignment

Metadata Normalization

Home Action Genome: Cooperative Compositional Action Understanding

Align and Prompt: Video-and-Language Pre-Training With Entity Prompts

Revisiting the "Video" in Video-Language Understanding

ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding

Procedure-Aware Pretraining for Instructional Video Understanding

Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations

Dense-Captioning Events in Videos

Visual Forecasting by Imitating Dynamics in Natural Sequences

Learning Temporal Action Proposals With Fewer Labels

Imitation Learning for Human Pose Prediction

TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild

Detecting Human-Object Relationships in Videos

Learning Privacy-Preserving Optics for Human Pose Estimation

Procedure Planning in Instructional Videos

RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition

PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens

Open Vocabulary Object Detection with Pseudo Bounding-Box Labels

Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation

On the Relationship Between Visual Attributes and Convolutional Networks

Robust Manhattan Frame Estimation From a Single RGB-D Image

Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos

A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization

SST: Single-Stream Temporal Action Proposals

Learning to Decompose and Disentangle Representations for Video Prediction

MOMA: Multi-Object Multi-Actor Activity Parsing

MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing

Temporally Disentangled Representation Learning under Unknown Nonstationarity

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild