Jitendra Malik

94

Papers

3,257

Total Citations

Papers (94)

Hypercolumns for Object Segmentation and Fine-Grained Localization

Learning to Poke by Poking: Experiential Learning of Intuitive Physics

NeurIPS 2016arXiv

Learning a Multi-View Stereo Machine

NeurIPS 2017arXiv

Sequential Modeling Enables Scalable Learning for Large Vision Models

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Estimating Body and Hand Motion in an Ego‑sensed World

An Empirical Study of Autoregressive Pre-training from Videos

Scaling Properties of Diffusion Models For Perceptual Tasks

Reconstructing People, Places, and Cameras

Depth From Shading, Defocus, and Correspondence Using Light-Field Angular Coherence

Category-Specific Object Reconstruction From a Single Image

Virtual View Networks for Object Reconstruction

Learning to Segment Moving Objects in Videos

Aligning 3D Models to RGB-D Images of Cluttered Scenes

Cross Modal Distillation for Supervision Transfer

Iterative Instance Segmentation

Human Pose Estimation With Iterative Error Feedback

Feedback Networks

Cognitive Mapping and Planning for Visual Navigation

Multi-View Supervision for Single-View Reconstruction via Differentiable Ray Consistency

Learning Shape Abstractions by Assembling Volumetric Primitives

Factoring Shape, Pose, and Layout From the 2D Image of a 3D Scene

Multi-View Consistency as Supervisory Signal for Learning Shape and Pose Prediction

Taskonomy: Disentangling Task Transfer Learning

From Lifestyle Vlogs to Everyday Interactions

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

End-to-End Recovery of Human Shape and Pose

Gibson Env: Real-World Perception for Embodied Agents

Learning Individual Styles of Conversational Gesture

Learning Independent Object Motion From Unlabelled Stereoscopic Videos

Learning 3D Human Dynamics From Video

Non-Adversarial Image Synthesis With Generative Latent Nearest Neighbors

Robust Learning Through Cross-Task Consistency

Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

Human Mesh Recovery From Multiple Shots

Tracking People by Predicting 3D Appearance, Location and Pose

MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition

Reversible Vision Transformers

PONI: Potential Functions for ObjectGoal Navigation With Interaction-Free Learning

Coupling Vision and Proprioception for Navigation of Legged Robots

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Differentiable Stereopsis: Meshes From Multiple Views Using Differentiable Rendering

Ego4D: Around the World in 3,000 Hours of Egocentric Video

On the Benefits of 3D Pose and Tracking for Human Action Recognition

Decoupling Human and Camera Motion From Videos in the Wild

Multiview Compressive Coding for 3D Reconstruction

Learning to See by Moving

Pose Induction for Novel Object Categories

Amodal Completion and Size Constancy in Natural Scenes

Contextual Action Recognition With R*CNN

Actions and Attributes From Wholes and Parts

DeepBox: Learning Objectness With Convolutional Networks

What Will Happen Next? Forecasting Player Moves in Sports Videos

Diverse Image Synthesis From Semantic Layouts via Conditional IMLE

3D Scene Graph: A Structure for Unified Semantics, 3D Space, and Camera

SlowFast Networks for Video Recognition

Predicting 3D Human Dynamics From Video

ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors

Habitat: A Platform for Embodied AI Research

Mesh R-CNN

From Goals, Waypoints & Paths to Long Term Human Trajectory Forecasting

Multiscale Vision Transformers

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans

Reconstructing Hand-Object Interactions in the Wild

Humans in 4D: Reconstructing and Tracking Humans with Transformers

Navigating to Objects Specified by Images

Long-term Human Motion Prediction with Scene Context

It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction

Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks

Shape and Viewpoint without Keypoints

Recurrent Network Models for Human Dynamics

Poly-Autoregressive Prediction for Modeling Interactions

Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning

Reconstructing Hands in 3D with Transformers

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

xT: Nested Tokenization for Larger Context in Large Images

Deformable Part Models are Convolutional Neural Networks

Finding Action Tubes

Viewpoints and Keypoints

Visual Memory for Robust Path Following

Approximate Feature Collisions in Neural Nets

3D Shape Reconstruction from Vision and Touch

Habitat 2.0: Training Home Assistants to Rearrange their Habitat

SEAL: Self-supervised Embodied Active Learning using Exploration and 3D Consistency

Active 3D Shape Reconstruction from Vision and Touch

Tracking People with 3D Representations

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

MAViL: Masked Audio-Video Learners

Speculative Decoding with Big Little Decoder

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding

Fast k-Nearest Neighbour Search via Dynamic Continuous Indexing

Fast k-Nearest Neighbour Search via Prioritized DCI