Lorenzo Torresani

41

Papers

150

Total Citations

Papers (41)

Video ReCap: Recursive Captioning of Hour-Long Videos

Learning to Inpaint for Image Compression

NeurIPS 2017arXiv

Step Differences in Instructional Video

Learning to Segment Referred Objects from Narrated Egocentric Videos

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection

Semantic Segmentation With Boundary Neural Fields

Convolutional Random Walk Networks for Semantic Image Segmentation

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

A Closer Look at Spatiotemporal Convolutions for Action Recognition

What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets

Video Modeling With Correlation Networks

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

Listen to Look: Action Recognition by Previewing Audio

Beyond Short Clips: End-to-End Video-Level Learning With Collaborative Memories

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Long-Short Temporal Contrastive Learning of Video Transformers

Learning To Recognize Procedural Activities With Distant Supervision

Deformable Video Transformer

Ego4D: Around the World in 3,000 Hours of Egocentric Video

HierVL: Learning Hierarchical Video-Language Embeddings

Relational Space-Time Query in Long-Form Videos

Egocentric Video Task Translation

High-for-Low and Low-for-High: Efficient Boundary Detection From Deep Object Features and its Applications to High-Level Vision

Learning Spatiotemporal Features With 3D Convolutional Networks

DistInit: Learning Video Representations Without a Single Labeled Video

Video Classification With Channel-Separated Convolutional Networks

SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition

HACS: Human Action Clips and Segments Dataset for Recognition and Temporal Localization

Learning to Ground Instructional Articles in Videos through Narrations

Ego-Only: Egocentric Action Detection without Exocentric Transferring

Detect-and-Track: Efficient Pose Estimation in Videos

VITED: Video Temporal Evidence Distillation

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

STAR-Caps: Capsule Networks with Straight-Through Attentive Routing

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

COBE: Contextualized Object Embeddings from Narrated Instructional Video

Ego4D Goal-Step: Toward Hierarchical Understanding of Procedural Activities

HT-Step: Aligning Instructional Articles with How-To Videos