Arsha Nagrani

28
Papers
327
Total Citations

Papers (28)

On Scaling Up a Multilingual Vision and Language Model

CVPR 2024
254
citations

VicTR: Video-conditioned Text Representations for Activity Recognition

CVPR 2024
36
citations

AutoAD III: The Prequel – Back to the Pixels

CVPR 2024
33
citations

Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation

ICCV 2025
3
citations

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

CVPR 2025
1
citations

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

CVPR 2018arXiv
0
citations

Speech2Action: Cross-Modal Supervision for Action Recognition

CVPR 2020arXiv
0
citations

Localizing Visual Sounds the Hard Way

CVPR 2021arXiv
0
citations

Look Before You Speak: Visually Contextualized Utterances

CVPR 2021arXiv
0
citations

End-to-End Generative Pretraining for Multimodal Video Captioning

CVPR 2022arXiv
0
citations

AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR

CVPR 2023arXiv
0
citations

AutoAD: Movie Description in Context

CVPR 2023arXiv
0
citations

EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition

ICCV 2019
0
citations

Composable Augmentation Encoding for Video Representation Learning

ICCV 2021arXiv
0
citations

Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval

ICCV 2021arXiv
0
citations

AutoAD II: The Sequel - Who, When, and What in Movie Audio Description

ICCV 2023
0
citations

Verbs in Action: Improving Verb Understanding in Video-Language Models

ICCV 2023arXiv
0
citations

UnLoc: A Unified Framework for Video Localization Tasks

ICCV 2023arXiv
0
citations

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

ECCV 2020
0
citations

Learning Audio-Video Modalities from Image Captions

ECCV 2022
0
citations

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

ECCV 2022
0
citations

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023arXiv
0
citations

Flexible Frame Selection for Efficient Video Reasoning

CVPR 2025
0
citations

MINERVA: Evaluating Complex Video Reasoning

ICCV 2025
0
citations

Streaming Dense Video Captioning

CVPR 2024
0
citations

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering

CVPR 2024
0
citations

Attention Bottlenecks for Multimodal Fusion

NeurIPS 2021
0
citations

VidChapters-7M: Video Chapters at Scale

NeurIPS 2023
0
citations