Arsha Nagrani
28
Papers
327
Total Citations
Papers (28)
On Scaling Up a Multilingual Vision and Language Model
CVPR 2024
254
citations
VicTR: Video-conditioned Text Representations for Activity Recognition
CVPR 2024
36
citations
AutoAD III: The Prequel – Back to the Pixels
CVPR 2024
33
citations
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
ICCV 2025
3
citations
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
CVPR 2025
1
citations
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
CVPR 2018arXiv
0
citations
Speech2Action: Cross-Modal Supervision for Action Recognition
CVPR 2020arXiv
0
citations
Localizing Visual Sounds the Hard Way
CVPR 2021arXiv
0
citations
Look Before You Speak: Visually Contextualized Utterances
CVPR 2021arXiv
0
citations
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022arXiv
0
citations
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
CVPR 2023arXiv
0
citations
AutoAD: Movie Description in Context
CVPR 2023arXiv
0
citations
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
ICCV 2019
0
citations
Composable Augmentation Encoding for Video Representation Learning
ICCV 2021arXiv
0
citations
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
ICCV 2021arXiv
0
citations
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description
ICCV 2023
0
citations
Verbs in Action: Improving Verb Understanding in Video-Language Models
ICCV 2023arXiv
0
citations
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023arXiv
0
citations
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
0
citations
Learning Audio-Video Modalities from Image Captions
ECCV 2022
0
citations
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
ECCV 2022
0
citations
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023arXiv
0
citations
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
0
citations
MINERVA: Evaluating Complex Video Reasoning
ICCV 2025
0
citations
Streaming Dense Video Captioning
CVPR 2024
0
citations
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
0
citations
Attention Bottlenecks for Multimodal Fusion
NeurIPS 2021
0
citations
VidChapters-7M: Video Chapters at Scale
NeurIPS 2023
0
citations