David Harwath

10

Papers

314

Total Citations

Papers (10)

Unsupervised Learning of Spoken Language with Visual Context

SyllableLM: Learning Coarse Semantic Units for Speech Language Models

Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Learning Words by Drawing Images

Spoken Moments: Learning Joint Audio-Visual Representations From Video Descriptions

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos

BAT: Learning to Reason about Spatial Sounds with Large Language Models