Hilde Kuehne

31

Papers

103

Total Citations

Papers (31)

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Teaching VLMs to Localize Specific Objects from In-context Examples

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment

Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks

NeuralNetwork-Viterbi: A Framework for Weakly Supervised Video Learning

Unsupervised Learning of Action Classes With Continuous Temporal Embedding

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

Everything at Once - Multi-Modal Fusion Transformer for Video Retrieval

Unsupervised Domain Generalization by Learning a Bridge Across Domains

Video Test-Time Adaptation for Action Recognition

Detector-Free Weakly Supervised Grounding by Separation

Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration Without Forgetting

Multimodal Clustering Networks for Self-Supervised Learning From Unlabeled Videos

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

Preserving Modality Structure Improves Multi-Modal Learning

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

CycDA: Unsupervised Cycle Domain Adaptation to Learn from Image to Video

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Learning Situation Hyper-Graphs for Video Question Answering

VideoGEM: Training-free Action Grounding in Videos

What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Weakly Supervised Action Learning With RNN Based Fine-To-Coarse Modeling

Action Sets: Weakly Supervised Action Segmentation Without Ordering Constraints

More Is Less: Learning Efficient Video Representations by Big-Little Network and Depthwise Temporal Aggregation

Learning with Algorithmic Supervision via Continuous Relaxations

Deep Differentiable Logic Gate Networks

How Transferable are Video Representations Based on Synthetic Data?

Learning Human Action Recognition Representations Without Real Humans

What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation