Salman Khan

23
Papers
216
Total Citations

Papers (23)

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

CVPR 2024
78
citations

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

CVPR 2024
34
citations

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

CVPR 2025
30
citations

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

ICCV 2025
24
citations

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

CVPR 2024
20
citations

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

CVPR 2025
9
citations

GroupMamba: Efficient Group-Based Visual State Space Model

CVPR 2025
6
citations

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

ICCV 2025
6
citations

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

NeurIPS 2025
5
citations

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

ICCV 2025
2
citations

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

ICCV 2025
1
citations

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

ICCV 2025arXiv
1
citations

GLaMM: Pixel Grounding Large Multimodal Model

CVPR 2024
0
citations

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

CVPR 2025
0
citations

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

CVPR 2025
0
citations

Bidirectional Reciprocative Information Communication for Few-Shot Semantic Segmentation

ICML 2024
0
citations

Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

ICCV 2025
0
citations

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

ICCV 2025
0
citations

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

AAAI 2025
0
citations

S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

AAAI 2024
0
citations

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

CVPR 2024
0
citations

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

CVPR 2024
0
citations

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning

CVPR 2024
0
citations