Salman Khan

69

Papers

215

Total Citations

Papers (69)

Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery

Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning

VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Composed Video Retrieval via Enriched Context and Discriminative Embeddings

O-TPT: Orthogonality Constraints for Calibrating Test-time Prompt Tuning in Vision-Language Models

AURELIA: Test-time Reasoning Distillation in Audio-Visual LLMs

GroupMamba: Efficient Group-Based Visual State Space Model

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks

TAViS: Text-bridged Audio-Visual Segmentation with Foundation Models

Beyond Simple Edits: Composed Video Retrieval with Dense Modifications

GLaMM: Pixel Grounding Large Multimodal Model

Bidirectional Reciprocative Information Communication for Few-Shot Semantic Segmentation

Striking the Right Balance With Uncertainty

Semi-Supervised Learning for Few-Shot Image-to-Image Translation

CycleISP: Real Image Restoration via Improved Data Synthesis

A Self-supervised Approach for Adversarial Robustness

AnimalWeb: A Large-Scale Hierarchical Dataset of Annotated Animal Faces

iTAML: An Incremental Task-Agnostic Meta-learning Approach

Towards Open World Object Detection

Exploring Complementary Strengths of Invariant and Equivariant Representations for Few-Shot Learning

Multi-Stage Progressive Image Restoration

OW-DETR: Open-World Detection Transformer

Burst Image Restoration and Enhancement

Restormer: Efficient Transformer for High-Resolution Image Restoration

Energy-Based Latent Aligner for Incremental Learning

Spatio-Temporal Relation Modeling for Few-Shot Action Recognition

Self-Supervised Video Transformer

PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery

Burstormer: Burst Image Restoration and Enhancement Transformer

Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection

Person Image Synthesis via Denoising Diffusion Model

Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection

MaPLe: Multi-Modal Prompt Learning

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

Fine-Tuned CLIP Models Are Efficient Video Learners

Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement

Adversarial Defense by Restricting the Hidden Space of Deep Neural Networks

Transductive Learning for Zero-Shot Object Detection

Gaussian Affinity for Max-Margin Class Imbalanced Learning

Ground-to-Aerial Image Geo-Localization With a Hard Exemplar Reweighting Triplet Loss

Orthogonal Projection Loss

Discriminative Region-Based Multi-Label Zero-Shot Learning

Handwriting Transformers

On Generating Transferable Targeted Perturbations

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

Towards Instance-adaptive Inference for Federated Learning

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation

Fixing Localization Errors to Improve Image Classification

Learning Enriched Features for Real Image Restoration and Enhancement

Class-Agnostic Object Detection with Multi-modal Transformer

DoodleFormer: Creative Sketch Drawing with Transformers

Video Instance Segmentation via Multi-Scale Spatio-Temporal Split Attention Transformer

OpenLDN: Learning to Discover Novel Classes for Open-World Semi-Supervised Learning

Learning Disentanglement with Decoupled Labels for Vision-Language Navigation

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues

Intrepretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

LawDIS: Language-Window-based Controllable Dichotomous Image Segmentation

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering

S3A: Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding

Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning