Yu-Gang Jiang

69
Papers
654
Total Citations

Papers (69)

NuScenes-QA: A Multi-Modal Visual Question Answering Benchmark for Autonomous Driving

AAAI 2024arXiv
266
citations

SimDA: Simple Diffusion Adapter for Efficient Video Generation

CVPR 2024
106
citations

Adversarial Prompt Tuning for Vision-Language Models

ECCV 2024
33
citations

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

ICCV 2025arXiv
33
citations

OmniViD: A Generative Framework for Universal Video Understanding

CVPR 2024
29
citations

Doubly Abductive Counterfactual Inference for Text-based Image Editing

CVPR 2024
25
citations

AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction

ICCV 2025
24
citations

MotionFollower: Editing Video Motion via Score-Guided Diffusion

ICCV 2025
22
citations

PromptFusion: Decoupling Stability and Plasticity for Continual Learning

ECCV 2024
21
citations

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

AAAI 2025
19
citations

LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation

AAAI 2024arXiv
17
citations

BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

ICLR 2025
16
citations

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

ECCV 2024
12
citations

Out of Length Text Recognition with Sub-String Matching

AAAI 2025
7
citations

DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

AAAI 2025
7
citations

Learning to Rank Patches for Unbiased Image Redundancy Reduction

CVPR 2024
6
citations

REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents

ICCV 2025
5
citations

AIM: Additional Image Guided Generation of Transferable Adversarial Attacks

AAAI 2025
3
citations

FaceA-Net: Facial Attribute-Driven ID Preserving Image Generation Network

AAAI 2025
1
citations

Achieving More with Less: Additive Prompt Tuning for Rehearsal-Free Class-Incremental Learning

ICCV 2025
1
citations

From Holistic to Localized: Local Enhanced Adapters for Efficient Visual Instruction Fine-Tuning

ICCV 2025
1
citations

Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples

CVPR 2023arXiv
0
citations

ResFormer: Scaling ViTs With Multi-Resolution Training

CVPR 2023arXiv
0
citations

SVFormer: Semi-Supervised Video Transformer for Action Recognition

CVPR 2023arXiv
0
citations

Look Before You Match: Instance Understanding Matters in Video Object Segmentation

CVPR 2023arXiv
0
citations

Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning

CVPR 2023arXiv
0
citations

Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining

CVPR 2023
0
citations

Enhancing the Self-Universality for Transferable Targeted Attacks

CVPR 2023arXiv
0
citations

Prototypical Residual Networks for Anomaly Detection and Localization

CVPR 2023arXiv
0
citations

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection

CVPR 2023arXiv
0
citations

StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning

CVPR 2023arXiv
0
citations

Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding

CVPR 2023arXiv
0
citations

Multi-Scale Deep Learning Architectures for Person Re-Identification

ICCV 2017arXiv
0
citations

Revisiting Adversarial Robustness Distillation: Robust Soft Labels Make Student Better

ICCV 2021arXiv
0
citations

Motion Guided Region Message Passing for Video Captioning

ICCV 2021
0
citations

VideoLT: Large-Scale Long-Tailed Video Recognition

ICCV 2021arXiv
0
citations

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

ICCV 2023arXiv
0
citations

MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition

ICCV 2023arXiv
0
citations

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

ECCV 2020
0
citations

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

ECCV 2020
0
citations

Semi-Supervised Single-View 3D Reconstruction via Prototype Shape Priors

ECCV 2022
0
citations

Semi-Supervised Vision Transformers

ECCV 2022
0
citations

Efficient Video Transformers with Spatial-Temporal Token Selection

ECCV 2022
0
citations

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

ECCV 2022
0
citations

DSOD: Learning Deeply Supervised Object Detectors From Scratch

ICCV 2017arXiv
0
citations

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

ICCV 2025
0
citations

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

ICCV 2025
0
citations

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation

ICCV 2025
0
citations

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

ICCV 2025
0
citations

Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

AAAI 2025
0
citations

Instance-Aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning

AAAI 2024arXiv
0
citations

MotionEditor: Editing Video Motion via Content-Aware Diffusion

CVPR 2024
0
citations

Harnessing Object and Scene Semantics for Large-Scale Video Understanding

CVPR 2016
0
citations

Weakly Supervised Dense Video Captioning

CVPR 2017arXiv
0
citations

Dual Skipping Networks

CVPR 2018arXiv
0
citations

Hyperbolic Visual Embedding Learning for Zero-Shot Recognition

CVPR 2020
0
citations

Sketch-BERT: Learning Sketch Bidirectional Encoder Representation From Transformers by Self-Supervised Learning of Sketch Gestalt

CVPR 2020
0
citations

FM2u-Net: Face Morphological Multi-Branch Network for Makeup-Invariant Face Verification

CVPR 2020
0
citations

Clean-Label Backdoor Attacks on Video Recognition Models

CVPR 2020arXiv
0
citations

Towards Bridging Event Captioner and Sentence Localizer for Weakly Supervised Dense Event Captioning

CVPR 2021
0
citations

Balanced Contrastive Learning for Long-Tailed Visual Recognition

CVPR 2022
0
citations

Cross-Modal Transferable Adversarial Attacks From Images to Videos

CVPR 2022arXiv
0
citations

BEVT: BERT Pretraining of Video Transformers

CVPR 2022arXiv
0
citations

ObjectFormer for Image Manipulation Detection and Localization

CVPR 2022arXiv
0
citations

AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

CVPR 2022arXiv
0
citations

LiteEval: A Coarse-to-Fine Framework for Resource Efficient Video Recognition

NeurIPS 2019
0
citations

OmniVL: One Foundation Model for Image-Language and Video-Language Tasks

NeurIPS 2022
0
citations

Multi-Prompt Alignment for Multi-Source Unsupervised Domain Adaptation

NeurIPS 2023
0
citations

Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection

NeurIPS 2023
0
citations