Yixiao Ge

43
Papers
317
Total Citations

Papers (43)

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

CVPR 2024
139
citations

ST-LLM: Large Language Models Are Effective Temporal Learners

ECCV 2024
124
citations

Scalable Image Tokenization with Index Backpropagation Quantization

ICCV 2025
16
citations

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

CVPR 2024
11
citations

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

ICCV 2025
9
citations

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

CVPR 2025
7
citations

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

ICML 2025
6
citations

Cached Transformers: Improving Transformers with Differentiable Memory Cached

AAAI 2024arXiv
5
citations

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

CVPR 2024
0
citations

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

CVPR 2024
0
citations

ViT-Lens: Towards Omni-modal Representations

CVPR 2024
0
citations

Mutual CRF-GNN for Few-Shot Learning

CVPR 2021
0
citations

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

CVPR 2021arXiv
0
citations

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

CVPR 2021arXiv
0
citations

Bridging Video-Text Retrieval With Multiple Choice Questions

CVPR 2022arXiv
0
citations

Object-Aware Video-Language Pre-Training for Retrieval

CVPR 2022arXiv
0
citations

Accelerating Vision-Language Pretraining With Free Language Modeling

CVPR 2023arXiv
0
citations

All in One: Exploring Unified Video-Language Pre-Training

CVPR 2023arXiv
0
citations

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

CVPR 2023arXiv
0
citations

RILS: Masked Visual Reconstruction in Language Semantic Space

CVPR 2023arXiv
0
citations

Progressive Correspondence Pruning by Consensus Learning

ICCV 2021arXiv
0
citations

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-Identification

ICCV 2021
0
citations

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

ICCV 2023arXiv
0
citations

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

ICCV 2023
0
citations

VoCo-LLaMA: Towards Vision Compression with Large Language Models

CVPR 2025
0
citations

Exploring Model Transferability through the Lens of Potential Energy

ICCV 2023arXiv
0
citations

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

ECCV 2020
0
citations

Mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training

ECCV 2022
0
citations

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

ECCV 2022
0
citations

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

ECCV 2022
0
citations

BoxSnake: Polygonal Instance Segmentation with Box Supervision

ICCV 2023arXiv
0
citations

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

CVPR 2025
0
citations

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

ICCV 2025
0
citations

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

ICCV 2025
0
citations

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

CVPR 2024
0
citations

YOLO-World: Real-Time Open-Vocabulary Object Detection

CVPR 2024
0
citations

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

CVPR 2024
0
citations

SEED-Bench: Benchmarking Multimodal Large Language Models

CVPR 2024
0
citations

FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification

NeurIPS 2018
0
citations

Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID

NeurIPS 2020
0
citations

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

NeurIPS 2023
0
citations

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

NeurIPS 2023
0
citations

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

NeurIPS 2023
0
citations