Yixiao Ge

43

Papers

317

Total Citations

Papers (43)

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

ST-LLM: Large Language Models Are Effective Temporal Learners

Scalable Image Tokenization with Index Backpropagation Quantization

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Cached Transformers: Improving Transformers with Differentiable Memory Cached

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

ViT-Lens: Towards Omni-modal Representations

Mutual CRF-GNN for Few-Shot Learning

Refining Pseudo Labels With Clustering Consensus Over Generations for Unsupervised Object Re-Identification

DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network

Bridging Video-Text Retrieval With Multiple Choice Questions

Object-Aware Video-Language Pre-Training for Retrieval

Accelerating Vision-Language Pretraining With Free Language Modeling

All in One: Exploring Unified Video-Language Pre-Training

Learning Transferable Spatiotemporal Representations From Natural Script Knowledge

RILS: Masked Visual Reconstruction in Language Semantic Space

Progressive Correspondence Pruning by Consensus Learning

Online Pseudo Label Generation by Hierarchical Cluster Dynamics for Adaptive Person Re-Identification

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Exploring Model Transferability through the Lens of Potential Energy

Self-supervising Fine-grained Region Similarities for Large-scale Image Localization

Mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training

Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval

BoxSnake: Polygonal Instance Segmentation with Box Supervision

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

YOLO-World: Real-Time Open-Vocabulary Object Detection

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

SEED-Bench: Benchmarking Multimodal Large Language Models

FD-GAN: Pose-guided Feature Distilling GAN for Robust Person Re-identification

Self-paced Contrastive Learning with Hybrid Memory for Domain Adaptive Object Re-ID

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction