Yixiao Ge

19

Papers

317

Total Citations

Papers (19)

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

ST-LLM: Large Language Models Are Effective Temporal Learners

Scalable Image Tokenization with Index Backpropagation Quantization

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Cached Transformers: Improving Transformers with Differentiable Memory Cached

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos

BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning

YOLO-World: Real-Time Open-Vocabulary Object Detection

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

SEED-Bench: Benchmarking Multimodal Large Language Models

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

ViT-Lens: Towards Omni-modal Representations

AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction

VoCo-LLaMA: Towards Vision Compression with Large Language Models