Renrui Zhang

26
Papers
1,269
Total Citations

Papers (26)

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

CVPR 2025
858
citations

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning

CVPR 2024
118
citations

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

ICML 2025
88
citations

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

AAAI 2024arXiv
58
citations

From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning

ICCV 2025
28
citations

No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation

CVPR 2024
27
citations

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

ICLR 2025
26
citations

FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection

AAAI 2024arXiv
22
citations

Cloud-Device Collaborative Learning for Multimodal Large Language Models

CVPR 2024
18
citations

Detect Anything 3D in the Wild

ICCV 2025
12
citations

Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation

ICLR 2025
8
citations

Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation

CVPR 2025
6
citations

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ICML 2024
0
citations

Let's Verify and Reinforce Image Generation Step by Step

CVPR 2025
0
citations

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

ICML 2024
0
citations

Chimera: Improving Generalist Model with Domain-Specific Experts

ICCV 2025
0
citations

TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction

ICCV 2025
0
citations

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

ICCV 2025
0
citations

MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding

AAAI 2025
0
citations

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

AAAI 2025
0
citations

Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency

AAAI 2024
0
citations

Gradient-based Parameter Selection for Efficient Fine-Tuning

CVPR 2024
0
citations

Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation

CVPR 2024
0
citations

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

CVPR 2024
0
citations

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything

CVPR 2024
0
citations

SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models

ICML 2024
0
citations