Renrui Zhang
43
Papers
1,269
Total Citations
Papers (43)
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
CVPR 2025
858
citations
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
CVPR 2024
118
citations
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
ICML 2025
88
citations
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation
AAAI 2024arXiv
58
citations
From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning
ICCV 2025
28
citations
No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation
CVPR 2024
27
citations
PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
ICLR 2025
26
citations
FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection
AAAI 2024arXiv
22
citations
Cloud-Device Collaborative Learning for Multimodal Large Language Models
CVPR 2024
18
citations
Detect Anything 3D in the Wild
ICCV 2025
12
citations
Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation
ICLR 2025
8
citations
Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation
CVPR 2025
6
citations
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
ICML 2024
0
citations
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
ICML 2024
0
citations
PointCLIP: Point Cloud Understanding by CLIP
CVPR 2022arXiv
0
citations
Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners
CVPR 2023arXiv
0
citations
Starting From Non-Parametric Networks for 3D Point Cloud Analysis
CVPR 2023arXiv
0
citations
Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders
CVPR 2023arXiv
0
citations
iQuery: Instruments As Queries for Audio-Visual Sound Separation
CVPR 2023arXiv
0
citations
EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding
CVPR 2023arXiv
0
citations
Let's Verify and Reinforce Image Generation Step by Step
CVPR 2025
0
citations
MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
ICCV 2023arXiv
0
citations
PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning
ICCV 2023arXiv
0
citations
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
ICCV 2023arXiv
0
citations
SparseMAE: Sparse Training Meets Masked Autoencoders
ICCV 2023
0
citations
Exploring Resolution and Degradation Clues As Self-Supervised Signal for Low Quality Object Detection
ECCV 2022
0
citations
Frozen CLIP Models Are Efficient Video Learners
ECCV 2022
0
citations
Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification
ECCV 2022
0
citations
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection
CVPR 2023arXiv
0
citations
Chimera: Improving Generalist Model with Domain-Specific Experts
ICCV 2025
0
citations
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
ICCV 2025
0
citations
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
ICCV 2025
0
citations
MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding
AAAI 2025
0
citations
LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding
AAAI 2025
0
citations
Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency
AAAI 2024
0
citations
Gradient-based Parameter Selection for Efficient Fine-Tuning
CVPR 2024
0
citations
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
CVPR 2024
0
citations
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
CVPR 2024
0
citations
NTO3D: Neural Target Object 3D Reconstruction with Segment Anything
CVPR 2024
0
citations
SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
ICML 2024
0
citations
Dual-stream Network for Visual Recognition
NeurIPS 2021
0
citations
Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training
NeurIPS 2022
0
citations
JourneyDB: A Benchmark for Generative Image Understanding
NeurIPS 2023
0
citations