Ruimao Zhang
32
Papers
1,136
Total Citations
Papers (32)
WorldSimBench: Towards Video Generation Models as World Simulators
ICML 2025
806
citations
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
CVPR 2024
139
citations
MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception
CVPR 2024
76
citations
Open-World Human-Object Interaction Detection via Multi-modal Prompts
CVPR 2024
31
citations
ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model
CVPR 2025
24
citations
F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions
ECCV 2024
22
citations
X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer
AAAI 2024arXiv
11
citations
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
ICCV 2025
11
citations
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation
CVPR 2025
10
citations
FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
CVPR 2024
6
citations
Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks
ICCV 2019
0
citations
End-to-End Dense Video Captioning With Parallel Decoding
ICCV 2021arXiv
0
citations
InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring
ICCV 2021arXiv
0
citations
SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection
ICCV 2023arXiv
0
citations
Neural Interactive Keypoint Detection
ICCV 2023arXiv
0
citations
Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation
ECCV 2020
0
citations
Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration
ECCV 2022
0
citations
2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds
ECCV 2022
0
citations
Exemplar Normalization for Learning Deep Representation
CVPR 2020arXiv
0
citations
SEED-Bench: Benchmarking Multimodal Large Language Models
CVPR 2024
0
citations
HumanTOMATO: Text-aligned Whole-body Motion Generation
ICML 2024
0
citations
Deep Structured Scene Parsing by Learning With Image Descriptions
CVPR 2016
0
citations
SSN: Learning Sparse Switchable Normalization via SparsestMax
CVPR 2019
0
citations
DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images
CVPR 2019
0
citations
Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content
CVPR 2020
0
citations
Parser-Free Virtual Try-On via Distilling Appearance Flows
CVPR 2021arXiv
0
citations
Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains
CVPR 2023arXiv
0
citations
Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once
ICCV 2019
0
citations
Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis
NeurIPS 2022
0
citations
AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation
NeurIPS 2022
0
citations
Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset
NeurIPS 2023
0
citations
Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions
NeurIPS 2023
0
citations