Ruimao Zhang

32
Papers
1,136
Total Citations

Papers (32)

WorldSimBench: Towards Video Generation Models as World Simulators

ICML 2025
806
citations

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

CVPR 2024
139
citations

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

CVPR 2024
76
citations

Open-World Human-Object Interaction Detection via Multi-modal Prompts

CVPR 2024
31
citations

ScaMo: Exploring the Scaling Law in Autoregressive Motion Generation Model

CVPR 2025
24
citations

F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions

ECCV 2024
22
citations

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-Modal Knowledge Transfer

AAAI 2024arXiv
11
citations

RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints

ICCV 2025
11
citations

DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation

CVPR 2025
10
citations

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions

CVPR 2024
6
citations

Differentiable Learning-to-Group Channels via Groupable Convolutional Neural Networks

ICCV 2019
0
citations

End-to-End Dense Video Captioning With Parallel Decoding

ICCV 2021arXiv
0
citations

InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds Through Instance Multi-Level Contextual Referring

ICCV 2021arXiv
0
citations

SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection

ICCV 2023arXiv
0
citations

Neural Interactive Keypoint Detection

ICCV 2023arXiv
0
citations

Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation

ECCV 2020
0
citations

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

ECCV 2022
0
citations

2DPASS: 2D Priors Assisted Semantic Segmentation on LiDAR Point Clouds

ECCV 2022
0
citations

Exemplar Normalization for Learning Deep Representation

CVPR 2020arXiv
0
citations

SEED-Bench: Benchmarking Multimodal Large Language Models

CVPR 2024
0
citations

HumanTOMATO: Text-aligned Whole-body Motion Generation

ICML 2024
0
citations

Deep Structured Scene Parsing by Learning With Image Descriptions

CVPR 2016
0
citations

SSN: Learning Sparse Switchable Normalization via SparsestMax

CVPR 2019
0
citations

DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

CVPR 2019
0
citations

Towards Photo-Realistic Virtual Try-On by Adaptively Generating-Preserving Image Content

CVPR 2020
0
citations

Parser-Free Virtual Try-On via Distilling Appearance Flows

CVPR 2021arXiv
0
citations

Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains

CVPR 2023arXiv
0
citations

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

ICCV 2019
0
citations

Let Images Give You More: Point Cloud Cross-Modal Training for Shape Analysis

NeurIPS 2022
0
citations

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

NeurIPS 2022
0
citations

Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset

NeurIPS 2023
0
citations

Discovering Intrinsic Spatial-Temporal Logic Rules to Explain Human Actions

NeurIPS 2023
0
citations