Yi Zhu

29
Papers
66
Total Citations

Papers (29)

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

CVPR 2025
44
citations

rStar-Coder: Scaling Competitive Code Reasoning with a Large-Scale Verified Dataset

NeurIPS 2025arXiv
22
citations

Weakly Supervised Instance Segmentation Using Class Peak Response

CVPR 2018arXiv
0
citations

Towards Universal Representation for Unseen Action Recognition

CVPR 2018arXiv
0
citations

Learning Instance Activation Maps for Weakly Supervised Instance Segmentation

CVPR 2019
0
citations

Improving Semantic Segmentation via Video Propagation and Label Relaxation

CVPR 2019
0
citations

Vision-Language Navigation With Self-Supervised Auxiliary Reasoning Tasks

CVPR 2020arXiv
0
citations

Vision-Dialog Navigation by Exploring Cross-Modal Memory

CVPR 2020arXiv
0
citations

Domain Consensus Clustering for Universal Domain Adaptation

CVPR 2021
0
citations

SOON: Scenario Oriented Object Navigation With Graph-Based Exploration

CVPR 2021arXiv
0
citations

Learning Canonical F-Correlation Projection for Compact Multiview Representation

CVPR 2022
0
citations

ADAPT: Vision-Language Navigation With Modality-Aligned Action Prompts

CVPR 2022arXiv
0
citations

Soft Proposal Networks for Weakly Supervised Object Localization

ICCV 2017arXiv
0
citations

CrossCLR: Cross-Modal Contrastive Learning for Multi-Modal Video Representations

ICCV 2021arXiv
0
citations

VidTr: Video Transformer Without Convolutions

ICCV 2021arXiv
0
citations

Self-Motivated Communication Agent for Real-World Vision-Dialog Navigation

ICCV 2021
0
citations

CrossNorm and SelfNorm for Generalization Under Distribution Shifts

ICCV 2021arXiv
0
citations

MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation

ICCV 2023arXiv
0
citations

Motion-Guided Masking for Spatiotemporal Representation Learning

ICCV 2023arXiv
0
citations

Towards Geospatial Foundation Models via Continual Pretraining

ICCV 2023arXiv
0
citations

Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior

ECCV 2020
0
citations

Selective Sparse Sampling for Fine-Grained Image Recognition

ICCV 2019
0
citations

CAP-Net: A Unified Network for 6D Pose and Size Estimation of Categorical Articulated Parts from a Single RGB-D Image

CVPR 2025
0
citations

Blending Anti-Aliasing into Vision Transformer

NeurIPS 2021
0
citations

Progressive Coordinate Transforms for Monocular 3D Object Detection

NeurIPS 2021
0
citations

CoupAlign: Coupling Word-Pixel with Sentence-Mask Alignments for Referring Image Segmentation

NeurIPS 2022
0
citations

Earthformer: Exploring Space-Time Transformers for Earth System Forecasting

NeurIPS 2022
0
citations

Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition

NeurIPS 2023
0
citations

PreDiff: Precipitation Nowcasting with Latent Diffusion Models

NeurIPS 2023
0
citations