Yi Yang

55

Papers

359

Total Citations

Papers (55)

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval

Deep SE(3)-Equivariant Geometric Reasoning for Precise Placement Tasks

Clustering Propagation for Universal Medical Image Segmentation

LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels

Controllable Navigation Instruction Generation with Chain of Thought Prompting

DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models

BVINet: Unlocking Blind Video Inpainting with Zero Annotations

EnergyMoGen: Compositional Human Motion Generation with Energy-Based Diffusion Model in Latent Space

Learning from One Continuous Video Stream

VideoUFO: A Million-Scale User-Focused Dataset for Text-to-Video Generation

Clustering for Protein Representation Learning

Imagine and Seek: Improving Composed Image Retrieval with an Imagined Proxy

Autonomous LLM-Enhanced Adversarial Attack for Text-to-Motion

Scene Map-based Prompt Tuning for Navigation Instruction Generation

DroneSplat: 3D Gaussian Splatting for Robust 3D Reconstruction from In-the-Wild Drone Imagery

DiffVsgg: Diffusion-Driven Online Video Scene Graph Generation

NeRF Is a Valuable Assistant for 3D Gaussian Splatting

MeshLLM: Empowering Large Language Models to Progressively Understand and Generate 3D Mesh

From Image to Video: An Empirical Study of Diffusion Representations

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

GraphMimic: Graph-to-Graphs Generative Modeling from Videos for Policy Learning

SparseDiT: Token Sparsification for Efficient Diffusion Transformer

Silence is Golden: Leveraging Adversarial Examples to Nullify Audio Control in LDM-based Talking-Head Generation

ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning

LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study

TDDBench: A Benchmark for Training data detection

Dual Reciprocal Learning of Language-based Human Motion Understanding and Generation

Adapting Text-to-Image Generation with Feature Difference Instruction for Generic Image Restoration

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion

UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

BrainGuard: Privacy-Preserving Multisubject Image Reconstructions from Brain Activities

Improving Context Understanding in Multimodal Large Language Models via Multimodal Composition Learning

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training

Interpretable3D: An Ad

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Volumetric Environment Representation for Vision-Language Navigation

TIP-I2V: A Million-Scale Real Text and Image Prompt Dataset for Image-to-Video Generation

TAPNext: Tracking Any Point (TAP) as Next Token Prediction

Neural Clustering based Visual Representation Learning

CapHuman: Capture Your Moments in Parallel Universes

MaGS: Reconstructing and Simulating Dynamic 3D Objects with Mesh-adsorbed Gaussian Splatting

Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity

Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields

SKDream: Controllable Multi-view and 3D Generation with Arbitrary Skeletons

MS-DETR: Efficient DETR Training with Mixed Supervision

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens

Underwater Visual SLAM with Depth Uncertainty and Medium Modeling

Towards Human-like Virtual Beings: Simulating Human Behavior in 3D Scenes

Gaussian-based World Model: Gaussian Priors for Voxel-Based Occupancy Prediction and Future Motion Prediction

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)

Hierarchical Event Memory for Accurate and Low-latency Online Video Temporal Grounding