Zhuowen Tu

14

Papers

270

Total Citations

Papers (14)

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Dolfin: Diffusion Layout Transformers without Autoencoder

Bayesian Diffusion Models for 3D Shape Reconstruction

Enhancing Vision-Language Pre-training with Rich Supervisions

YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Lay-Your-Scene: Natural Scene Layout Generation with Diffusion Transformers

Ground-V: Teaching VLMs to Ground Complex Instructions in Pixels

Open-World Dynamic Prompt and Continual Visual Representation Learning

Restoration by Generation with Constrained Priors

DepR: Depth Guided Single-view Scene Reconstruction with Instance-level Diffusion

Non-autoregressive Sequence-to-Sequence Vision-Language Models

On the Scalability of Diffusion-based Text-to-Image Generation

TokenCompose: Text-to-Image Diffusion with Token-level Supervision

HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data