Dahua Lin

41

Papers

2,696

Total Citations

Papers (41)

VBench: Comprehensive Benchmark Suite for Video Generative Models

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

VideoBooth: Diffusion-based Video Generation with Image Prompts

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Long Context Tuning for Video Generation

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

Keyframe-Guided Creative Video Inpainting

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

Multi-identity Human Image Animation with Structural Video Diffusion

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

OneLLM: One Framework to Align All Modalities with Language

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Towards Text-guided 3D Scene Composition

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

Visual-RFT: Visual Reinforcement Fine-Tuning

MM-IFEngine: Towards Multimodal Instruction Following

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Conical Visual Concentration for Efficient Large Vision-Language Models

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback