Dahua Lin

133

Papers

3,162

Total Citations

Papers (133)

VBench: Comprehensive Benchmark Suite for Video Generative Models

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

Contrastive Learning for Image Captioning

NeurIPS 2017arXiv

Recognize Complex Events From Static Images by Fusing Deep Channels

VideoBooth: Diffusion-based Video Generation with Image Prompts

Unified Human-Scene Interaction via Prompted Chain-of-Contacts

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

GPT4Point: A Unified Framework for Point-Language Understanding and Generation

Long Context Tuning for Video Generation

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Online Multi-modal Person Search in Videos

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

IDArb: Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations

Learn to Propagate Reliably on Noisy Affinity Graphs

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Horizon-GS: Unified 3D Gaussian Splatting for Large-Scale Aerial-to-Ground Scenes

GenDoP: Auto-regressive Camera Trajectory Generation as a Director of Photography

HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance

Keyframe-Guided Creative Video Inpainting

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

Multi-identity Human Image Animation with Structural Video Diffusion

Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization

Adapting Object Detectors via Selective Cross-Domain Alignment

Libra R-CNN: Towards Balanced Learning for Object Detection

Learning a Unified Classifier Incrementally via Rebalancing

Self-Supervised Learning via Conditional Motion Propagation

Learning to Cluster Faces on an Affinity Graph

Region Proposal by Guided Anchoring

Hybrid Task Cascade for Instance Segmentation

IRLAS: Inverse Reinforcement Learning for Architecture Search

FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding

Self-Supervised Scene De-Occlusion

Intra- and Inter-Action Understanding via Temporal Action Parsing

When NAS Meets Robustness: In Search of Robust Architectures Against Adversarial Attacks

A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

Learning to Cluster Faces via Confidence and Connectivity Estimation

DSNAS: Direct Neural Architecture Search Without Parameter Retraining

Open Compound Domain Adaptation

Prime Sample Attention in Object Detection

Visually Informed Binaural Audio Generation without Binaural Audios

Scene-Aware Generative Network for Human Motion Synthesis

Adversarial Robustness Under Long-Tailed Distribution

Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR Segmentation

Seesaw Loss for Long-Tailed Instance Segmentation

Towards Evaluating and Training Verifiably Robust Neural Networks

TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition

OCSampler: Compressing Videos to One Clip With Single-Step Sampling

Towards Diverse and Natural Scene-Aware 3D Human Motion Synthesis

SwinTextSpotter: Scene Text Spotting via Better Synergy Between Text Detection and Text Recognition

Revisiting Skeleton-Based Action Recognition

Multi-Level Logit Distillation

OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Controllable Mesh Generation Through Sparse Latent Point Diffusion Models

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

Grid-Guided Neural Radiance Fields for Large Urban Scenes

Be Your Own Prada: Fashion Synthesis With Structural Coherence

Temporal Action Detection With Structured Segment Networks

Towards Diverse and Natural Image Descriptions via a Conditional GAN

Recursive Visual Sound Separation Using Minus-Plus Net

CARAFE: Content-Aware ReAssembly of FEatures

Convolutional Sequence Generation for Skeleton-Based Action Synthesis

A Graph-Based Framework to Bridge Movies and Synopses

Online Hyper-Parameter Learning for Auto-Augmentation Strategy

Vision Transformer With Progressive Sampling

BlockPlanner: City Block Generation With Vectorized Graph Representation

3D Building Reconstruction From Monocular Remote Sensing Images

MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond

SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering

Scene as Occupancy

AssetField: Assets Mining and Reconfiguration in Ground Feature Plane Representation

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

V3Det: Vast Vocabulary Visual Detection Dataset

Learning Human Dynamics in Autonomous Driving Scenarios

Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation

Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets

Side-Aware Boundary Localization for More Precise Object Detection

MovieNet: A Holistic Dataset for Movie Understanding

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Motion Guided 3D Pose Estimation from Videos

Omni-sourced Webly-supervised Learning for Video Recognition

Caption-Supervised Face Recognition: Training a State-of-the-Art Face Model without Manual Annotation

Placepedia: Comprehensive Place Understanding with Multi-Faceted Annotations

Monocular 3D Object Detection with Depth from Motion

Static and Dynamic Concepts for Self-Supervised Video Representation Learning

BungeeNeRF: Progressive Neural Radiance Field for Extreme Multi-Scale Scene Rendering

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Conical Visual Concentration for Efficient Large Vision-Language Models

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

MM-IFEngine: Towards Multimodal Instruction Following

Visual-RFT: Visual Reinforcement Fine-Tuning

SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree

X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

OneLLM: One Framework to Align All Modalities with Language

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Towards Text-guided 3D Scene Composition

Cinematic Behavior Transfer via NeRF-based Differentiable Filming

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving

Linear Alignment: A Closed-form Solution for Aligning Human Preferences without Tuning and Feedback

PolyNet: A Pursuit of Structural Diversity in Very Deep Networks

Detecting Visual Relationships With Deep Relational Networks

Discover and Learn New Objects From Documentaries

UntrimmedNets for Weakly Supervised Action Recognition and Detection

Unifying Identification and Context Learning for Person Recognition

Unsupervised Feature Learning via Non-Parametric Instance Discrimination

Low-Latency Video Semantic Segmentation

Learning Globally Optimized Object Detector via Policy Gradient

Recognize Actions by Disentangling Components of Dynamics

Optimizing Video Object Detection via a Scale-Time Lattice

Trajectory Convolution for Action Recognition

A Neural Compositional Paradigm for Image Captioning

Policy Continuation with Hindsight Inverse Dynamics

Few-Shot Object Detection via Association and DIscrimination

Generative Occupancy Fields for 3D Surface-Aware Image Synthesis

Balanced Chamfer Distance as a Comprehensive Metric for Point Cloud Completion

Semi-Supervised Semantic Segmentation via Gentle Teaching Assistant

Audio-Driven Co-Speech Gesture Video Generation

RenderMe-360: A Large Digital Asset Library and Benchmarks Towards High-fidelity Head Avatars

POPQORN: Quantifying Robustness of Recurrent Neural Networks