Song Bai

46

Papers

463

Total Citations

Papers (46)

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

General Object Foundation Model for Images and Videos at Scale

Regional Homogeneity: Towards Learning Transferable Universal Adversarial Perturbations Against Defenses

DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data

Describe, Adapt and Combine: Empowering CLIP Encoders for Open-set 3D Object Retrieval

Re-Ranking via Metric Fusion for Object Retrieval and Person Re-Identification

Learning Attraction Field Representation for Robust Line Segment Detection

Improving Transferability of Adversarial Examples With Input Diversity

Holistically-Attracted Wireframe Parsing

Neural Architecture Search for Lightweight Non-Local Networks

Multi-Shot Temporal Event Localization: A Benchmark

SwiftNet: Real-Time Video Object Segmentation

Anchor-Free Person Search

Mimicking the Oracle: An Initial Phase Decorrelation Approach for Class Incremental Learning

An Empirical Study of End-to-End Temporal Action Detection

Knowledge Distillation As Efficient Pre-Training: Faster Convergence, Higher Data-Efficiency, and Better Transferability

Fourier Document Restoration for Robust Document Dewarping and Recognition

TransMix: Attend To Mix for Vision Transformers

YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

InstMove: Instance Motion for Object-Centric Video Segmentation

Ensemble Diffusion for Retrieval

Asymmetric Non-Local Neural Networks for Semantic Segmentation

Anchor Diffusion for Unsupervised Video Object Segmentation

CenterNet: Keypoint Triplets for Object Detection

View N-Gram Network for 3D Object Retrieval

Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting

Symmetry-Constrained Rectification Network for Scene Text Recognition

Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation

PlaneTR: Structure-Guided Transformers for 3D Plane Recovery

Versatile Transition Generation with Image-to-Video Diffusion

SRFormer: Permuted Self-Attention for Single Image Super-Resolution

Corner Proposal Network for Anchor-free, Two-stage Object Detection

XingGAN for Person Image Generation

Explicit Occlusion Reasoning for Multi-Person 3D Human Pose Estimation

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Contextual Text Block Detection towards Scene Text Understanding

SeqFormer: Sequential Transformer for Video Instance Segmentation

In Defense of Online Models for Video Instance Segmentation

MOSE: A New Dataset for Video Object Segmentation in Complex Scenes

TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding

GIFT: A Real-Time and Scalable 3D Shape Search Engine

Scalable Person Re-Identification on Supervised Smoothed Manifold

Triplet-Center Loss for Multi-View 3D Object Retrieval

Mixed Samples as Probes for Unsupervised Model Selection in Domain Adaptation