Kai Chen

75

Papers

584

Total Citations

Papers (75)

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

OMG-Seg: Is One Model Good Enough For All Segmentation?

RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation

MagicDrive-V2: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Gaining Wisdom from Setbacks: Aligning Large Language Models via Mistake Analysis

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

Towards Language-Driven Video Inpainting via Multimodal Large Language Models

UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement

Implicit Concept Removal of Diffusion Models

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Rethinking Verification for LLM Code Generation: From Generation to Testing

DuMo: Dual Encoder Modulation Network for Precise Concept Erasure

RepeatLeakage: Leak Prompts from Repeating as Large Language Model Is a Good Repeater

Memory Decoder: A Pretrained, Plug-and-Play Memory for Large Language Models

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Contact Map Transfer with Conditional Diffusion Model for Generalizable Dexterous Grasp Generation

SocialMOIF: Multi-Order Intention Fusion for Pedestrian Trajectory Prediction

Differentiable Model Scaling using Differentiable Topk

Can AI Assistants Know What They Don't Know?

Discover and Learn New Objects From Documentaries

Optimizing Video Object Detection via a Scale-Time Lattice

Libra R-CNN: Towards Balanced Learning for Object Detection

Region Proposal by Guided Anchoring

Hybrid Task Cascade for Instance Segmentation

Prime Sample Attention in Object Detection

Positional Encoding As Spatial Inductive Bias in GANs

Seesaw Loss for Long-Tailed Instance Segmentation

Learning To Identify Correct 2D-2D Line Correspondences on Sphere

TransRank: Self-Supervised Video Representation Learning via Ranking-Based Transformation Recognition

OCSampler: Compressing Videos to One Clip With Single-Step Sampling

Video K-Net: A Simple, Strong, and Unified Baseline for Video Segmentation

Revisiting Skeleton-Based Action Recognition

GCFSR: A Generative and Controllable Face Super Resolution Method Without Facial and GAN Priors

Group R-CNN for Weakly Semi-Supervised Object Detection With Points

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Mixed Autoencoder for Self-Supervised Visual Representation Learning

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

Dense Distinct Query for End-to-End Object Detection

CARAFE: Content-Aware ReAssembly of FEatures

SGPA: Structure-Guided Prior Adaptation for Category-Level 6D Object Pose Estimation

MultiSiam: Self-Supervised Multi-Instance Siamese Representation Learning for Autonomous Driving

Learning Icosahedral Spherical Probability Map Based on Bingham Mixture Model for Vanishing Point Estimation

Learning Shape Primitives via Implicit Convexity Regularization

Robo3D: Towards Robust and Reliable 3D Perception against Corruptions

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

UMC: A Unified Bandwidth-efficient and Multi-resolution based Collaborative Perception Framework

Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation

Side-Aware Boundary Localization for More Precise Object Detection

Dense Siamese Network for Dense Unsupervised Learning

CODA: A Real-World Road Corner Case Dataset for Object Detection in Autonomous Driving

Sim-to-Real 6D Object Pose Estimation via Iterative Self-Training for Robotic Bin Picking

Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection

Hybrid Reciprocal Transformer with Triplet Feature Alignment for Scene Graph Generation

TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models

Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language

Information Density Principle for MLLM Benchmarks

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation

Mixing Expert Knowledge: Bring Human Thoughts Back To the Game of Go

DocVision: a Seamless, Cross-Device Immersive Active Reading Framework for Digital Academic Literature

Social Recommendation via Graph-Level Counterfactual Augmentation

Semantic-guided Masked Mutual Learning for Multi-modal Brain Tumor Segmentation with Arbitrary Missing Modalities

LLM-DR: A Novel LLM-Aided Diffusion Model for Rule Generation on Temporal Knowledge Graphs

Promptable Anomaly Segmentation with SAM Through Self-Perception Tuning

Parallel Beam Search Algorithms for Domain-Independent Dynamic Programming

Everything2Motion: Synchronizing Diverse Inputs via a Unified Framework for Human Motion Synthesis

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI

PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models

Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

K-Net: Towards Unified Image Segmentation

Few-Shot Object Detection via Association and DIscrimination

Deliberated Domain Bridging for Domain Adaptive Semantic Segmentation

Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

GlyphControl: Glyph Conditional Control for Visual Text Generation