Jing Shao

32

Papers

947

Total Citations

Papers (32)

WorldSimBench: Towards Video Generation Models as World Simulators

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

REEF: Representation Encoding Fingerprints for Large Language Models

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis

NeurIPS 2025arXiv

Exploring Disentangled Feature Representation Beyond Face Identification

Practical Block-Wise Neural Network Architecture Generation

Avatar-Net: Multi-Scale Zero-Shot Style Transfer by Feature Decoration

Improving Referring Expression Grounding With Cross-Modal Attention-Guided Erasing

Semantics Disentangling for Text-To-Image Generation

Video Generation From Single Semantic Label Map

Context and Attribute Grounded Dense Captioning

ForgeryNet: A Versatile Benchmark for Comprehensive Forgery Analysis

Siamese DETR

HydraPlus-Net: Attentive Deep Features for Pedestrian Analysis

Orientation Invariant Feature Embedding and Spatial Temporal Regularization for Vehicle Re-Identification

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations

Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues

Learning Connectivity of Neural Networks from a Topological Perspective

Benchmarking Omni-Vision Representation through the Lens of Visual Realms

Towards Accurate Binary Neural Networks via Modeling Contextual Dependencies

PalGAN: Image Colorization with Palette Generative Adversarial Networks

X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models

Deeply Learned Attributes for Crowded Scene Understanding

Slicing Convolutional Neural Network for Crowd Video Understanding

Spindle Net: Person Re-Identification With Human Body Region Guided Feature Decomposition and Fusion

Learning to Predict Layout-to-image Conditional Convolutions for Semantic Image Synthesis

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark