Hongtao Xie

19

Papers

132

Total Citations

Papers (19)

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis

GRIP: A Graph-Based Reasoning Instruction Producer

NeurIPS 2025arXiv

IDseq: Decoupled and Sequentially Detecting and Grounding Multi-Modal Media Manipulation

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

IGD: Instructional Graphic Design with Multimodal Layer Generation

Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts

OTE: Exploring Accurate Scene Text Recognition Using One Token

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing