Hongtao Xie

19
Papers
132
Total Citations

Papers (19)

Towards Balanced Alignment: Modal-Enhanced Semantic Modeling for Video Moment Retrieval

AAAI 2024arXiv
40
citations

DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection

CVPR 2024
36
citations

PosterMaker: Towards High-Quality Product Poster Generation with Accurate Text Rendering

CVPR 2025
21
citations

Hybrid-Level Instruction Injection for Video Token Compression in Multi-modal Large Language Models

CVPR 2025
14
citations

Mask^2DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation

CVPR 2025
8
citations

AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation

ECCV 2024
5
citations

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

NeurIPS 2025
3
citations

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis

CVPR 2025
2
citations

GRIP: A Graph-Based Reasoning Instruction Producer

NeurIPS 2025arXiv
2
citations

IDseq: Decoupled and Sequentially Detecting and Grounding Multi-Modal Media Manipulation

AAAI 2025
1
citations

DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations

CVPR 2024
0
citations

CLIP-Adapted Region-to-Text Learning for Generative Open-Vocabulary Semantic Segmentation

ICCV 2025
0
citations

Invisible Watermarks, Visible Gains: Steering Machine Unlearning with Bi-Level Watermarking Design

ICCV 2025
0
citations

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation

ICCV 2025
0
citations

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

ICCV 2025
0
citations

IGD: Instructional Graphic Design with Multimodal Layer Generation

ICCV 2025
0
citations

Forensic-MoE: Exploring Comprehensive Synthetic Image Detection Traces with Mixture of Experts

ICCV 2025
0
citations

OTE: Exploring Accurate Scene Text Recognition Using One Token

CVPR 2024
0
citations

Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing

CVPR 2024
0
citations