🧬Applications

Document Understanding

Understanding document images and layouts

45 papers306 total citations
Compare with other topics
Feb '24 Jan '2642 papers
Also includes: document understanding, document analysis, ocr, document ai, layout analysis

Top Papers

#1

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

CVPR 2024
79
citations
#2

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733
visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4
58
citations
#3

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025arXiv:2412.03859
layout-to-image generationmultimodal diffusion transformerssiamese network architecturecreative layout planning+1
33
citations
#4

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

Jiaxin Zhang, Dezhi Peng, Chongyu Liu et al.

CVPR 2024
29
citations
#5

2382 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

Chengyou Jia, Minnan Luo, Zhuohang Dang et al.

AAAI 2024
26
citations
#6

M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

Ning Zhang, Hiuyi Cheng, Jiayu Chen et al.

AAAI 2024
14
citations
#7

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Yufan Chen, Jiaming Zhang, Kunyu Peng et al.

CVPR 2024
13
citations
#8

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei, Yifan Zhou, Dongdong Chen et al.

ICCV 2025
12
citations
#9

Knowledge-Enhanced Historical Document Segmentation and Recognition

En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu et al.

AAAI 2024
7
citations
#10

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu, Chuwei Luo, Zirui Shao et al.

CVPR 2025arXiv:2503.18434
document understandinglayout token integrationpositional encoding schemecross-modality learning+3
7
citations
#11

``Principal Components" Enable A New Language of Images

Xin Wen, Bingchen Zhao, Ismail Elezi et al.

ICCV 2025
6
citations
#12

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang et al.

CVPR 2024
5
citations
#13

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

Jiawei Lin, Shizhao Sun, Danqing Huang et al.

CVPR 2025arXiv:2412.19712
graphic design compositionmultimodal graphic elementslayer planninglarge multimodal models+3
5
citations
#14

LayerD: Decomposing Raster Graphic Designs into Layers

Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue et al.

ICCV 2025
3
citations
#15

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, yina xie, Guanxin tan et al.

CVPR 2025
3
citations
#16

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Yuke Zhu, Yue Zhang, Dongdong Liu et al.

ICLR 2025
2
citations
#17

Graph-based Document Structure Analysis

Yufan Chen, Ruiping Liu, Junwei Zheng et al.

ICLR 2025
2
citations
#18

Design Graph Guided Element Importance-Aware Layout Generation with Multi-Modality Cascade Transformer

Qiuyun Zhang, Bin Guo, Lina Yao et al.

AAAI 2024
1
citations
#19

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

Wenzhuang Wang, Yifan Zhao, Mingcan Ma et al.

ICCV 2025
1
citations
#20

InstructDoc: A Dataset for Zero

Shot Generalization of Visual Document Understanding with Instructions - Ryota Tanaka, Taichi Iki, Kyosuke Nishida et al.

AAAI 2024arXiv:2401.13313
visual document understandinginstruction-based modelsmultimodal large language modelszero-shot learning+4
not collected
#21

Referring Image Editing: Object-level Image Editing via Referring Expressions

Chang Liu, Xiangtai Li, Henghui Ding

CVPR 2024
not collected
#22

Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation

Mohammad Amin Shabani, Zhaowen Wang, Difan Liu et al.

CVPR 2024
not collected
#23

HRVDA: High-Resolution Visual Document Assistant

Chaohu Liu, Kun Yin, Haoyu Cao et al.

CVPR 2024
not collected
#24

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull Preixens, Thomas Scialom et al.

ICLR 2024
not collected
#25

ADOPD: A Large-Scale Document Page Decomposition Dataset

Jiuxiang Gu, Xiangxi Shi, Jason Kuen et al.

ICLR 2024
not collected
#26

Decomposition of Graphic Design with Unified Multimodal Model

Hui Nie, Zhao Zhang, Yutao Cheng et al.

ICML 2025
not collected
#27

Compositional Image Decomposition with Diffusion Models

Jocelin Su, Nan Liu, Yanbo Wang et al.

ICML 2024
image decompositiondiffusion modelsunsupervised learningscene composition+2
not collected
#28

DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin et al.

ICCV 2025
not collected
#29

A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu et al.

ICCV 2025
not collected
#30

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

Zhaoran Zhao, Peng Lu, Anran Zhang et al.

CVPR 2025
not collected
#31

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou et al.

CVPR 2025
not collected
#32

Empowering LLMs to Understand and Generate Complex Vector Graphics

XiMing Xing, Juncheng Hu, Guotao Liang et al.

CVPR 2025
not collected
#33

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei et al.

CVPR 2025
not collected
#34

AutoPresent: Designing Structured Visuals from Scratch

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou et al.

CVPR 2025
not collected
#35

UnZipLoRA: Separating Content and Style from a Single Image

Chang Liu, Viraj Shah, Aiyu Cui et al.

ICCV 2025
not collected
#36

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso et al.

ICCV 2025
not collected
#37

Edicho: Consistent Image Editing in the Wild

Qingyan Bai, Hao Ouyang, Yinghao Xu et al.

ICCV 2025
not collected
#38

LACONIC: A 3D Layout Adapter for Controllable Image Creation

Léopold Maillard, Tom Durand, Adrien RAMANANA RAHARY et al.

ICCV 2025
not collected
#39

ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

KA WONG, Jicheng Zhou, Haiwei Wu et al.

ICCV 2025
not collected
#40

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Ahmed Nassar, Matteo Omenetti, Maksym Lysak et al.

ICCV 2025
not collected
#41

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, Zeyuan Chen et al.

ECCV 2024
not collected
#42

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Shoma Iwai, Atsuki Osanai, Shunsuke Kitada et al.

ECCV 2024
not collected
#43

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Ada-Astrid Balauca, Danda Paudel, Kristina Toutanova et al.

ECCV 2024
not collected
#44

GLIC: General Format Learned Image Compression

MingSheng Zhou, MingMing Kong

AAAI 2025
not collected
#45

Table of Contents

AAAI 2024arXiv:2212.02896
table of contents extractiondocument understandingmultimodal feature fusionhierarchical relationship parsing+4
not collected