🧬Applications

Document Understanding

Understanding document images and layouts

45 papers306 total citations

Compare with other topics

Feb '24 — Jan '2642 papers

Top Conferences

CVPR: 15 ICCV: 13 AAAI: 8 ICLR: 4 ECCV: 3 ICML: 2

Top Papers

#1

General Object Foundation Model for Images and Videos at Scale

Junfeng Wu, Yi Jiang, Qihao Liu et al.

DocFormerv2: Local Features for Document Understanding

Srikar Appalaraju, Peng Tang, Qi Dong et al.

AAAI 2024arXiv:2306.01733

visual document understandingmulti-modal transformerlocal-feature alignmentdocument information extraction+4

58

citations

#3

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang et al.

ICCV 2025arXiv:2412.03859

layout-to-image generationmultimodal diffusion transformerssiamese network architecturecreative layout planning+1

33

citations

#4

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

Jiaxin Zhang, Dezhi Peng, Chongyu Liu et al.

2382 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

Chengyou Jia, Minnan Luo, Zhuohang Dang et al.

M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

Ning Zhang, Hiuyi Cheng, Jiayu Chen et al.

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

Yufan Chen, Jiaming Zhang, Kunyu Peng et al.

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Tianyi Wei, Yifan Zhou, Dongdong Chen et al.

Knowledge-Enhanced Historical Document Segmentation and Recognition

En-Hao Gao, Yu-Xuan Huang, Wen-Chao Hu et al.

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

Zhaoqing Zhu, Chuwei Luo, Zirui Shao et al.

CVPR 2025arXiv:2503.18434

document understandinglayout token integrationpositional encoding schemecross-modality learning+3

7

citations

#11

``Principal Components" Enable A New Language of Images

Xin Wen, Bingchen Zhao, Ismail Elezi et al.

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

Tianci Bi, Xiaoyi Zhang, Zhizheng Zhang et al.

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

Jiawei Lin, Shizhao Sun, Danqing Huang et al.

CVPR 2025arXiv:2412.19712

graphic design compositionmultimodal graphic elementslayer planninglarge multimodal models+3

5

citations

#14

LayerD: Decomposing Raster Graphic Designs into Layers

Tomoyuki Suzuki, Kang-Jun Liu, Naoto Inoue et al.

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Han Xiao, yina xie, Guanxin tan et al.

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Yuke Zhu, Yue Zhang, Dongdong Liu et al.

Graph-based Document Structure Analysis

Yufan Chen, Ruiping Liu, Junwei Zheng et al.

Design Graph Guided Element Importance-Aware Layout Generation with Multi-Modality Cascade Transformer

Qiuyun Zhang, Bin Guo, Lina Yao et al.

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

Wenzhuang Wang, Yifan Zhao, Mingcan Ma et al.

InstructDoc: A Dataset for Zero

Shot Generalization of Visual Document Understanding with Instructions - Ryota Tanaka, Taichi Iki, Kyosuke Nishida et al.

AAAI 2024arXiv:2401.13313

visual document understandinginstruction-based modelsmultimodal large language modelszero-shot learning+4

—

not collected

#21

Referring Image Editing: Object-level Image Editing via Referring Expressions

Chang Liu, Xiangtai Li, Henghui Ding

Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation

Mohammad Amin Shabani, Zhaowen Wang, Difan Liu et al.

HRVDA: High-Resolution Visual Document Assistant

Chaohu Liu, Kun Yin, Haoyu Cao et al.

Nougat: Neural Optical Understanding for Academic Documents

Lukas Blecher, Guillem Cucurull Preixens, Thomas Scialom et al.

ADOPD: A Large-Scale Document Page Decomposition Dataset

Jiuxiang Gu, Xiangxi Shi, Jason Kuen et al.

Decomposition of Graphic Design with Unified Multimodal Model

Hui Nie, Zhao Zhang, Yutao Cheng et al.

Compositional Image Decomposition with Diffusion Models

Jocelin Su, Nan Liu, Yanbo Wang et al.

ICML 2024

image decompositiondiffusion modelsunsupervised learningscene composition+2

—

not collected

#28

DOGR: Towards Versatile Visual Document Grounding and Referring

Yinan Zhou, Yuxin Chen, Haokun Lin et al.

A Token-level Text Image Foundation Model for Document Understanding

Tongkun Guan, Zining Wang, Pei Fu et al.

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

Zhaoran Zhao, Peng Lu, Anran Zhang et al.

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Linke Ouyang, Yuan Qu, Hongbin Zhou et al.

Empowering LLMs to Understand and Generate Complex Vector Graphics

XiMing Xing, Juncheng Hu, Guotao Liang et al.

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

Jun Chen, Dannong Xu, Junjie Fei et al.

AutoPresent: Designing Structured Visuals from Scratch

Jiaxin Ge, Zora Zhiruo Wang, Xuhui Zhou et al.

UnZipLoRA: Separating Content and Style from a Single Image

Chang Liu, Viraj Shah, Aiyu Cui et al.

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Vlad Hosu, Lorenzo Agnolucci, Daisuke Iso et al.

Edicho: Consistent Image Editing in the Wild

Qingyan Bai, Hao Ouyang, Yinghao Xu et al.

LACONIC: A 3D Layout Adapter for Controllable Image Creation

Léopold Maillard, Tom Durand, Adrien RAMANANA RAHARY et al.

ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

KA WONG, Jicheng Zhou, Haiwei Wu et al.

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

Ahmed Nassar, Matteo Omenetti, Maksym Lysak et al.

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, Zeyuan Chen et al.

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Shoma Iwai, Atsuki Osanai, Shunsuke Kitada et al.

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

Ada-Astrid Balauca, Danda Paudel, Kristina Toutanova et al.

GLIC: General Format Learned Image Compression

MingSheng Zhou, MingMing Kong

Document Understanding

Top Conferences

Related Topics (Applications)

Top Papers

General Object Foundation Model for Images and Videos at Scale

DocFormerv2: Local Features for Document Understanding

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks

2382 SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-Form Layout-to-Image Generation

M2Doc: A Multi-Modal Fusion Approach for Document Layout Analysis

RoDLA: Benchmarking the Robustness of Document Layout Analysis Models

FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing

Knowledge-Enhanced Historical Document Segmentation and Recognition

A Simple yet Effective Layout Token in Large Language Models for Document Understanding

``Principal Components" Enable A New Language of Images

Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis

From Elements to Design: A Layered Approach for Automatic Graphic Design Composition

LayerD: Decomposing Raster Graphic Designs into Layers

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

Enhancing Document Understanding with Group Position Embedding: A Novel Approach to Incorporate Layout Information

Graph-based Document Structure Analysis

Design Graph Guided Element Importance-Aware Layout Generation with Multi-Modality Cascade Transformer

FICGen: Frequency-Inspired Contextual Disentanglement for Layout-driven Degraded Image Generation

InstructDoc: A Dataset for Zero

Referring Image Editing: Object-level Image Editing via Referring Expressions

Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation

HRVDA: High-Resolution Visual Document Assistant

Nougat: Neural Optical Understanding for Academic Documents

ADOPD: A Large-Scale Document Page Decomposition Dataset

Decomposition of Graphic Design with Unified Multimodal Model

Compositional Image Decomposition with Diffusion Models

DOGR: Towards Versatile Visual Document Grounding and Referring

A Token-level Text Image Foundation Model for Document Understanding

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Empowering LLMs to Understand and Generate Complex Vector Graphics

Document Haystacks: Vision-Language Reasoning Over Piles of 1000+ Documents

AutoPresent: Designing Structured Visuals from Scratch

UnZipLoRA: Separating Content and Style from a Single Image

Image Intrinsic Scale Assessment: Bridging the Gap Between Quality and Resolution

Edicho: Consistent Image Editing in the Wild

LACONIC: A 3D Layout Adapter for Controllable Image Creation

ADCD-Net: Robust Document Image Forgery Localization via Adaptive DCT Feature and Hierarchical Content Disentanglement

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Layout-Corrector: Alleviating Layout Sticking Phenomenon in Discrete Diffusion Model

Taming CLIP for Fine-grained and Structured Visual Understanding of Museum Exhibits

GLIC: General Format Learned Image Compression

Table of Contents