PSMBench: A Benchmark and Dataset for Evaluating LLMs Extraction of Protocol State Machines from RFC Specifications

0citations

Project

citations

#2219

in NEURIPS 2025

of 5858 papers

Top Authors

Data Points

Top Authors

Zilin Shen Xinyu Luo Imtiaz Karim Elisa Bertino

Topics

protocol state machines rfc specifications long-context reasoning automated security analysis semantic fuzzy-matching graph structure generation technical prose interpretation protocol testing

Abstract

Accurately extracting protocol-state machines (PSMs) from the long, densely written Request-for-Comments (RFC) standards that govern Internet‐scale communication remains a bottleneck for automated security analysis and protocol testing. In this paper, we introduce RFC2PSM, the first large-scale dataset that pairs 1,580 pages of cleaned RFC text with 108 manually validated states and 297 transitions covering 14 widely deployed protocols spanning the data-link, transport, session, and application layers. Built on this corpus, we propose PsmBench, a benchmark that (i) feeds chunked RFC to an LLM, (ii) prompts the model to emit a machine-readable PSM, and (iii) scores the output with structure-aware, semantic fuzzy-matching metrics that reward partially correct graphs.A comprehensive baseline study of nine state-of-the-art open and commercial LLMs reveals a persistent state–transition gap: models identify many individual states (up to $0.82$ F1) but struggle to assemble coherent transition graphs ($\leq 0.38$ F1), highlighting challenges in long-context reasoning, alias resolution, and action/event disambiguation. We release the dataset, evaluation code, and all model outputs as open-sourced, providing a fully reproducible starting point for future work on reasoning over technical prose and generating executable graph structures. RFC2PSM and PsmBench aim to catalyze cross-disciplinary progress toward LLMs that can interpret and verify the protocols that keep the Internet safe.

Citation History

Jan 26, 2026

Jan 27, 2026

Feb 2, 2026