🧬Language Models

Prompt Engineering

Designing effective prompts for LLMs

100 papers4,743 total citations

Compare with other topics

Feb '24 — Jan '26554 papers

Top Conferences

ICLR: 44 AAAI: 21 NeurIPS: 11 ICML: 9 ECCV: 7 CVPR: 6

Top Papers

#1

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Tianle Li, Wei-Lin Chiang, Evan Frick et al.

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

Knowledge Graph Prompting for Multi-Document Question Answering

Yu Wang, Nedim Lipka, Ryan A. Rossi et al.

AAAI 2024arXiv:2308.11730

knowledge graph promptingmulti-document question answeringgraph construction modulegraph traversal module+4

231

citations

#4

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

Jiayi Ye, Yanbo Wang, Yue Huang et al.

ICLR 2025arXiv:2410.02736

llm-as-a-judgebias quantificationautomated evaluation frameworklanguage model evaluation+2

207

citations

#5

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo, Tanya Goyal et al.

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

ICLR 2025arXiv:2410.12784

llm-based judgesevaluation frameworkpreference labelingobjective correctness+4

150

citations

#7

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng et al.

ICLR 2025arXiv:2406.14598

safety refusal evaluationlarge language modelsfine-grained taxonomieslinguistic augmentations+4

141

citations

#8

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu et al.

ICLR 2025arXiv:2410.10813

long-term memorychat assistant systemsmemory indexingmemory retrieval+4

114

citations

#11

ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu, Xu Huang, Xingshan Zeng et al.

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Longtao Zheng, Rundong Wang, Xinrun Wang et al.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

Hanrong Zhang, Jingyuan Huang, Kai Mei et al.

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Jiaxing Cui, Wei-Lin Chiang, Ion Stoica et al.

Curiosity-driven Red-teaming for Large Language Models

Zhang-Wei Hong, Idan Shenfeld, Johnson (Tsun-Hsuan) Wang et al.

Eliciting Human Preferences with Language Models

Belinda Li, Alex Tamkin, Noah Goodman et al.

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Ziyue Jiang, Jinglin Liu, Yi Ren et al.

Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks

Yingpeng Du, Di Luo, Rui Yan et al.

AAAI 2024arXiv:2307.10747

job recommendationlarge language modelsgenerative adversarial networksresume completion+4

72

citations

#19

Programming Refusal with Conditional Activation Steering

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy et al.

PromptTTS 2: Describing and Generating Voices with Text Prompt

Yichong Leng, ZHifang Guo, Kai Shen et al.

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Ke Yang, Yao Liu, Sapana Chaudhary et al.

ICLR 2025arXiv:2410.13825

web agent groundingobservation space alignmentaction space alignmentllm-based agents+4

66

citations

#22

Does Refusal Training in LLMs Generalize to the Past Tense?

Maksym Andriushchenko, Nicolas Flammarion

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha, Xian Li, Marjan Ghazvininejad et al.

Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference

Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar et al.

ICLR 2025arXiv:2310.15117

causal inferencecausal ordercausal graphspairwise prompting+4

48

citations

#25

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Yixuan Wu, Yizhou Wang, Shixiang Tang et al.

ECCV 2024arXiv:2403.12488

multimodal large language modelszero-shot object detectionprompting paradigmdetection prompting toolkit+4

47

citations

#26

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Hang Hua, Yunlong Tang, Chenliang Xu et al.

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

Mehul Damani, Idan Shenfeld, Andi Peng et al.

How efficient is LLM-generated code? A rigorous & high-standard benchmark

Ruizhong Qiu, Weiliang Zeng, James Ezick et al.

ICLR 2025arXiv:2406.06647

program synthesiscode efficiency evaluationefficiency metric designllm-generated code+4

43

citations

#29

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Chunwei Wang, Guansong Lu, Junwei Yang et al.

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

Jaehun Jung, Faeze Brahman, Yejin Choi

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

Chenrui Zhang, Lin Liu, Chuyuan Wang et al.

AAAI 2024arXiv:2308.12033

prompt ensemble learninglarge language modelsfeedback mechanismboosting algorithms+4

41

citations

#33

HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLMs

Pham Vu Tuan Dat, Long Doan, Huynh Thi Thanh Binh

Agents' Room: Narrative Generation through Multi-step Collaboration

Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki et al.

MathAttack: Attacking Large Language Models towards Math Solving Ability

Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.

AAAI 2024arXiv:2309.01686

adversarial attacksmath word problemslarge language modelslogical entity recognition+4

37

citations

#36

Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions

Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos et al.

PAD: Personalized Alignment of LLMs at Decoding-time

Ruizhe Chen, Xiaotian Zhang, Meng Luo et al.

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Yaxi Lu, Shenzhi Yang, Cheng Qian et al.

Adversarial Prompt Tuning for Vision-Language Models

Jiaming Zhang, Xingjun Ma, Xin Wang et al.

Preference Optimization for Reasoning with Pseudo Feedback

Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.

SCALM: Detecting Bad Practices in Smart Contracts Through LLMs

Zongwei Li, Xiaoqi Li, Wenkai Li et al.

Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Yiming Wang, Pei Zhang, Baosong Yang et al.

Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

Michael Zhang, W. Bradley Knox, Eunsol Choi

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa et al.

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Jie Yang, Bingliang Li, Ailing Zeng et al.

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Barys Liskavets, Maxim Ushakov, Shuvendu Roy et al.

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Muhammad Jehanzeb Mirza, Leonid Karlinsky, Wei Lin et al.

Soft Prompt Generation for Domain Generalization

Shuanghao Bai, Yuedi Zhang, Wanqi Zhou et al.

ECCV 2024arXiv:2404.19286

soft prompt learningdomain generalizationvision language modelsprompt generation+4

30

citations

#49

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Kunhao Zheng, Juliette Decugis, Jonas Gehring et al.

Self-Boosting Large Language Models with Synthetic Preference Data

Qingxiu Dong, Li Dong, Xingxing Zhang et al.

LAMM: Label Alignment for Multi-Modal Prompt Learning

Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang et al.

AAAI 2024arXiv:2312.08212

prompt tuningvisual-language modelslabel alignmentfew-shot learning+3

28

citations

#52

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Benjamin Feuer, Micah Goldblum, Teresa Datta et al.

CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs

Siyu Wang, Cailian Chen, Xinyi Le et al.

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Zihan Zheng, Zerui Cheng, Zeyu Shen et al.

Cascade Prompt Learning for Visual-Language Model Adaptation

Ge Wu, Xin Zhang, Zheng Li et al.

MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

Jian Wu, Linyi Yang, Dongyuan Li et al.

ICLR 2025

tabular data understandingmulti-table question answeringtext-to-sql generationmulti-hop reasoning+4

23

citations

#57

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Weixuan Wang, JINGYUAN YANG, Wei Peng

Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization

Mainak Singha, Ankit Jha, Shirsha Bose et al.

Truthful Aggregation of LLMs with an Application to Online Advertising

Ermis Soumalias, Michael Curry, Sven Seuken

NeurIPS 2025arXiv:2405.05905

auction mechanism designtruthful reportingpreference aggregationonline advertising+3

22

citations

#60

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Weixiang Yan, Haitian Liu, Tengxiao Wu et al.

Diverse Preference Learning for Capabilities and Alignment

Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell

ICLR 2025arXiv:2511.08594

preference learningkl divergence regularizerllm output diversityalignment algorithms+3

21

citations

#62

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

Yuanzhao Zhai, Tingkai Yang, Kele Xu et al.

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Zouying Cao, Yifei Yang, Hai Zhao

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Yexin Liu, Zhengyang Liang, Yueze Wang et al.

Reducing Tool Hallucination via Reliability Alignment

Hongshen Xu, Zichen Zhu, Lei Pan et al.

Efficiently Scaling LLM Reasoning Programs with Certaindex

Yichao Fu, Junda Chen, Siqi Zhu et al.

Mechanism Design for LLM Fine-tuning with Multiple Reward Models

Haoran Sun, Yurong Chen, Siwei Wang et al.

Investigating Non-Transitivity in LLM-as-a-Judge

Yi Xu, Laura Ruis, Tim Rocktäschel et al.

Customizing Language Model Responses with Contrastive In-Context Learning

Xiang Gao, Kamalika Das

AAAI 2024arXiv:2401.17390

contrastive learninglanguage model alignmentin-context learningintent customization+4

19

citations

#70

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Sijia Chen, Baochun Li, Di Niu

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

Joonhyun Jeong, Seyun Bae, Yeonsung Jung et al.

Generating Novel Leads for Drug Discovery Using LLMs with Logical Feedback

Shreyas Bhat Brahmavar, Ashwin Srinivasan, Tirtharaj Dash et al.

OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning

Wei-Cheng Huang, Chun-Fu Chen, Hsiang Hsu

One-stage Prompt-based Continual Learning

Youngeun Kim, YUHANG LI, Priyadarshini Panda

ECCV 2024arXiv:2402.16189

prompt-based continual learningvision transformercomputational efficiencyclass-incremental learning+3

17

citations

#75

Memory Injection Attacks on LLM Agents via Query-Only Interaction

Shen Dong, Shaochen Xu, Pengfei He et al.

NeurIPS 2025arXiv:2503.03704

memory injection attacksllm agent securityquery-only interactionmalicious reasoning steps+3

16

citations

#76

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization

Kangyu Zhu, Peng Xia, Yun Li et al.

LLMs Can Plan Only If We Tell Them

Bilgehan Sel, Ruoxi Jia, Ming Jin

Controllable Navigation Instruction Generation with Chain of Thought Prompting

Xianghao Kong, Jinyu Chen, Wenguan Wang et al.

BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute

Dujian Ding, Ankur Mallick, Shaokun Zhang et al.

RocketEval: Efficient automated LLM evaluation via grading checklist

Tianjun Wei, Wei Wen, Ruizhi Qiao et al.

Get an A in Math: Progressive Rectification Prompting

Zhenyu Wu, Meng Jiang, Chao Shen

AAAI 2024arXiv:2312.06867

chain-of-thought promptingmath word problemsprogressive rectification promptingreasoning path generation+3

15

citations

#82

ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems

Xiangyuan Xue, Zeyu Lu, Di Huang et al.

Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

Sheryl Hsu, Omar Khattab, Chelsea Finn et al.

Security Attacks on LLM-based Code Completion Tools

Wen Cheng, Ke Sun, Xinyu Zhang et al.

Language Guided Skill Discovery

Seungeun Rho, Laura Smith, Tianyu Li et al.

ICLR 2025arXiv:2406.06615

skill discoverysemantic diversitylanguage guided learninglarge language models+4

14

citations

#86

Learning to Learn Better Visual Prompts

Fengxiang Wang, Wanrong Huang, Shaowu Yang et al.

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

Hao Tan, Jun Li, Yizhuang Zhou et al.

AAAI 2024arXiv:2312.06401

prompt tuningvision-language modelsfew-shot recognitiondomain generalization+3

13

citations

#88

EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code

Yuhao Qing, Boyu Zhu, Mingzhe Du et al.

Detecting High-Stakes Interactions with Activation Probes

Alex McKenzie, Urja Pawar, Phil Blandfort et al.

xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation

Qingchen Yu, Zifan Zheng, Shichao Song et al.

An Engorgio Prompt Makes Large Language Model Babble on

Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang et al.

Citations and Trust in LLM Generated Responses

Yifan Ding, Matthew Facciani, Ellen Joyce et al.

Active Evaluation Acquisition for Efficient LLM Benchmarking

Yang Li, Jie Ma, Miguel Ballesteros et al.

Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs

Xinyu Fang, Zhijian Chen, Kai Lan et al.

ICCV 2025arXiv:2503.14478

multimodal large language modelscreative intelligence assessmentcontext-aware evaluationimage-based tasks+4

12

citations

#95

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Aiwei Liu, Sheng Guan, Yiming Liu et al.

MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants

Zeyu Zhang, Quanyu Dai, Luyu Chen et al.

CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions

Matan Levi, Yair Allouche, Daniel Ohayon et al.

SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu et al.

Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models

Thomas Zollo, Todd Morrill, Zhun Deng et al.

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting

Shaofei Cai, Zihao Wang, Kewei Lian et al.

CVPR 2025

11

citations

Prompt Engineering

Top Conferences

Related Topics (Language Models)

Top Papers

From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Knowledge Graph Prompting for Multi-Document Question Answering

Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

ToolACE: Winning the Points of LLM Function Calling

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

OR-Bench: An Over-Refusal Benchmark for Large Language Models

Curiosity-driven Red-teaming for Large Language Models

Eliciting Human Preferences with Language Models

Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis

Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks

Programming Refusal with Conditional Activation Steering

PromptTTS 2: Describing and Generating Voices with Text Prompt

AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents

Does Refusal Training in LLMs Generalize to the Past Tense?

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference

DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

How efficient is LLM-generated code? A rigorous & high-standard benchmark

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine

HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLMs

Agents' Room: Narrative Generation through Multi-step Collaboration

MathAttack: Attacking Large Language Models towards Math Solving Ability

Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions

PAD: Personalized Alignment of LLMs at Decoding-time

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Adversarial Prompt Tuning for Vision-Language Models

Preference Optimization for Reasoning with Pseudo Feedback

SCALM: Detecting Bad Practices in Smart Contracts Through LLMs

Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Open-World Human-Object Interaction Detection via Multi-modal Prompts

Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Soft Prompt Generation for Domain Generalization

What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

Self-Boosting Large Language Models with Synthetic Preference Data

LAMM: Label Alignment for Multi-Modal Prompt Learning

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs

LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?

Cascade Prompt Learning for Visual-Language Model Adaptation

MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors

Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization

Truthful Aggregation of LLMs with an Application to Online Advertising

ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World

Diverse Preference Learning for Capabilities and Alignment

Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models

SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly

Reducing Tool Hallucination via Reliability Alignment

Efficiently Scaling LLM Reasoning Programs with Certaindex

Mechanism Design for LLM Fine-tuning with Multiple Reward Models

Investigating Non-Transitivity in LLM-as-a-Judge

Customizing Language Model Responses with Contrastive In-Context Learning

Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models

Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy

Generating Novel Leads for Drug Discovery Using LLMs with Logical Feedback

OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning

One-stage Prompt-based Continual Learning

Memory Injection Attacks on LLM Agents via Query-Only Interaction

MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization