Prompt Engineering
Designing effective prompts for LLMs
Top Papers
From Crowdsourced Data to High-quality Benchmarks: Arena-Hard and Benchbuilder Pipeline
Tianle Li, Wei-Lin Chiang, Evan Frick et al.
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers
Chenglei Si, Diyi Yang, Tatsunori Hashimoto
Knowledge Graph Prompting for Multi-Document Question Answering
Yu Wang, Nedim Lipka, Ryan A. Rossi et al.
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye, Yanbo Wang, Yue Huang et al.
BooookScore: A systematic exploration of book-length summarization in the era of LLMs
Yapei Chang, Kyle Lo, Tanya Goyal et al.
JudgeBench: A Benchmark for Evaluating LLM-Based Judges
Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie, Xiangyu Qi, Yi Zeng et al.
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian et al.
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus, Arman Zharmagambetov, Chuan Guo et al.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Di Wu, Hongwei Wang, Wenhao Yu et al.
ToolACE: Winning the Points of LLM Function Calling
Weiwen Liu, Xu Huang, Xingshan Zeng et al.
Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control
Longtao Zheng, Rundong Wang, Xinrun Wang et al.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
Hanrong Zhang, Jingyuan Huang, Kai Mei et al.
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Jiaxing Cui, Wei-Lin Chiang, Ion Stoica et al.
Curiosity-driven Red-teaming for Large Language Models
Zhang-Wei Hong, Idan Shenfeld, Johnson (Tsun-Hsuan) Wang et al.
Eliciting Human Preferences with Language Models
Belinda Li, Alex Tamkin, Noah Goodman et al.
Mega-TTS 2: Boosting Prompting Mechanisms for Zero-Shot Speech Synthesis
Ziyue Jiang, Jinglin Liu, Yi Ren et al.
Enhancing Job Recommendation through LLM-Based Generative Adversarial Networks
Yingpeng Du, Di Luo, Rui Yan et al.
Programming Refusal with Conditional Activation Steering
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy et al.
PromptTTS 2: Describing and Generating Voices with Text Prompt
Yichong Leng, ZHifang Guo, Kai Shen et al.
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary et al.
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko, Nicolas Flammarion
Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge
Swarnadeep Saha, Xian Li, Marjan Ghazvininejad et al.
Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference
Aniket Vashishtha, Abbavaram Gowtham Reddy, Abhinav Kumar et al.
DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM
Yixuan Wu, Yizhou Wang, Shixiang Tang et al.
V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning
Hang Hua, Yunlong Tang, Chenliang Xu et al.
Learning How Hard to Think: Input-Adaptive Allocation of LM Computation
Mehul Damani, Idan Shenfeld, Andi Peng et al.
How efficient is LLM-generated code? A rigorous & high-standard benchmark
Ruizhong Qiu, Weiliang Zeng, James Ezick et al.
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance
Chunwei Wang, Guansong Lu, Junwei Yang et al.
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung, Faeze Brahman, Yejin Choi
MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs
Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier et al.
PREFER: Prompt Ensemble Learning via Feedback-Reflect-Refine
Chenrui Zhang, Lin Liu, Chuyuan Wang et al.
HSEvo: Elevating Automatic Heuristic Design with Diversity-Driven Harmony Search and Genetic Algorithm Using LLMs
Pham Vu Tuan Dat, Long Doan, Huynh Thi Thanh Binh
Agents' Room: Narrative Generation through Multi-step Collaboration
Fantine Huot, Reinald Kim Amplayo, Jennimaria Palomaki et al.
MathAttack: Attacking Large Language Models towards Math Solving Ability
Zihao Zhou, Qiufeng Wang, Mingyu Jin et al.
Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos et al.
PAD: Personalized Alignment of LLMs at Decoding-time
Ruizhe Chen, Xiaotian Zhang, Meng Luo et al.
Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance
Yaxi Lu, Shenzhi Yang, Cheng Qian et al.
Adversarial Prompt Tuning for Vision-Language Models
Jiaming Zhang, Xingjun Ma, Xin Wang et al.
Preference Optimization for Reasoning with Pseudo Feedback
Fangkai Jiao, Geyang Guo, Xingxing Zhang et al.
SCALM: Detecting Bad Practices in Smart Contracts Through LLMs
Zongwei Li, Xiaoqi Li, Wenkai Li et al.
Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation
Yiming Wang, Pei Zhang, Baosong Yang et al.
Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions
Michael Zhang, W. Bradley Knox, Eunsol Choi
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
Mantas Mazeika, Xuwang Yin, Rishub Tamirisa et al.
Open-World Human-Object Interaction Detection via Multi-modal Prompts
Jie Yang, Bingliang Li, Ailing Zeng et al.
Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference
Barys Liskavets, Maxim Ushakov, Shuvendu Roy et al.
Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs
Muhammad Jehanzeb Mirza, Leonid Karlinsky, Wei Lin et al.
Soft Prompt Generation for Domain Generalization
Shuanghao Bai, Yuedi Zhang, Wanqi Zhou et al.
What Makes Large Language Models Reason in (Multi-Turn) Code Generation?
Kunhao Zheng, Juliette Decugis, Jonas Gehring et al.
Self-Boosting Large Language Models with Synthetic Preference Data
Qingxiu Dong, Li Dong, Xingxing Zhang et al.
LAMM: Label Alignment for Multi-Modal Prompt Learning
Jingsheng Gao, Jiacheng Ruan, Suncheng Xiang et al.
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Benjamin Feuer, Micah Goldblum, Teresa Datta et al.
CAD-GPT: Synthesising CAD Construction Sequence with Spatial Reasoning-Enhanced Multimodal LLMs
Siyu Wang, Cailian Chen, Xinyi Le et al.
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?
Zihan Zheng, Zerui Cheng, Zeyu Shen et al.
Cascade Prompt Learning for Visual-Language Model Adaptation
Ge Wu, Xin Zhang, Zheng Li et al.
MMQA: Evaluating LLMs with Multi-Table Multi-Hop Complex Questions
Jian Wu, Linyi Yang, Dongyuan Li et al.
Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors
Weixuan Wang, JINGYUAN YANG, Wei Peng
Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
Mainak Singha, Ankit Jha, Shirsha Bose et al.
Truthful Aggregation of LLMs with an Application to Online Advertising
Ermis Soumalias, Michael Curry, Sven Seuken
ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
Weixiang Yan, Haitian Liu, Tengxiao Wu et al.
Diverse Preference Learning for Capabilities and Alignment
Stewart Slocum, Asher Parker-Sartori, Dylan Hadfield-Menell
Enhancing Decision-Making for LLM Agents via Step-Level Q-Value Models
Yuanzhao Zhai, Tingkai Yang, Kele Xu et al.
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
Zouying Cao, Yifei Yang, Hai Zhao
Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly
Yexin Liu, Zhengyang Liang, Yueze Wang et al.
Reducing Tool Hallucination via Reliability Alignment
Hongshen Xu, Zichen Zhu, Lei Pan et al.
Efficiently Scaling LLM Reasoning Programs with Certaindex
Yichao Fu, Junda Chen, Siqi Zhu et al.
Mechanism Design for LLM Fine-tuning with Multiple Reward Models
Haoran Sun, Yurong Chen, Siwei Wang et al.
Investigating Non-Transitivity in LLM-as-a-Judge
Yi Xu, Laura Ruis, Tim Rocktäschel et al.
Customizing Language Model Responses with Contrastive In-Context Learning
Xiang Gao, Kamalika Das
Boosting of Thoughts: Trial-and-Error Problem Solving with Large Language Models
Sijia Chen, Baochun Li, Di Niu
Playing the Fool: Jailbreaking LLMs and Multimodal LLMs with Out-of-Distribution Strategy
Joonhyun Jeong, Seyun Bae, Yeonsung Jung et al.
Generating Novel Leads for Drug Discovery Using LLMs with Logical Feedback
Shreyas Bhat Brahmavar, Ashwin Srinivasan, Tirtharaj Dash et al.
OVOR: OnePrompt with Virtual Outlier Regularization for Rehearsal-Free Class-Incremental Learning
Wei-Cheng Huang, Chun-Fu Chen, Hsiang Hsu
One-stage Prompt-based Continual Learning
Youngeun Kim, YUHANG LI, Priyadarshini Panda
Memory Injection Attacks on LLM Agents via Query-Only Interaction
Shen Dong, Shaochen Xu, Pengfei He et al.
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
Kangyu Zhu, Peng Xia, Yun Li et al.
LLMs Can Plan Only If We Tell Them
Bilgehan Sel, Ruoxi Jia, Ming Jin
Controllable Navigation Instruction Generation with Chain of Thought Prompting
Xianghao Kong, Jinyu Chen, Wenguan Wang et al.
BEST-Route: Adaptive LLM Routing with Test-Time Optimal Compute
Dujian Ding, Ankur Mallick, Shaokun Zhang et al.
RocketEval: Efficient automated LLM evaluation via grading checklist
Tianjun Wei, Wei Wen, Ruizhi Qiao et al.
Get an A in Math: Progressive Rectification Prompting
Zhenyu Wu, Meng Jiang, Chao Shen
ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems
Xiangyuan Xue, Zeyu Lu, Di Huang et al.
Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval
Sheryl Hsu, Omar Khattab, Chelsea Finn et al.
Security Attacks on LLM-based Code Completion Tools
Wen Cheng, Ke Sun, Xinyu Zhang et al.
Language Guided Skill Discovery
Seungeun Rho, Laura Smith, Tianyu Li et al.
Learning to Learn Better Visual Prompts
Fengxiang Wang, Wanrong Huang, Shaowu Yang et al.
Compound Text-Guided Prompt Tuning via Image-Adaptive Cues
Hao Tan, Jun Li, Yizhuang Zhou et al.
EffiBench-X: A Multi-Language Benchmark for Measuring Efficiency of LLM-Generated Code
Yuhao Qing, Boyu Zhu, Mingzhe Du et al.
Detecting High-Stakes Interactions with Activation Probes
Alex McKenzie, Urja Pawar, Phil Blandfort et al.
xFinder: Large Language Models as Automated Evaluators for Reliable Evaluation
Qingchen Yu, Zifan Zheng, Shichao Song et al.
An Engorgio Prompt Makes Large Language Model Babble on
Jianshuo Dong, Ziyuan Zhang, Qingjie Zhang et al.
Citations and Trust in LLM Generated Responses
Yifan Ding, Matthew Facciani, Ellen Joyce et al.
Active Evaluation Acquisition for Efficient LLM Benchmarking
Yang Li, Jie Ma, Miguel Ballesteros et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.
Can Watermarked LLMs be Identified by Users via Crafted Prompts?
Aiwei Liu, Sheng Guan, Yiming Liu et al.
MemSim: A Bayesian Simulator for Evaluating Memory of LLM-based Personal Assistants
Zeyu Zhang, Quanyu Dai, Luyu Chen et al.
CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions
Matan Levi, Yair Allouche, Daniel Ohayon et al.
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks
Hwiwon Lee, Ziqi Zhang, Hanxiao Lu et al.
Prompt Risk Control: A Rigorous Framework for Responsible Deployment of Large Language Models
Thomas Zollo, Todd Morrill, Zhun Deng et al.
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
Shaofei Cai, Zihao Wang, Kewei Lian et al.