🧬Language Models

Large Language Models

LLMs including GPT, LLaMA, and scaling laws

100 papers27,818 total citations

Compare with other topics

Feb '24 — Jan '263147 papers

Top Conferences

ICLR: 66 CVPR: 8 AAAI: 8 ECCV: 8 NeurIPS: 7 ICML: 2

Top Papers

#1

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Zhe Chen, Jiannan Wu, Wenhai Wang et al.

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Chaoyou Fu, Peixian Chen, Yunhang Shen et al.

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia et al.

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye et al.

A Generalist Agent

Jackie Kay, Sergio Gómez Colmenarejo, Mahyar Bordbar et al.

LISA: Reasoning Segmentation via Large Language Model

Xin Lai, Zhuotao Tian, Yukang Chen et al.

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

Haipeng Luo, Qingfeng Sun, Can Xu et al.

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen JIANG, Han Shi et al.

Language Model Beats Diffusion - Tokenizer is key to visual generation

Lijun Yu, José Lezama, Nitesh Bharadwaj Gundavarapu et al.

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han et al.

AAAI 2024arXiv:2309.01431

retrieval-augmented generationlarge language modelsnoise robustnessnegative rejection+4

458

citations

#11

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Changli Tang, Wenyi Yu, Guangzhi Sun et al.

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

Yangsibo Huang, Samyak Gupta, Mengzhou Xia et al.

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan et al.

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Chenhao Tan, Robert Ness, Amit Sharma et al.

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Zhang Li, Biao Yang, Qiang Liu et al.

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

Seungone Kim, Jamin Shin, yejin cho et al.

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu et al.

Large Language Models Are Not Robust Multiple Choice Selectors

Chujie Zheng, Hao Zhou, Fandong Meng et al.

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Qidong Huang, Xiaoyi Dong, Pan Zhang et al.

Generative Verifiers: Reward Modeling as Next-Token Prediction

Lunjun Zhang, Arian Hosseini, Hritik Bansal et al.

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang et al.

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng, Xianyuan Zhan et al.

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman et al.

ICLR 2025arXiv:2406.04093

sparse autoencoderslanguage model interpretabilityfeature extractionk-sparse autoencoders+4

298

citations

#24

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Saleh Ashkboos, Maximilian Croci, Marcelo Gennari do Nascimento et al.

Safety Alignment Should be Made More Than Just a Few Tokens Deep

Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu et al.

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Gengze Zhou, Yicong Hong, Qi Wu

AAAI 2024arXiv:2305.16986

vision-and-language navigationlarge language modelsinstruction-following navigationzero-shot action prediction+4

276

citations

#27

DreamLLM: Synergistic Multimodal Comprehension and Creation

Runpei Dong, chunrui han, Yuang Peng et al.

Mixture-of-Agents Enhances Large Language Model Capabilities

Junlin Wang, Jue Wang, Ben Athiwaratkun et al.

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

Provable Robust Watermarking for AI-Generated Text

Xuandong Zhao, Prabhanjan Ananth, Lei Li et al.

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis et al.

Large Language Models as Tool Makers

Tianle Cai, Xuezhi Wang, Tengyu Ma et al.

On Scaling Up a Multilingual Vision and Language Model

Xi Chen, Josip Djolonga, Piotr Padlewski et al.

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

Jian Xie, Kai Zhang, Jiangjie Chen et al.

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier et al.

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

Kai Zhang, Sai Bi, Hao Tan et al.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh vahid et al.

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu et al.

AAAI 2024arXiv:2308.15366

industrial anomaly detectionvision-language modelsanomaly localizationfew-shot learning+4

240

citations

#39

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Jin Su, Chenchen Han, Yuyang Zhou et al.

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Jimeng Sun, Shubhendu Trivedi, Zhen Lin

Language Models Represent Space and Time

Wes Gurnee, Max Tegmark

Sequential Modeling Enables Scalable Learning for Large Vision Models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye, Haiyang Xu, Haowei Liu et al.

RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation

Fangyuan Xu, Weijia Shi, Eunsol Choi

Listen, Think, and Understand

Yuan Gong, Hongyin Luo, Alexander Liu et al.

Generative Representational Instruction Tuning

Niklas Muennighoff, Hongjin SU, Liang Wang et al.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Jingyi Zhang, Jiaxing Huang, Huanjin Yao et al.

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue

Songhua Yang, Hanjie Zhao, Senbin Zhu et al.

AAAI 2024arXiv:2308.03549

large language modelschinese medicineexpert feedbackmulti-turn dialogue+4

204

citations

#49

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Aojun Zhou, Ke Wang, Zimu Lu et al.

Why Do Multi-Agent LLM Systems Fail?

Mert Cemri, Melissa Z Pan, Shuyi Yang et al.

NeurIPS 2025arXiv:2503.13657

multi-agent llm systemsfailure pattern analysissystem failure taxonomyllm-as-a-judge+3

188

citations

#51

Think before you speak: Training Language Models With Pause Tokens

Sachin Goyal, Ziwei Ji, Ankit Singh Rawat et al.

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Xin Liu, Yichen Zhu, Jindong Gu et al.

ECCV 2024arXiv:2311.17600

multimodal large language modelssafety evaluationimage-based manipulationsadversarial attacks+4

183

citations

#53

Inverse Scaling: When Bigger Isn't Better

Joe Cavanagh, Andrew Gritsevskiy, Najoung Kim et al.

ReLoRA: High-Rank Training Through Low-Rank Updates

Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde et al.

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen et al.

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

Zonghao Guo, Ruyi Xu, Yuan Yao et al.

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Jiahui Gao, Renjie Pi, Jipeng Zhang et al.

Can Large Language Models Infer Causation from Correlation?

Zhijing Jin, Jiarui Liu, Zhiheng LYU et al.

Is Self-Repair a Silver Bullet for Code Generation?

Theo X. Olausson, Jeevana Priya Inala, Chenglong Wang et al.

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

Niels Mündler, Jingxuan He, Slobodan Jenko et al.

The Unreasonable Ineffectiveness of the Deeper Layers

Andrey Gromov, Kushal Tirumala, Hassan Shapourian et al.

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou et al.

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

Weijia Shi, Jaechan Lee, Yangsibo Huang et al.

ICLR 2025arXiv:2407.06460

machine unlearninglanguage modelsprivacy leakageverbatim memorization+4

157

citations

#64

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo et al.

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Yapei Chang, Kyle Lo, Tanya Goyal et al.

Training Language Models to Reason Efficiently

Daman Arora, Andrea Zanette

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Sijun Tan, Siyuan Zhuang, Kyle Montgomery et al.

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

Yangzhen Wu, Zhiqing Sun, Shanda Li et al.

ICLR 2025

inference scaling lawscompute-optimal inferencelarge language modelstest-time scaling+4

146

citations

#69

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu et al.

ICLR 2025arXiv:2406.04770

large language modelsautomated evaluation frameworkreal-world user queriespairwise comparison metrics+3

142

citations

#70

Physics of Language Models: Part 3.2, Knowledge Manipulation

Zeyuan Allen-Zhu, Yuanzhi Li

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Tinghao Xie, Xiangyu Qi, Yi Zeng et al.

ICLR 2025arXiv:2406.14598

safety refusal evaluationlarge language modelsfine-grained taxonomieslinguistic augmentations+4

141

citations

#72

Linearity of Relation Decoding in Transformer Language Models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay et al.

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

Junfeng Fang, Houcheng Jiang, Kun Wang et al.

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Shansan Gong, Shivam Agarwal, Yizhe Zhang et al.

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

Bofei Gao, Feifan Song, Zhe Yang et al.

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Jonas Geiping, Sean McLeish, Neel Jain et al.

Task Contamination: Language Models May Not Be Few-Shot Anymore

Changmao Li, Jeffrey Flanigan

AAAI 2024arXiv:2312.16337

task contaminationfew-shot learningzero-shot learninglanguage model evaluation+3

130

citations

#78

LongVLM: Efficient Long Video Understanding via Large Language Models

Yuetian Weng, Mingfei Han, Haoyu He et al.

ECCV 2024arXiv:2404.03384

video understandinglarge language modelshierarchical token merginglong video processing+4

128

citations

#79

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research

Liangtai Sun, Yang Han, Zihan Zhao et al.

AAAI 2024arXiv:2308.13149

large language modelsscientific evaluation benchmarkmulti-disciplinary evaluationbloom's taxonomy+4

127

citations

#80

GSVA: Generalized Segmentation via Multimodal Large Language Models

Zhuofan Xia, Dongchen Han, Yizeng Han et al.

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Dilxat Muhtar, Zhenshi Li, Feng Gu et al.

Adapting Large Language Models via Reading Comprehension

Daixuan Cheng, Shaohan Huang, Furu Wei

ST-LLM: Large Language Models Are Effective Temporal Learners

Ruyang Liu, Chen Li, Haoran Tang et al.

GenSim: Generating Robotic Simulation Tasks via Large Language Models

Lirui Wang, Yiyang Ling, Zhecheng Yuan et al.

Layer by Layer: Uncovering Hidden Representations in Language Models

Oscar Skean, Md Rifat Arefin, Dan Zhao et al.

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Xeron Du, Yifan Yao, Kaijing Ma et al.

NeurIPS 2025arXiv:2502.14739

llm evaluationgraduate-level knowledgespecialized disciplinesbenchmark construction+4

118

citations

#87

TTRL: Test-Time Reinforcement Learning

Yuxin Zuo, Kaiyan Zhang, Li Sheng et al.

Teaching Arithmetic to Small Transformers

Nayoung Lee, Kartik Sreenivasan, Jason Lee et al.

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Rui-Jie Zhu, Qihang Zhao, Jason Eshraghian et al.

ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu, Xu Huang, Xingshan Zeng et al.

ShapeLLM: Universal 3D Object Understanding for Embodied Interaction

Zekun Qi, Runpei Dong, Shaochen Zhang et al.

LLMCarbon: Modeling the End-to-End Carbon Footprint of Large Language Models

Ahmad Faiz, Sotaro Kaneda, Ruhan Wang et al.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Tan, Niels Warncke et al.

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

Zehan Qi, Xiao Liu, Iat Long Iong et al.

VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu et al.

Tamper-Resistant Safeguards for Open-Weight LLMs

Rishub Tamirisa, Bhrugu Bharathi, Long Phan et al.

ICLR 2025arXiv:2408.00761

tamper-resistant safeguardsopen-weight llmsmodel weight tamperingrefusal safeguards+3

108

citations

#97

Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-Based Retrofitting

Xinyan Guan, Yanjiang Liu, Hongyu Lin et al.

AAAI 2024arXiv:2311.13314

knowledge graph integrationlarge language model hallucinationfactual knowledge retrievalautonomous knowledge verification+2

108

citations

#98

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu et al.

Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference

Han Zhao, Min Zhang, Wei Zhao et al.

A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models

Haoran Xu, Young Jin Kim, Amr Mohamed Nabil Aly Aly Sharaf et al.

ICLR 2024

105

citations

Large Language Models

Top Conferences

Related Topics (Language Models)

Top Papers

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

A Generalist Agent

LISA: Reasoning Segmentation via Large Language Model

WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Language Model Beats Diffusion - Tokenizer is key to visual generation

Benchmarking Large Language Models in Retrieval-Augmented Generation

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation

YaRN: Efficient Context Window Extension of Large Language Models

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

Prometheus: Inducing Fine-Grained Evaluation Capability in Language Models

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Large Language Models Are Not Robust Multiple Choice Selectors

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

Generative Verifiers: Reward Modeling as Next-Token Prediction

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Scaling and evaluating sparse autoencoders

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Safety Alignment Should be Made More Than Just a Few Tokens Deep

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

DreamLLM: Synergistic Multimodal Comprehension and Creation

Mixture-of-Agents Enhances Large Language Model Capabilities

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Provable Robust Watermarking for AI-Generated Text

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Large Language Models as Tool Makers

On Scaling Up a Multilingual Vision and Language Model

Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models

Language Models Represent Space and Time

Sequential Modeling Enables Scalable Learning for Large Vision Models

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

RECOMP: Improving Retrieval-Augmented LMs with Context Compression and Selective Augmentation

Listen, Think, and Understand

Generative Representational Instruction Tuning

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-World Multi-Turn Dialogue

Solving Challenging Math Word Problems Using GPT-4 Code Interpreter with Code-based Self-Verification

Why Do Multi-Agent LLM Systems Fail?

Think before you speak: Training Language Models With Pause Tokens

MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models

Inverse Scaling: When Bigger Isn't Better

ReLoRA: High-Rank Training Through Low-Rank Updates

On the Reliability of Watermarks for Large Language Models

LLaVA-UHD: an LMM Perceiving any Aspect Ratio and High-Resolution Images

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

Can Large Language Models Infer Causation from Correlation?

Is Self-Repair a Silver Bullet for Code Generation?

Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation

The Unreasonable Ineffectiveness of the Deeper Layers

Can LLMs Keep a Secret? Testing Privacy Implications of Language Models via Contextual Integrity Theory

MUSE: Machine Unlearning Six-Way Evaluation for Language Models

DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

BooookScore: A systematic exploration of book-length summarization in the era of LLMs

Training Language Models to Reason Efficiently

JudgeBench: A Benchmark for Evaluating LLM-Based Judges

Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving

WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

Physics of Language Models: Part 3.2, Knowledge Manipulation

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

Linearity of Relation Decoding in Transformer Language Models

AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark for Large Language Models

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach