Most Cited 2025 "contour tracing" Papers
22,274 papers found • Page 2 of 112
Conference
MoBA: Mixture of Block Attention for Long-Context LLMs
Enzhe Lu, Zhejun Jiang, Jingyuan Liu et al.
CBraMod: A Criss-Cross Brain Foundation Model for EEG Decoding
Jiquan Wang, Sha Zhao, Zhiling Luo et al.
DEIM: DETR with Improved Matching for Fast Convergence
Shihua Huang, Zhichao Lu, Xiaodong Cun et al.
Deconstructing Denoising Diffusion Models for Self-Supervised Learning
Xinlei Chen, Zhuang Liu, Saining Xie et al.
DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Yuang Peng, Yuxin Cui, Haomiao Tang et al.
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu et al.
LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias
Haian Jin, Hanwen Jiang, Hao Tan et al.
ImgEdit: A Unified Image Editing Dataset and Benchmark
Yang Ye, Xianyi He, Zongjian Li et al.
When Attention Sink Emerges in Language Models: An Empirical View
Xiangming Gu, Tianyu Pang, Chao Du et al.
SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models
Muyang Li, Yujun Lin, Zhekai Zhang et al.
OmniEdit: Building Image Editing Generalist Models Through Specialist Supervision
Cong Wei, Zheyang Xiong, Weiming Ren et al.
RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete
Yuheng Ji, Huajie Tan, Jiayu Shi et al.
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency
Jianwen Jiang, Chao Liang, Jiaqi Yang et al.
Not All Language Model Features Are One-Dimensionally Linear
Josh Engels, Eric Michaud, Isaac Liao et al.
Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami, Alekh Agarwal, Jonathan Berant et al.
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Decoupled Video Diffusion
Wenqiang Sun, Shuo Chen, Fangfu Liu et al.
Improving Diffusion Inverse Problem Solving with Decoupled Noise Annealing
Bingliang Zhang, Wenda Chu, Julius Berner et al.
Remarkable Robustness of LLMs: Stages of Inference?
Vedang Lad, Jin Hwa Lee, Wes Gurnee et al.
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency
Dongzhi Jiang, Renrui Zhang, Ziyu Guo et al.
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
gaojie lin, Jianwen Jiang, Jiaqi Yang et al.
Kolmogorov-Arnold Transformer
Xingyi Yang, Xinchao Wang
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay et al.
Learning Smooth and Expressive Interatomic Potentials for Physical Property Prediction
Xiang Fu, Brandon Wood, Luis Barroso-Luque et al.
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining
Rosie Zhao, Alexandru Meterez, Sham M. Kakade et al.
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White, Samuel Dooley, Manley Roberts et al.
AdaIR: Adaptive All-in-One Image Restoration via Frequency Mining and Modulation
Yuning Cui, Syed Waqas Zamir, Salman Khan et al.
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
Weifeng Lin, Xinyu Wei, Ruichuan An et al.
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu, Meng Chen, Baotong Lu et al.
Making Text Embedders Few-Shot Learners
Chaofan Li, Minghao Qin, Shitao Xiao et al.
Remasking Discrete Diffusion Models with Inference-Time Scaling
Guanghan Wang, Yair Schiff, Subham Sahoo et al.
TimeMixer++: A General Time Series Pattern Machine for Universal Predictive Analysis
Shiyu Wang, Jiawei LI, Xiaoming Shi et al.
Vision-LSTM: xLSTM as Generic Vision Backbone
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
Yang Tian, Sizhe Yang, Jia Zeng et al.
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
Jiafei Duan, Wilbert Pumacay, Nishanth Kumar et al.
Robust Watermarking Using Generative Priors Against Image Editing: From Benchmarking to Advances
Shilin Lu, Zihan Zhou, Jiayou Lu et al.
DriveDreamer4D: World Models Are Effective Data Machines for 4D Driving Scene Representation
Guosheng Zhao, Chaojun Ni, Xiaofeng Wang et al.
MambaIRv2: Attentive State Space Restoration
Hang Guo, Yong Guo, Yaohua Zha et al.
Training-free Camera Control for Video Generation
Chen Hou, Zhibo Chen
Stable Virtual Camera: Generative View Synthesis with Diffusion Models
Jensen Zhou, Hang Gao, Vikram Voleti et al.
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Siyan Zhao, Devaansh Gupta, Qinqing Zheng et al.
Unlocking Guidance for Discrete State-Space Diffusion and Flow Models
Hunter Nisonoff, Junhao Xiong, Stephan Allenspach et al.
Soft Merging of Experts with Adaptive Routing
Haokun Liu, Muqeeth Mohammed, Colin Raffel
DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models
Chengke Zou, Xingang Guo, Rui Yang et al.
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang et al.
DepthFM: Fast Generative Monocular Depth Estimation with Flow Matching
Ming Gui, Johannes Schusterbauer, Ulrich Prestel et al.
Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving
Yong Lin, Shange Tang, Bohan Lyu et al.
Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs
Minh Nguyen, Andrew Baker, Clement Neo et al.
General-Reasoner: Advancing LLM Reasoning Across All Domains
Xueguang Ma, Qian Liu, Dongfu Jiang et al.
MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang et al.
WebDancer: Towards Autonomous Information Seeking Agency
Jialong Wu, Baixuan Li, Runnan Fang et al.
LMFusion: Adapting Pretrained Language Models for Multimodal Generation
Weijia Shi, Xiaochuang Han, Chunting Zhou et al.
Point Cloud Mamba: Point Cloud Learning via State Space Model
Tao Zhang, Haobo Yuan, Lu Qi et al.
Consistency Models Made Easy
Zhengyang Geng, Ashwini Pokle, Weijian Luo et al.
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
jiarui zhang, Mahyar Khayatkhoei, Prateek Chhikara et al.
Teaching Large Language Models to Regress Accurate Image Quality Scores Using Score Distribution
Zhiyuan You, Xin Cai, Jinjin Gu et al.
CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models
Hyungjin Chung, Jeongsol Kim, Geon Yeong Park et al.
Real-Time Video Generation with Pyramid Attention Broadcast
Xuanlei Zhao, Xiaolong Jin, Kai Wang et al.
WebPilot: A Versatile and Autonomous Multi-Agent System for Web Task Execution with Strategic Exploration
Yao Zhang, Zijian Ma, Yunpu Ma et al.
AnalogCoder: Analog Circuit Design via Training-Free Code Generation
Yao Lai, Sungyoung Lee, Guojin Chen et al.
MM-EMBED: UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi et al.
OWL: Optimized Workforce Learning for General Multi-Agent Assistance in Real-World Task Automation
Mengkang Hu, Yuhang Zhou, Wendong Fan et al.
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Chongyu Fan, Jiancheng Liu, Licong Lin et al.
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen et al.
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting
Zilong (Ryan) Wang, Zifeng Wang, Long Le et al.
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian et al.
Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models
Javier Ferrando, Oscar Obeso, Senthooran Rajamanoharan et al.
A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
David Chanin, James Wilken-Smith, Tomáš Dulka et al.
Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data
Xinyi Wang, Antonis Antoniades, Yanai Elazar et al.
GraphRouter: A Graph-based Router for LLM Selections
Tao Feng, Yanzhen Shen, Jiaxuan You
Language models scale reliably with over-training and on downstream tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar et al.
Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization
Qingyang Zhang, Haitao Wu, Changqing Zhang et al.
VerilogCoder: Autonomous Verilog Coding Agents with Graph-based Planning and Abstract Syntax Tree (AST)-based Waveform Tracing Tool
Chia-Tung Ho, Haoxing Ren, Brucek Khailany
Dissecting Adversarial Robustness of Multimodal LM Agents
Chen Wu, Rishi Shah, Jing Yu Koh et al.
Eliciting Human Preferences with Language Models
Belinda Li, Alex Tamkin, Noah Goodman et al.
Diffusion-Based Planning for Autonomous Driving with Flexible Guidance
Yinan Zheng, Ruiming Liang, Kexin ZHENG et al.
Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning
Jiacheng Ye, Jiahui Gao, Shansan Gong et al.
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Akshara Prabhakar, Zuxin Liu, Ming Zhu et al.
FakeShield: Explainable Image Forgery Detection and Localization via Multi-modal Large Language Models
Zhipei Xu, Xuanyu Zhang, Runyi Li et al.
ChatTime: A Unified Multimodal Time Series Foundation Model Bridging Numerical and Textual Data
Chengsen Wang, Qi Qi, Jingyu Wang et al.
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
Renrui Zhang, Xinyu Wei, Dongzhi Jiang et al.
The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning
Xinyu Zhu, Mengzhou Xia, Zhepei Wei et al.
Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du et al.
MV-Adapter: Multi-View Consistent Image Generation Made Easy
Zehuan Huang, Yuan-Chen Guo, Haoran Wang et al.
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Yilun Zhao, Lujing Xie, Haowei Zhang et al.
OGBench: Benchmarking Offline Goal-Conditioned RL
Seohong Park, Kevin Frans, Benjamin Eysenbach et al.
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua et al.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Sreyan Ghosh, Arushi Goel, Jaehyeon Kim et al.
InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales
Zhepei Wei, Wei-Lin Chen, Yu Meng
Multimodal Autoregressive Pre-training of Large Vision Encoders
Enrico Fini, Mustafa Shukor, Xiujun Li et al.
Divide, Conquer and Combine: A Training-Free Framework for High-Resolution Image Perception in Multimodal Large Language Models
Wenbin Wang, Liang Ding, Minyan Zeng et al.
Offline Actor-Critic for Average Reward MDPs
William Powell, Jeongyeol Kwon, Qiaomin Xie et al.
REPA-E: Unlocking VAE for End-to-End Tuning of Latent Diffusion Transformers
Xingjian Leng, Jaskirat Singh, Yunzhong Hou et al.
MaskBit: Embedding-free Image Generation via Bit Tokens
Mark Weber, Lijun Yu, Qihang Yu et al.
Language Models Learn to Mislead Humans via RLHF
Jiaxin Wen, Ruiqi Zhong, Akbir Khan et al.
Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents
Saaket Agashe, Kyle Wong, Vincent Tu et al.
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine
Yunfei Xie, Ce Zhou, Lang Gao et al.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan et al.
dKV-Cache: The Cache for Diffusion Language Models
Xinyin Ma, Runpeng Yu, Gongfan Fang et al.
Planning in Natural Language Improves LLM Search for Code Generation
Evan Wang, Federico Cassano, Catherine Wu et al.
MeshAnything V2: Artist-Created Mesh Generation with Adjacent Mesh Tokenization
Yiwen Chen, Yikai Wang, Yihao Luo et al.
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data and Metric Perspectives
Shaoyuan Xie, Lingdong Kong, Yuhao Dong et al.
Programming Refusal with Conditional Activation Steering
Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy et al.
UniTok: a Unified Tokenizer for Visual Generation and Understanding
Chuofan Ma, Yi Jiang, Junfeng Wu et al.
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri, Bartłomiej Cupiał, Samuel Coward et al.
UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics
Xi Chen, Zhifei Zhang, He Zhang et al.
Weak to Strong Generalization for Large Language Models with Multi-capabilities
Yucheng Zhou, Jianbing Shen, Yu Cheng
VideoDPO: Omni-Preference Alignment for Video Diffusion Generation
Runtao Liu, Haoyu Wu, Zheng Ziqiang et al.
A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility
Andreas Hochlehnert, Hardik Bhatnagar, Vishaal Udandarao et al.
EasyControl: Adding Efficient and Flexible Control for Diffusion Transformer
Yuxuan Zhang, Yirui Yuan, Yiren Song et al.
Fine-tuning can cripple your foundation model; preserving features may be the solution
Philip Torr, Puneet Dokania, Jishnu Mukhoti et al.
Accelerating Diffusion Transformers with Token-wise Feature Caching
Chang Zou, Xuyang Liu, Ting Liu et al.
DiT4Edit: Diffusion Transformer for Image Editing
Kunyu Feng, Yue Ma, Bingyuan Wang et al.
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Melissa Hall, Michal Drozdzal, Oscar Mañas et al.
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
Yongdong Luo, Xiawu Zheng, Guilin Li et al.
Adaptive Keyframe Sampling for Long Video Understanding
Xi Tang, Jihao Qiu, Lingxi Xie et al.
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Luo, Xue Yang, Wenhan Dou et al.
Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Zhen Ye, Peiwen Sun, Jiahe Lei et al.
Scaling Test-Time Compute Without Verification or RL is Suboptimal
Amrith Setlur, Nived Rajaraman, Sergey Levine et al.
One-Minute Video Generation with Test-Time Training
Jiarui Xu, Shihao Han, Karan Dalal et al.
DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving
Xiaosong Jia, Junqi You, Zhiyuan Zhang et al.
CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models
Zheng Chong, Xiao Dong, Haoxiang Li et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.
Cradle: Empowering Foundation Agents towards General Computer Control
Weihao Tan, Wentao Zhang, Xinrun Xu et al.
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung et al.
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
Yi Li, Yuquan Deng, Jesse Zhang et al.
GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning
Zhen Xiang, Linzhi Zheng, Yanjie Li et al.
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko, Nicolas Flammarion
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
Ke Yang, Yao Liu, Sapana Chaudhary et al.
Scaling Laws for Precision
Tanishq Kumar, Zachary Ankner, Benjamin Spector et al.
Augmenting Math Word Problems via Iterative Question Composing
Haoxiong Liu, Yifan Zhang, Yifan Luo et al.
History-Guided Video Diffusion
Kiwhan Song, Boyuan Chen, Max Simchowitz et al.
CycleResearcher: Improving Automated Research via Automated Review
Yixuan Weng, Minjun Zhu, Guangsheng Bao et al.
Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models
Fushuo Huo, Wenchao Xu, Zhong Zhang et al.
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation
Cheng Yang, Chufan Shi, Yaxin Liu et al.
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, Liwei Wang
Reasoning with Latent Thoughts: On the Power of Looped Transformers
Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li et al.
Key-Point-Driven Data Synthesis with Its Enhancement on Mathematical Reasoning
Yiming Huang, Xiao Liu, Yeyun Gong et al.
SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model
Zhenglin Huang, Jinwei Hu, Yiwei He et al.
SWE-smith: Scaling Data for Software Engineering Agents
John Yang, Kilian Lieret, Carlos Jimenez et al.
ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering
Yakun Song, Zhuo Chen, Xiaofei Wang et al.
MagicPIG: LSH Sampling for Efficient LLM Generation
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye et al.
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequences
Canyu Zhao, Mingyu Liu, Wen Wang et al.
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
Samuel Miserendino, Michele Wang, Tejal Patwardhan et al.
UniScene: Unified Occupancy-centric Driving Scene Generation
Bohan Li, Jiazhe Guo, Hongsi Liu et al.
GameFactory: Creating New Games with Generative Interactive Videos
Jiwen Yu, Yiran Qin, Xintao Wang et al.
Arithmetic Without Algorithms: Language Models Solve Math with a Bag of Heuristics
Yaniv Nikankin, Anja Reusch, Aaron Mueller et al.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
Hritik Bansal, Arian Hosseini, Rishabh Agarwal et al.
Thinkless: LLM Learns When to Think
Gongfan Fang, Xinyin Ma, Xinchao Wang
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Ziqi Pang, Tianyuan Zhang, Fujun Luan et al.
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation
Tianwei Lin, Wenqiao Zhang, Sijing Li et al.
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
FreDF: Learning to Forecast in the Frequency Domain
Hao Wang, Lichen Pan, Yuan Shen et al.
Why do LLMs attend to the first token?
Federico Barbero, Alvaro Arroyo, Xiangming Gu et al.
ImageFolder: Autoregressive Image Generation with Folded Tokens
Xiang Li, Kai Qiu, Hao Chen et al.
LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language Models
Marwa Abdulhai, Isadora White, Charlie Snell et al.
DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?
Liqiang Jing, Zhehui Huang, Xiaoyang Wang et al.
UMA: A Family of Universal Models for Atoms
Brandon Wood, Misko Dzamba, Xiang Fu et al.
Simple Guidance Mechanisms for Discrete Diffusion Models
Yair Schiff, Subham Sahoo, Hao Phung et al.
StreamDiffusion: A Pipeline-level Solution for Real-Time Interactive Generation
Akio Kodaira, Chenfeng Xu, Toshiki Hazama et al.
Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models
Fei Shen, Hu Ye, Sibo Liu et al.
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao et al.
SF3D: Stable Fast 3D Mesh Reconstruction with UV-unwrapping and Illumination Disentanglement
Mark Boss, Zixuan Huang, Aaryaman Vasishta et al.
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
Yongting Zhang, Lu Chen, Guodong Zheng et al.
Learning Dynamics of LLM Finetuning
YI REN, Danica Sutherland
StableAnimator: High-Quality Identity-Preserving Human Image Animation
Shuyuan Tu, Zhen Xing, Xintong Han et al.
MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding
Ranajoy Sadhukhan, Jian Chen, Zhuoming Chen et al.
CSGO: Content-Style Composition in Text-to-Image Generation
Peng Xing, Haofan Wang, Yanpeng Sun et al.
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan et al.
GoT: Unleashing Reasoning Capability of MLLM for Visual Generation and Editing
Rongyao Fang, Chengqi Duan, Kun Wang et al.
Image and Video Tokenization with Binary Spherical Quantization
Yue Zhao, Yuanjun Xiong, Philipp Krähenbühl
Animate-X: Universal Character Image Animation with Enhanced Motion Representation
Shuai Tan, Biao Gong, Xiang Wang et al.
Repetition Improves Language Model Embeddings
Jacob Springer, Suhas Kotha, Daniel Fried et al.
Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation
Tiansheng Huang, Sihao Hu, Fatih Ilhan et al.
CATCH: Channel-Aware Multivariate Time Series Anomaly Detection via Frequency Patching
Xingjian Wu, Xiangfei Qiu, Zhengyu Li et al.
Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise
Ryan Burgert, Yuancheng Xu, Wenqi Xian et al.
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki et al.
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang, Yang Lin, Jing Lin et al.
See What You Are Told: Visual Attention Sink in Large Multimodal Models
Seil Kang, Jinyeong Kim, Junhyeok Kim et al.
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation
Hyungjoo Chae, Namyoung Kim, Kai Ong et al.
FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection
Yao Xiao, Tingfa Xu, Yu Xin et al.
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij, Felix Hofstätter, Oliver Jaffe et al.
Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos
Linyi Jin, Richard Tucker, Zhengqi Li et al.
Perception-R1: Pioneering Perception Policy with Reinforcement Learning
En Yu, Kangheng Lin, Liang Zhao et al.
DriveArena: A Closed-loop Generative Simulation Platform for Autonomous Driving
Xuemeng Yang, Licheng Wen, Tiantian Wei et al.
Matryoshka Multimodal Models
Mu Cai, Jianwei Yang, Jianfeng Gao et al.
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu, Hao Fei, Xiangtai Li et al.
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
Wenwen Zhuang, Xin Huang, Xiantao Zhang et al.
LoRA vs Full Fine-tuning: An Illusion of Equivalence
Reece Shuttleworth, Jacob Andreas, Antonio Torralba et al.
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
Tianyu Yu, Haoye Zhang, Qiming Li et al.
Scaling Transformers for Low-Bitrate High-Quality Speech Coding
Julian Parker, Anton Smirnov, Jordi Pons et al.
MUSt3R: Multi-view Network for Stereo 3D Reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld et al.
Align3R: Aligned Monocular Depth Estimation for Dynamic Videos
Edward LOO, Tianyu HUANG, Peng Li et al.
Long Context Tuning for Video Generation
Yuwei Guo, Ceyuan Yang, Ziyan Yang et al.
MJ-Bench: Is Your Multimodal Reward Model Really a Good Judge for Text-to-Image Generation?
Zhaorun Chen, Zichen Wen, Yichao Du et al.
Simple is Effective: The Roles of Graphs and Large Language Models in Knowledge-Graph-Based Retrieval-Augmented Generation
Mufei Li, Siqi Miao, Pan Li
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun, Li-Wen Chang, Wenlei Bao et al.
What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?
Guangkai Xu, yongtao ge, Mingyu Liu et al.
Cut the Crap: An Economical Communication Pipeline for LLM-based Multi-Agent Systems
Guibin Zhang, Yanwei Yue, Zhixun Li et al.
ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning
Mingyang Chen, Linzhuang Sun, Tianpeng Li et al.
Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models
Lvmin Zhang, Shengqu Cai, Muyang Li et al.