Luo

49
Papers
160
Total Citations

Papers (49)

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

CVPR 2025arXiv
68
citations

Preserving Diversity in Supervised Fine-Tuning of Large Language Models

ICLR 2025arXiv
33
citations

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

ECCV 2024arXiv
10
citations

Uncertainty-aware sign language video retrieval with probability distribution modeling

ECCV 2024arXiv
10
citations

Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games

ICLR 2025arXiv
7
citations

Latent Chain-of-Thought for Visual Reasoning

NeurIPS 2025arXiv
7
citations

Simultaneous Swap Regret Minimization via KL-Calibration

NeurIPS 2025arXiv
6
citations

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

ECCV 2024arXiv
6
citations

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

CVPR 2025arXiv
4
citations

WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception

NeurIPS 2025arXiv
4
citations

WeakMCN: Multi-task Collaborative Network for Weakly Supervised Referring Expression Comprehension and Segmentation

CVPR 2025arXiv
3
citations

Attention! Your Vision Language Model Could Be Maliciously Manipulated

NeurIPS 2025arXiv
2
citations

DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

NeurIPS 2025arXiv
0
citations

RTop-K: Ultra-Fast Row-Wise Top-K Selection for Neural Network Acceleration on GPUs

ICLR 2025arXiv
0
citations

SysBench: Can LLMs Follow System Message?

ICLR 2025
0
citations

Real-World Reinforcement Learning of Active Perception Behaviors

NeurIPS 2025arXiv
0
citations

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

NeurIPS 2025arXiv
0
citations

Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild

ECCV 2024arXiv
0
citations

SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement

ICLR 2025arXiv
0
citations

Interpreting Global Perturbation Robustness of Image Models using Axiomatic Spectral Importance Decomposition

ICLR 2025arXiv
0
citations

Towards Multiple Character Image Animation Through Enhancing Implicit Decoupling

ICLR 2025arXiv
0
citations

DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search

ICLR 2025arXiv
0
citations

Self-diffusion for Solving Inverse Problems

NeurIPS 2025arXiv
0
citations

Knowing Your Target: Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

ICLR 2025arXiv
0
citations

Text-Aware Real-World Image Super-Resolution via Diffusion Model with Joint Segmentation Decoders

NeurIPS 2025arXiv
0
citations

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

ECCV 2024arXiv
0
citations

SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding

NeurIPS 2025arXiv
0
citations

PixArt-Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

ECCV 2024
0
citations

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

ICLR 2025arXiv
0
citations

Adapting to Stochastic and Adversarial Losses in Episodic MDPs with Aggregate Bandit Feedback

NeurIPS 2025arXiv
0
citations

UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens

NeurIPS 2025arXiv
0
citations

Differentiable extensions with rounding guarantees for combinatorial optimization over permutations

NeurIPS 2025arXiv
0
citations

Removing Rows and Columns of Tokens in Vision Transformer enables Faster Dense Prediction without Retraining

ECCV 2024
0
citations

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

NeurIPS 2025arXiv
0
citations

On Inductive Biases That Enable Generalization in Diffusion Transformers

NeurIPS 2025arXiv
0
citations

Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

ICLR 2025arXiv
0
citations

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning

ICLR 2025arXiv
0
citations

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

ECCV 2024arXiv
0
citations

DViN: Dynamic Visual Routing Network for Weakly Supervised Referring Expression Comprehension

CVPR 2025
0
citations

When GNNs meet symmetry in ILPs: an orbit-based feature augmentation approach

ICLR 2025arXiv
0
citations

Unlocking Multimodal Mathematical Reasoning via Process Reward Model

NeurIPS 2025arXiv
0
citations

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

NeurIPS 2025arXiv
0
citations

FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference

ICLR 2025arXiv
0
citations

Geometric Algorithms for Neural Combinatorial Optimization with Constraints

NeurIPS 2025arXiv
0
citations

Multi-Agent Collaboration via Evolving Orchestration

NeurIPS 2025arXiv
0
citations

Geometry-Aware Approaches for Balancing Performance and Theoretical Guarantees in Linear Bandits

ICLR 2025arXiv
0
citations

Don’t Forget the Enjoin: FocalLoRA for Instruction Hierarchical Alignment in Large Language Models

NeurIPS 2025
0
citations

CodeMerge: Codebook-Guided Model Merging for Robust Test-Time Adaptation in Autonomous Driving

NeurIPS 2025arXiv
0
citations

MobileNetV4: Universal Models for the Mobile Ecosystem

ECCV 2024arXiv
0
citations