Conghui He

28

Papers

647

Total Citations

Papers (28)

OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

LEGION: Learning to Ground and Explain for Synthetic Image Detection

Cross-view image geo-localization with Panorama-BEV Co-Retrieval Network

Dataset Distillation with Neural Characteristic Function: A Minmax Perspective

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Where am I? Cross-View Geo-localization with Natural Language Descriptions

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching

Multi-step Visual Reasoning with Visual Tokens Scaling and Verification

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos

Utilize the Flow Before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning

PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images

Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving

Influence Selection for Active Learning

3D Building Reconstruction From Monocular Remote Sensing Images

V3Det: Vast Vocabulary Visual Detection Dataset

Conical Visual Concentration for Efficient Large Vision-Language Models