Tong Lu

28

Papers

2,560

Total Citations

Papers (28)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

CG-Bench: Clue-grounded Question Answering Benchmark for Long Video Understanding

Docopilot: Improving Multimodal Models for Document-Level Understanding

EgoExoBench: A Benchmark for First- and Third-person View Video Understanding in MLLMs

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation

Temporal Action Localization by Structured Maximal Sums

Shape Robust Text Detection With Progressive Scale Expansion Network

InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions

Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

TAM: Temporal Adaptive Module for Video Recognition

Adaptive Graph Convolution for Point Cloud Analysis

Memory-and-Anticipation Transformer for Online Action Understanding

FB-BEV: BEV Representation from Forward-Backward View Transformations

DDP: Diffusion Model for Dense Visual Prediction

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

SeedFormer: Patch Seeds Based Point Cloud Completion with Upsample Transformer

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Panoptic SegFormer: Delving Deeper Into Panoptic Segmentation With Transformers

MOERL: When Mixture-of-Experts Meet Reinforcement Learning for Adverse Weather Image Restoration

Deconfound Semantic Shift and Incompleteness in Incremental Few-shot Semantic Segmentation

AVSegFormer: Audio-Visual Segmentation with Transformer

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks