Lijuan Wang

51
Papers
269
Total Citations

Papers (51)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

CVPR 2025
123
citations

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

CVPR 2024
49
citations

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

ICLR 2025arXiv
34
citations

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation

CVPR 2025
20
citations

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

ICLR 2025arXiv
17
citations

Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

ICLR 2025
14
citations

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

NeurIPS 2025
12
citations

DisCo: Disentangled Control for Realistic Human Dance Generation

CVPR 2024
0
citations

Segment and Caption Anything

CVPR 2024
0
citations

Completing Visual Objects via Bridging Generation and Segmentation

ICML 2024
0
citations

StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis

ICML 2024
0
citations

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

ICML 2024
0
citations

Large Scale Incremental Learning

CVPR 2019
0
citations

Rethinking Classification and Localization for Object Detection

CVPR 2020arXiv
0
citations

VinVL: Revisiting Visual Representations in Vision-Language Models

CVPR 2021arXiv
0
citations

End-to-End Human Pose and Mesh Reconstruction with Transformers

CVPR 2021arXiv
0
citations

M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training

CVPR 2021arXiv
0
citations

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption

CVPR 2021arXiv
0
citations

DAP: Detection-Aware Pre-Training With Weak Supervision

CVPR 2021arXiv
0
citations

Grounded Language-Image Pre-Training

CVPR 2022arXiv
0
citations

Cross-Modal Representation Learning for Zero-Shot Action Recognition

CVPR 2022arXiv
0
citations

SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning

CVPR 2022arXiv
0
citations

An Empirical Study of Training End-to-End Vision-and-Language Transformers

CVPR 2022arXiv
0
citations

Injecting Semantic Concepts Into End-to-End Image Captioning

CVPR 2022arXiv
0
citations

Scaling Up Vision-Language Pre-Training for Image Captioning

CVPR 2022arXiv
0
citations

Adaptive Human Matting for Dynamic Videos

CVPR 2023arXiv
0
citations

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

CVPR 2023arXiv
0
citations

ReCo: Region-Controlled Text-to-Image Generation

CVPR 2023arXiv
0
citations

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling

CVPR 2023arXiv
0
citations

Generalized Decoding for Pixel, Image, and Language

CVPR 2023arXiv
0
citations

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

CVPR 2023arXiv
0
citations

Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network

CVPR 2023
0
citations

Compressing Visual-Linguistic Model via Knowledge Distillation

ICCV 2021arXiv
0
citations

End-to-End Semi-Supervised Object Detection With Soft Teacher

ICCV 2021arXiv
0
citations

Mesh Graphormer

ICCV 2021arXiv
0
citations

Equivariant Similarity for Vision-Language Foundation Models

ICCV 2023arXiv
0
citations

Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks

ECCV 2020
0
citations

"A Simple Approach and Benchmark for 21,000-Category Object Detection"

ECCV 2022
0
citations

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

ECCV 2022
0
citations

Non-Contrastive Learning Meets Language-Image Pre-Training

CVPR 2023arXiv
0
citations

LiVOS: Light Video Object Segmentation with Gated Linear Matching

CVPR 2025
0
citations

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

ICCV 2025
0
citations

SITE: towards Spatial Intelligence Thorough Evaluation

ICCV 2025
0
citations

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

ICCV 2025
0
citations

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

CVPR 2024
0
citations

Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning

CVPR 2024
0
citations

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

NeurIPS 2022
0
citations

K-LITE: Learning Transferable Visual Models with External Knowledge

NeurIPS 2022
0
citations

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

NeurIPS 2022
0
citations

GLIPv2: Unifying Localization and Vision-Language Understanding

NeurIPS 2022
0
citations

Segment Everything Everywhere All at Once

NeurIPS 2023
0
citations