Lijuan Wang
51
Papers
269
Total Citations
Papers (51)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
123
citations
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
49
citations
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025arXiv
34
citations
ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation
CVPR 2025
20
citations
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025arXiv
17
citations
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
ICLR 2025
14
citations
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
NeurIPS 2025
12
citations
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
0
citations
Segment and Caption Anything
CVPR 2024
0
citations
Completing Visual Objects via Bridging Generation and Segmentation
ICML 2024
0
citations
StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis
ICML 2024
0
citations
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
0
citations
Large Scale Incremental Learning
CVPR 2019
0
citations
Rethinking Classification and Localization for Object Detection
CVPR 2020arXiv
0
citations
VinVL: Revisiting Visual Representations in Vision-Language Models
CVPR 2021arXiv
0
citations
End-to-End Human Pose and Mesh Reconstruction with Transformers
CVPR 2021arXiv
0
citations
M3P: Learning Universal Representations via Multitask Multilingual Multimodal Pre-Training
CVPR 2021arXiv
0
citations
TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption
CVPR 2021arXiv
0
citations
DAP: Detection-Aware Pre-Training With Weak Supervision
CVPR 2021arXiv
0
citations
Grounded Language-Image Pre-Training
CVPR 2022arXiv
0
citations
Cross-Modal Representation Learning for Zero-Shot Action Recognition
CVPR 2022arXiv
0
citations
SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning
CVPR 2022arXiv
0
citations
An Empirical Study of Training End-to-End Vision-and-Language Transformers
CVPR 2022arXiv
0
citations
Injecting Semantic Concepts Into End-to-End Image Captioning
CVPR 2022arXiv
0
citations
Scaling Up Vision-Language Pre-Training for Image Captioning
CVPR 2022arXiv
0
citations
Adaptive Human Matting for Dynamic Videos
CVPR 2023arXiv
0
citations
An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling
CVPR 2023arXiv
0
citations
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023arXiv
0
citations
LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling
CVPR 2023arXiv
0
citations
Generalized Decoding for Pixel, Image, and Language
CVPR 2023arXiv
0
citations
Neural Voting Field for Camera-Space 3D Hand Pose Estimation
CVPR 2023arXiv
0
citations
Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network
CVPR 2023
0
citations
Compressing Visual-Linguistic Model via Knowledge Distillation
ICCV 2021arXiv
0
citations
End-to-End Semi-Supervised Object Detection With Soft Teacher
ICCV 2021arXiv
0
citations
Mesh Graphormer
ICCV 2021arXiv
0
citations
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023arXiv
0
citations
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
ECCV 2020
0
citations
"A Simple Approach and Benchmark for 21,000-Category Object Detection"
ECCV 2022
0
citations
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022
0
citations
Non-Contrastive Learning Meets Language-Image Pre-Training
CVPR 2023arXiv
0
citations
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
0
citations
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
0
citations
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
0
citations
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
0
citations
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
0
citations
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
CVPR 2024
0
citations
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis
NeurIPS 2022
0
citations
K-LITE: Learning Transferable Visual Models with External Knowledge
NeurIPS 2022
0
citations
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
NeurIPS 2022
0
citations
GLIPv2: Unifying Localization and Vision-Language Understanding
NeurIPS 2022
0
citations
Segment Everything Everywhere All at Once
NeurIPS 2023
0
citations