Kevin Lin

17

Papers

57

Total Citations

Papers (17)

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning

BizGen: Advancing Article-level Visual Text Rendering for Infographics Generation

LiVOS: Light Video Object Segmentation with Gated Linear Matching

Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension

ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning

DisCo: Disentangled Control for Realistic Human Dance Generation

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

End-to-End Human Pose and Mesh Reconstruction with Transformers

Cross-Modal Representation Learning for Zero-Shot Action Recognition

SwinBERT: End-to-End Transformers With Sparse Attention for Video Captioning

Adaptive Human Matting for Dynamic Videos

An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling

ReCo: Region-Controlled Text-to-Image Generation

LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling

Neural Voting Field for Camera-Space 3D Hand Pose Estimation

Mesh Graphormer

Equivariant Similarity for Vision-Language Foundation Models