Yueting Zhuang

37

Papers

135

Total Citations

Papers (37)

HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

Let LRMs Break Free from Overthinking via Self-Braking Tuning

Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program

Janus-Pro-R1: Advancing Collaborative Visual Comprehension and Generation via Reinforcement Learning

Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models

HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data

Auto-Encoding Morph-Tokens for Multimodal LLM

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Hierarchical Recurrent Neural Encoder for Video Representation With Application to Captioning

Zero-Shot Recognition Using Dual Visual-Semantic Mapping Paths

Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Unsupervised Reinforcement Learning of Transferable Meta-Skills for Embodied Navigation

Label Matching Semi-Supervised Object Detection

Slimmable Domain Adaptation

Learning To Learn by Jointly Optimizing Neural Architecture and Weights

Compositional Temporal Grounding With Structured Variational Cross-Graph Correspondence Learning

Deeply-Learned Part-Aligned Representations for Person Re-Identification

Semi-Supervised Active Learning for Semi-Supervised Models: Exploit Adversarial Examples With Graph-Based Virtual Labels

Adaptive Hierarchical Graph Reasoning With Semantic Coherence for Video-and-Language Inference

Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels

Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World

Unsupervised Prompt Tuning for Text-Driven Object Detection

AnyEdit: Mastering Unified High-Quality Image Editing for Any Idea

STEP: Enhancing Video-LLMs’ Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Data Shunt: Collaboration of Small and Large Models for Lower Costs and Better Performance

MacNet: Transferring Knowledge from Machine Comprehension to Sequence-to-Sequence Models

Pixel-Level Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation

Learning to Generate Visual Questions with Noisy Supervision

Fine-Grained Semantically Aligned Vision-Language Pre-Training

HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

Zero-shot Visual Relation Detection via Composite Visual Cues from Large Language Models