Jiebo Luo

81

Papers

324

Total Citations

Papers (81)

SurgicalSAM: Efficient Class Promptable Surgical Instrument Segmentation

Adaptive Offline Quintuplet Loss for Image-Text Matching

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

OpenS2V-Nexus: A Detailed Benchmark and Million-Scale Dataset for Subject-to-Video Generation

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

FineMatch: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction

HOIGen-1M: A Large-scale Dataset for Human-Object Interaction Video Generation

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Mixture of Weak and Strong Experts on Graphs

INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance

Improving Pairwise Ranking for Multi-Label Image Classification

Deep Multimodal Representation Learning From Temporal Data

Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks

VizWiz Grand Challenge: Answering Visual Questions From Blind People

DOTA: A Large-Scale Dataset for Object Detection in Aerial Images

End-to-End Convolutional Semantic Embeddings

Gaussian Temporal Awareness Networks for Action Localization

Spatio-Temporal Video Re-Localization by Warp LSTM

AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations Rather Than Data

Attentive Relational Networks for Mapping Images to Scene Graphs

Unsupervised Image Captioning

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition

Foreground-Aware Image Inpainting

Revisiting Local Descriptor Based Image-To-Class Measure for Few-Shot Learning

DuDoNet: Dual Domain Network for CT Metal Artifact Reduction

Multiview 2D/3D Rigid Registration via a Point-Of-Interest Network for Tracking and Triangulation

Fine-Grained Image-to-Image Transformation Towards Visual Recognition

On Vocabulary Reliance in Scene Text Recognition

Learning a Weakly-Supervised Video Actor-Action Segmentation Model With a Wise Selection

Self-Supervised Domain-Aware Generative Network for Generalized Zero-Shot Learning

TransMatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning

ArtFlow: Unbiased Image Style Transfer via Reversible Neural Flows

Structured Multi-Level Interaction Network for Video Moment Localization via Language Query

Improving OCR-Based Image Captioning by Incorporating Geometrical Relationship

Group-aware Label Transfer for Domain Adaptive Person Re-identification

TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption

Localized Adversarial Domain Generalization

SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing

Stand-Alone Inter-Frame Attention in Video Models

Self-Sustaining Representation Expansion for Non-Exemplar Class-Incremental Learning

Automatic Relation-Aware Graph Network Proliferation

AnchorFormer: Point Cloud Completion From Discriminative Nodes

QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity

Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning

Meta-Causal Learning for Single Domain Generalization

Stare at What You See: Masked Image Modeling Without Reconstruction

Semantic Video Entity Linking Based on Visual Content and Metadata

Learning From Noisy Labels With Distillation

Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition

A Fast and Accurate One-Stage Approach to Visual Grounding

Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

Learning Bias-Invariant Representation by Cross-Sample Mutual Information Minimization

Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment

SAT: 2D Semantics Assisted Training for 3D Visual Grounding

Procedure Planning in Instructional Videos via Contextual Modeling and Model-Based Policy Learning

PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3

Spatial-Aware Token for Weakly Supervised Object Localization

Grounding 3D Object Affordance from 2D Interactions in Images

Learning to Localize Actions from Moments

TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images

Structured Landmark Detection via Topology-Adapting Deep Graph Learning

Improving One-stage Visual Grounding by Recursive Sub-query Construction

Example-Guided Image Synthesis using Masked Spatial-Channel Attention and Self-Supervision

Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification

Image Inpainting with Cascaded Modulation GAN and Object-Aware Training

Large-Scale Tag-Based Font Retrieval With Generative Feature Learning

Exploring the Adversarial Vulnerabilities of Vision-Language-Action Models in Robotics

Latent-Reframe: Enabling Camera Control for Video Diffusion Models without Training

OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting

Aligning Global Semantics and Local Textures in Generative Video Enhancement

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance

Multi-Task Deep Visual-Semantic Embedding for Video Thumbnail Selection

TGIF: A New Dataset and Benchmark on Animated GIF Description

Image Captioning With Semantic Attention

Learning Deep Bilinear Transformation for Fine-grained Image Representation

Learning Semantic-aware Normalization for Generative Adversarial Networks

Probing Inter-modality: Visual Parsing with Self-Attention for Vision-and-Language Pre-training

Multi-modal Dependency Tree for Video Captioning

Wyze Rule: Federated Rule Dataset for Rule Recommendation Benchmarking