Qi Wu

78

Papers

627

Total Citations

1

Affiliations

Affiliations

Carnegie Mellon University

Papers (78)

NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

Object-and-Action Aware Model for Visual Language Navigation

Context-I2W: Mapping Images to Context-Dependent Words for Accurate Zero-Shot Composed Image Retrieval

3DGUT: Enabling Distorted Cameras and Secondary Rays in Gaussian Splatting

Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning

Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs

WebVLN: Vision-and-Language Navigation on Websites

PairAug: What Can Augmented Image-Text Pairs Do for Radiology?

General Scene Adaptation for Vision-and-Language Navigation

Invariant Random Forest: Tree-Based Model Solution for OOD Generalization

The Causal Impact of Credit Lines on Spending Distributions

Sparse Bayesian Deep Learning for Cross Domain Medical Image Reconstruction

KPA-Tracker: Towards Robust and Real-Time Category-Level Articulated Object 6D Pose Tracking

G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images

Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework

ModaVerse: Efficiently Transforming Modalities with LLMs

What Value Do Explicit High Level Concepts Have in Vision to Language Problems?

Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge From External Sources

The VQA-Machine: Learning How to Use Existing Vision Algorithms to Answer New Questions

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments

Parallel Attention: A Unified Framework for Visual Object Discovery Through Dialogs and Queries

Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning

Learning Semantic Concepts and Order for Image and Sentence Matching

Visual Question Answering With Memory-Augmented Networks

Visual Grounding via Accumulated Attention

Neighbourhood Watch: Referring Expression Comprehension via Language-Guided Graph Attention Networks

Mind Your Neighbours: Image Annotation With Metadata Neighbourhood Graph Co-Attention Networks

What's to Know? Uncertainty as a Guide to Asking Goal-Oriented Questions

Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs

Cops-Ref: A New Dataset and Task on Compositional Referring Expression Comprehension

Gold Seeker: Information Gain From Policy Distributions for Goal-Oriented Vision-and-Langauge Reasoning

Intelligent Home 3D: Automatic 3D-House Design From Linguistic Descriptions Only

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments

Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning

Sketch, Ground, and Refine: Top-Down Dense Video Captioning

Towards Accurate Text-Based Image Captioning With Content Diversity Exploration

Jo-SRC: A Contrastive Approach for Combating Noisy Labels

Room-and-Object Aware Knowledge Reasoning for Remote Embodied Referring Expression

Non-Salient Region Object Mining for Weakly Supervised Semantic Segmentation

VLN BERT: A Recurrent Vision-and-Language BERT for Navigation

V2C: Visual Voice Cloning

Bridging the Gap Between Learning in Discrete and Continuous Environments for Vision-and-Language Navigation

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-Based Visual Question Answering

Maintaining Reasoning Consistency in Compositional Visual Question Answering

HOP: History-and-Order Aware Pre-Training for Vision-and-Language Navigation

Learning To Dub Movies via Hierarchical Prosody Models

S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning

The Road To Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation

AerialVLN: Vision-and-Language Navigation for UAVs

VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

Scaling Data Generation in Vision-and-Language Navigation

Identity-Consistent Aggregation for Video Object Detection

ShapeScaffolder: Structure-Aware 3D Shape Generation from Text

March in Chat: Interactive Prompting for Remote Embodied Referring Expression

NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping

Soft Expert Reward Learning for Vision-and-Language Navigation

Length-Controllable Image Captioning

Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering

UniMiSS: Universal Medical Self-Supervised Learning via Breaking Dimensionality Barrier

A Simple and Robust Correlation Filtering Method for Text-Based Person Search

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval

Missing Target-Relevant Information Prediction with World Model for Accurate Zero-Shot Composed Image Retrieval

EnvPoser: Environment-aware Realistic Human Motion Estimation from Sparse Observations with Uncertainty Modeling

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

MFL-Owner: Ownership Protection for Multi-modal Federated Learning via Orthogonal Transform Watermark

Realistic Noise Synthesis with Diffusion Models

Distributionally Robust Policy Evaluation and Learning for Continuous Treatment with Observational Data

Augmented Commonsense Knowledge for Remote Object Grounding

Parsimonious Quantile Regression of Financial Asset Tail Dynamics via Sequential Learning

Cross-sectional Learning of Extremal Dependence among Financial Assets

Language and Visual Entity Relationship Graph for Agent Navigation

Landmark-RxR: Solving Vision-and-Language Navigation with Fine-Grained Alignment Supervision

Debiased Visual Question Answering from Feature and Sample Perspectives

Learning Distinct and Representative Modes for Image Captioning

LoRA: A Logical Reasoning Augmented Dataset for Visual Question Answering