Xiaojuan Qi

84

Papers

579

Total Citations

Papers (84)

SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

V-IRL: Grounding Virtual Intelligence in Real Life

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Can OOD Object Detectors Learn from Foundation Models?

Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction

ObjectMover: Generative Object Movement with Video Prior

SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

Deformable Radial Kernel Splatting

``Principal Components" Enable A New Language of Images

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Equipping Vision Foundation Model with Mixture of Experts for Out-of-Distribution Detection

Pyramid Scene Parsing Network

GeoNet: Geometric Neural Network for Joint Depth and Surface Normal Estimation

Referring Image Segmentation via Recurrent Refinement Networks

Semi-Parametric Image Synthesis

3D Motion Decomposition for RGBD Future Dynamic Scene Synthesis

Global Texture Enhancement for Fake Face Detection in the Wild

ManiGAN: Text-Guided Image Manipulation

Unifying Training and Inference for Panoptic Segmentation

3D-to-2D Distillation for Indoor Scene Parsing

PAConv: Position Adaptive Convolution With Dynamic Kernel Assembling on Point Clouds

ST3D: Self-Training for Unsupervised Domain Adaptation on 3D Object Detection

Fully Convolutional Networks for Panoptic Segmentation

One Thing One Click: A Self-Training Approach for Weakly Supervised 3D Semantic Segmentation

TWIST: Two-Way Inter-Label Self-Training for Semi-Supervised 3D Instance Segmentation

Voxel Field Fusion for 3D Object Detection

Towards Implicit Text-Guided 3D Shape Generation

Slot-VPS: Object-Centric Representation Learning for Video Panoptic Segmentation

HINT: Hierarchical Neuron Concept Explainer

Progressive End-to-End Object Detection in Crowded Scenes

Knowledge Distillation As Efficient Pre-Training: Faster Convergence, Higher Data-Efficiency, and Better Transferability

Video Demoireing With Relation-Based Temporal Consistency

Stratified Transformer for 3D Point Cloud Segmentation

MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking

Understanding Imbalanced Semantic Segmentation Through Neural Collapse

LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs

Command-Driven Articulated Object Understanding and Manipulation

Semantic Segmentation With Object Clique Potential

3D Graph Neural Networks for RGBD Semantic Segmentation

Improved Techniques for Training Adaptive Deep Networks

AGSS-VOS: Attention Guided Single-Shot Video Object Segmentation

Aggregation With Feature Detection

Re-Distributing Biased Pseudo Labels for Semi-Supervised Semantic Segmentation: A Baseline Investigation

Texture Generation on 3D Meshes with Point-UV Diffusion

Learning a Room with the Occ-SDF Hybrid: Signed Distance Function Mingled with Occupancy Aids Scene Representation

Parametric Classification for Generalized Category Discovery: A Baseline Study

IST-Net: Prior-Free Category-Level Pose Estimation with Implicit Space Transformation

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

Domain-invariant Stereo Matching Networks

Few-shot Action Recognition with Permutation-invariant Attention

CN: Channel Normalization For Point Cloud Recognition

Memory Selection Network for Video Propagation

Towards Efficient and Scale-Robust Ultra-High-Definition Image Demoiréing

DODA: Data-Oriented Sim-to-Real Domain Adaptation for 3D Semantic Segmentation

Multimodal Transformer for Automatic 3D Annotation and Object Detection

Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur

Learning from Neighbors: Category Extrapolation for Long-Tail Learning

UniScene: Unified Occupancy-centric Driving Scene Generation

Holistic Tokenizer for Autoregressive Image Generation

Aligning Effective Tokens with Video Anomaly in Large Language Models

Mixture-of-Scores: Robust Image-Text Data Valuation via Three Lines of Code

DiST-4D: Disentangled Spatiotemporal Diffusion with Metric Depth for 4D Driving Scene Generation

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach

EscherNet: A Generative Model for Scalable View Synthesis

Classes Are Not Equal: An Empirical Study on Image Recognition Fairness

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

DCAN: Deep Contour-Aware Networks for Accurate Gland Segmentation

Multi-Scale Patch Aggregation (MPA) for Simultaneous Detection and Segmentation

Image Inpainting via Generative Multi-column Convolutional Neural Networks

Controllable Text-to-Image Generation

Lightweight Generative Adversarial Networks for Text-Guided Image Manipulation

Spatial Pruned Sparse Convolution for Efficient 3D Object Detection

Prototypical VoteNet for Few-Shot 3D Point Cloud Object Detection

Self-Supervised Visual Representation Learning with Semantic Grouping

Unifying Voxel-based Representation with Transformer for 3D Object Detection

Towards Efficient 3D Object Detection with Knowledge Distillation

Rethinking Resolution in the Context of Efficient Video Recognition

Data Pruning via Moving-one-Sample-out

CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation

CoDet: Co-occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection