Xi Chen

39

Papers

598

Total Citations

1

Affiliations

Affiliations

Google Research

Papers (39)

On Scaling Up a Multilingual Vision and Language Model

OpenIns3D: Snap and Lookup for 3D Open-vocabulary Instance Segmentation

UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

PolyVoice: Language Models for Speech to Speech Translation

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

EnvGS: Modeling View-Dependent Appearance with Environment Gaussian

ViLLa: Video Reasoning Segmentation with Large Language Model

GOPlan: Goal-conditioned Offline Reinforcement Learning by Planning with Learned Models

MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation

Dual-Window Multiscale Transformer for Hyperspectral Snapshot Compressive Imaging

NoT: Federated Unlearning via Weight Negation

Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding

ObjectMover: Generative Object Movement with Video Prior

Online Video Understanding: OVBench and VideoChat-Online

Asynchronous Federated Clustering with Unknown Number of Clusters

Exploiting Symmetric Temporally Sparse BPTT for Efficient RNN Training

Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

ROSE: Remove Objects with Side Effects in Videos

Less or More From Teacher: Exploiting Trilateral Geometry For Knowledge Distillation

PlayerOne: Egocentric World Simulator

Understanding the Training Speedup from Sampling with Approximate Losses

MangaNinja: Line Art Colorization with Precise Reference Following

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

UniGraspTransformer: Simplified Policy Distillation for Scalable Dexterous Robotic Grasping

DiffDoctor: Diagnosing Image Diffusion Models Before Treating

GenieBlue: Integrating both Linguistic and Multimodal Capabilities for Large Language Models on Mobile Devices

Zero-shot Denoising via Neural Compression: Theoretical and algorithmic framework

TC-LLaVA: Rethinking the Transfer of LLava from Image to Video Understanding with Temporal Considerations

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

HFF-Tracker: A Hierarchical Fine-grained Fusion Tracker for Referring Multi-Object Tracking

Reverse Region-to-Entity Annotation for Pixel-Level Visual Entity Linking

Disentangled Modeling of Preferences and Social Influence for Group Recommendation

The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards

Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in LLMs

Calibrated One Round Federated Learning with Bayesian Inference in the Predictive Space

AnyDoor: Zero-shot Object-level Image Customization

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Bagged Deep Image Prior for Recovering Images in the Presence of Speckle Noise

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension