Most Cited ICCV "text-guided synthesis" Papers
2,701 papers found • Page 2 of 14
Conference
TinyViM: Frequency Decoupling for Tiny Hybrid Vision Mamba
Xiaowen Ma, Zhen-Liang Ni, Xinghao Chen
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models
Haiwen Diao, Xiaotong Li, Yufeng Cui et al.
Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation and Reconstruction
Yuanhao Cai, He Zhang, Kai Zhang et al.
FastVAR: Linear Visual Autoregressive Modeling via Cached Token Pruning
Hang Guo, Yawei Li, Taolin Zhang et al.
DiT4SR: Taming Diffusion Transformer for Real-World Image Super-Resolution
Zheng-Peng Duan, jiawei zhang, Xin Jin et al.
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
Shoubin Yu, Difan Liu, Ziqiao Ma et al.
LoRA-FAIR: Federated LoRA Fine-Tuning with Aggregation and Initialization Refinement
Jieming Bian, Lei Wang, Letian Zhang et al.
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
Hongcheng Gao, Tianyu Pang, Chao Du et al.
3DRealCar: An In-the-wild RGB-D Car Dataset with 360-degree Views
Xiaobiao Du, Yida Wang, Haiyang Sun et al.
Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation
Yuqing Wang, Zhijie Lin, Yao Teng et al.
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
Dewei Zhou, Mingwei Li, Zongxin Yang et al.
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He, Qihang Yu, Qihao Liu et al.
Boosting MLLM Reasoning with Text-Debiased Hint-GRPO
Qihan Huang, Weilong Dai, Jinlong Liu et al.
VSP: Diagnosing the Dual Challenges of Perception and Reasoning in Spatial Planning Tasks for MLLMs
Qiucheng Wu, Handong Zhao, Michael Saxon et al.
Compression of 3D Gaussian Splatting with Optimized Feature Planes and Standard Video Codecs
Soonbin Lee, Fangwen Shu, Yago Sanchez de la Fuente et al.
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
Junyuan Zhang, Qintong Zhang, Bin Wang et al.
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation
YINWEI WU, Xianpan Zhou, bing ma et al.
Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
Li, Nikolaos Tsagkas, Jifei Song et al.
ViLLa: Video Reasoning Segmentation with Large Language Model
rongkun Zheng, Lu Qi, Xi Chen et al.
Magic Insert: Style-Aware Drag-and-Drop
Nataniel Ruiz, Yuanzhen Li, Neal Wadhwa et al.
USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Xiangxiang Chu, Renda Li, Yong Wang
AllTracker: Efficient Dense Point Tracking at High Resolution
Adam Harley, Yang You, Yang Zheng et al.
CAD-Assistant: Tool-Augmented VLLMs as Generic CAD Task Solvers
Dimitrios Mallis, Ahmet Karadeniz, Sebastian Cavada et al.
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer
Haoxuan Wang, Jinlong Peng, Qingdong He et al.
EmbodiedOcc: Embodied 3D Occupancy Prediction for Vision-based Online Scene Understanding
Yuqi Wu, Wenzhao Zheng, Sicheng Zuo et al.
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Junqi Ge, Ziyi Chen, Jintao Lin et al.
FOLDER: Accelerating Multi-Modal Large Language Models with Enhanced Performance
Haicheng Wang, Zhemeng Yu, Gabriele Spadaro et al.
DexVLG: Dexterous Vision-Language-Grasp Model at Scale
Jiawei He, Danshi Li, Xinqiang Yu et al.
UniOcc: A Unified Benchmark for Occupancy Forecasting and Prediction in Autonomous Driving
Yuping Wang, Xiangyu Huang, Xiaokang Sun et al.
Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency
Tianqi Liu, Zihao Huang, Zhaoxi Chen et al.
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
Xindi Yang, Baolu Li, Yiming Zhang et al.
DreamRelation: Relation-Centric Video Customization
Yujie Wei, Shiwei Zhang, Hangjie Yuan et al.
WonderPlay: Dynamic 3D Scene Generation from a Single Image and Actions
Zizhang Li, Hong-Xing Yu, Wei Liu et al.
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou, Alexander Vilesov, Xuehai He et al.
AffordDexGrasp: Open-set Language-guided Dexterous Grasp with Generalizable-Instructive Affordance
Yilin Wei, Mu Lin, Yuhao Lin et al.
An Empirical Study of Autoregressive Pre-training from Videos
Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.
GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
Pinxin Liu, Luchuan Song, Junhua Huang et al.
OmniPaint: Mastering Object-Oriented Editing via Disentangled Insertion-Removal Inpainting
Yongsheng Yu, Ziyun Zeng, Haitian Zheng et al.
FoundIR: Unleashing Million-scale Training Data to Advance Foundation Models for Image Restoration
Hao Li, Xiang Chen, Jiangxin Dong et al.
POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction
Songyan Zhang, Yongtao Ge, Jinyuan Tian et al.
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye, Honglin Lin, Leyan Ou et al.
LMM4LMM: Benchmarking and Evaluating Large-multimodal Image Generation with LMMs
Jiarui Wang, Huiyu Duan, Yu Zhao et al.
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
Kaixuan Jiang, Yang Liu, Weixing Chen et al.
UnZipLoRA: Separating Content and Style from a Single Image
Chang Liu, Viraj Shah, Aiyu Cui et al.
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
Han Wang, Yuxiang Nie, Yongjie Ye et al.
GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
Simon Boeder, Fabian Gigengack, Benjamin Risse
Efficient Track Anything
Yunyang Xiong, Chong Zhou, Xiaoyu Xiang et al.
Generating Multi-Image Synthetic Data for Text-to-Image Customization
Nupur Kumari, Xi Yin, Jun-Yan Zhu et al.
I2VControl: Disentangled and Unified Video Motion Synthesis Control
Wanquan Feng, Tianhao Qi, Jiawei Liu et al.
Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models
Hongyang Wei, Shuaizheng Liu, Chun Yuan et al.
LONG3R: Long Sequence Streaming 3D Reconstruction
Zhuoguang Chen, Minghui Qin, Tianyuan Yuan et al.
VLRMBench: A Comprehensive and Challenging Benchmark for Vision-Language Reward Models
JIACHENG RUAN, Wenzhen Yuan, Xian Gao et al.
V2M4: 4D Mesh Animation Reconstruction from a Single Monocular Video
Jianqi Chen, Biao Zhang, Xiangjun Tang et al.
BillBoard Splatting (BBSplat): Learnable Textured Primitives for Novel View Synthesis
David Svitov, Pietro Morerio, Lourdes Agapito et al.
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi, Minjing Dong, Chang Xu
Multi-Granular Spatio-Temporal Token Merging for Training-Free Acceleration of Video LLMs
Jeongseok Hyun, Sukjun Hwang, Su Ho Han et al.
MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs
Hui Sun, Shiyin Lu, Huanyu Wang et al.
Online Reasoning Video Segmentation with Just-in-Time Digital Twins
Yiqing Shen, Bohan Liu, Chenjia Li et al.
Referring to Any Person
Qing Jiang, Lin Wu, Zhaoyang Zeng et al.
FreeFlux: Understanding and Exploiting Layer-Specific Roles in RoPE-Based MMDiT for Versatile Image Editing
Tianyi Wei, Yifan Zhou, Dongdong Chen et al.
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection
Jiawen Zhu, YEW-SOON ONG, Chunhua Shen et al.
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI
Fangwei Zhong, Kui Wu, Churan Wang et al.
Unleashing Vecset Diffusion Model for Fast Shape Generation
Zeqiang Lai, Zhao Yunfei, Zibo Zhao et al.
LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos
Chin-Yang Lin, Cheng Sun, Fu-En Yang et al.
Know "No" Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Junsung Park, Jungbeom Lee, Jongyoon Song et al.
Find Any Part in 3D
Ziqi Ma, Yisong Yue, Georgia Gkioxari
DyWA: Dynamics-adaptive World Action Model for Generalizable Non-prehensile Manipulation
Jiangran Lyu, Ziming Li, Xuesong Shi et al.
DSO: Aligning 3D Generators with Simulation Feedback for Physical Soundness
Ruining Li, Chuanxia Zheng, Christian Rupprecht et al.
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
Yue Li, Meng Tian, Zhenyu Lin et al.
Unsupervised Visual Chain-of-Thought Reasoning via Preference Optimization
Kesen Zhao, Beier Zhu, Qianru Sun et al.
When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning
Junwei Luo, Yingying Zhang, Xue Yang et al.
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
Yanzuo Lu, Yuxi Ren, Xin Xia et al.
UniMLVG: Unified Framework for Multi-view Long Video Generation with Comprehensive Control Capabilities for Autonomous Driving
Rui Chen, Zehuan Wu, Yichen Liu et al.
Semi-supervised Concept Bottleneck Models
Lijie Hu, Tianhao Huang, Huanyi Xie et al.
TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training
Felix Krause, Timy Phan, Ming Gui et al.
Growing a Twig to Accelerate Large Vision-Language Models
Zhenwei Shao, Mingyang Wang, Zhou Yu et al.
LBM: Latent Bridge Matching for Fast Image-to-Image Translation
Clément Chadebec, Onur Tasar, Sanjeev Sreetharan et al.
Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction
Weirong Chen, Ganlin Zhang, Felix Wimbauer et al.
NuPlanQA: A Large-Scale Dataset and Benchmark for Multi-View Driving Scene Understanding in Multi-Modal Large Language Models
Sung-Yeon Park, Can Cui, Yunsheng Ma et al.
AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models
Ziyin Zhou, Yunpeng Luo, Yuanchen Wu et al.
RealGeneral: Unifying Visual Generation via Temporal In-Context Learning with Video Models
Yijing Lin, Mengqi Huang, Shuhan Zhuang et al.
BVINet: Unlocking Blind Video Inpainting with Zero Annotations
zhiliang wu, Kerui Chen, Kun Li et al.
LUDVIG: Learning-Free Uplifting of 2D Visual Features to Gaussian Splatting Scenes
Juliette Marrie, Romain Menegaux, Michael Arbel et al.
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
Cong Wei, Yujie Zhong, yingsen zeng et al.
MagicID: Hybrid Preference Optimization for ID-Consistent and Dynamic-Preserved Video Customization
Hengjia Li, Lifan Jiang, Xi Xiao et al.
VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
Jiale Cheng, Ruiliang Lyu, Xiaotao Gu et al.
Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation
Yingjie Chen, Yifang Men, Yuan Yao et al.
Detect Anything 3D in the Wild
Hanxue Zhang, Haoran Jiang, Qingsong Yao et al.
RoboFactory: Exploring Embodied Agent Collaboration with Compositional Constraints
Yiran Qin, Li Kang, Xiufeng Song et al.
MaterialMVP: Illumination-Invariant Material Generation via Multi-view PBR Diffusion
Zebin He, Mx Yang, Shuhui Yang et al.
GWM: Towards Scalable Gaussian World Models for Robotic Manipulation
Guanxing Lu, Baoxiong Jia, Puhao Li et al.
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Siyuan Yan, Ming Hu, Yiwen Jiang et al.
Creation-MMBench: Assessing Context-Aware Creative Intelligence in MLLMs
Xinyu Fang, Zhijian Chen, Kai Lan et al.
Beyond Losses Reweighting: Empowering Multi-Task Learning via the Generalization Perspective
Hoang Phan, Tung Lam Tran, Quyen Tran et al.
Learning Few-Step Diffusion Models by Trajectory Distribution Matching
Yihong Luo, Tianyang Hu, Jiacheng Sun et al.
CoA-VLA: Improving Vision-Language-Action Models via Visual-Text Chain-of-Affordance
Jinming Li, Yichen Zhu, Zhibin Tang et al.
MoMa-Kitchen: A 100K+ Benchmark for Affordance-Grounded Last-Mile Navigation in Mobile Manipulation
Pingrui Zhang, Xianqiang Gao, Yuhan Wu et al.
RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Feng yan, Fanfan Liu, Yiyang Huang et al.
XTrack: Multimodal Training Boosts RGB-X Video Object Trackers
Yuedong Tan, Zongwei Wu, Yuqian Fu et al.
CityGS-X: A Scalable Architecture for Efficient and Geometrically Accurate Large-Scale Scene Reconstruction
Yuanyuan Gao, Hao Li, Jiaqi Chen et al.
Mobile Video Diffusion
Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas et al.
CharaConsist: Fine-Grained Consistent Character Generation
Mengyu Wang, Henghui Ding, Jianing Peng et al.
An OpenMind for 3D Medical Vision Self-supervised Learning
Tassilo Wald, Constantin Ulrich, Jonathan Suprijadi et al.
MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild
Xi Fang, Jiankun Wang, Xiaochen Cai et al.
LoftUp: Learning a Coordinate-Based Feature Upsampler for Vision Foundation Models
Haiwen Huang, Anpei Chen, Volodymyr Havrylov et al.
ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation
Daniel Winter, Asaf Shul, Matan Cohen et al.
MaskControl: Spatio-Temporal Control for Masked Motion Synthesis
Ekkasit Pinyoanuntapong, Muhammad Usama Saleem, Korrawe Karunratanakul et al.
SeaS: Few-shot Industrial Anomaly Image Generation with Separation and Sharing Fine-tuning
Zhewei Dai, Shilei Zeng, Haotian Liu et al.
Effective Training Data Synthesis for Improving MLLM Chart Understanding
Yuwei Yang, Zeyu Zhang, Yunzhong Hou et al.
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
Zhijian Huang, Chengjian Feng, Baihui Xiao et al.
CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
Lin Sun, Jiale Cao, Jin Xie et al.
MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm
Ziyan Guo, Zeyu HU, Na Zhao et al.
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing
Ming Li, Xin Gu, Fan Chen et al.
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
Tianyuan Qu, Longxiang Tang, Bohao PENG et al.
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors
Yabo Zhang, xinpeng zhou, Yihan Zeng et al.
DreamDance: Animating Human Images by Enriching 3D Geometry Cues from 2D Poses
Yatian Pang, Bin Zhu, Bin Lin et al.
UniVG: A Generalist Diffusion Model for Unified Image Generation and Editing
Tsu-Jui Fu, Yusu Qian, Chen Chen et al.
EmotiCrafter: Text-to-Emotional-Image Generation based on Valence-Arousal Model
Shengqi Dang, Yi He, Long Ling et al.
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
Chirui CHANG, Jiahui Liu, Zhengzhe Liu et al.
HiP-AD: Hierarchical and Multi-Granularity Planning with Deformable Attention for Autonomous Driving in a Single Decoder
Yingqi Tang, Zhuoran Xu, Zhaotie Meng et al.
3D Mesh Editing using Masked LRMs
William Gao, Dilin Wang, Yuchen Fan et al.
Momentum-GS: Momentum Gaussian Self-Distillation for High-Quality Large Scene Reconstruction
JIXUAN FAN, Wanhua Li, Yifei Han et al.
Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
Bowen Zhang, Sicheng Xu, Chuxin Wang et al.
ICE-Bench: A Unified and Comprehensive Benchmark for Image Creating and Editing
Yulin Pan, Xiangteng He, Chaojie Mao et al.
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
Yue Fan, Xiaojian Ma, Rongpeng Su et al.
SplatTalk: 3D VQA with Gaussian Splatting
Anh Thai, Kyle Genova, Songyou Peng et al.
Rectifying Magnitude Neglect in Linear Attention
Qihang Fan, Huaibo Huang, Yuang Ai et al.
LATINO-PRO: LAtent consisTency INverse sOlver with PRompt Optimization
Alessio Spagnoletti, Jean Prost, Andres Almansa et al.
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Qi Zhao, Xingyu Ni, Ziyu Wang et al.
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers
Hanling Zhang, Rundong Su, Zhihang Yuan et al.
AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation
Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin et al.
SuperDec: 3D Scene Decomposition with Superquadrics Primitives
Elisabetta Fedele, Boyang Sun, Francis Engelmann et al.
Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Yun Wang, Longguang Wang, Chenghao Zhang et al.
Reangle-A-Video: 4D Video Generation as Video-to-Video Translation
Hyeonho Jeong, Suhyeon Lee, Jong Ye
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams
Haoji Zhang, Yiqin Wang, Yansong Tang et al.
FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers
Renshan Zhang, Rui Shao, Gongwei Chen et al.
Generative Gaussian Splatting: Generating 3D Scenes with Video Diffusion Priors
Katja Schwarz, Norman Müller, Peter Kontschieder
X-Dancer: Expressive Music to Human Dance Video Generation
Zeyuan Chen, Hongyi Xu, Guoxian Song et al.
3DGraphLLM: Combining Semantic Graphs and Large Language Models for 3D Scene Understanding
Tatiana Zemskova, Dmitry Yudin
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
ViSpeak: Visual Instruction Feedback in Streaming Videos
Shenghao Fu, Qize Yang, Yuan-Ming Li et al.
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence
Jie Feng, Shengyuan Wang, Tianhui Liu et al.
Dense Policy: Bidirectional Autoregressive Learning of Actions
Yue Su, Xinyu Zhan, Hongjie Fang et al.
Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation
Fating Hong, Zunnan Xu, Zixiang Zhou et al.
ROADWork: A Dataset and Benchmark for Learning to Recognize, Observe, Analyze and Drive Through Work Zones
Anurag Ghosh, Shen Zheng, Robert Tamburo et al.
Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling
Fengxiang Wang, Hongzhen Wang, Di Wang et al.
SILO: Solving Inverse Problems with Latent Operators
Ron Raphaeli, Sean Man, Michael Elad
Contrastive Flow Matching
George Stoica, Vivek Ramanujan, Xiang Fan et al.
GENMO: A GENeralist Model for Human MOtion
Jiefeng Li, Jinkun Cao, Haotian Zhang et al.
MINERVA: Evaluating Complex Video Reasoning
Arsha Nagrani, Sachit Menon, Ahmet Iscen et al.
RANKCLIP: Ranking-Consistent Language-Image Pretraining
Yiming Zhang, Zhuokai Zhao, Zhaorun Chen et al.
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Kaichen Zhang, Yifei Shen, Bo Li et al.
Quadratic Gaussian Splatting: High Quality Surface Reconstruction with Second-order Geometric Primitives
ziyu zhang, Binbin Huang, Hanqing Jiang et al.
Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM
Yatai Ji, Jiacheng Zhang, Jie Wu et al.
WSI-LLaVA: A Multimodal Large Language Model for Whole Slide Image
Yuci Liang, Xinheng Lyu, Meidan Ding et al.
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang, Yaxiong Wang, Li Zhu et al.
MagicColor: Multi-instance Sketch Colorization
yinhan Zhang, Yue Ma, Bingyuan Wang et al.
AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs
Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta et al.
Nautilus: Locality-aware Autoencoder for Scalable Mesh Generation
Yuxuan Wang, Xuanyu Yi, Haohan Weng et al.
Decoupled Diffusion Sparks Adaptive Scene Generation
Yunsong Zhou, Naisheng Ye, William Ljungbergh et al.
Benchmarking Multimodal CoT Reward Model Stepwise by Visual Program
Minghe Gao, Xuqi Liu, Zhongqi Yue et al.
Advancing Textual Prompt Learning with Anchored Attributes
Zheng Li, Yibing Song, Ming-Ming Cheng et al.
TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction
Xuying Zhang, Yutong Liu, Yangguang Li et al.
A Recipe for Generating 3D Worlds from a Single Image
Katja Schwarz, Denis Rozumny, Samuel Rota Bulò et al.
Spectral Image Tokenizer
Carlos Esteves, Mohammed Suhail, Ameesh Makadia
Make Me Happier: Evoking Emotions Through Image Diffusion Models
Qing Lin, Jingfeng Zhang, YEW-SOON ONG et al.
Visual Test-time Scaling for GUI Agent Grounding
Tiange Luo, Lajanugen Logeswaran, Justin Johnson et al.
RoMo: Robust Motion Segmentation Improves Structure from Motion
Lily Goli, Sara Sabour, Mark Matthews et al.
Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
Yikang Zhou, Tao Zhang, Shilin Xu et al.
WildSAT: Learning Satellite Image Representations from Wildlife Observations
Rangel Daroya, Elijah Cole, Oisin Mac Aodha et al.
PanSt3R: Multi-view Consistent Panoptic Segmentation
Lojze Zust, Yohann Cabon, Juliette Marrie et al.
OMNI-DC: Highly Robust Depth Completion with Multiresolution Depth Integration
Yiming Zuo, Willow Yang, Zeyu Ma et al.
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
Shijie Ma, Yuying Ge, Teng Wang et al.
No Pose at All: Self-Supervised Pose-Free 3D Gaussian Splatting from Sparse Views
Ranran Huang, Krystian Mikolajczyk
Di[M]O: Distilling Masked Diffusion Models into One-step Generator
Yuanzhi Zhu, Xi WANG, Stéphane Lathuilière et al.
TACO: Taming Diffusion for in-the-wild Video Amodal Completion
Ruijie Lu, Yixin Chen, Yu Liu et al.
Generalized and Efficient 2D Gaussian Splatting for Arbitrary-scale Super-Resolution
Du Chen, Liyi Chen, Zhengqiang ZHANG et al.
RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis
yifei feng, Mx Yang, Shuhui Yang et al.
SliderSpace: Decomposing the Visual Capabilities of Diffusion Models
Rohit Gandikota, Zongze Wu, Richard Zhang et al.
AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation
zijie wu, Chaohui Yu, Fan Wang et al.
LoRA.rar: Learning to Merge LoRAs via Hypernetworks for Subject-Style Conditioned Image Generation
Donald Shenaj, Ondrej Bohdal, Mete Ozay et al.
Fine-structure Preserved Real-world Image Super-resolution via Transfer VAE Training
Qiaosi Yi, Shuai Li, Rongyuan Wu et al.
DIVE: Taming DINO for Subject-Driven Video Editing
Yi Huang, Wei Xiong, He Zhang et al.
NAVER: A Neuro-Symbolic Compositional Automaton for Visual Grounding with Explicit Logic Reasoning
Zhixi Cai, Fucai Ke, Simindokht Jahangard et al.
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos
Jiashuo Yu, Yue Wu, Meng Chu et al.
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model
Tao Wang, Changxu Cheng, Lingfeng Wang et al.
Fine-Tuning Visual Autogressive Models for Subject-Driven Generation
Jiwoo Chung, Sangeek Hyun, Hyunjun Kim et al.
FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases
Shuai Tan, Bill Gong, Bin Ji et al.
GeoSplatting: Towards Geometry Guided Gaussian Splatting for Physically-based Inverse Rendering
Kai Ye, Chong Gao, Guanbin Li et al.
EMD: Explicit Motion Modeling for High-Quality Street Gaussian Splatting
Xiaobao Wei, Qingpo Wuwu, Zhongyu Zhao et al.
VideoOrion: Tokenizing Object Dynamics in Videos
Yicheng Feng, Yijiang Li, Wanpeng Zhang et al.
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
Yanrui Bin, Wenbo Hu, Haoyuan Wang et al.
QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation
Junyi Wu, Zhiteng Li, Zheng Hui et al.
Chimera: Improving Generalist Model with Domain-Specific Experts
Tianshuo Peng, Mingsheng Li, Jiakang Yuan et al.
MultiADS: Defect-aware Supervision for Multi-type Anomaly Detection and Segmentation in Zero-Shot Learning
Ylli Sadikaj, Hongkuan Zhou, Lavdim Halilaj et al.
Knowledge Distillation with Refined Logits
Wujie Sun, Defang Chen, Siwei Lyu et al.
NeuralSVG: An Implicit Representation for Text-to-Vector Generation
Sagi Polaczek, Yuval Alaluf, Elad Richardson et al.
CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games
Peng Chen, Pi Bu, Yingyao Wang et al.
HUMOTO: A 4D Dataset of Mocap Human Object Interactions
Jiaxin Lu, Chun-Hao Huang, Uttaran Bhattacharya et al.
Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis
Jingjing Ren, Wenbo Li, Zhongdao Wang et al.