Most Cited CVPR "morton-order scanning" Papers
5,589 papers found • Page 1 of 28
Conference
Improved Baselines with Visual Instruction Tuning
Haotian Liu, Chunyuan Li, Yuheng Li et al.
DETRs Beat YOLOs on Real-time Object Detection
Yian Zhao, Wenyu Lv, Shangliang Xu et al.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen, Jiannan Wu, Wenhai Wang et al.
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang et al.
Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data
Lihe Yang, Bingyi Kang, Zilong Huang et al.
4D Gaussian Splatting for Real-Time Dynamic Scene Rendering
Guanjun Wu, Taoran Yi, Jiemin Fang et al.
VBench: Comprehensive Benchmark Suite for Video Generative Models
Ziqi Huang, Yinan He, Jiashuo Yu et al.
DUSt3R: Geometric 3D Vision Made Easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark
Kunchang Li, Yali Wang, Yinan He et al.
LISA: Reasoning Segmentation via Large Language Model
Xin Lai, Zhuotao Tian, Yukang Chen et al.
Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction
Ziyi Yang, Xinyu Gao, Wen Zhou et al.
VILA: On Pre-training for Visual Language Models
Ji Lin, Danny Yin, Wei Ping et al.
Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation
Li Hu
YOLO-World: Real-Time Open-Vocabulary Object Detection
Tianheng Cheng, Lin Song, Yixiao Ge et al.
Wonder3D: Single Image to 3D using Cross-Domain Diffusion
Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin et al.
SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering
Antoine Guédon, Vincent Lepetit
CogAgent: A Visual Language Model for GUI Agents
Wenyi Hong, Weihan Wang, Qingsong Lv et al.
Mip-Splatting: Alias-free 3D Gaussian Splatting
Zehao Yu, Anpei Chen, Binbin Huang et al.
Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
Tao Lu, Mulin Yu, Linning Xu et al.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye et al.
VGGT: Visual Geometry Grounded Transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev et al.
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong, Zhuang Liu, Yuexiang Zhai et al.
One-step Diffusion with Distribution Matching Distillation
Tianwei Yin, Michaël Gharbi, Richard Zhang et al.
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen, Zhuo Xu, Sean Kirmani et al.
Diffusion Model Alignment Using Direct Preference Optimization
Bram Wallace, Meihua Dang, Rafael Rafailov et al.
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction
David Charatan, Sizhe Lester Li, Andrea Tagliasacchi et al.
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Haoxin Chen, Yong Zhang, Xiaodong Cun et al.
SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula et al.
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Sicong Leng, Hang Zhang, Guanzheng Chen et al.
RepViT: Revisiting Mobile CNN From ViT Perspective
Ao Wang, Hui Chen, Zijia Lin et al.
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song, Wenhao Chai, Guanhong Wang et al.
Gaussian Splatting SLAM
Hidenobu Matsuki, Riku Murai, Paul Kelly et al.
Generative Multimodal Models are In-Context Learners
Quan Sun, Yufeng Cui, Xiaosong Zhang et al.
FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects
Bowen Wen, Wei Yang, Jan Kautz et al.
Structured 3D Latents for Scalable and Versatile 3D Generation
Jianfeng XIANG, Zelong Lv, Sicheng Xu et al.
AnyDoor: Zero-shot Object-level Image Customization
Xi Chen, Lianghua Huang, Yu Liu et al.
GLaMM: Pixel Grounding Large Multimodal Model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly et al.
Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks
Bin Xiao, Haiping Wu, Weijian Xu et al.
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
Zhang Li, Biao Yang, Qiang Liu et al.
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Tianrui Guan, Fuxiao Liu, Xiyang Wu et al.
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Qidong Huang, Xiaoyi Dong, Pan Zhang et al.
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
Jing Shi, Wei Xiong, Zhe Lin et al.
GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting
Chi Yan, Delin Qu, Dong Wang et al.
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding
Shuhuai Ren, Linli Yao, Shicheng Li et al.
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W. Gupta et al.
LangSplat: 3D Language Gaussian Splatting
Minghan Qin, Wanhua Li, Jiawei ZHOU et al.
Compact 3D Gaussian Representation for Radiance Field
Joo Chan Lee, Daniel Rho, Xiangyu Sun et al.
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin, Ryuichi Takanobu, Cai Zhang et al.
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
Tianyu Yu, Yuan Yao, Haoye Zhang et al.
DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes
Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan et al.
Analyzing and Improving the Training Dynamics of Diffusion Models
Tero Karras, Miika Aittala, Jaakko Lehtinen et al.
Rewrite the Stars
Xu Ma, Xiyang Dai, Yue Bai et al.
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace et al.
V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs
Penghao Wu, Saining Xie
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani et al.
Poly Kernel Inception Network for Remote Sensing Detection
Xinhao Cai, Qiuxia Lai, Yuwei Wang et al.
Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields
Shijie Zhou, Haoran Chang, Sicheng Jiang et al.
Text-to-3D using Gaussian Splatting
Zilong Chen, Feng Wang, Yikai Wang et al.
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
Yiwen Chen, Zilong Chen, Chi Zhang et al.
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
Bingxin Ke, Anton Obukhov, Shengyu Huang et al.
Splatter Image: Ultra-Fast Single-View 3D Reconstruction
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi
PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
Tianyi Xie, Zeshun Zong, Yuxing Qiu et al.
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew et al.
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Zhen Li, Mingdeng Cao, Xintao Wang et al.
GeoChat: Grounded Large Vision-Language Model for Remote Sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer et al.
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing
Yujun Shi, Chuhui Xue, Jun Hao Liew et al.
UniDepth: Universal Monocular Metric Depth Estimation
Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis et al.
Video-P2P: Video Editing with Cross-attention Control
Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.
SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes
Yihua Huang, Yangtian Sun, Ziyi Yang et al.
Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis
Zhan Li, Zhang Chen, Zhong Li et al.
MoMask: Generative Masked Modeling of 3D Human Motions
chuan guo, Yuxuan Mu, Muhammad Gohar Javed et al.
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit et al.
ReconFusion: 3D Reconstruction with Diffusion Priors
Rundi Wu, Ben Mildenhall, Philipp Henzler et al.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu et al.
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
Minghua Liu, Ruoxi Shi, Linghao Chen et al.
LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching
Yixun Liang, Xin Yang, Jiantao Lin et al.
InceptionNeXt: When Inception Meets ConvNeXt
Weihao Yu, Pan Zhou, Shuicheng Yan et al.
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action
Jiasen Lu, Christopher Clark, Sangho Lee et al.
DeepCache: Accelerating Diffusion Models for Free
Xinyin Ma, Gongfan Fang, Xinchao Wang
TransNeXt: Robust Foveal Visual Perception for Vision Transformers
Dai Shi
Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo et al.
DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Lu Ling, Yichen Sheng, Zhi Tu et al.
SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution
Rongyuan Wu, Tao Yang, Lingchen Sun et al.
OmniGen: Unified Image Generation
Shitao Xiao, Yueze Wang, Junjie Zhou et al.
MambaVision: A Hybrid Mamba-Transformer Vision Backbone
Ali Hatamizadeh, Jan Kautz
Reconstructing Hands in 3D with Transformers
Georgios Pavlakos, Dandan Shan, Ilija Radosavovic et al.
VTimeLLM: Empower LLM to Grasp Video Moments
Bin Huang, Xin Wang, Hong Chen et al.
On Scaling Up a Multilingual Vision and Language Model
Xi Chen, Josip Djolonga, Piotr Padlewski et al.
Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving
Yuqi Wang, Jiawei He, Lue Fan et al.
LMDrive: Closed-Loop End-to-End Driving with Large Language Models
Hao Shao, Yuxuan Hu, Letian Wang et al.
Continuous 3D Perception Model with Persistent State
Qianqian Wang, Yifei Zhang, Aleksander Holynski et al.
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
Shelly Sheynin, Adam Polyak, Uriel Singer et al.
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Yaofang Liu, Xiaodong Cun, Xuebo Liu et al.
RoMa: Robust Dense Feature Matching
Johan Edstedt, Qiyu Sun, Georg Bökman et al.
An Edit Friendly DDPM Noise Space: Inversion and Manipulations
Inbar Huberman-Spiegelglas, Vladimir Kulikov, Tomer Michaeli
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
Taoran Yi, Jiemin Fang, Junjie Wang et al.
EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Yunyang Xiong, Balakrishnan Varadarajan, Lemeng Wu et al.
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
Qingqing Zhao, Yao Lu, Moo Jin Kim et al.
SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
Xin Guo, Jiangwei Lao, Bo Dang et al.
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition
Xiaohan Ding, Yiyuan Zhang, Yixiao Ge et al.
DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization
Jiahe Li, Jiawei Zhang, Xiao Bai et al.
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians
Shenhan Qian, Tobias Kirschstein, Liam Schoneveld et al.
Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild
Fanghua Yu, Jinjin Gu, Zheyuan Li et al.
Sequential Modeling Enables Scalable Learning for Large Vision Models
Yutong Bai, Xinyang Geng, Karttikeya Mangalam et al.
GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces
Yingwenqi Jiang, Jiadong Tu, Yuan Liu et al.
HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models
Nataniel Ruiz, Yuanzhen Li, Varun Jampani et al.
Style Aligned Image Generation via Shared Attention
Amir Hertz, Andrey Voynov, Shlomi Fruchter et al.
OpenEQA: Embodied Question Answering in the Era of Foundation Models
Arjun Majumdar, Anurag Ajay, Xiaohan Zhang et al.
FreeU: Free Lunch in Diffusion U-Net
Chenyang Si, Ziqi Huang, Yuming Jiang et al.
SinSR: Diffusion-Based Image Super-Resolution in a Single Step
Yufei Wang, Wenhan Yang, Xinyuan Chen et al.
MACE: Mass Concept Erasure in Diffusion Models
Shilin Lu, Zilan Wang, Leyang Li et al.
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
Jiwoo Chung, Sangeek Hyun, Jae-Pil Heo
Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis
Simon Niedermayr, Josef Stumpfegger, rüdiger westermann
EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation
Md Mostafijur Rahman, Mustafa Munir, Radu Marculescu
VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction
Jiaqi Lin, Zhihao Li, Xiao Tang et al.
MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers
Yawar Siddiqui, Antonio Alliegro, Alexey Artemov et al.
Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model
Kai Yang, Jian Tao, Jiafei Lyu et al.
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha, Woo-Young Kang, Jonghwan Mun et al.
3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting
Zhiyin Qian, Shaofei Wang, Marko Mihajlovic et al.
Infinity∞: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Jian Han, Jinlai Liu, Yi Jiang et al.
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han, Kaixiong Gong, Yiyuan Zhang et al.
COLMAP-Free 3D Gaussian Splatting
Yang Fu, Sifei Liu, Amey Kulkarni et al.
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue, Ning Yu, Shu Zhang et al.
GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians
Liangxiao Hu, Hongwen Zhang, Yuxiang Zhang et al.
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei et al.
CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation
Seokju Cho, Heeseong Shin, Sunghwan Hong et al.
MambaOut: Do We Really Need Mamba for Vision?
Weihao Yu, Xinchao Wang
Holodeck: Language Guided Generation of 3D Embodied AI Environments
Yue Yang, Fan-Yun Sun, Luca Weihs et al.
GS-IR: 3D Gaussian Splatting for Inverse Rendering
Zhihao Liang, Qi Zhang, Ying Feng et al.
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Jiazuo Yu, Yunzhi Zhuge, Lu Zhang et al.
Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle
Youtian Lin, Zuozhuo Dai, Siyu Zhu et al.
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He, Hengduo Li, Young Kyun Jang et al.
Putting the Object Back into Video Object Segmentation
Ho Kei Cheng, Seoung Wug Oh, Brian Price et al.
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Jingfeng Yao, Bin Yang, Xinggang Wang
RMT: Retentive Networks Meet Vision Transformers
Qihang Fan, Huaibo Huang, Mingrui Chen et al.
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng et al.
InstanceDiffusion: Instance-level Control for Image Generation
XuDong Wang, Trevor Darrell, Sai Saketh Rambhatla et al.
Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass
Jianing "Jed" Yang, Alexander Sax, Kevin Liang et al.
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
Jeongho Kim, Gyojung Gu, Minho Park et al.
Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding
Jin-Chuan Shi, Miao Wang, Haobin Duan et al.
BioCLIP: A Vision Foundation Model for the Tree of Life
Samuel Stevens, Jiaman Wu, Matthew Thompson et al.
RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection
Ximiao Zhang, Min Xu, Xiuzhuang Zhou
SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting
Zhijing Shao, Wang Zhaolong, Zhuang Li et al.
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
Sherwin Bahmani, Ivan Skorokhodov, Victor Rong et al.
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
Yukang Cao, Yan-Pei Cao, Kai Han et al.
Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models
Huan Ling, Seung Wook Kim, Antonio Torralba et al.
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D
Lingteng Qiu, Guanying Chen, Xiaodong Gu et al.
Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras
Huajian Huang, Longwei Li, Hui Cheng et al.
Logit Standardization in Knowledge Distillation
Shangquan Sun, Wenqi Ren, Jingzhi Li et al.
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
Zhiqi Li, Zhiding Yu, Shiyi Lan et al.
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra, Brandon Huang, Trevor Darrell et al.
GauHuman: Articulated Gaussian Splatting from Monocular Human Videos
Shoukang Hu, Tao Hu, Ziwei Liu
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Zeyi Sun, Ye Fang, Tong Wu et al.
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Yanwu Xu, Yang Zhao, Zhisheng Xiao et al.
WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion
Soyong Shin, Juyong Kim, Eni Halilaj et al.
HIVE: Harnessing Human Feedback for Instructional Visual Editing
Shu Zhang, Xinyi Yang, Yihao Feng et al.
GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions
Junjie Wang, Jiemin Fang, Xiaopeng Zhang et al.
GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis
Shunyuan Zheng, Boyao ZHOU, Ruizhi Shao et al.
Equivariant Multi-Modality Image Fusion
Zixiang Zhao, Haowen Bai, Jiangshe Zhang et al.
DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving
Bencheng Liao, Shaoyu Chen, haoran yin et al.
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan et al.
MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision
Ruicheng Wang, Sicheng Xu, Cassie Lee Dai et al.
Grounded Text-to-Image Synthesis with Attention Refocusing
Quynh Phung, Songwei Ge, Jia-Bin Huang
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Zigang Geng, Binxin Yang, Tiankai Hang et al.
Multi-Modal Hallucination Control by Visual Information Grounding
Alessandro Favero, Luca Zancato, Matthew Trager et al.
Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering
Zhiwen Yan, Weng Fei Low, Yu Chen et al.
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos
Wenbo Hu, Xiangjun Gao, Xiaoyu Li et al.
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Haoning Wu, Zicheng Zhang, Erli Zhang et al.
DreamVideo: Composing Your Dream Videos with Customized Subject and Motion
Yujie Wei, Shiwei Zhang, Zhiwu Qing et al.
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi et al.
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Mu Cai, Haotian Liu, Siva Mustikovela et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
HUGS: Human Gaussian Splats
Muhammed Kocabas, Jen-Hao Rick Chang, James Gabriel et al.
Navigation World Models
Amir Bar, Gaoyue Zhou, Danny Tran et al.
MMA-Diffusion: MultiModal Attack on Diffusion Models
Yijun Yang, Ruiyuan Gao, Xiaosen Wang et al.
Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian liu et al.
Make Pixels Dance: High-Dynamic Video Generation
Yan Zeng, Guoqiang Wei, Jiani Zheng et al.
Infrared Small Target Detection with Scale and Location Sensitivity
Qiankun Liu, Rui Liu, Bolun Zheng et al.
From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
Tianwei Yin, Qiang Zhang, Richard Zhang et al.
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
Yuwen Xiong, Zhiqi Li, Yuntao Chen et al.
Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians
Yuelang Xu, Benwang Chen, Zhe Li et al.
Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration
Chen Zhao, Weiling Cai, Chenyu Dong et al.
Optimal Transport Aggregation for Visual Place Recognition
Sergio Izquierdo, Javier Civera
Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed
Yifan Wang, Xingyi He, Sida Peng et al.
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Tong Wu, Guandao Yang, Zhibing Li et al.
XFeat: Accelerated Features for Lightweight Image Matching
Guilherme Potje, Felipe Cadar, André Araujo et al.
GEN3C: 3D-Informed World-Consistent Video Generation with Precise Camera Control
Xuanchi Ren, Tianchang Shen, Jiahui Huang et al.
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu, Tianbin, Quanfeng Lu et al.
3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos
Jiakai Sun, Han Jiao, Guangyuan Li et al.
SpatialTracker: Tracking Any 2D Pixels in 3D Space
Yuxi Xiao, Qianqian Wang, Shangzhan Zhang et al.
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Yuzhou Huang, Liangbin Xie, Xintao Wang et al.
Neural Video Compression with Feature Modulation
Jiahao Li, Bin Li, Yan Lu
Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection
Huan Liu, Zichang Tan, Chuangchuang Tan et al.
Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection
Zhiyuan Yan, Yuhao Luo, Siwei Lyu et al.
DisCo: Disentangled Control for Realistic Human Dance Generation
Tan Wang, Linjie Li, Kevin Lin et al.
HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting
Hongyu Zhou, Jiahao Shao, Lu Xu et al.
Probing the 3D Awareness of Visual Foundation Models
Mohamed El Banani, Amit Raj, Kevis-kokitsi Maninis et al.
MegaSaM: Accurate, Fast and Robust Structure and Motion from Casual Dynamic Videos
Zhengqi Li, Richard Tucker, Forrester Cole et al.