🧬Vision Recognition

Action Recognition

Recognizing actions in videos

100 papers2,401 total citations
Compare with other topics
Mar '24 Feb '26445 papers
Also includes: action recognition, activity recognition, video action, temporal action

Top Papers

#1

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

CVPR 2024
309
citations
#2

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia et al.

CVPR 2024
78
citations
#3

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

CVPR 2024
70
citations
#4

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

ICLR 2024
69
citations
#5

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang, Zongzhen Wu, Bo Jiang et al.

AAAI 2024arXiv:2211.09648
human activity recognitiondynamic vision sensorsevent camerasspatial-temporal feature learning+3
64
citations
#6

Open-Vocabulary Video Anomaly Detection

Peng Wu, Xuerong Zhou, Guansong Pang et al.

CVPR 2024
64
citations
#7

Koala: Key Frame-Conditioned Long Video-LLM

Reuben Tan, Ximeng Sun, Ping Hu et al.

CVPR 2024
62
citations
#8

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Shuai Tan, Biao Gong, Xiang Wang et al.

ICLR 2025
59
citations
#9

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

CVPR 2024
55
citations
#10

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chenlin Zhang, Chen Zhao et al.

CVPR 2024
51
citations
#11

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik et al.

ECCV 2024arXiv:2311.17057
3d motion synthesishuman motion synthesisdenoising diffusion modelstwo-person interactions+4
50
citations
#12

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Pingping Zhang, Yuhao Wang, Yang Liu et al.

CVPR 2024
49
citations
#13

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Long Le, Jason Xie, William Liang et al.

ICLR 2025arXiv:2410.13882
vision-language modelsarticulated object modeling3d asset generationmesh retrieval+4
42
citations
#14

Universal Actions for Enhanced Embodied Foundation Models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu et al.

CVPR 2025
42
citations
#15

Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition

Kun Li, Dan Guo, Guoliang Chen et al.

AAAI 2025
41
citations
#16

VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani et al.

CVPR 2024
36
citations
#17

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP

wenxin ma, Xu Zhang, Qingsong Yao et al.

CVPR 2025
33
citations
#18

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

Hongwei Ren, Yue ZHOU, Xiaopeng LIN et al.

ICLR 2024
32
citations
#19

REACTO: Reconstructing Articulated Objects from a Single Video

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo et al.

CVPR 2024
32
citations
#20

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.

CVPR 2025arXiv:2501.03218
video large language modelsactive real-time interactionstreaming video processingdisentangled system architecture+4
31
citations
#21

PREGO: Online Mistake Detection in PRocedural EGOcentric Videos

Alessandro Flaborea, Guido M. D&amp, #x27 et al.

CVPR 2024
30
citations
#22

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang et al.

CVPR 2025
29
citations
#23

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NeurIPS 2025arXiv:2506.01300
video understandingreward-driven agentsmulti-agent frameworkvision-language models+4
29
citations
#24

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

Ziyue Feng, Huangying Zhan, Zheng Chen et al.

CVPR 2024
27
citations
#25

Navigating Open Set Scenarios for Skeleton-Based Action Recognition

Kunyu Peng, Cheng Yin, Junwei Zheng et al.

AAAI 2024arXiv:2312.06330
skeleton-based action recognitionopen set recognitioncross-modality alignmentdistance-based classification+3
26
citations
#26

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang et al.

CVPR 2025arXiv:2411.17820
25
citations
#27

SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition

Cong Wu, Xiao-Jun Wu, Josef Kittler et al.

AAAI 2024arXiv:2309.05834
skeleton-based action recognitioncontrastive learningspatiotemporal disentanglementmasked image modeling+4
24
citations
#28

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou et al.

ECCV 2024arXiv:2407.10957
audio-visual segmentationreference segmentationmultimodal perceptionmultimodal-cue expressions+2
23
citations
#29

PALM: Predicting Actions through Language Models

Sanghwan Kim, Daoji Huang, Yongqin Xian et al.

ECCV 2024arXiv:2311.17944
egocentric visionlong-term action anticipationaction recognition modelvision-language model+3
21
citations
#30

Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature

Wu Yun, Mengshi Qi, Chuanming Wang et al.

AAAI 2024arXiv:2303.12332
weakly-supervised temporal action localizationsalient snippet-feature inferencepseudo label generationtemporal structure exploitation+3
21
citations
#31

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

ECCV 2024
21
citations
#32

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Volodymyr Mnih, Aleksandra Faust et al.

ICLR 2025
21
citations
#33

VideoMAC: Video Masked Autoencoders Meet ConvNets

Gensheng Pei, Tao Chen, Xiruo Jiang et al.

CVPR 2024
20
citations
#34

Learning to Predict Activity Progress by Self-Supervised Video Alignment

Gerard Donahue, Ehsan Elhamifar

CVPR 2024
20
citations
#35

AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta et al.

ECCV 2024
19
citations
#36

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation

Huaxin Zhang, Xiang Wang, Xiaohao Xu et al.

AAAI 2024arXiv:2308.12608
temporal action localizationpoint-supervised learninghierarchical reliability propagationsnippet-level discrimination+3
19
citations
#37

Improving Video Segmentation via Dynamic Anchor Queries

Yikang Zhou, Tao Zhang, Xiangtai Li et al.

ECCV 2024
19
citations
#38

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Hongxiang Zhao, Xingchen Liu, Mutian Xu et al.

CVPR 2025
19
citations
#39

MoST: Motion Style Transformer Between Diverse Action Contents

Boeun Kim, Jungho Kim, Hyung Jin Chang et al.

CVPR 2024
18
citations
#40

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

Chen Guo, Junxuan Li, Yash Kant et al.

CVPR 2025
18
citations
#41

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Mingfang Zhang, Yifei Huang, Ruicong Liu et al.

ECCV 2024
17
citations
#42

HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

Jinglei Zhang, Jiankang Deng, Chao Ma et al.

CVPR 2025
17
citations
#43

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang, gaohuan, Ping Guo et al.

CVPR 2024
17
citations
#44

Semi-supervised Active Learning for Video Action Detection

Ayush Singh, Aayush J Rana, Akash Kumar et al.

AAAI 2024arXiv:2312.07169
semi-supervised active learningvideo action detectionspatio-temporal localizationinformative sample selection+3
16
citations
#45

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Lei Fan, Mingfu Liang, Yunxuan Li et al.

CVPR 2024
16
citations
#46

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

Hongda Liu, Yunfan Liu, Min Ren et al.

CVPR 2025
16
citations
#47

Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

Yifei Chen, Dapeng Chen, Ruijin Liu et al.

CVPR 2024
16
citations
#48

An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.

ICCV 2025
15
citations
#49

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun et al.

NeurIPS 2025
15
citations
#50

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Remy Sabathier, David Novotny, Niloy Mitra

ECCV 2024
15
citations
#51

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng, Yujie Zhong, Chengjian Feng et al.

ECCV 2024arXiv:2404.04933
moment retrievaltemporal action detectiontask fusion learningunified video understanding+3
14
citations
#52

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen, Bohan Liu, Chenjia Li et al.

ICCV 2025
14
citations
#53

Event-Adapted Video Super-Resolution

Zeyu Xiao, Dachun Kai, Yueyi Zhang et al.

ECCV 2024
14
citations
#54

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

Hang Zhou, Jiale Cai, Yuteng Ye et al.

AAAI 2025
14
citations
#55

Referring Atomic Video Action Recognition

Kunyu Peng, Jia Fu, Kailun Yang et al.

ECCV 2024
14
citations
#56

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Zhi Cen, Huaijin Pi, Sida Peng et al.

ICLR 2025
14
citations
#57

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Yupeng Hou, Jianmo Ni, Zhankui He et al.

ICML 2025
14
citations
#58

Exploring More from Multiple Gait Modalities for Human Identification

Dongyang Jin, Chao Fan, Weihua Chen et al.

AAAI 2025
13
citations
#59

EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang, Hao Yin, Zeren Wang et al.

ECCV 2024arXiv:2407.12593
13
citations
#60

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao et al.

ICCV 2025arXiv:2501.05020
13
citations
#61

On the Utility of 3D Hand Poses for Action Recognition

Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener et al.

ECCV 2024
13
citations
#62

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yunlong Tang, Gen Zhan, Li Yang et al.

AAAI 2025
13
citations
#63

Real Appearance Modeling for More General Deepfake Detection

Jiahe Tian, Yu Cai, Xi Wang et al.

ECCV 2024
deepfake detectiongeneralizable detectionreal appearance modelingface disturbance+2
12
citations
#64

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Mohaiminul Islam, Thomas Seidl et al.

ECCV 2024
12
citations
#65

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

Qi Wang, Zhou Xu, Yuming Lin et al.

ECCV 2024
12
citations
#66

CLIMB-ReID: A Hybrid CLIP-Mamba Framework for Person Re-Identification

Chenyang Yu, Xuehu Liu, Jiawen Zhu et al.

AAAI 2025
12
citations
#67

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Rohit Kundu, Hao Xiong, Vishal Mohanty et al.

CVPR 2025
12
citations
#68

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

ECCV 2024
12
citations
#69

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025arXiv:2503.08585
12
citations
#70

ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos et al.

CVPR 2025
11
citations
#71

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Cong Wu, Xiao-Jun Wu, Linze Li et al.

ECCV 2024
11
citations
#72

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Zhengcen Li, Xinle Chang, Yueran Li et al.

ECCV 2024
11
citations
#73

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Bozheng Li, Mushui Liu, Gaoang Wang et al.

AAAI 2025
11
citations
#74

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu, Yuzhuo Tian, Hao Chen et al.

CVPR 2025
11
citations
#75

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Baoqi Pei, Yifei Huang, Jilan Xu et al.

ICLR 2025arXiv:2503.00986
11
citations
#76

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Changan Chen, Kumar Ashutosh, Rohit Girdhar et al.

CVPR 2024
11
citations
#77

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Xi Fang, Jiankun Wang, Xiaochen Cai et al.

ICCV 2025
10
citations
#78

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

Michele Mazzamuto, Antonino Furnari, Yoichi Sato et al.

CVPR 2025
10
citations
#79

UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks

Yuanbin Qian, Shuhan Ye, Chong Wang et al.

AAAI 2025
10
citations
#80

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Ruijie Lu, Yixin Chen, Yu Liu et al.

ICCV 2025
9
citations
#81

Live and Learn: Continual Action Clustering with Incremental Views

Xiaoqiang Yan, Yingtao Gan, Yiqiao Mao et al.

AAAI 2024arXiv:2404.07962
multi-view action clusteringcontinual learningincremental camera viewsconsensus partition matrix+2
9
citations
#82

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia, Ruben Martinez-Cantin, Jose J Guerrero et al.

ECCV 2024
9
citations
#83

ActionVOS: Actions as Prompts for Video Object Segmentation

LIANGYANG OUYANG, Ruicong Liu, Yifei Huang et al.

ECCV 2024arXiv:2407.07402
referring video object segmentationegocentric visionaction-aware segmentationvideo object segmentation+2
9
citations
#84

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition - And Ways to Overcome Them

Harish Haresamudram, Apoorva Beedu, Mashfiqui Rabbi et al.

AAAI 2025
9
citations
#85

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang, Qifan Zhang, Yu-Wei Chao et al.

NeurIPS 2025
9
citations
#86

ProMotion: Prototypes As Motion Learners

Yawen Lu, Dongfang Liu, Qifan Wang et al.

CVPR 2024
9
citations
#87

Efficient 3D Recognition with Event-driven Spike Sparse Convolution

Xuerui Qiu, Man Yao, Jieyuan Zhang et al.

AAAI 2025
9
citations
#88

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

Ekta Prashnani, Koki Nagano, Shalini De Mello et al.

ECCV 2024
8
citations
#89

Task-Aware Encoder Control for Deep Video Compression

Xingtong Ge, Jixiang Luo, XINJIE ZHANG et al.

CVPR 2024
8
citations
#90

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang et al.

AAAI 2024arXiv:2401.11649
video action recognitionvision-language pretrained modelsparameter-efficient fine-tuningmultimodal adapters+3
8
citations
#91

Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng et al.

CVPR 2025
8
citations
#92

Towards Scene Graph Anticipation

Rohith Peddi, Saksham Singh, Saurabh . et al.

ECCV 2024
8
citations
#93

MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps

Valentin Gabeff, Haozhe Qi, Brendan Flaherty et al.

CVPR 2025
8
citations
#94

Generalizable Sensor-Based Activity Recognition via Categorical Concept Invariant Learning

Di Xiong, Shuoyuan Wang, Lei Zhang et al.

AAAI 2025
7
citations
#95

KinMo: Kinematic-aware Human Motion Understanding and Generation

Pengfei Zhang, Pinxin Liu, Pablo Garrido et al.

ICCV 2025
7
citations
#96

Language Model Guided Interpretable Video Action Reasoning

Ning Wang, Guangming Zhu, Hongsheng Li et al.

CVPR 2024
7
citations
#97

Elucidating the Hierarchical Nature of Behavior with Masked Autoencoders

Lucas Stoffl, Andy Bonnetto, Stéphane D'Ascoli et al.

ECCV 2024
masked autoencodershierarchical behavior analysisaction segmentationmotion capture data+4
7
citations
#98

HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren et al.

CVPR 2025
7
citations
#99

MimeQA: Towards Socially-Intelligent Nonverbal Foundation Models

Hengzhi Li, Megan Tjandrasuwita, Yi R. (May) Fung et al.

NeurIPS 2025arXiv:2502.16671
nonverbal social understandingvideo question answeringsocial reasoningvideo large language models+4
7
citations
#100

ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping

Youxin Pang, Ruizhi Shao, Jiajun Zhang et al.

CVPR 2025arXiv:2412.16212
7
citations