🧬Vision Recognition

Action Recognition

Recognizing actions in videos

100 papers2,481 total citations
Compare with other topics
Feb '24 Jan '26471 papers
Also includes: action recognition, activity recognition, video action, temporal action

Top Papers

#1

Video-P2P: Video Editing with Cross-attention Control

Shaoteng Liu, Yuechen Zhang, Wenbo Li et al.

CVPR 2024
309
citations
#2

Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia et al.

CVPR 2024
78
citations
#3

Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection

Zhiwei Yang, Jing Liu, Peng Wu

CVPR 2024
70
citations
#4

Learning to Act without Actions

Dominik Schmidt, Minqi Jiang

ICLR 2024
69
citations
#5

HARDVS: Revisiting Human Activity Recognition with Dynamic Vision Sensors

Xiao Wang, Zongzhen Wu, Bo Jiang et al.

AAAI 2024arXiv:2211.09648
human activity recognitiondynamic vision sensorsevent camerasspatial-temporal feature learning+3
64
citations
#6

Open-Vocabulary Video Anomaly Detection

Peng Wu, Xuerong Zhou, Guansong Pang et al.

CVPR 2024
64
citations
#7

Koala: Key Frame-Conditioned Long Video-LLM

Reuben Tan, Ximeng Sun, Ping Hu et al.

CVPR 2024
62
citations
#8

Animate-X: Universal Character Image Animation with Enhanced Motion Representation

Shuai Tan, Biao Gong, Xiang Wang et al.

ICLR 2025
59
citations
#9

Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors

Nicolae Ristea, Florinel Croitoru, Radu Tudor Ionescu et al.

CVPR 2024
55
citations
#10

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames

Shuming Liu, Chenlin Zhang, Chen Zhao et al.

CVPR 2024
51
citations
#11

ReMoS: 3D Motion-Conditioned Reaction Synthesis for Two-Person Interactions

Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik et al.

ECCV 2024arXiv:2311.17057
3d motion synthesishuman motion synthesisdenoising diffusion modelstwo-person interactions+4
51
citations
#12

Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification

Pingping Zhang, Yuhao Wang, Yang Liu et al.

CVPR 2024
49
citations
#13

Universal Actions for Enhanced Embodied Foundation Models

Jinliang Zheng, Jianxiong Li, Dongxiu Liu et al.

CVPR 2025
42
citations
#14

Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Long Le, Jason Xie, William Liang et al.

ICLR 2025
42
citations
#15

Prototypical Calibrating Ambiguous Samples for Micro-Action Recognition

Kun Li, Dan Guo, Guoliang Chen et al.

AAAI 2025
41
citations
#16

VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani et al.

CVPR 2024
36
citations
#17

AA-CLIP: Enhancing Zero-Shot Anomaly Detection via Anomaly-Aware CLIP

wenxin ma, Xu Zhang, Qingsong Yao et al.

CVPR 2025
33
citations
#18

SpikePoint: An Efficient Point-based Spiking Neural Network for Event Cameras Action Recognition

Hongwei Ren, Yue ZHOU, Xiaopeng LIN et al.

ICLR 2024
32
citations
#19

REACTO: Reconstructing Articulated Objects from a Single Video

Chaoyue Song, Jiacheng Wei, Chuan-Sheng Foo et al.

CVPR 2024
32
citations
#20

VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan et al.

ICLR 2025arXiv:2502.17258
diffusion modelsvideo editingattention mechanismmulti-grained editing+4
31
citations
#21

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu et al.

CVPR 2024
31
citations
#22

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Rui Qian, Shuangrui Ding, Xiaoyi Dong et al.

CVPR 2025
31
citations
#23

PREGO: Online Mistake Detection in PRocedural EGOcentric Videos

Alessandro Flaborea, Guido M. D&amp, #x27 et al.

CVPR 2024
30
citations
#24

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Yiyang Zhou, Yangfan He, Yaofeng Su et al.

NeurIPS 2025
29
citations
#25

VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation

Hanzhi Chen, Boyang Sun, Anran Zhang et al.

CVPR 2025
29
citations
#26

NARUTO: Neural Active Reconstruction from Uncertain Target Observations

Ziyue Feng, Huangying Zhan, Zheng Chen et al.

CVPR 2024
27
citations
#27

Navigating Open Set Scenarios for Skeleton-Based Action Recognition

Kunyu Peng, Cheng Yin, Junwei Zheng et al.

AAAI 2024arXiv:2312.06330
skeleton-based action recognitionopen set recognitioncross-modality alignmentdistance-based classification+3
26
citations
#28

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu, Jintong Li, Yicheng Jiang et al.

CVPR 2025
25
citations
#29

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Videos Generation

Xiaofeng Wang, Kang Zhao, Feng Liu et al.

NeurIPS 2025arXiv:2411.08380
egocentric video generationvideo-action datasetkinematic controlaction annotations+4
25
citations
#30

SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-Supervised Skeleton-Based Action Recognition

Cong Wu, Xiao-Jun Wu, Josef Kittler et al.

AAAI 2024arXiv:2309.05834
skeleton-based action recognitioncontrastive learningspatiotemporal disentanglementmasked image modeling+4
24
citations
#31

Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes

Yaoting Wang, Peiwen Sun, Dongzhan Zhou et al.

ECCV 2024
23
citations
#32

PALM: Predicting Actions through Language Models

Sanghwan Kim, Daoji Huang, Yongqin Xian et al.

ECCV 2024
21
citations
#33

DyFADet: Dynamic Feature Aggregation for Temporal Action Detection

Le Yang, Ziwei Zheng, Yizeng Han et al.

ECCV 2024
21
citations
#34

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Volodymyr Mnih, Aleksandra Faust et al.

ICLR 2025
21
citations
#35

Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature

Wu Yun, Mengshi Qi, Chuanming Wang et al.

AAAI 2024arXiv:2303.12332
weakly-supervised temporal action localizationsalient snippet-feature inferencepseudo label generationtemporal structure exploitation+3
21
citations
#36

VideoMAC: Video Masked Autoencoders Meet ConvNets

Gensheng Pei, Tao Chen, Xiruo Jiang et al.

CVPR 2024
20
citations
#37

Learning to Predict Activity Progress by Self-Supervised Video Alignment

Gerard Donahue, Ehsan Elhamifar

CVPR 2024
20
citations
#38

AMEGO: Active Memory from long EGOcentric videos

Gabriele Goletto, Tushar Nagarajan, Giuseppe Averta et al.

ECCV 2024
19
citations
#39

TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation

Hongxiang Zhao, Xingchen Liu, Mutian Xu et al.

CVPR 2025
19
citations
#40

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation

Huaxin Zhang, Xiang Wang, Xiaohao Xu et al.

AAAI 2024arXiv:2308.12608
temporal action localizationpoint-supervised learninghierarchical reliability propagationsnippet-level discrimination+3
19
citations
#41

Improving Video Segmentation via Dynamic Anchor Queries

Yikang Zhou, Tao Zhang, Xiangtai Li et al.

ECCV 2024
19
citations
#42

MoST: Motion Style Transformer Between Diverse Action Contents

Boeun Kim, Jungho Kim, Hyung Jin Chang et al.

CVPR 2024
18
citations
#43

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior

Chen Guo, Junxuan Li, Yash Kant et al.

CVPR 2025
18
citations
#44

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

Mingfang Zhang, Yifei Huang, Ruicong Liu et al.

ECCV 2024
17
citations
#45

HaWoR: World-Space Hand Motion Reconstruction from Egocentric Videos

Jinglei Zhang, Jiankang Deng, Chao Ma et al.

CVPR 2025
17
citations
#46

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

Min Yang, gaohuan, Ping Guo et al.

CVPR 2024
17
citations
#47

Semi-supervised Active Learning for Video Action Detection

Ayush Singh, Aayush J Rana, Akash Kumar et al.

AAAI 2024arXiv:2312.07169
semi-supervised active learningvideo action detectionspatio-temporal localizationinformative sample selection+3
16
citations
#48

Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception

Lei Fan, Mingfu Liang, Yunxuan Li et al.

CVPR 2024
16
citations
#49

Revealing Key Details to See Differences: A Novel Prototypical Perspective for Skeleton-based Action Recognition

Hongda Liu, Yunfan Liu, Min Ren et al.

CVPR 2025
16
citations
#50

Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition

Yifei Chen, Dapeng Chen, Ruijin Liu et al.

CVPR 2024
16
citations
#51

An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar et al.

ICCV 2025
15
citations
#52

Animal Avatars: Reconstructing Animatable 3D Animals from Casual Videos

Remy Sabathier, David Novotny, Niloy Mitra

ECCV 2024
15
citations
#53

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Qiao Gu, Yuanliang Ju, Shengxiang Sun et al.

NeurIPS 2025
15
citations
#54

Event-Adapted Video Super-Resolution

Zeyu Xiao, Dachun Kai, Yueyi Zhang et al.

ECCV 2024
14
citations
#55

Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model

Hang Zhou, Jiale Cai, Yuteng Ye et al.

AAAI 2025
14
citations
#56

Online Reasoning Video Segmentation with Just-in-Time Digital Twins

Yiqing Shen, Bohan Liu, Chenjia Li et al.

ICCV 2025
14
citations
#57

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

Zhi Cen, Huaijin Pi, Sida Peng et al.

ICLR 2025
14
citations
#58

UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

Yingsen Zeng, Yujie Zhong, Chengjian Feng et al.

ECCV 2024
14
citations
#59

Referring Atomic Video Action Recognition

Kunyu Peng, Jia Fu, Kailun Yang et al.

ECCV 2024
14
citations
#60

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Yupeng Hou, Jianmo Ni, Zhankui He et al.

ICML 2025
14
citations
#61

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yunlong Tang, Gen Zhan, Li Yang et al.

AAAI 2025
13
citations
#62

EvSign: Sign Language Recognition and Translation with Streaming Events

Pengyu Zhang, Hao Yin, Zeren Wang et al.

ECCV 2024
13
citations
#63

Perception-as-Control: Fine-grained Controllable Image Animation with 3D-aware Motion Representation

Yingjie Chen, Yifang Men, Yuan Yao et al.

ICCV 2025
13
citations
#64

On the Utility of 3D Hand Poses for Action Recognition

Md Salman Shamil, Dibyadip Chatterjee, Fadime Sener et al.

ECCV 2024
13
citations
#65

Exploring More from Multiple Gait Modalities for Human Identification

Dongyang Jin, Chao Fan, Weihua Chen et al.

AAAI 2025
13
citations
#66

CrossGLG: LLM Guides One-shot Skeleton-based 3D Action Recognition in a Cross-level Manner

Tingbing Yan, Wenzheng Zeng, Yang Xiao et al.

ECCV 2024
12
citations
#67

CLIMB-ReID: A Hybrid CLIP-Mamba Framework for Person Re-Identification

Chenyang Yu, Xuehu Liu, Jiawen Zhu et al.

AAAI 2025
12
citations
#68

DailyDVS-200: A Comprehensive Benchmark Dataset for Event-Based Action Recognition

Qi Wang, Zhou Xu, Yuming Lin et al.

ECCV 2024
12
citations
#69

Real Appearance Modeling for More General Deepfake Detection

Jiahe Tian, Yu Cai, Xi Wang et al.

ECCV 2024
12
citations
#70

AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning

Yuanfei Wang, Xiaojie Zhang, Ruihai Wu et al.

ICLR 2025arXiv:2502.11124
articulated object manipulationadaptive manipulation policy3d visual diffusionimitation learning+4
12
citations
#71

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Rohit Kundu, Hao Xiong, Vishal Mohanty et al.

CVPR 2025
12
citations
#72

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

Tanveer Hannan, Mohaiminul Islam, Thomas Seidl et al.

ECCV 2024
12
citations
#73

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

Shehreen Azad, Vibhav Vineet, Yogesh S. Rawat

CVPR 2025
12
citations
#74

ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh, Tushar Nagarajan, Georgios Pavlakos et al.

CVPR 2025
11
citations
#75

Efficient Few-Shot Action Recognition via Multi-Level Post-Reasoning

Cong Wu, Xiao-Jun Wu, Linze Li et al.

ECCV 2024
11
citations
#76

Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning

Baoqi Pei, Yifei Huang, Jilan Xu et al.

ICLR 2025
11
citations
#77

Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation

Shuangrui Ding, Rui Qian, Haohang Xu et al.

ECCV 2024arXiv:2311.17893
self-supervised learningvideo object segmentationdino-pretrained transformersspatio-temporal correspondence+3
11
citations
#78

Frame Order Matters: A Temporal Sequence-Aware Model for Few-Shot Action Recognition

Bozheng Li, Mushui Liu, Gaoang Wang et al.

AAAI 2025
11
citations
#79

Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph

Zhengcen Li, Xinle Chang, Yueran Li et al.

ECCV 2024
11
citations
#80

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories

Muzhi Zhu, Yuzhuo Tian, Hao Chen et al.

CVPR 2025
11
citations
#81

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos

Changan Chen, Kumar Ashutosh, Rohit Girdhar et al.

CVPR 2024
11
citations
#82

MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild

Xi Fang, Jiankun Wang, Xiaochen Cai et al.

ICCV 2025
10
citations
#83

UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks

Yuanbin Qian, Shuhan Ye, Chong Wang et al.

AAAI 2025
10
citations
#84

RoMo: Robust Motion Segmentation Improves Structure from Motion

Lily Goli, Sara Sabour, Mark Matthews et al.

ICCV 2025arXiv:2411.18650
motion segmentationstructure from motioncamera calibrationoptical flow+4
10
citations
#85

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities

Michele Mazzamuto, Antonino Furnari, Yoichi Sato et al.

CVPR 2025
10
citations
#86

TACO: Taming Diffusion for in-the-wild Video Amodal Completion

Ruijie Lu, Yixin Chen, Yu Liu et al.

ICCV 2025
9
citations
#87

Live and Learn: Continual Action Clustering with Incremental Views

Xiaoqiang Yan, Yingtao Gan, Yiqiao Mao et al.

AAAI 2024arXiv:2404.07962
multi-view action clusteringcontinual learningincremental camera viewsconsensus partition matrix+2
9
citations
#88

ActionVOS: Actions as Prompts for Video Object Segmentation

LIANGYANG OUYANG, Ruicong Liu, Yifei Huang et al.

ECCV 2024
9
citations
#89

Limitations in Employing Natural Language Supervision for Sensor-Based Human Activity Recognition - And Ways to Overcome Them

Harish Haresamudram, Apoorva Beedu, Mashfiqui Rabbi et al.

AAAI 2025
9
citations
#90

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

Jikai Wang, Qifan Zhang, Yu-Wei Chao et al.

NeurIPS 2025
9
citations
#91

ProMotion: Prototypes As Motion Learners

Yawen Lu, Dongfang Liu, Qifan Wang et al.

CVPR 2024
9
citations
#92

AFF-ttention! Affordances and Attention models for Short-Term Object Interaction Anticipation

Lorenzo Mur Labadia, Ruben Martinez-Cantin, Jose J Guerrero et al.

ECCV 2024
9
citations
#93

Efficient 3D Recognition with Event-driven Spike Sparse Convolution

Xuerui Qiu, Man Yao, Jieyuan Zhang et al.

AAAI 2025
9
citations
#94

Avatar Fingerprinting for Authorized Use of Synthetic Talking-Head Videos

Ekta Prashnani, Koki Nagano, Shalini De Mello et al.

ECCV 2024
8
citations
#95

Towards Scene Graph Anticipation

Rohith Peddi, Saksham Singh, Saurabh . et al.

ECCV 2024
8
citations
#96

Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng et al.

CVPR 2025
8
citations
#97

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition

Mengmeng Wang, Jiazheng Xing, Boyuan Jiang et al.

AAAI 2024arXiv:2401.11649
video action recognitionvision-language pretrained modelsparameter-efficient fine-tuningmultimodal adapters+3
8
citations
#98

MammAlps: A Multi-view Video Behavior Monitoring Dataset of Wild Mammals in the Swiss Alps

Valentin Gabeff, Haozhe Qi, Brendan Flaherty et al.

CVPR 2025
8
citations
#99

Task-Aware Encoder Control for Deep Video Compression

Xingtong Ge, Jixiang Luo, XINJIE ZHANG et al.

CVPR 2024
8
citations
#100

Re-HOLD: Video Hand Object Interaction Reenactment via adaptive Layout-instructed Diffusion Model

Yingying Fan, Quanwei Yang, Kaisiyuan Wang et al.

CVPR 2025arXiv:2503.16942
video hand object interactionhuman-object interaction generationlayout-instructed diffusion modelhand synthesis+3
8
citations