Chen Sun

52

Papers

352

Total Citations

Papers (52)

Actor-centric Relation Network

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery

Solving New Tasks by Adapting Internet Video Knowledge

Dense Video Object Captioning from Disjoint Supervision

Self-Correcting Self-Consuming Loops for Generative Model Training

Potential Based Diffusion Motion Planning

ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks

MotiF: Making Text Count in Image Animation with Motion Focal Loss

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

The INaturalist Species Classification and Detection Dataset

Relational Action Forecasting

Composing Text and Image for Image Retrieval - an Empirical Odyssey

Hyperspectral Image Reconstruction Using a Deep Spatial-Spectral Prior

DNU: Deep Non-Local Unrolling for Computational Spectral Imaging

Speech2Action: Cross-Modal Supervision for Action Recognition

VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation

HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps

Multiview Transformers for Video Recognition

REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory

How Can Objects Help Action Recognition?

Automatic Concept Discovery From Parallel Text and Visual Corpora

Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

TALL: Temporal Activity Localization via Language Query

VideoBERT: A Joint Model for Video and Language Representation Learning

Composable Augmentation Encoding for Video Representation Learning

DenseTNT: End-to-End Trajectory Prediction From Dense Goal Sets

Episodic Transformer for Vision-and-Language Navigation

ViViT: A Video Vision Transformer

Learning Temporal Dynamics From Cycles in Narrated Video

Unified Graph Structured Models for Video Understanding

Multi-modal Transformer for Video Retrieval

Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos

Learning Audio-Video Modalities from Image Captions

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors

Motion Prompting: Controlling Video Generation with Motion Trajectories

How Can Objects Help Video-Language Understanding?

End-to-End Spatio-Temporal Action Localisation with Video Transformers

Pixel-Aligned Language Model

Unsupervised learning of object structure and dynamics from videos

What Makes for Good Views for Contrastive Learning?

Discrete-Valued Neural Communication

Attention Bottlenecks for Multimodal Fusion

Trajectory balance: Improved credit assignment in GFlowNets

AVIS: Autonomous Visual Information Seeking with Large Language Model Agent

Does Visual Pretraining Help End-to-End Reasoning?

Goal-Conditioned Predictive Coding for Offline Reinforcement Learning

Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL