Jie Tang

30

Papers

1,836

Total Citations

Papers (30)

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

LVBench: An Extreme Long Video Understanding Benchmark

KoLA: Carefully Benchmarking World Knowledge of Large Language Models

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Bilateral Propagation Network for Depth Completion

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

CATANet: Efficient Content-Aware Token Aggregation for Lightweight Image Super-Resolution

Sketch and Refine: Towards Fast and Accurate Lane Detection

SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

TriSampler: A Better Negative Sampling Principle for Dense Retrieval

Small Language Model Makes an Effective Long Text Extractor

Towards Efficient Exact Optimization of Language Model Alignment

AutoLUT: LUT-Based Image Super-Resolution with Automatic Sampling and Adaptive Residual Learning

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

CogAgent: A Visual Language Model for GUI Agents

Residual Feature Aggregation Network for Image Super-Resolution

BodyGAN: General-Purpose Controllable Neural Human Body Generation

Robust Object Modeling for Visual Tracking

Bandit Learning with Implicit Feedback

CogLTX: Applying BERT to Long Texts

A Matrix Chernoff Bound for Markov Chains and Its Application to Co-occurrence Matrices

Graph Random Neural Networks for Semi-Supervised Learning on Graphs

CogView: Mastering Text-to-Image Generation via Transformers

Adaptive Diffusion in Graph Neural Networks

A Hierarchical Reinforcement Learning Based Optimization Framework for Large-scale Dynamic Pickup and Delivery Problems

UFC-BERT: Unifying Multi-Modal Controls for Conditional Image Synthesis

CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation