Rongrong Ji

138

Papers

1,842

Total Citations

Papers (138)

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

Multiple Expert Brainstorming for Domain Adaptive Person Re-identification

Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation

Enabling Deep Residual Networks for Weakly Supervised Object Detection

AffineQuant: Affine Transformation Quantization for Large Language Models

Towards General Visual-Linguistic Face Forgery Detection

AccDiffusion: An Accurate Method for Higher-Resolution Image Generation

Attention Disturbance and Dual-Path Constraint Network for Occluded Person Re-identification

CamoTeacher: Dual-Rotation Consistency Learning for Semi-Supervised Camouflaged Object Detection

VTON-HandFit: Virtual Try-on for Arbitrary Hand Pose Guided by Hand Priors Embedding

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

FlashSloth : Lightning Multimodal Large Language Models via Embedded Visual Compression

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective

UniPTS: A Unified Framework for Proficient Post-Training Sparsity

Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models

From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning

Evaluating and Analyzing Relationship Hallucinations in Large Vision-Language Models

SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation

CaM: Cache Merging for Memory-efficient LLMs Inference

Fast Text-to-3D-Aware Face Generation and Manipulation via Direct Cross-modal Mapping and Geometric Regularization

ERQ: Error Reduction for Post-Training Quantization of Vision Transformers

Integrating Global Context Contrast and Local Sensitivity for Blind Image Quality Assessment

Adaptive Feature Selection for No-Reference Image Quality Assessment by Mitigating Semantic Noise Sensitivity

Towards 3D Object Detection With Bimodal Deep Boltzmann Machines Over RGBD Imagery

Understanding Image Structure via Hierarchical Shape Parsing

Cross-Modality Binary Code Learning via Fusion Similarity Hashing

GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition

Modulated Convolutional Networks

GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints

Generative Adversarial Learning Towards Fast Weakly Supervised Detection

Cyclic Guidance for Weakly Supervised Joint Detection and Segmentation

Circulant Binary Convolutional Networks: Enhancing the Performance of 1-Bit DCNNs With Circulant Back Propagation

Towards Optimal Structured CNN Pruning via Generative Adversarial Learning

Exploiting Kernel Sparsity and Entropy for Interpretable CNN Compression

Towards Visual Feature Translation

Pyramidal Person Re-IDentification via Multi-Loss Dynamic Training

HRank: Filter Pruning Using High-Rank Feature Map

Salience-Guided Cascaded Suppression Network for Person Re-Identification

Projection & Probability-Driven Black-Box Attack

Cogradient Descent for Bilinear Optimization

AD-Cluster: Augmented Discriminative Clustering for Domain Adaptive Person Re-Identification

Siamese Box Adaptive Network for Visual Tracking

Filter Grafting for Deep Neural Networks

Rethinking Performance Estimation in Neural Architecture Search

Multi-Task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

One-Shot Adversarial Attacks on Visual Tracking With Dual Attention

Noise-Aware Fully Webly Supervised Object Detection

Toward Joint Thing-and-Stuff Mining for Weakly Supervised Panoptic Segmentation

Towards Compact CNNs via Collaborative Compression

Discover Cross-Modality Nuances for Visible-Infrared Person Re-Identification

Image-to-Image Translation via Hierarchical Style Disentanglement

Beyond Max-Margin: Class Margin Equilibrium for Few-Shot Object Detection

Removing the Background by Adding the Background: Towards Background Robust Self-Supervised Video Representation Learning

RSTNet: Captioning With Adaptive Attention on Visual and Non-Visual Words

DIFNet: Boosting Visual Information Flow for Image Captioning

Active Teacher for Semi-Supervised Object Detection

Boosting Crowd Counting via Multifaceted Attention

Neural Architecture Search With Representation Mutual Information

Training-Free Transformer Architecture Search

IntraQ: Learning Synthetic Images With Intra-Class Heterogeneity for Zero-Shot Network Quantization

RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension

You Only Segment Once: Towards Real-Time Panoptic Segmentation

Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective

Meta Architecture for Point Cloud Analysis

STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection

Clover: Towards a Unified Video-Language Alignment and Fusion Model

Discriminator-Cooperated Feature Map Distillation for GAN Compression

DistilPose: Tokenized Pose Regression With Heatmap Distillation

RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension

Top Rank Supervised Binary Coding for Visual Search

Multinomial Distribution Learning for Effective Neural Architecture Search

Universal Adversarial Perturbation via Prior Driven Uncertainty Approximation

Structured Modeling of Joint Deep Feature and Prediction Refinement for Salient Object Detection

Universal Perturbation Attack Against Image Retrieval

Bayesian Optimized 1-Bit CNNs

Scoot: A Perceptual Metric for Facial Sketches

Architecture Disentanglement for Deep Neural Networks

TRAR: Routing the Attention Spans in Transformer for Visual Question Answering

ReCU: Reviving the Dead Weights in Binary Neural Networks

EC-DARTS: Inducing Equalized and Consistent Optimization Into DARTS

Aha! Adaptive History-Driven Attack for Decision-Based Black-Box Models

Seminar Learning for Click-Level Weakly Supervised Semantic Segmentation

Parallel Detection-and-Segmentation Learning for Weakly Supervised Instance Segmentation

Occlude Them All: Occlusion-Aware Attention Network for Occluded Person Re-ID

Pseudo-label Alignment for Semi-supervised Instance Segmentation

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

Category-aware Allocation Transformer for Weakly Supervised Object Localization

X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance

SMMix: Self-Motivated Image Mixing for Vision Transformers

Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information Bottleneck Principle

Anti-Bandit Neural Architecture Search for Model Defense

API-Net: Robust Generative Classifier via a Single Discriminator

SSCGAN: Facial Attribute Editing via Style Skip Connections

Interpretable Neural Network Decoupling

PAMS: Quantized Super-Resolution via Parameterized Max Scale

Improving Face Recognition from Hard Samples via Distribution Distillation Loss

Black-Box Dissector: Towards Erasing-Based Hard-Label Model Stealing Attack

ECO-TR: Efficient Correspondences Finding via Coarse-to-Fine Refinement

Fine-Grained Data Distribution Alignment for Post-Training Quantization

Privacy-Preserving Face Recognition with Learnable Privacy Budgets in Frequency Domain

An Information Theoretic Approach for Attention-Driven Face Forgery Detection

PixelFolder: An Efficient Progressive Pixel Synthesis Network for Image Generation

Dynamic Dual Trainable Bounds for Ultra-Low Precision Super-Resolution Networks

ARM: Any-Time Super-Resolution Method

SeqTR: A Simple Yet Universal Network for Visual Grounding

InterFormer: Real-time Interactive Image Segmentation

SVFR: A Unified Framework for Generalized Video Face Restoration

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive Segmentation

Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers

OracleFusion: Assisting the Decipherment of Oracle Bone Script with Structurally Constrained Semantic Typography

Learning Interleaved Image-Text Comprehension in Vision-Language Large Models

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

Learning Image Demoireing from Unpaired Real Data

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

GraCo: Granularity-Controllable Interactive Segmentation

FocSAM: Delving Deeply into Focused Objects in Segmenting Anything

Aligning and Prompting Everything All at Once for Universal Visual Perception

DS-VLM: Diffusion Supervision Vision Language Model

polybasic Speculative Decoding Through a Theoretical Perspective

Outlier-aware Slicing for Post-Training Quantization in Vision Transformer

X-Oscar: A Progressive Framework for High-quality Text-guided 3D Animatable Avatar Generation

FreeAnchor: Learning to Match Anchors for Visual Object Detection

Information Competing Process for Learning Diversified Representations

Variational Structured Semantic Inference for Diverse Image Captioning

UWSOD: Toward Fully-Supervised-Level Capacity Weakly Supervised Object Detection

Rotated Binary Neural Network

Revisiting Discriminator in GAN Compression: A Generator-discriminator Cooperative Compression Scheme

Learning Best Combination for Efficient N:M Sparsity

Make Sharpness-Aware Minimization Stronger: A Sparsified Perturbation Approach

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining

Improving Adversarial Robustness via Information Bottleneck Distillation

Discover and Align Taxonomic Context Priors for Open-world Semi-Supervised Learning

Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models

Parameter and Computation Efficient Transfer Learning for Vision-Language Pre-trained Models

CAPro: Webly Supervised Learning with Cross-modality Aligned Prototypes