Ishan Misra

41
Papers
63
Total Citations

Papers (41)

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

CVPR 2024
36
citations

Generating Multi-Image Synthetic Data for Text-to-Image Customization

ICCV 2025
14
citations

Generating Illustrated Instructions

CVPR 2024
7
citations

LLMs can see and hear without any training

ICML 2025
6
citations

Seeing Through the Human Reporting Bias: Visual Classifiers From Noisy Human-Centric Labels

CVPR 2016
0
citations

Cross-Stitch Networks for Multi-Task Learning

CVPR 2016
0
citations

From Red Wine to Red Tomato: Composition With Context

CVPR 2017
0
citations

Learning by Asking Questions

CVPR 2018arXiv
0
citations

ClusterFit: Improving Generalization of Visual Representations

CVPR 2020arXiv
0
citations

Self-Supervised Learning of Pretext-Invariant Representations

CVPR 2020arXiv
0
citations

In Defense of Grid Features for Visual Question Answering

CVPR 2020arXiv
0
citations

Audio-Visual Instance Discrimination with Cross-Modal Agreement

CVPR 2021arXiv
0
citations

Robust Audio-Visual Instance Discrimination

CVPR 2021arXiv
0
citations

3D Spatial Recognition Without Spatially Labeled 3D

CVPR 2021arXiv
0
citations

Omnivore: A Single Model for Many Visual Modalities

CVPR 2022arXiv
0
citations

Masked-Attention Mask Transformer for Universal Image Segmentation

CVPR 2022arXiv
0
citations

GeneCIS: A Benchmark for General Conditional Image Similarity

CVPR 2023
0
citations

OmniMAE: Single Model Masked Pretraining on Images and Videos

CVPR 2023arXiv
0
citations

ImageBind: One Embedding Space To Bind Them All

CVPR 2023arXiv
0
citations

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

CVPR 2023arXiv
0
citations

Learning Video Representations From Large Language Models

CVPR 2023
0
citations

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

CVPR 2023arXiv
0
citations

Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection

ICCV 2017arXiv
0
citations

3D-RelNet: Joint Object and Relational Network for 3D Prediction

ICCV 2019
0
citations

Scaling and Benchmarking Self-Supervised Visual Representation Learning

ICCV 2019
0
citations

An End-to-End Transformer Model for 3D Object Detection

ICCV 2021arXiv
0
citations

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments With Support Samples

ICCV 2021arXiv
0
citations

Self-Supervised Pretraining of 3D Features on Any Point-Cloud

ICCV 2021arXiv
0
citations

Emerging Properties in Self-Supervised Vision Transformers

ICCV 2021arXiv
0
citations

Space-Time Crop & Attend: Improving Cross-Modal Video Representation Learning

ICCV 2021arXiv
0
citations

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

ICCV 2023arXiv
0
citations

MOST: Multiple Object Localization with Self-Supervised Transformers for Object Discovery

ICCV 2023arXiv
0
citations

Detecting Twenty-Thousand Classes Using Image-Level Supervision

ECCV 2022
0
citations

Masked Siamese Networks for Label-Efficient Learning

ECCV 2022
0
citations

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

ICCV 2021
0
citations

InstanceDiffusion: Instance-level Control for Image Generation

CVPR 2024
0
citations

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

CVPR 2024
0
citations

Watch and Learn: Semi-Supervised Learning for Object Detectors From Video

CVPR 2015
0
citations

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

NeurIPS 2020
0
citations

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

NeurIPS 2021
0
citations

A Data-Augmentation Is Worth A Thousand Samples: Analytical Moments And Sampling-Free Training

NeurIPS 2022
0
citations