Ishan Misra

41

Papers

63

Total Citations

Papers (41)

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

Generating Multi-Image Synthetic Data for Text-to-Image Customization

Generating Illustrated Instructions

LLMs can see and hear without any training

Seeing Through the Human Reporting Bias: Visual Classifiers From Noisy Human-Centric Labels

Cross-Stitch Networks for Multi-Task Learning

From Red Wine to Red Tomato: Composition With Context

Learning by Asking Questions

ClusterFit: Improving Generalization of Visual Representations

Self-Supervised Learning of Pretext-Invariant Representations

In Defense of Grid Features for Visual Question Answering

Audio-Visual Instance Discrimination with Cross-Modal Agreement

Robust Audio-Visual Instance Discrimination

3D Spatial Recognition Without Spatially Labeled 3D

Omnivore: A Single Model for Many Visual Modalities

Masked-Attention Mask Transformer for Universal Image Segmentation

GeneCIS: A Benchmark for General Conditional Image Similarity

OmniMAE: Single Model Masked Pretraining on Images and Videos

ImageBind: One Embedding Space To Bind Them All

Cut and Learn for Unsupervised Object Detection and Instance Segmentation

Learning Video Representations From Large Language Models

Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture

Cut, Paste and Learn: Surprisingly Easy Synthesis for Instance Detection

3D-RelNet: Joint Object and Relational Network for 3D Prediction

Scaling and Benchmarking Self-Supervised Visual Representation Learning

An End-to-End Transformer Model for 3D Object Detection

Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments With Support Samples

Self-Supervised Pretraining of 3D Features on Any Point-Cloud

Emerging Properties in Self-Supervised Vision Transformers

Space-Time Crop & Attend: Improving Cross-Modal Video Representation Learning

The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining

MOST: Multiple Object Localization with Self-Supervised Transformers for Object Discovery

Detecting Twenty-Thousand Classes Using Image-Level Supervision

Masked Siamese Networks for Label-Efficient Learning

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding

InstanceDiffusion: Instance-level Control for Image Generation

FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis

Watch and Learn: Semi-Supervised Learning for Object Detectors From Video

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers

A Data-Augmentation Is Worth A Thousand Samples: Analytical Moments And Sampling-Free Training