Cordelia Schmid

DataDream: Few-shot Guided Dataset Generation

arXiv 2024

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy

arXiv 2024

Towards Zero-Shot Multimodal Machine Translation

arXiv 2024

Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

CVPR 2023 1

Verbs in Action: Improving verb understanding in video-language models

ICCV 2023 1

PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation

arXiv 2023

CoVR-2: Automatic Data Construction for Composed Video Retrieval

arXiv 2023

Modular Visual Question Answering via Code Generation

arXiv 2023

POCO: 3D Pose and Shape Estimation with Confidence

arXiv 2023

Waffling around for Performance: Visual Classification with Random Words and Broad Concepts

ICCV 2023 1

Bridging the Gap between Model Explanations in Partially Annotated Multi-label Classification

CVPR 2023 1

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

arXiv 2022

TubeDETR: Spatio-Temporal Video Grounding with Transformers

CVPR 2022 1

Learning to Answer Visual Questions from Web Videos

arXiv 2022

WALDO: Future Video Synthesis using Object Layer Decomposition and Parametric Flow Prediction

ICCV 2023 1