Yong Jae Lee

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

arXiv 2026

Your Embedding Model is SMARTer Than You Think

arXiv 2026

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

arXiv 2026

Reasoning-Augmented Representations for Multimodal Retrieval

arXiv 2026

Relational Visual Similarity

arXiv 2025

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

arXiv 2025

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

arXiv 2025

Revisiting Active Speaker Detection: An In-the-Wild Benchmark for Generalization and Robustness

arXiv 2025

CuRe: Cultural Gaps in the Long Tail of Text-to-Image Systems

ICCV 2025

LLM Inference Unveiled: Survey and Roofline Model Insights

arXiv 2024

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

arXiv 2024

Yo'LLaVA: Your Personalized Language and Vision Assistant

arXiv 2024

Matryoshka Multimodal Models

arXiv 2024

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

arXiv 2024

CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

arXiv 2024

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation

arXiv 2024

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

vinoground-scrutinizing-lmms-over-dense