0

Ranjay Krishna

Papers
56

Cite

Notes

Only stored in your browser.

Attribution

Affiliations & profile
Semantic Scholar
Attribution policy →
56papers

Authored papers

56

MolmoAct2: Action Reasoning Models for Real-world Deployment

arXiv 2026

2026

MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation

arXiv 2026

2026

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

arXiv 2026

2026

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

arXiv 2026

2026

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

arXiv 2026

2026

WildDet3D: Scaling Promptable 3D Detection in the Wild

arXiv 2026

2026

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models

arXiv 2026

2026

VLS: Steering Pretrained Robot Policies via Vision-Language Models

arXiv 2026

2026

Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?

arXiv 2026

2026

Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning

arXiv 2026

2026

PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

arXiv 2026

2026

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

arXiv 2026

2026

Video-Based Reward Modeling for Computer-Use Agents

arXiv 2026

2026

Reinforced Visual Perception with Tools

arXiv 2025

2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

arXiv 2025

2025

Contrastive Flow Matching

ICCV 2025

2025

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

arXiv 2025

2025

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

arXiv 2025

2025

PointArena: Probing Multimodal Grounding Through Language-Guided Pointing

arXiv 2025

2025

Spatial Mental Modeling from Limited Views

arXiv 2025

2025

MolmoAct: Action Reasoning Models that can Reason in Space

arXiv 2025

2025

Spurious Rewards: Rethinking Training Signals in RLVR

arXiv 2025

2025

VideoMolmo: Spatio-Temporal Grounding Meets Pointing

arXiv 2025

2025

Mull-Tokens: Modality-Agnostic Latent Thinking

arXiv 2025

2025

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

arXiv 2025

2025

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

arXiv 2025

2025

Explain Before You Answer: A Survey on Compositional Visual Reasoning

arXiv 2025

2025

Seeking and Updating with Live Visual Knowledge

arXiv 2025

2025

MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation

arXiv 2025

2025

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

CVPR 2025 1

2024

NVILA: Efficient Frontier Visual Language Models

CVPR 2025 1

2024

One Diffusion to Generate Them All

CVPR 2025 1

2024

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

arXiv 2024

2024

SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models

arXiv 2024

2024

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment

arXiv 2024

2024

ImageInWords: Unlocking Hyper-Detailed Image Descriptions

arXiv 2024

2024

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps

arXiv 2024

2024

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

arXiv 2024

2024

Negative Token Merging: Image-based Adversarial Feature Guidance

arXiv 2024

2024

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

arXiv 2024

2024

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks

arXiv 2024

2024

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

arXiv 2024

2024

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

arXiv 2024

2024

Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

arXiv 2024

2024

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

arXiv 2024

2024

DataComp: In search of the next generation of multimodal datasets

NeurIPS 2023 11

2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

arXiv 2023

2023

Holodeck: Language Guided Generation of 3D Embodied AI Environments

CVPR 2024 1

2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality

sugarcrepe-fixing-hackable-benchmarks-for

2023

TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering

ICCV 2023 1

2023

Quilt-1M: One Million Image-Text Pairs for Histopathology

quilt-1m-one-million-image-text-pairs-for

2023

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias

large-language-model-as-attributed-training

2023

EcoAssistant: Using LLM Assistant More Affordably and Accurately

arXiv 2023

2023

MIMIC: Masked Image Modeling with Image Correspondences

arXiv 2023

2023

Improving Interpersonal Communication by Simulating Audiences with Language Models

arXiv 2023

2023

Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering

ACL 2021 5

2021

Affiliations

No known affiliations.

Frequent co-authors

10

from 56 papers