Ranjay Krishna
- Papers
- 56
Cite
Notes
Only stored in your browser.
Authored papers
56MolmoAct2: Action Reasoning Models for Real-world Deployment
arXiv 2026
MolmoSpaces: A Large-Scale Open Ecosystem for Robot Navigation and Manipulation
arXiv 2026
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning
arXiv 2026
Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
arXiv 2026
TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics
arXiv 2026
WildDet3D: Scaling Promptable 3D Detection in the Wild
arXiv 2026
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
arXiv 2026
VLS: Steering Pretrained Robot Policies via Vision-Language Models
arXiv 2026
Theory of Space: Can Foundation Models Construct Spatial Beliefs through Active Exploration?
arXiv 2026
Recurrent-Depth VLA: Implicit Test-Time Compute Scaling of Vision-Language-Action Models via Latent Iterative Reasoning
arXiv 2026
PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning
arXiv 2026
Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
arXiv 2026
Video-Based Reward Modeling for Computer-Use Agents
arXiv 2026
Reinforced Visual Perception with Tools
arXiv 2025
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
arXiv 2025
Contrastive Flow Matching
ICCV 2025
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
arXiv 2025
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations
arXiv 2025
PointArena: Probing Multimodal Grounding Through Language-Guided Pointing
arXiv 2025
Spatial Mental Modeling from Limited Views
arXiv 2025
MolmoAct: Action Reasoning Models that can Reason in Space
arXiv 2025
Spurious Rewards: Rethinking Training Signals in RLVR
arXiv 2025
VideoMolmo: Spatio-Temporal Grounding Meets Pointing
arXiv 2025
Mull-Tokens: Modality-Agnostic Latent Thinking
arXiv 2025
OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation
arXiv 2025
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
arXiv 2025
Explain Before You Answer: A Survey on Compositional Visual Reasoning
arXiv 2025
Seeking and Updating with Live Visual Knowledge
arXiv 2025
MMMG: a Comprehensive and Reliable Evaluation Suite for Multitask Multimodal Generation
arXiv 2025
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025 1
NVILA: Efficient Frontier Visual Language Models
CVPR 2025 1
One Diffusion to Generate Them All
CVPR 2025 1
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
arXiv 2024
SAT: Dynamic Spatial Aptitude Training for Multimodal Language Models
arXiv 2024
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
arXiv 2024
ImageInWords: Unlocking Hyper-Detailed Image Descriptions
arXiv 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
arXiv 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
arXiv 2024
Negative Token Merging: Image-based Adversarial Feature Guidance
arXiv 2024
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
arXiv 2024
m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
arXiv 2024
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
arXiv 2024
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
arXiv 2024
Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
arXiv 2024
The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
arXiv 2024
DataComp: In search of the next generation of multimodal datasets
NeurIPS 2023 11
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
arXiv 2023
Holodeck: Language Guided Generation of 3D Embodied AI Environments
CVPR 2024 1
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
sugarcrepe-fixing-hackable-benchmarks-for
TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering
ICCV 2023 1
Quilt-1M: One Million Image-Text Pairs for Histopathology
quilt-1m-one-million-image-text-pairs-for
Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias
large-language-model-as-attributed-training
EcoAssistant: Using LLM Assistant More Affordably and Accurately
arXiv 2023
MIMIC: Masked Image Modeling with Image Correspondences
arXiv 2023
Improving Interpersonal Communication by Simulating Audiences with Language Models
arXiv 2023
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
ACL 2021 5
Affiliations
Frequent co-authors
10from 56 papers