Caiming Xiong
VP of AI Research at Salesforce, leading work on enterprise LLMs, CodeGen, XGen, BLIP-2, and agent benchmarks like WebArena and AgentInstruct.
- Role
- researcher
- Currently at
- Salesforce Research / Salesforce AI Research
- twitter.com/CaimingXiong
- Scholar
- scholar.google.com/citations
- Papers
- 99
Cite
Notes
Only stored in your browser.
Authored papers
99AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
arXiv 2026
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
arXiv 2026
Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey
arXiv 2026
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
arXiv 2026
The Illusion of Certainty: Decoupling Capability and Calibration in On-Policy Distillation
arXiv 2026
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
arXiv 2026
Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts
arXiv 2026
Future Optical Flow Prediction Improves Robot Control & Video Generation
arXiv 2026
OSWorld-Verified: A Cleaner, More Reliable Computer-Use Benchmark
blog
Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms
arXiv 2025
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
arXiv 2025
ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
arXiv 2025
Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis
arXiv 2025
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
arXiv 2025
Large Language Models Post-training: Surveying Techniques from Alignment to Reasoning
arXiv 2025
Meta-Design Matters: A Self-Design Multi-Agent System
arXiv 2025
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective
arXiv 2025
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models
arXiv 2025
Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
arXiv 2025
CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions
arXiv 2025
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
arXiv 2025
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
arXiv 2025
LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering
arXiv 2025
Demystifying Domain-adaptive Post-training for Financial LLMs
arXiv 2025
Fractured Chain-of-Thought Reasoning
arXiv 2025
Scalable Chain of Thoughts via Elastic Reasoning
arXiv 2025
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
arXiv 2025
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
arXiv 2025
Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics
arXiv 2025
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
arXiv 2025
GTA1: GUI Test-time Scaling Agent
arXiv 2025
Promptomatix: An Automatic Prompt Optimization Framework for Large Language Models
arXiv 2025
MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
arXiv 2025
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG
arXiv 2025
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
arXiv 2025
Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels
arXiv 2025
UserBench: An Interactive Gym Environment for User-Centric Agents
arXiv 2025
CoDA: Coding LM via Diffusion Adaptation
arXiv 2025
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
arXiv 2025
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
NeurIPS
Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions
arXiv 2024
AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning
arXiv 2024
TrustLLM: Trustworthiness in Large Language Models
arXiv 2024
GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation
arXiv 2024
FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
arXiv 2024
Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
arXiv 2024
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
arXiv 2024
TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action
arXiv 2024
How Much are Large Language Models Contaminated? A Comprehensive Survey and the LLMSanitize Library
arXiv 2024
PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback
arXiv 2024
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
arXiv 2024
GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers
arXiv 2024
ThinK: Thinner Key Cache by Query-Driven Pruning
arXiv 2024
FOFO: A Benchmark to Evaluate LLMs' Format-Following Capability
arXiv 2024
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs
arXiv 2024
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
arXiv 2024
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
arXiv 2024
AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System
arXiv 2024
Automatic Curriculum Expert Iteration for Reliable LLM Reasoning
arXiv 2024
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
arXiv 2023
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
CVPR 2024 1
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond
arXiv 2023
UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild
unicontrol-a-unified-diffusion-model-for
Improved Online Conformal Prediction via Strongly Adaptive Online Learning
arXiv 2023
ChatGPT's One-year Anniversary: Are Open-Source Large Language Models Catching up?
arXiv 2023
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
arXiv 2023
Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization
arXiv 2023
Preference-grounded Token-level Guidance for Language Model Fine-tuning
preference-grounded-token-level-guidance-for
Unlocking Anticipatory Text Generation: A Constrained Approach for Large Language Models Decoding
arXiv 2023
Lemur: Harmonizing Natural Language and Code for Language Agents
arXiv 2023
DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI
arXiv 2023
OpenAgents: An Open Platform for Language Agents in the Wild
arXiv 2023
XGen-7B Technical Report
arXiv 2023
BOLAA: Benchmarking and Orchestrating LLM-augmented Autonomous Agents
arXiv 2023
Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles
arXiv 2023
LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer
arXiv 2022
Converse: A Tree-Based Modular Task-Oriented Dialogue System
arXiv 2022
FOLIO: Natural Language Reasoning with First-Order Logic
arXiv 2022
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
arXiv 2022
Prompt-Tuning Can Be Much Better Than Fine-Tuning on Cross-lingual Understanding With Multilingual Language Models
arXiv 2022
Binding Language Models in Symbolic Languages
arXiv 2022
UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
arXiv 2022
DocNLI: A Large-scale Dataset for Document-level Natural Language Inference
Findings (ACL) 2021 8
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization
NAACL 2022 7
MixQG: Neural Question Generation with Mixed Answer Types
Findings (NAACL) 2022 7
Robustness Gym: Unifying the NLP Evaluation Landscape
NAACL 2021 4
BookSum: A Collection of Datasets for Long-form Narrative Summarization
arXiv 2021
GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing
arXiv 2020
SummEval: Re-evaluating Summarization Evaluation
arXiv 2020
BERTology Meets Biology: Interpreting Attention in Protein Language Models
ICLR 2021 1
Bridging Textual and Tabular Data for Cross-Domain Text-to-SQL Semantic Parsing
Findings of the Association for Computational Linguistics 2020
CTRLsum: Towards Generic Controllable Text Summarization
ctrlsum-towards-generic-controllable-text
ERASER: A Benchmark to Evaluate Rationalized NLP Models
eraser-a-benchmark-to-evaluate-rationalized-1
SParC: Cross-Domain Semantic Parsing in Context
sparc-cross-domain-semantic-parsing-in-1
Evaluating the Factual Consistency of Abstractive Text Summarization
EMNLP 2020 11
The Natural Language Decathlon: Multitask Learning as Question Answering
the-natural-language-decathlon-multitask-1
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
seq2sql-generating-structured-queries-from-1
Non-Autoregressive Neural Machine Translation
non-autoregressive-neural-machine-translation-2
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning
knowing-when-to-look-adaptive-attention-via-a-1
Affiliations
Previously
Frequent co-authors
10from 99 papers