0

CUCo: An Agentic Framework for Compute and Communication Co-design

Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is…

Preview
Year
2026
Hosting
Full text hostedCC-BY-4.0

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2603.02376CC-BY-4.0
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Computation and communication in distributed LLM training and inference are traditionally optimized in isolation; expert-crafted systems such as DeepEP, FLUX, and TokenWeave show the potential of co-design but require deep systems expertise and hardware-specific tuning; CUCo is an agentic framework that automates compute-communication co-design of CUDA kernels by combining a structured design-space formalization with a correctness-first fast-path agent for reliable baselines and an evolution-driven slow-path agent for high-performance strategies, achieving up to 1.57x speedup across four multi-GPU workloads and discovering a two-stream overlap strategy on a DeepSeek-V3 MoE layer that hides dispatch behind local compute at an LLM inference cost under $10 per workload.