Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify the integrity of linear components offloaded to an untrusted GPU. However, directly applying TSDP to Transformer-based LLMs incurs significant TEE computation and TEE-GPU communication overhead. This paper presents Communication-efficient TEE-GPU Attention (VeriAttn) for accelerating verifiable LLM inference. VeriAttn offloads both linear and non-linear computations of attention to the GPU, while TEE performs verification. Moreover, for prefill, VeriAttn uses a two-level pipeline to overlap data movement, TEE pre-/post-processing, and GPU computation. For decoding, when the key-value cache exceeds available GPU memory, VeriAttn partitions attention across TEE and GPU to reduce repeated key-value transfers. Evaluation on an Intel TDX platform shows that VeriAttn achieves 2.60-3.38\times and 3.86-5.42\times acceleration over TSDP for 6k-token prompts and 10k-token outputs during prefill and decoding, respectively.
Communication-Efficient Verifiable Attention for LLM Inference
Computation integrity of remote large language model (LLM) serving can be questionable. For conventional deep neural networks (DNNs), the existing TEE-shielded DNN partitioning (TSDP) approach uses Trusted Execution Environment (TEE) to compute non-linear components and verify…
- Preview

- Year
- 2026
- Hosting
- Abstract onlyARXIV-DEFAULT
Cite
Notes
Only stored in your browser.
Attribution
- Abstract & full text
- arxiv.org/abs/2606.16352ARXIV-DEFAULT
- TL;DR
- Semantic Scholar