0

Towards smaller, faster decoder-only transformers: Architectural variants and their implications

Three new decoder-only transformer variants (ParallelGPT, LinearlyCompressedGPT, and ConvCompressedGPT) maintain code generation performance while reducing model sizes and improving training efficiency.

Year
2024
Venue
arXiv 2024
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2404.14462v4ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

In recent times, the research on Large Language Models (LLMs) has grown exponentially, predominantly focusing on models underpinned by the transformer architecture, as established by [1], and further developed through the decoder-only variations by [2]. Contemporary efforts in this field primarily aim to enhance model capabilities by scaling up both the architecture and data volumes utilized during training. However, the exploration into reduce these model sizes while preserving their efficacy remains scant. In this study, we introduce three modifications to the decoder-only transformer architecture, namely ParallelGPT (pgpt), LinearGPT (lgpt), and ConvGPT (cgpt). These variants demonstrate comparable performance to the conventional architecture in language generation, yet benefit from reduced model sizes and faster training processes. We open-source the model weights and the complete codebase for these implementation for further research.

Authors

2