0

The infrastructure powering IBM's Gen AI model development

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models.

Year
2024
Venue
arXiv 2024
Authors
146
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2407.05467v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.

Authors

146
David CoxYikang ShenMayank MishraRameswar PandaBrian BelgodereGaoyuan ZhangI-Hsin ChungIdo LevyRuchir PuriSwaminathan SundararamanRobert GuthrieShubham SharmaTalia GershonSeetharami SeelamMilton BonillaLan HoangDanny BarnettApoorve MohanMing-Hung ChenLixiang LuoRobert WalkupConstantinos EvangelinosShweta SalariaMarc DombrowaYoonho ParkApo KayiLiran SchourAlim AlimAli SydneyPavlos ManiotisLaurent ScharesBernard MetzlerBengi Karacali-AkyamacSophia WenTatsuhiro ChibaSunyanan ChoochotkaewTakeshi YoshimuraClaudia MisaleTonia ElengikalKevin O ConnorZhuoran LiuRichard MolinaLars SchneidenbachJames CadenChristopher LaibinisCarlos FonsecaVasily TarasovFrank SchmuckScott GuthridgeJeremy CohnMarc EshelPaul MuenchRunyu LiuWilliam PointerDrew WyskidaBob KrullRay RoseBrent WolfeWilliam CornejoJohn WalterColm MaloneClifford PerucciFrank FrancoNigel HindsBob CalioPavel DruyanRobert KilduffJohn KienleConnor McStayAndrew FigueroaMatthew ConnollyEdie FostGina RomaJake FonsecaMichele PayneRyan SchenkelAmir MalkiLion SchneiderAniruddha NarkhedeShekeba MoshrefAlexandra KisinOlga DodinBill RipponHenry WriethJohn GanciJohnny ColinoDonna Habeger-RoseRakesh PandeyAditya GidhDennis PattersonSamsuddin SalmaniRambilas VarmaRumana RumanaAditya GaurAditya PrasadMatt StalloneDakshi AgrawalDrew ThorstensenJoel BelogBrent TangSaurabh Kumar GuptaAmitabha BiswasAnup MaheshwariEran GampelJason Van PattenMatthew RunionSai KakiYigal BoginBrian ReitzSteve PritkoShahan NajamSurya NambalaRadhika ChirraRick WelpFrank DiMitriFelipe TellesAmilcar ArveloKing ChuEd SeminaroAndrew SchramFelix EickhoffWilliam HansonEric MckeeverMichael LightDinakaran JosephPiyush ChaudharyPiyush ShivamPuneet ChaudharyWesley JonesChris BosticRezaul IslamSteve DuerschWayne SawdonJohn LewarsMatthew KlosMichael SpriggsBill McMillanGeorge GaoAshish KamraGaurav SinghMarc CurryTushar KatarkiJoe TalericoZenghui ShiSai Sindhur MalleniErwan Gallen