April 19, 2024
  • Marking a significant funding in Meta’s AI future, we’re saying two 24k GPU clusters. We’re sharing particulars on the {hardware}, community, storage, design, efficiency, and software program that assist us extract excessive throughput and reliability for numerous AI workloads. We use this cluster design for Llama 3 coaching.
  • We’re strongly dedicated to open compute and open supply. We constructed these clusters on high of Grand Teton, OpenRack, and PyTorch and proceed to push open innovation throughout the business.
  • This announcement is one step in our formidable infrastructure roadmap. By the top of 2024, we’re aiming to proceed to develop our infrastructure build-out that can embody 350,000 NVIDIA H100 GPUs as a part of a portfolio that can characteristic compute energy equal to just about 600,000 H100s.

To steer in growing AI means main investments in {hardware} infrastructure. {Hardware} infrastructure performs an necessary function in AI’s future. Right this moment, we’re sharing particulars on two variations of our 24,576-GPU information middle scale cluster at Meta. These clusters assist our present and subsequent technology AI fashions, together with Llama 3, the successor to Llama 2, our publicly launched LLM, in addition to AI analysis and growth throughout GenAI and different areas .

A peek into Meta’s large-scale AI clusters

Meta’s long-term imaginative and prescient is to construct synthetic basic intelligence (AGI) that’s open and constructed responsibly in order that it may be extensively out there for everybody to profit from. As we work in the direction of AGI, we’ve got additionally labored on scaling our clusters to energy this ambition. The progress we make in the direction of AGI creates new merchandise, new AI features for our family of apps, and new AI-centric computing gadgets. 

Whereas we’ve had an extended historical past of constructing AI infrastructure, we first shared particulars on our AI Research SuperCluster (RSC), that includes 16,000 NVIDIA A100 GPUs, in 2022. RSC has accelerated our open and accountable AI analysis by serving to us construct our first technology of superior AI fashions. It performed and continues to play an necessary function within the growth of Llama and Llama 2, in addition to superior AI fashions for purposes starting from laptop imaginative and prescient, NLP, and speech recognition, to image generation, and even coding.

Beneath the hood

Our newer AI clusters construct upon the successes and classes discovered from RSC. We centered on constructing end-to-end AI methods with a significant emphasis on researcher and developer expertise and productiveness. The effectivity of the high-performance community materials inside these clusters, among the key storage selections, mixed with the 24,576 NVIDIA Tensor Core H100 GPUs in every, enable each cluster variations to assist fashions bigger and extra advanced than that could possibly be supported within the RSC and pave the way in which for developments in GenAI product growth and AI analysis.

Community

At Meta, we deal with a whole lot of trillions of AI mannequin executions per day. Delivering these companies at a big scale requires a extremely superior and versatile infrastructure. Customized designing a lot of our personal {hardware}, software program, and community materials permits us to optimize the end-to-end expertise for our AI researchers whereas guaranteeing our information facilities function effectively. 

With this in thoughts, we constructed one cluster with a distant direct reminiscence entry (RDMA) over converged Ethernet (RoCE) community cloth resolution based mostly on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches. The opposite cluster options an NVIDIA Quantum2 InfiniBand cloth. Each of those options interconnect 400 Gbps endpoints. With these two, we’re in a position to assess the suitability and scalability of those several types of interconnect for large-scale coaching, giving us extra insights that can assist inform how we design and construct even bigger, scaled-up clusters sooner or later. By cautious co-design of the community, software program, and mannequin architectures, we’ve got efficiently used each RoCE and InfiniBand clusters for big, GenAI workloads (together with our ongoing coaching of Llama 3 on our RoCE cluster) with none community bottlenecks.

Compute

Each clusters are constructed utilizing Grand Teton, our in-house-designed, open GPU {hardware} platform that we’ve contributed to the Open Compute Mission (OCP). Grand Teton builds on the various generations of AI methods that combine energy, management, compute, and cloth interfaces right into a single chassis for higher general efficiency, sign integrity, and thermal efficiency. It offers speedy scalability and adaptability in a simplified design, permitting it to be shortly deployed into information middle fleets and simply maintained and scaled. Mixed with different in-house improvements like our Open Rack energy and rack structure, Grand Teton permits us to construct new clusters in a approach that’s purpose-built for present and future purposes at Meta.

We have now been overtly designing our GPU {hardware} platforms starting with our Massive Sur platform in 2015.

Storage

Storage performs an necessary function in AI coaching, and but is likely one of the least talked-about facets. Because the GenAI coaching jobs turn out to be extra multimodal over time, consuming massive quantities of picture, video, and textual content information, the necessity for information storage grows quickly. The necessity to match all that information storage right into a performant, but power-efficient footprint doesn’t go away although, which makes the issue extra fascinating.

Our storage deployment addresses the info and checkpointing wants of the AI clusters by way of a home-grown Linux Filesystem in Userspace (FUSE) API backed by a model of Meta’s ‘Tectonic’ distributed storage solution optimized for Flash media. This resolution permits hundreds of GPUs to avoid wasting and cargo checkpoints in a synchronized style (a challenge for any storage resolution) whereas additionally offering a versatile and high-throughput exabyte scale storage required for information loading.

We have now additionally partnered with Hammerspace to co-develop and land a parallel community file system (NFS) deployment to satisfy the developer expertise necessities for this AI cluster. Amongst different advantages, Hammerspace permits engineers to carry out interactive debugging for jobs utilizing hundreds of GPUs as code adjustments are instantly accessible to all nodes throughout the atmosphere. When paired collectively, the mixture of our Tectonic distributed storage resolution and Hammerspace allow quick iteration velocity with out compromising on scale.     

The storage deployments in our GenAI clusters, each Tectonic- and Hammerspace-backed, are based mostly on the YV3 Sierra Point server platform, upgraded with the newest excessive capability E1.S SSD we are able to procure out there right this moment. Apart from the upper SSD capability, the servers per rack was custom-made to realize the suitable stability of throughput capability per server, rack depend discount, and related energy effectivity. Using the OCP servers as Lego-like constructing blocks, our storage layer is ready to flexibly scale to future necessities on this cluster in addition to in future, greater AI clusters, whereas being fault-tolerant to day-to-day Infrastructure upkeep operations.

Efficiency

One of many ideas we’ve got in constructing our large-scale AI clusters is to maximise efficiency and ease of use concurrently with out compromising one for the opposite. This is a crucial precept in creating the best-in-class AI fashions. 

As we push the bounds of AI methods, one of the best ways we are able to take a look at our potential to scale-up our designs is to easily construct a system, optimize it, and truly take a look at it (whereas simulators assist, they solely go to date). On this design journey, we in contrast the efficiency seen in our small clusters and with massive clusters to see the place our bottlenecks are. Within the graph beneath, AllGather collective efficiency is proven (as normalized bandwidth on a 0-100 scale) when a lot of GPUs are speaking with one another at message sizes the place roofline efficiency is anticipated. 

Our out-of-box efficiency for big clusters was initially poor and inconsistent, in comparison with optimized small cluster efficiency. To deal with this we made a number of adjustments to how our inside job scheduler schedules jobs with community topology consciousness – this resulted in latency advantages and minimized the quantity of visitors going to higher layers of the community. We additionally optimized our community routing technique together with NVIDIA Collective Communications Library (NCCL) adjustments to realize optimum community utilization. This helped push our massive clusters to realize nice and anticipated efficiency simply as our small clusters.

Within the determine we see that small cluster efficiency (general communication bandwidth and utilization) reaches 90%+ out of the field, however an unoptimized massive cluster efficiency has very poor utilization, starting from 10% to 90%. After we optimize the total system (software program, community, and so on.), we see massive cluster efficiency return to the perfect 90%+ vary.

Along with software program adjustments concentrating on our inside infrastructure, we labored intently with groups authoring coaching frameworks and fashions to adapt to our evolving infrastructure. For instance, NVIDIA H100 GPUs open the potential of leveraging new information sorts comparable to 8-bit floating level (FP8) for coaching. Totally using bigger clusters required investments in further parallelization methods and new storage options offered alternatives to extremely optimize checkpointing throughout hundreds of ranks to run in a whole lot of milliseconds.

We additionally acknowledge debuggability as one of many main challenges in large-scale coaching. Figuring out a problematic GPU that’s stalling a whole coaching job turns into very tough at a big scale. We’re constructing instruments comparable to desync debug, or a distributed collective flight recorder, to reveal the small print of distributed coaching, and assist determine points in a a lot quicker and simpler approach

Lastly, we’re persevering with to evolve PyTorch, the foundational AI framework powering our AI workloads, to make it prepared for tens, and even a whole lot, of hundreds of GPU coaching. We have now recognized a number of bottlenecks for course of group initialization, and lowered the startup time from typically hours all the way down to minutes. 

Dedication to open AI innovation

Meta maintains its dedication to open innovation in AI software program and {hardware}. We consider open-source {hardware} and software program will at all times be a beneficial device to assist the business clear up issues at massive scale.

Right this moment, we proceed to assist open {hardware} innovation as a founding member of OCP, the place we make designs like Grand Teton and Open Rack out there to the OCP group. We additionally proceed to be the biggest and first contributor to PyTorch, the AI software program framework that’s powering a big chunk of the business.

We additionally proceed to be dedicated to open innovation within the AI analysis group. We’ve launched the Open Innovation AI Research Community, a partnership program for tutorial researchers to deepen our understanding of how you can responsibly develop and share AI applied sciences – with a selected concentrate on LLMs.

An open method to AI just isn’t new for Meta. We’ve additionally launched the AI Alliance, a bunch of main organizations throughout the AI business centered on accelerating accountable innovation in AI inside an open group. Our AI efforts are constructed on a philosophy of open science and cross-collaboration. An open ecosystem brings transparency, scrutiny, and belief to AI growth and results in improvements that everybody can profit from which might be constructed with security and duty high of thoughts. 

The way forward for Meta’s AI infrastructure

These two AI coaching cluster designs are part of our bigger roadmap for the way forward for AI. By the top of 2024, we’re aiming to proceed to develop our infrastructure build-out that can embody 350,000 NVIDIA H100s as a part of a portfolio that can characteristic compute energy equal to just about 600,000 H100s.

As we glance to the longer term, we acknowledge that what labored yesterday or right this moment might not be enough for tomorrow’s wants. That’s why we’re always evaluating and bettering each facet of our infrastructure, from the bodily and digital layers to the software program layer and past. Our aim is to create methods which might be versatile and dependable to assist the fast-evolving new fashions and analysis.