//php echo do_shortcode(‘[responsivevoice_button voice=”US English Male” buttontext=”Listen to Post”]’) ?>
SAN JOSE, CALIF.—Market chief Nvidia unveiled its new technology of GPU expertise, designed to speed up coaching and inference of generative AI. The brand new expertise platform is known as Blackwell, after sport theorist David Harold Blackwell, and can exchange the earlier technology, Hopper.
“Clearly, AI has hit the purpose the place each software within the business can profit by making use of generative AI to enhance how we make PowerPoints, write paperwork, perceive our information and ask questions of it,” Ian Buck, VP and normal supervisor of Nvidia’s hyperscale and HPC computing enterprise, advised EE Occasions. “It’s such an extremely priceless device that the world can’t construct up infrastructure quick sufficient to satisfy the promise, and make it accessible, reasonably priced and ubiquitous.”
The B200, two reticle-sized GPU die on a brand new customized TSMC 4NP course of node with 196 GB of HBM3e reminiscence, will supersede the H100 as state-of-the-art AI acceleration within the information middle. The GB200, or “Grace Blackwell,” is the brand new Grace Hopper—the identical Grace Arm-based CPU mixed with two B200s. There may be additionally a B100—a single-die model of Blackwell which can primarily be used to switch Hopper programs the place the identical kind issue is required.
B200’s two CoWoS-mounted die are linked by a 10-TB/s NV-HBI (high-bandwidth interconnect) hyperlink.
By Shingo Kojima, Sr Principal Engineer of Embedded Processing, Renesas Electronics 03.26.2024
By Dylan Liu, Geehy Semiconductor 03.21.2024
By Lancelot Hu 03.18.2024
“That material is not only a community, the material of the GPU extends from each core and each reminiscence, throughout the 2 die, into each core, which implies software program sees one totally coherent GPU,” Buck mentioned. “There’s no locality, no programming variations – there is only one big GPU.”
B200 will supply 2.5× the FLOPS of H100 on the similar precision, nevertheless it additionally helps decrease precision codecs, together with FP6 and FP4. A second-gen model of the transformer engine reduces precision so far as doable throughout inference and coaching to maximise throughput.
Buck described how {hardware} help for dynamic scaling meant the first-gen transformer engine might dynamically alter scale and bias whereas sustaining accuracy so far as doable, on a layer-by-layer foundation. The transformer engine successfully “does the bookkeeping,” he mentioned.
“For the subsequent step [in the calculation], the place do you’ll want to transfer the tensor in that dynamic vary to maintain all the pieces in vary? When you fall out, you’re out of vary,” he mentioned. “We have now to foretell it…the transformer engine seems all the way in which again, a thousand [operations] again in historical past to mission the place it must dynamically transfer the tensor in order that the ahead calculation stays inside vary.”
For the Blackwell technology, the transformer engine has been upgraded to allow micro-scaling not simply on the tensor degree, however for components inside the tensor. Teams of “tens of components” can now have totally different scaling elements, with that degree of granularity supported in {hardware} right down to FP4.
“With Blackwell, I can have a separate vary for each group of components inside the tensor, and that’s how I can go beneath FP8 right down to 4-bit illustration,” Buck mentioned. “Blackwell has {hardware} to do this micro-scaling…so now the transformer engine is monitoring each tensor in each layer, but additionally each group of components within the tensor.”
Communication
With B200 boosting efficiency 2.5× over H100, the place do Nvidia’s 25×-30× efficiency claims come from? The bottom line is in communication for giant generative AI fashions, Buck mentioned.
Whereas earlier generative AI fashions have been a single monolithic transformer, in the present day’s largest generative AI fashions use a method referred to as combination of consultants (MoE). With MoE, layers are composed of a number of mini-layers, that are extra targeted on explicit duties. A router mannequin decides which of those consultants to make use of for any given MoE layer. Fashions like Gemini, Mixtral and Grok are constructed this manner.
The difficulty is these fashions are so giant that someplace between 4 and 32 particular person consultants are seemingly being run on separate GPUs. Communication between them turns into the bottleneck; all-to-all and all-reduce operations are required to mix outcomes from totally different consultants. Whereas giant consideration and feedforward layers in monolithic transformers are sometimes cut up throughout a number of GPUs, the issue is especially acute for MoE fashions.
Hopper has eight GPUs per NVLink (quick vary chip-to-chip communication) area at 900 GB/s—however when transferring from, say eight to 16 consultants, half the communication has to go over Infiniband (used for server communications) at solely 100 GB/s.
“So in case your information middle has Hoppers, the most effective you are able to do is half of your time goes to be spent on consultants speaking, and when that’s taking place, the GPUs are sitting idle—you’ve constructed a billion-dollar information middle and at finest, it’s solely 50% utilized,” Buck mentioned. “This can be a drawback for contemporary generative AI. It’s do-able—individuals do it—nevertheless it’s one thing we needed to resolve within the Blackwell technology.”
For Blackwell, Nvidia doubled NVLink speeds to 1800 GB/s per GPU, and prolonged NVLink domains to 72 GPUs in the identical rack. Nvidia’s NVL72 rack scale system, additionally introduced at GTC, has 36 Grace Blackwells—for a complete of 72 B200 GPUs.
Nvidia additionally constructed a brand new swap chip, NVLink Change, with 144 NVLink ports and a non-blocking switching capability of 14.4 TB/s. There are 18 of those switches within the NVL72 rack, with an all-to-all community topology, that means each GPU within the rack can speak to each different GPU within the rack on the full bidirectional bandwidth of 1800 GB/s—18× what it will have been with Infiniband.
“We crushed it,” Buck mentioned.
The brand new switches may do math. They help Nvidia’s scalable hierarchical aggregation and discount protocol (Sharp) expertise, which may carry out sure kinds of simple arithmetic within the swap. This implies the identical information doesn’t should be despatched to totally different endpoints a number of instances and reduces the time spent speaking.
“If we have to add tensors or one thing like that, we don’t even must hassle the GPUs any extra, we will try this within the community, giving it an efficient bandwidth for all-reduce operations of 3600 GB/s,” Buck mentioned. “That’s how we get to 30 instances sooner.”
B200 GPUs can run in a 1000-W energy envelope with air cooling, however with liquid cooling, they will run on 1200 W. The bounce to liquid cooling was not essentially about wanting to spice up the ability provide to every GPU, Buck mentioned.
“The rationale for liquid cooling is for the NVL72, we needed to construct a much bigger NVLink area,” he mentioned. “We couldn’t construct a much bigger PCB, so we constructed it rack scale. We might try this with a number of racks, however to do quick signaling, we’d should go to optics…that might be a whole lot of transceivers. It could want one other 20 kW of energy, and it will be six instances dearer to do this versus copper, which is a direct connection to the GPU SerDes.”
Copper’s distances are shorter than optics’—restricted to round a meter—so the GPUs should be shut collectively in the identical rack.
“Within the rack, the 2 compute trays are sandwiched between the switches, it wouldn’t work for those who did a top-of-rack NVLink swap, as a result of the space from the underside to the highest of the rack wouldn’t be capable to run with 1800 GB/s or 200 Gb/s SerDes—it’s too far,” Buck mentioned. “We transfer the NVSwitch to the center, we will do all the pieces in 200 Gb/s SerDes, all in copper, six instances decrease price for 72 GPUs. That’s why liquid cooling is so necessary—we now have to do all the pieces inside a meter.”
Trillion-parameter fashions can now be deployed on a single rack, lowering general price. Buck mentioned that versus the identical efficiency with Hopper GPUs, Grace Blackwell can do it with 25× much less energy and 25× much less price.
“What which means is that trillion-parameter generative AI will likely be all over the place—it’ll democratize AI,” he mentioned. “Each firm could have entry to that degree of AI interactivity, functionality, creativity…I’m tremendous excited.”