Nvidia's yield-killing design flaw in its Blackwell GPU was fixed months ago, and a refined version of the B100/B200 processors is about to enter mass production. Jensen Huang, Nvidia's CEO, admitted this week that the flaw was entirely caused by Nvidia and said that the company's production partner TSMC helped fix it in a timely manner, according to Reuters.
"We had a design flaw in Blackwell, it was functional, but the design flaw caused the yield to be low," Huang said. "It was 100% Nvidia's fault."
When the first reports about the design flaw emerged, some media outlets reported that TSMC was to blame — and suggested this might be causing strain between Nvidia and its foundry partner. This was not the case, according to Huang, and Nvidia's own miscalculations caused the problem. Huang also dismissed reports of tensions between the two companies as "fake news."
Nvidia's Blackwell B100 and B200 GPUs link their two chiplets using TSMC's CoWoS-L packaging technology, which relies on an RDL interposer equipped with local silicon interconnect (LSI) bridges (to enable data transfer rates of about 10 TB/s). The placement of these bridges is critical. However, a supposed mismatch in the thermal expansion properties between the GPU chiplets, LSI bridges, RDL interposer, and motherboard substrate caused the system to warp and fail, and Nvidia reportedly had to modify the top metal layers and bumps of the GPU silicon to enhance production yields. While the company did not disclose specific details about the fix, it did mention that new masks were required.
Yield-killing problems and major functionality issues (errata) are not unheard of in the semiconductor world. Typically, companies fix them by modifying a metal layer (or two) and calling it a new stepping. Case in point: Intel's Sapphire Rapids reportedly had 500 bugs, and the company released around a dozen steppings to fix them all (five were base respins). Every new stepping takes around three months to complete (including identifying the problem, fixing it, and producing a new version of the chip), so the speed at which Nvidia and TSMC fixed the Blackwell GPU is pretty impressive.
The now-fixed Blackwell GPUs for AI and supercomputers will enter mass production in late October and should start shipping early next year (which will still be Nvidia's fiscal year 2025).
That said, Nvidia disclosed earlier this year that, in order to meet demand for its Blackwell GPUs among major cloud service providers such as AWS, Google, and Microsoft, it will still have to ship some of the initial low-yield Blackwell processors in 2024. It's unclear how many Blackwell GPUs will be shipped to data centers in 2024.