Update 4/42024 6:15am PT: We have been notified that the refined and 'clarified' U.S. sanctions do not impact Nvidia's existing sanctions-compliant GPUs for China, specifically the H20 and RTX 4090D.
The new document includes "Corrections and Clarifications" on the export controls, and some of the language was confusing and misinterpreted, by us and other sites. Specifically, the document details "adjusted peak performance" (APP) and "weighted teraflops" (WT), with a limit of 70 TFLOPS or less. We have received additional information from Nvidia on the restrictions and clarifications, and the short summary is that the sanctions-compliant H20 and 4090D GPUs are not impacted.
The specific reasons that the 4090D isn't affected has to do with the definitions. First, the guidelines are for computer systems, not individual GPUs, and more specifically they are for systems with memory coherence — as an example, a 4-way DGX H100 system would fall under this classification.
In an email from Nvidia, it states: "Processor combinations share memory when any processor is capable of accessing any memory location in the system through the hardware transmission of cache lines or memory words, without the involvement of any software mechanism, which may be achieved using “electronic assemblies” specified in 4A003.c, z.1, or z.3."
The other important detail is that the "adjusted peak performance" applies to FP64 throughput, and it's "weighted" because the value gets scaled based on whether it's a vector processor or a scalar (non-vector) processor. In other words, FP64 done via vector units like Nvidia Tensor cores is different from FP64 done via a CPU running 64-bit calculations. (That's a simplification, as CPUs can also include vector units.)
To determine the "weighted teraflops" and "adjusted peak performance," take the aggregate FP64 throughput of the system. Then multiply by 0.9 for vector processors or by 0.3 for non-vector processors. So going back to the 4-way DGX H100 as an example, the H100 SXM variant of the GPU has 67 teraflops of vector FP64 throughput. Four of them in aggregate would deliver 268 teraflops, and multiplied by 0.9 gives 241.2 — well above the 70 weighted teraflops limit. And of course, the HGX H100 would have already been restricted even prior to the more recent updates.
So, what has actually changed? Succinctly, not much. These are not new export controls or restrictions but rather an addendum to attempt to clarify the official "speed limits." The RTX 4090D for its part hardly offers any FP64 throughput, only 1.15 TFLOPS, though it still comes close to the 4,800 TPP limit.
Original unedited article (which misinterpreted the 'clarifications' described above):
The United States government has revised its Chinese semiconductor export restrictions to encompass more high-performance hardware. Specifically, any semiconductor chip offering over 70 "Weighted TeraFLOPS" of performance is now banned from export to China without a license. This lowered limit now includes Nvidia's Chinese-exclusive RTX 4090D "Dragon" graphics card.
The RTX 4090D was made specifically to comply with the U.S. China export bans several months back. The RTX 4090 exceeded the 4,800 Total Processing Power (TPP) limit by 10%, and so Nvidia created the 4090D to come in below that limit (it lands at 4,707 TPP). Amazingly, the new 70 TFLOPS limit is only 5% lower than the RTX 4090D's 73.5 TFLOPS performance figure.
While this change was seemingly inevitable, we have to question whether it's even meaningful. After the launch of the RTX 4090D, the U.S. government has warned Nvidia that its tactics wouldn't go unnoticed, and it has now moved to ban Nvidia's China-exclusive GPU. But does a 5% reduction in the GPU 'speed limit' even matter, and if so, what happens when Nvidia makes a new GPU that comes in below that limit?
The RTX 4090D is a cut-down variant of the RTX 4090, featuring 14,592 CUDA cores and a 425W TBP. Compared to the outgoing RTX 4090, the RTX 4090D has 12.8% fewer CUDA cores and a 5.9% lower TDP. All other core specifications remain the same between the two. The only exception is the base clock, which has been brought up slightly to 2.28 GHz from 2.23 GHz.
According to other websites that have tested the card, the RTX 4090D is roughly 10% slower than the RTX 4090 in AI workloads and only 5% slower in gaming. Ironically, Nvidia never fully "locked" the RTX 4090D, enabling Chinese gamers and professionals to overclock the RTX 4090D to RTX 4090 performance levels.
The RTX 4090D was expressly designed for the purpose of complying with America's China export restrictions. These laws were put in place to prevent China and non-NATO countries from acquiring too much computing power — particularly AI processing power — for security reasons. These sanctions have been repeatedly changed over the past few years, first targeting data center chips like the Nvidia A100 and Nvidia H100, but later the RTX 4090 fell victim to the restrictions as it was 10% "too fast."
The current metric used to calculate the maximum allowed performance is known as TPP or Total Processing Power. This is calculated by the maximum compute for a given bit-depth, using TFLOPS (or TOPS for integer work) multiplied by the number of bits. For the RTX 4090, TPP is 660.8 * 8 = 5,286 for FP8 work running on the Tensor cores (sparsity doesn't count).
The new regulations apparently change the defined limit to include "Weighted TeraFLOPS" but neglect to clearly define that that means. Based on the language, however, we assume this refers to FP32 TFLOPS. The RTX 4090 for reference offers 82.6 TFLOPS of compute, while the RTX 4090D drops that to 73.5 TFLOPS — and the next step down for Nvidia's consumer GPUs is the RTX 4080 Super at 'only' 52.2 TFLOPS. Note also that these repeatedly lowering limits are starting to encroach on AMD's RX 7900 XTX, which offers 61.4 TFLOPS of compute.
While the government doesn't specifically name the RTX 4090D as the reason for the new restrictions, it's a safe bet that the card will be discontinued in the near future. Nvidia might keep the 4090D around and rebrand it for a Western audience... or it might just come out with a new GPU that once again complies with the sanctions shenanigans. Let's call it the RTX 4090 DD "Double Dragon" and give it 108 Streaming Multiprocessors and 13,824 CUDA cores, and perhaps a 400W TGP — and most importantly, 69.7 TFLOPS of compute. Then it would once again become compliant, shift tens of thousands of units into China, and probably result in yet another cut to the allowed export performance.
The new restrictions will reportedly go into effect on April 4, 2024. Nvidia has not announced any response to the lower limits, though of course it will comply with them. But fundamentally, if 80 TFLOPS was too fast, and now 70 TFLOPS is also too fast, at some point the U.S. needs to set a hard limit and stick with it — or it will end up 'banning' GPUs that have long since been discontinued. It's doubtful the restrictions are even fully effective, as there are still plenty of Chinese customers hungry for GPUs, and the hardware likely continues to flow into the country through indirect means.