Demand for more computing power in the data center is growing at a staggering pace, and AMD has revealed that it has had serious inquiries to build single AI clusters packing a whopping 1.2 million GPUs or more.
AMD's admission comes from a lengthy discussion The Next Platform had with Forrest Norrod, AMD's EVP and GM of the Datacenter Solutions Group, about the future of AMD in the data center. One of the most eye-opening responses was about the biggest AI training cluster that someone is seriously considering.
When asked if the company has fielded inquiries for clusters as large as 1.2 million GPUs, Forrest replied that the assessment was virtually spot on.
Morgan: What’s the biggest AI training cluster that somebody is serious about – you don’t have to name names. Has somebody come to you and said with MI500, I need 1.2 million GPUs or whatever.
Forrest Norrod: It’s in that range? Yes.
Morgan: You can’t just say “it’s in that range.” What’s the biggest actual number?
Forrest Norrod: I am dead serious, it is in that range.
Morgan: For one machine.
Forrest Norrod: Yes, I’m talking about one machine.
Morgan: It boggles the mind a little bit, you know?
1.2 million GPUs is an absurd number (mind-boggling, as Forest quips later in the interview). AI-training clusters are often built with a few thousand GPUs connected via a high-speed interconnect across several server racks or less. By contrast, creating an AI cluster with 1.2 million GPUs seems virtually impossible.
We can only imagine the pitfalls someone will need to overcome to try and build an AI cluster with over a million GPUs, but latency, power, and the inevitability of hardware failures are a few factors that immediately come to mind.
AI workloads are extremely sensitive to latency, particularly tail latency and outliers, wherein certain data transfers take much longer than others and disrupt the workload. Additionally, today's supercomputers have to mitigate the GPU or other hardware failures that, at their scale, occur every few hours. Those issues would become far more pronounced when scaling to 30X the size of today's largest known clusters. And that's before we even touch on the nuclear power plant-sized power delivery required for such an audacious goal.
Even the most powerful supercomputers in the world don't scale to millions of GPUs. For instance, the fastest operational supercomputer right now, Frontier, "only" has 37,888 GPUs.
The goal of million-GPU clusters speaks to the seriousness of the AI race that is molding the 2020s. If it is in the realm of possibility, someone will try to do it if it means greater AI processing power. Forest didn't say which organization is considering building a system of this scale but did mention that "very sober people" are contemplating spending tens to hundreds of billions of dollars on AI training clusters (which is why millions of GPU clusters are being considered at all).