One of the challenges with building high-end AI data centers is connecting servers and making tens of thousands of GPUs work in concert and without problems, making network interconnections as important as GPUs. To build xAI's Colossus supercomputer, which now has 100,000 of Nvidia's Hopper processors and will expand to 200,000 H100 and H200 GPUs in the coming months, the company chose Nvidia's Spectrum-X Ethernet.
Nvidia's Spectrum-X platform includes the Spectrum SN5600 Ethernet switch, which enables port speeds up to 800 Gb/s and is built on the Spectrum-4 switch ASIC. The network platform works with Nvidia's BlueField-3 SuperNICs to deliver exceptional speed and efficiency when transferring massive data flows required for AI training. With Spectrum-X, Colossus achieves consistently high data throughput (95%) and virtually eliminates network latency issues and packet loss, allowing seamless operation at an unprecedented scale.
The green company says that traditional Ethernet would struggle to handle such a scale, often experiencing heavy congestion and low data throughput. By contrast, Spectrum-X's adaptive routing, congestion control, and performance isolation technologies tackle these issues, ensuring a stable, high-performance environment.
"AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency," said Gilad Shainer, senior vice president of networking at Nvidia. "The Nvidia Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions."
Even with 100,000 Hopper GPUs, xAI's Colossus is one of the world's most powerful supercomputers for AI training. Yet, it was constructed in just 122 days, and its rapid deployment contrasts sharply with typical timelines for such massive systems, which often span months or even years. This efficiency extended to its operational setup, where training commenced 19 days after the first hardware was delivered and installed.
It remains to be seen how long it will take xAI to install 100,000 more Hopper GPUs, though it is safe to say that for a while, this will be the world's most powerful AI supercomputer, at least before Microsoft and Oracle deploy their Blackwell-based PCs.
"Colossus is the most powerful training system in the world," said Elon Musk on X. "Nice work by xAI team, NVIDIA and our many partners/suppliers."