Elon Musk's X (formerly Twitter) has brought the world's most powerful training system online. The Colossus supercomputer uses as many as 100,000 Nvidia H100 GPUs for training and is set to expand with another 50,000 Nvidia H100 and H200 GPUs in the coming months.
"This weekend, the xAI team brought our Colossus 100K H100 training cluster online," Elon Musk wrote in an X post. "From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200K (50K H200s) in a few months."
According to Michael Dell, the head of the high-tech giant, Dell developed and assembled the Colossus system quickly. This highlights that the server maker has accumulated considerable experience deploying AI servers during the last few years' AI boom.
Elon Musk and his companies have been busy making supercomputer-related announcements recently. In late August, Tesla announced its Cortex AI cluster featuring 50,000 Nvidia H100 GPUs and 20,000 of Tesla's Dojo AI wafer-sized chips. Even before that, in late July, X kicked off AI training on the Memphis Supercluster, comprising 100,000 liquid-cooled H100 GPUs. This supercomputer has to consume at least 150 MW of power, as 100,000 H100 GPUs consume around 70 MW.
Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs.
xAI's training of the Grok version 2 large language model (LLM) required up to 20,000 Nvidia H100 GPUs, and Musk predicted that future versions, such as Grok 3, will need even more resources, potentially around 100,000 Nvidia H100 processors for training. To that end, xAI needs its vast data centers to train Grok 3 and then run inference on this model.