We recently got a glimpse of what $1 billion worth of AI GPUs looks like when Elon Musk shared a brief video tour of Cortex, X's AI training supercomputer currently under construction at Tesla’s Giga Texas plant.
More recently, Musk took to his social media platform to announce that Colossus, a new 100k H100 training cluster, is now up and running.
Musk claims that Colossus is "the most powerful AI training system in the world" and that it was built "from start to finish" in just 122 days. That's quite an achievement. Servers for the xAI cluster were reportedly provided by Dell and Supermicro, with the cost of the project estimated to be between $3-4 billion.
This weekend, the @xAI team brought our Colossus 100k H100 training cluster online. From start to finish, it was done in 122 days. Colossus is the most powerful AI training system in the world. Moreover, it will double in size to 200k (50k H200s) in a few months. Excellent…September 2, 2024
Where does Colossus get its name?
Tom’s Hardware notes, “Although all of these clusters are formally operational and even training AI models, it is entirely unclear how many are actually online today. First, it takes some time to debug and optimize the settings of those superclusters. Second, X needs to ensure that they get enough power, and while Elon Musk's company has been using 14 diesel generators to power its Memphis supercomputer, they were still not enough to feed all 100,000 H100 GPUs.”
The Colossus system is poised to eventually double in capacity, with plans to incorporate an additional 100,000 GPUs - 50,000 H100 units and 50,000 of Nvidia's next-gen H200 chips. The supercluster will primarily be used to train xAI's Grok-3, the company's latest, most advanced AI model. We've yet to see any mention of storage for the new system, but it will need to be huge.
The naming of the new supercomputer has raised more than a few eyebrows, however, with people noting that it shares its name with a 1970 sci-fi movie (based on a 1966 novel by D.F. Jones) about a supercomputer that becomes sentient after being given control of the US nuclear arsenal. Things, predictably, go horribly wrong for humanity.
Both the novel and film explore timely themes of AI autonomy, the dangers of relinquishing control to machines, and the ethical implications of artificial intelligence. It’s possible that Musk wasn’t aware of this when the name was chosen for his new AI training system, and it might have been selected purely to emphasize the sheer scale of the supercluster. Then again, with Musk's track record, it wouldn’t be surprising if the reference was entirely intentional - he knows exactly what he’s doing.