There is a common misconception that one GPU cloud is very much like another. But that is simply not the case. They are built with different technologies and architectures, and they come with their own sets of challenges, advantages, and disadvantages.
The most advanced AI cloud operators today are developing new blueprints for GPU datacenters that deploy NVIDIA H100s in Kubernetes or other virtualized environments to reach new levels of performance for AI processing.
To the customer the specs are basically the same. AI cloud computing service providers boast about Nvidia HGX H100 arrays and the fact they have 3.2 terabytes of InfiniBand. But this is because they’re all using the same network cards. If all clouds seem the same from a technical standpoint, customers will make decisions based on price.
But technical specifications alone do not tell the whole story. You can buy a Toyota Corolla with 100 kilowatts of power, and you buy a Mercedes with 100 kilowatts of power, but they are not the same thing. The build quality is different, the cost is different, and the user experience is different.
It is the same with data centers. If the head of finance oversaw architecting, it’s likely we would have the Toyota Corolla of data centers, and that’s fine for some, but given the choice, most organizations are going to pick the Mercedes. A data center built with cost savings at its heart might work for some customers, but it will be slower, and/or offer less cloud storage, and it may even be less secure.
GPU clouds
The construction of GPU clouds varies enormously between data centers. A common misconception is that AI infrastructure can just be built to the NVIDIA DGX reference architecture. But that is the easy bit and is the minimum viable baseline. How far organizations go beyond that is the differentiator. AI cloud providers are building highly differentiated solutions through the application of management and storage networks that can dramatically accelerate the productivity of AI computing.
The deployment of GPU data centers as the infrastructure for AI is a complex and challenging task that requires deep understanding of the balancing technologies to maximize throughput. High quality management systems and security systems have an obvious impact on the customer experience.
Another major factor governing the performance of AI clouds is the storage architecture. Using dynamically assigned WEKA architectures, NVMe (non-volatile memory express) disks and GPU direct storage can improve speed of execution by up to 100% for certain workloads such as Large Language Models (LLMs) used in machine learning.
WEKA’s data platform offers unmatched performance and scalability, particularly in feeding data to large-scale GPU environments. By transforming stagnant data silos into dynamic data pipelines, it effortlessly powers data-starved GPUs, enabling them to operate with up to 20 times greater efficiency and sustainability.
Storage access
How fast you access storage is critical in AI because you’re dealing with very large data sets of probably small chunks of data. You could be looking at 100 billion pieces of data spread across a network. Compared with digital media where you're working with a few thousand assets at most, although they could be hundreds of gigabytes each, it's a very different profile. Traditional hard disks will provide good speeds for digital media. Whereas an AI workload is very random by comparison, you need to take one gig here and one gig there and do that millions of times a second.
Another important difference to note with regards AI architecture versus traditional storage models is the absence of a requirement to cache data. Everything is done by direct request. The GPUs talk directly to the disks across the network, they don't go through the CPUs or the TCP IP stack. The GPUs are directly connected to the network fabric. They bypass most of the network layers and go directly to the storage. It removes network lag.
AI infrastructure architecture
AI infrastructure architecture needs to be designed to maximize compute power for the coming wave of AI workloads. Furthermore, network architectures should be designed to be completely uncontended. Lots of organizations will promise that, but you need a provider that has over provisioned to deliver that level of guarantee.
Major AI users such as Tesla and Meta are designing cloud infrastructure to meet the needs of different applications where the AI cloud architectures can be dynamically optimized for specific workloads. But most cloud providers do not have the luxury of knowing exactly what they’re building for.
Returning to the automotive analogy, most modern transport networks in major cities around the world were not build with today’s volumes of traffic in mind. Indeed, the issue with building a data center based on a current or even projected target in mind is that data canters will reach capacity sooner than you think. Clouds need to be both overprovisioned and extremely scalable.
If you don’t know exactly what you’re building for, you just need to build the biggest, fastest, most secure, easy to use platform possible. To optimize throughput, data centers need highly distributed storage architecture, with hundreds of disks generating 10s of millions of Input/output operations per second throughout your servers.
Supporting infrastructure
GPU clouds are also reliant on supporting infrastructure. For instance, if you're running Kubernetes, you need master nodes. You need coordination nodes, you need nodes to ingress the data, you need nodes to just log into so that you can have dashboards. The cloud provider needs to provision very substantial amounts of non-GPU compute in the same region.
Building genuine clouds is neither easy nor cheap. A lot of data center providers call themselves ‘cloud’, but really it is more like a managed hardware environment. It is certainly less risky from a financial perspective to sign up organizations to x-year contracts, and then build out a facility that meets the demands of the contract. And there are some benefits particularly around security and performance. But it is not cloud.
Cloud is self-service, it's API driven, you log in, click a button, and you've got access to the processing power you need for as long as you need it. There are a lot of organizations out there that don’t have the resource or requirement for ongoing data center support, they might only need the processing power for a short time and cloud gives them that option. NexGen Cloud is democratizing AI by opening access to shared high performance architectures.
A final consideration, and what that is becoming increasingly important, is energy consumption. Organizations of all sizes are being asked to not only monitor but also improve on their emissions. Not just from customers and society at large, but also from a regulatory perspective. Google and Microsoft recently announced an agreement with Nucor for a clean energy initiative to power data centers and ultimately reach Net Zero for AI processing. ESG performance is also proving to be a critical metric in terms of shareholder value and AI is incredibly power hungry.
Ultimately, organisations should partner with a provider they can rely on. A partner that can offer guidance, provide engineering and support. Businesses using cloud infrastructure are doing so to concentrate on their own core differentiators. They’re not in the business of running cloud AI infrastructure, they want convenience, security, and reliability, and genuine cloud provides all of that on-demand.
This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro