Get all your news in one place.

100’s of premium titles.
One app.

Start reading

Get all your news in one place.

100’s of premium titles. One news app.

Start reading

TechRadar

James Capell

"It's trust, execution, and quality" — AMD head outlines the challenges for data centers, and how it offers the ideal end to end solution

AMD United States

AMD’s recent Advancing AI event saw the release of new 5th generation EYPC processors, MI325x accelerators, and new networking technology among a range of announcements.

These new products have the goal of easing the issues faced by ageing datacenters including managing dynamic workloads, energy efficiency, and space.

I got the chance to sit down with AMD's Senior Vice President and General Manager, Server Business Unit, Dan McNamara to talk about what the future holds for data centers and HPCs, why AMD has been so successful at expanding their market share in data centers, as well as how networking technology is the breakthrough datacenters and HPCs need to push the advancement of AI even further.

When you re-enter a market like we did in 2017, it’s about three things. It’s trust, execution, and quality. Customers really need to know you’re going to get them products. Then it’s about perf per watt per dollar. That’s what we’ve driven our roadmap on 100%.

When we’re looking at a new program or new product, the first thing (Dr) Lisa (Su, AMD CEO) asks is show me the progression. So, the economic value has to be there. Then its easy to adopt.

We’ve gotten better with every generation. We’ve just hit our stride. Milan was a really big inflection point for us. Naples and Rome were really good and broke into the cloud but Milan really expanded the aperture for us across enterprise and cloud.

How do you plan on continuing this growth?

Now, we are no longer a contender, we are now viewed as the leader. So, that’s what got us here. But what got you here doesn’t get you there, right? So, for me it’s about three things. Extend the product leadership. Perf per core, per socket, density, energy efficiency, TCO, all that, across a very broad set of cloud and enterprise workloads. Then, address the two major hot topics, AI and the refresh of old ageing fleets.

The cycles have been elongated. I was with a customer a week ago and he told me over 50% of my fleet is four years old. So, driving refresh and consolidation is critical but then with that is, how do you help them with the refresh and also help them with software licensing? Broadcom has created a lot of disruption with their new pricing model. There is also AI. CPU is critical in this mixed workload environment.

It’s very rare that you see someone say, I’m going to pick that server CPU because I’m going to run AI all day on it. If that’s the case, we recommend that they use accelerators. Anything above 30% of AI workloads should be put on accelerators. So if it’s 80% - 20% with 80% being traditional workloads and 20% being for AI, we win. We’ve shown that we’re optimised for general purpose.

For AI workloads, do you agree with the statement that we’re now seeing diminishing returns on compute which was reported by data scientists working at LUMI and that we need to focus on improving the network?

Network is super critical. It’s kind of like memory bandwidth. Network bandwidth is very similar. You have to feed the cores. It’s similar on the network between the GPUs and the backside and the front side. Frontside is important but backside is really critical to keep these GPUs all clustered and data flowing. When LUMI was built they didn’t have a backside, it uses a coherent cache interface between the CPU and the GPU. GPUs are for parallel processing so it’s super critical. This is why we are building AI NIC to expand that.

We are going to break through. We had dinner with a big hyper scaler on Monday and even they’re amazed that there is no end in sight for compute. It’s exciting.

On getting more compute, what impact do export bans have on the development of chips, if any?

We believe because of our architecture, we can adhere to that and still service other regions like China. Keep in mind, they’re going to solve the problem themselves. Ultimately they’re going to have to. At this time, anyone who is building a GPU or accelerator is finding a way to come in under the requirements that have been set.

So, it’s not stifling?

No.

OK, so back to AI workloads. AI workloads require a lot more energy than traditional workloads. Whose headache should this be and is this a consideration when developing chips?

I really believe that it starts at the country level. Lets face it. You don’t want to fall that far behind in technology, I don’t care what country you’re from. So, I think it starts there. Then I think its the datacenter providers. It’s a big issue. In the US you cannot even go across state lines. If you want to build out AI compute from California to Phoenix, it’ll take you a year to get some form of agreement. Across state lines is very very hard with transmission of power. So, that’s a problem the US needs to solve and other countries.

On performance and efficiency and I think they go hand in hand. If you look at what we did with EYPC, we increased our performance per watt but we also increased the TDP on the overall chip. So yes, performance went up and power consumption went up but it’s like when you run a business you want your top line to be going up at a much steeper slope than your expenditure. So, that’s the way I look at performance and efficiency.

Is it easier to push that top line higher or keep that bottom line lower?

It varies. So, with Turnin, very interesting, I’ve had some of the hyper scalers say 'I don’t want more perf, I want lower power and lower cost'. I have others saying 'I want perf per dollar and I’ll eat some of the power'. So it varies on the strategy of the provider. At the end of the day we’re providing chips but they’re providing the services.

We can do either and I can give you an example. For perf to price, to reduce the overall price we lowered the TDP to a point where we were happy with the perf and that reduced the overall cost. Whereas with the other one it’s a little harder because it’s purely 'I don’t want your perf give me the same and lower cost'. We show customers, here is where you’re on the curve and if you want to operate here, we can do it, if you want to operate there, we can do that too.