AMD CEO Lisa Su on the AI revolution and competing with Nvidia

Photo illustration by Alex Parkin / The VergeAt this year’s Code Conference, the CEO of one of the world’s largest computer chip companies discusses competing with Nvidia’s leading GPU, AI regulation, and the global supply chain. Continue reading…

Sep 29, 2023 - 21:02

0 34

AMD CEO Lisa Su on the AI revolution and competing with Nvidia

Today, we’re bringing you something a little different. The Code Conference was this week, and we had a great time talking live onstage with all of our guests. We’ll be sharing a lot of these conversations here in the coming days, and the first one we’re sharing is my chat with Dr. Lisa Su, the CEO of AMD.

Lisa and I spoke for half an hour, and we covered an incredible number of topics, especially about AI and the chip supply chain. These past few years have seen a global chip shortage, exacerbated by the pandemic, and now, coming out of it, there’s suddenly another big spike in demand thanks to everyone wanting to run AI models. The balance of supply and demand is overall in a pretty good place right now, Lisa told us, with the notable exception of these high-end GPUs powering all of the large AI models that everyone’s running.

Listen to Decoder, a show hosted by The Verge’s Nilay Patel about big ideas — and other problems. Subscribe here!

The hottest GPU in the game is Nvidia’s H100 chip. But AMD is working to compete with a new chip Lisa told us about called the MI300 that should be as fast as the H100. There’s also a lot of work being done in software to make it so that developers can move easily between Nvidia and AMD. So we got into that.

You’ll also hear Lisa talk about what companies are doing to increase manufacturing capacity. The CHIPS and Science Act that recently passed is a great step toward building chip manufacturing here in the United States, but Lisa told us it takes a long time to bring up that supply. So I wanted to know how AMD is looking to diversify this supply chain and make sure it has enough capacity to meet all of this new demand.

Finally, Lisa answered questions from the amazing Code audience and talked a lot about how much AMD is using AI inside the company right now. It’s more than you think, although Lisa did say AI is not going to be designing chips all by itself anytime soon.

Okay, Dr. Lisa Su, CEO of AMD. Here we go.

This transcript has been lightly edited for length and clarity.

Lisa Su: Hello, hello. Nice to see you.

Nilay Patel: Nice to see you.

Thank you for having me.

I have a ton to talk about — 500 cards’ worth of questions. We’re going to be here all night. But let’s start with something exciting. AMD made some news today in the AI market. What’s going on?

Well, I can say, first of all, the theme of this whole conference, AI, is the theme of everything in tech these days. And when we look at all of the opportunities for computing to really advance AI, that’s really what we’re working on. So yes, today, we did have an announcement this morning from a company, a startup called Lamini, a great company that we’ve been working with, some of the top researchers in large language models.

And the key for everyone is, when I talk to CEOs, people are all asking, “I know I need to pay attention to AI. I know I need to do something. But what do I do? It’s so complicated. There are so many different factors.” And with these foundational models like Llama, which are great foundational models, many enterprises actually want to customize those models with their own data and ensure that you can do that in your private environment and for your application. And that’s what Lamini does.

They actually customize models, fine-tune models for enterprises, and they operate on AMD GPUs. And so that was a cool thing. And we spent a bit of time with them, quite a bit of time with them, really optimizing the software and the applications to make it as easy as possible to develop these enterprise, fine-tuned models.

I want to talk about that software in depth. I think it’s very interesting where we’re abstracting the different levels of software development away from the hardware. But I want to come back to that.

I want to begin broadly with the chip market. We’re exiting a period of pretty incredible constraint in chips across every process node. Where do you think we are now?

It’s interesting. I’ve been in the semiconductor business for, I don’t know, the last 30 years, and for the longest time, people didn’t really even understand what semiconductors were or where they fit in the overall supply chain and where they were necessary in applications. I think the last few years, especially with the pandemic-driven demand and everything that we’re doing with AI, people now are really focused on semiconductors.

I think there has been a tremendous cycle. One, a cycle where we needed a lot more chips than we had, and then a cycle where we had too many of some. But at the end of the day, I think the fact is semiconductors are essential to so many applications. And particularly for us, what we’re focused on are the most complex, the highest performance, the bleeding edge of semiconductors. And I would say that there’s tremendous growth in the market.

What do you think the bottleneck is now? Is it the cutting edge? Is it at the older process nodes, which is what we were hearing in the middle of the chip shortage?

I think the industry as a whole has really come together as an ecosystem to put a lot of capacity on for the purposes of ensuring that we do satisfy overall demand. So in general, I would say that the supply / demand balance is in a pretty good place, with perhaps the exception of GPUs. If you need GPUs for large language model training and inference, they’re probably tight right now. A little bit tight.

Lisa’s got some in the back if you need some.

But look, the truth is we absolutely are putting a tremendous amount of effort getting the entire supply chain ramped up. These are some of the most complex devices in the world — hundreds of billions of transistors, lots of advanced technology. But absolutely ramping up supply overall.

The CHIPS and Science Act passed last year, a massive investment in this country in fabs. AMD is obviously the largest fabless semiconductor company in the world. Has that had a noticeable effect yet, or are we still waiting for that to come to fruition?

I do think that if you look at the CHIPS and Science Act and what it’s doing for the semiconductor industry in the United States, it’s really a fantastic thing. I have to say, hats off to Gina Raimondo and everything that the Commerce Department is doing with industry. These are long lead time things. The semiconductor ecosystem in the US needed to be built five years ago. It is expanding now, especially at the leading edge, but it’s going to take some time.

So I don’t know that we feel the effects right now. But one of the things that we always believe is the more you invest over the longer term, you’ll see those effects. So I’m excited about onshore capacity. I’m also really excited about some of the investments in our national research infrastructure because that’s also extremely important for long-term semiconductor strength and leadership.

AMD’s results speak for themselves. You’re selling a lot more chips than you were a few years ago. Where have you found that supply? Are you still relying on TSMC while you wait for these new fabs to come up?

Again, when you look at the business that we’re in, it’s pushing the bleeding edge of technology. So we’re always on the most advanced node and trying to get the next big innovation out there. And there’s a combination of both process technology, manufacturing, design, design systems. We are very happy with our partnership with TSMC. They are the best in the world with advanced and leading-edge technologies.

They’re it, right? Can you diversify away from them?

I think the key is geographical diversity, Nilay. So when you think about geographical diversity, and by the way, this is true no matter what. Nobody wants to be in the same place because there are just natural risks that happen. And that’s where the CHIPS and Science Act has actually been helpful because there are now significant numbers of manufacturing plants being built in the US. They’re actually going to start production over the next number of quarters, and we will be active in having some of our manufacturing here in the United States.

I talked to Intel CEO Pat Gelsinger when he broke ground in Ohio. They’re trying to become a foundry. He said very confidently to me, “I would love to have an AMD logo on the side of one of these fabs.” How close is he to making that a reality?

Well, I would say this. I would say that from onshore manufacturing, we are certainly looking at lots and lots of opportunities. I think Pat has a very ambitious plan, and I think that’s there. I think we always look at who are the best manufacturing partners, and what’s most important to us is someone who’s really dedicated to the bleeding edge of technology.

Is there a competitor in the market to TSMC on that front?

There’s always competition in the market. TSMC is certainly very good. Samsung is certainly making a lot of investments. You mentioned Intel. I think there are some activities in Japan as well to bring up advanced manufacturing. So there are lots of different options.

Last question on this thread, and then I do want to talk to you about AI. There has been a lot of noise recently about Huawei. They put out a seven-nanometer chip. This is either an earth-shattering geopolitical event or it’s bullshit. What do you think it is?

Let’s see. I don’t know that I would call it an earth-shattering geopolitical event. Look, I think there’s no question that technology is considered a national security importance. And from a US standpoint, I think we want to ensure that we keep that lead. Again, I think the US government has spent a lot of time on this aspect.

The way I look at these things is we are a global company. China’s an important market for us. We do sell to China more consumer-related goods versus other things, and there’s an opportunity there for us to really have a balanced approach into how we deal with some of these geopolitical matters.

Do you think that there was more supply available at TSMC because Huawei got kicked out of the game?

I think TSMC has put a tremendous amount of supply on the table. I mean, if you think about the CapEx that’s happened over the last three or four years, it’s there because we all need more chips. And when we need more chips, the investment is there. Now chips are more expensive as a result, and that’s part of the ecosystem that we’ve built out.

Let’s talk about that part of it. So you mentioned GPUs are constrained. The Nvidia H100, there’s effectively a black market for access to these chips. You have some chips, you’re coming out with some new ones. You just announced Lamini’s training fully on your chips. Have you seen opportunity to disrupt this market because Nvidia supply is so constrained?

I would take a step back, Nilay, and just talk about what’s happening in the AI market because it’s incredible what’s happening. If you think about the technology trends that we’ve seen over the last 10 or 20 years — whether you’re talking about the internet or the mobile phone revolution or how PCs have changed things — AI is 10 times, 100 times, more than that in terms of how it’s impacting everything that we do.

So if you talk about enterprise productivity, if you talk about personal productivity or society, what we can do from a productivity standpoint, it’s that big. So the fact that there’s a shortage of GPUs, I think it’s not surprising because people recognize how important the technology is. Now, we’re in such the early innings of how AI and especially generative AI is coming to market that I view this as a 10-year cycle that we’re talking about, not how many GPUs can you get in the next two to four quarters.

We are excited about our road map. I think with high-performance computing, I would call generative AI the killer app for high-performance computing. You need more and more and more. And as good as today’s large language model is, it can still get better if you continue to increase the training performance and the inference performance.

And so that’s what we do. We build the most complex chips. We do have a new one coming out. It’s called MI300 if you want the code name there, and it’s going to be fantastic. It’s targeted at large language model training as well as large language model inference. Do we see opportunity? Yes. We see significant opportunity, and it’s not just in one place. The idea of the cloud guys are the only users, that’s not true. There’s going to be a lot of enterprise AI. A lot of startups have tremendous VC backing around AI as well. And so we see opportunity across all those spaces.

So MI300?

MI300, you got it.

Performance-wise, is this going to be competitive with the H100 or exceed the H100?

It is definitely going to be competitive from training workloads, and in the AI market, there’s no one-size-fits-all as it relates to chips. There are some that are going to be exceptional for training. There are some that are going to be exceptional for inference, and that depends on how you put it together.

What we’ve done with MI300 is we’ve built an exceptional product for inference, especially large language model inference. So when we look going forward, much of what work is done right now is companies training and deciding what their models are going to be. But going forward, we actually think inference is going to be a larger market, and that plays well into some of what we’ve designed MI300 for.

If you look at what Wall Street thinks Nvidia’s mode is, it’s CUDA, it’s the proprietary software stack, it’s the long-running relationships with developers. You have ROCm, which is a little different. Do you think that that’s a moat that you can overcome with better products or with a more open approach? How are you going about attacking that?

I’m not a believer in moats when the market is moving as fast as it is. When you think about moats, it’s more mature markets where people are not really wanting to change things a lot. When you look at generative AI, it’s moving at an incredible pace. The progress that we’re making in a few months in a regular development environment might’ve taken a few years. And software in particular, our approach is an open software approach.

There’s actually a dichotomy. If you look at people who have developed software over the last five, seven, or eight years, they’ve tended to use… let’s call it, more hardware-specific software. It was convenient. There weren’t that many choices out there, and so that’s what people did. When you look at going forward, actually what you find is everyone’s looking for the ability to build hardware-agnostic software because people want choice. Frankly, people want choice. People want to use their older infrastructure. People want to ensure that they’re able to move from one infrastructure to another infrastructure. And so they’re building on these higher levels of software. Things like PyTorch, for example, which tends to be that hardware-agnostic capability.

So I do think the next 10 years are going to be different from the last 10 as it relates to how do you develop within AI. And I think we’re seeing that across the industry and the ecosystem. And the benefit of an open approach is that there’s no one company that has all of the ideas. So the more we’re able to bring the ecosystem together, we get to take advantage of all of those really, really smart developers who want to accelerate AI learning.

PyTorch is a big deal, right? This is the language that all these models are actually coded in. I talk to a bunch of cloud CEOs. They don’t love their dependency on Nvidia as much as anybody doesn’t love being dependent on any one vendor. Is this a place where you can go work with those cloud providers and say, “We’re going to optimize our chips for PyTorch and not CUDA,” and developers can just run on PyTorch and pick whichever is best optimized?

That’s exactly it. So if you think about what PyTorch is trying to do — and it really is trying to be that sort of hardware-agnostic layer — one of the major milestones that we’ve come up with is on PyTorch 2.0, AMD was qualified on day one. And what that means is anybody who runs CUDA on PyTorch right now, it will run on AMD out of the box because we’ve done the work there. And frankly, it’ll run on other hardware as well.

But our goal is “may the best chip win.” And the way you do that is to make the software much more seamless. And it’s PyTorch, but it’s also Jax. It’s also some of the tools that OpenAI is bringing in with Triton. There are lots of different tools and frameworks that people are bringing forward that are hardware-agnostic. There are a bunch of people who are also doing “build your own” types of things. So I do think this is the wave of the future for AI software.

Are you building custom chips for any of these companies?

We have the capability of building custom chips. And the way I think about it is the time to build custom chips is actually when you get very high volume applications going forward. So I do believe there will be custom chips over the next number of years. The other piece that’s also interesting is you need all different types of engines for AI. So we spend a lot of time talking about big GPUs because that’s what’s needed for trading large language models. But you’re also going to see ASICs for some… let’s call it, more narrow applications. You’re also going to see AI in client chips. So I’m pretty excited about that as well in terms of just how broad AI will be incorporated into chips across all of the market segments.

I’ve got Kevin Scott, CTO of Microsoft, here tomorrow. So I’ll ask you this question so I can chase him down with it. If, say, Microsoft wanted to diversify Azure and put more AMD in there and be invisible to customers, is that possible right now?

Well, first of all, I love Kevin Scott. He’s a great guy, and we have a tremendous partnership with Microsoft across both the cloud as well as the Windows environment. I think you should ask him the question. But I think if you were to ask him or if you were to ask a bunch of other cloud manufacturers, they would say it’s absolutely possible. Yes, it takes work. It takes work that we each have to put in, but it’s much less work than you might have imagined because people are actually writing code at the higher-level frameworks. And we believe that this is the wave of the future for AI programming.

Let me connect this to an end-user application just for a second. We’re talking about things that are very much raising the cost curve: a lot of smart people doing a lot of work to develop for really high-end GPUs on the cutting-edge process nodes. Everything’s just getting more expensive, and you see how the consumer applications are expensive: $25 a month, $30 a seat for Microsoft Office with Copilot. When do you come down the cost curve that brings those consumer prices down?

It’s a great, great question. I do believe that the value that you get with gen AI in terms of productivity will absolutely be proven out. So yes, the cost of these infrastructures is high right now, but the productivity that you get on the other side is also exciting. We’re deploying AI internally within AMD, and it’s such a high priority because, if I can get chips out faster, that’s huge productivity.

Do you trust it? Do you have your people checking the work that AI is doing, or do you trust it?

Sure. Look, we’re all experimenting, right? We’re in the very, very early stages of building the tools and the infrastructure so that we can deploy. But the fact is it saves us time — whether we’re designing chips, where we’re testing chips, where we’re validating chips — it saves us time, and time is money in our world.

But back to your question about when do you get to the other side of the curve. I think that’s why it’s so important to think about AI broadly and not just in the cloud. So if you think about how the ecosystem will look a few years from now, you would imagine a place where, yes, you have the cloud infrastructures training these largest foundational models, but you’re also going to have a bunch of AI at the edge. And whether it’s in your PC or it’s in your phone, you’re going to be able to do local AI. And there, it is cheaper, it is faster, and it is actually more private when you do that. And so, that’s this idea of AI everywhere and how it can really enhance the way we’re deploying.

That brings me to open source and, honestly, to the idea of how we will regulate this. So there’s a White House meeting, everyone participates, great. Everyone’s very proud of each other. You think about how you will actually enforce AI regulation. And it’s okay, you can probably tell AWS or Azure not to run certain work streams. “Don’t do these things.” And that seems fine. Can you tell AMD to not let certain things happen on the chips for somebody running an open-source model on Linux on their laptop?

I think it is something that we all take very seriously. The technology has so much upside in terms of what it can do from a productivity and a discovery standpoint, but there’s also safety in AI. And I do think that, as large companies, we have a responsibility. If you think about the two things around data privacy as well as just overall ensuring as these models are developed that they’re developed to the best of our ability without too much bias. We’re going to make mistakes. The industry as a whole will not be perfect here. But I think there is clarity around its importance and that we need to do it together and that there needs to be a public / private partnership to make it happen.

I can’t remember anyone’s name, so I’d be a horrible politician. But let’s pretend I’m a regulator. I’m going to do it. And I say, “Boy, I really don’t want these kids using any model to develop chemical weapons. And I need to figure out where to land that enforcement.” I can definitely tell Azure, “Don’t do that.” But a kid with an AMD chip in a Dell laptop running Linux, I have no mechanism of enforcement except to tell you to make the chip not do it. Would you accept that regulation?

I don’t think there’s a silver bullet. It’s not, “I can make the chip not do it.” It’s “I can make the combination of the chip and the model and have some safeguards in place.” And we’re absolutely willing to be at that table to help that happen.

You would accept that kind of regulation, that the chip will be constrained?

Yes, I would accept an opportunity for us to look at what are the safeguards that we would need to put in place.

I think this is going to be one of the most complicated... I don’t think we expect our chips to be limited in what we can do, and it feels like this is a question we have to ask and answer.

Let me say again, it’s not the chip by itself. Because in general, chips have broad capability. It’s the chips plus the software and the models. Particularly on the model side, what you do in terms of safeguards.

We could start lining up for questions. I’ve just got a couple more for you. You’re in the PS5; you’re in the Xbox. There’s a view of the world that says cloud gaming is the future of all things. That might be great for you because you’ll be in their data centers, too. But do you see that shift underway? Is that for real, or are we still doing console generations?

It’s so interesting. Gaming is everywhere. Gaming is everywhere in every form factor. There’s been this long conversation about: is this the end of console gaming? And I don’t see it. I see PC gaming strong, I see console gaming strong, and I see cloud gaming also having legs. And they all need similar types of technology, but they obviously use it in different ways.

Audience Q&A

Nilay Patel: Please introduce yourself.

Alan Lee: Hi, Lisa. Alan Lee, Analog Devices. One and a half years after the Xilinx acquisition, how do you see adaptive computing playing out in AI?

Lisa Su: First of all, it’s nice to see you, Alan. I think, first of all, the Xilinx acquisition was an acquisition we completed about 18 months ago — fantastic acquisition. Brought a lot of high-performance IP with adaptive computing IP. And I do see that particularly on these AI engines, engines that are optimized for data flow architectures, that’s one of the things that we were able to bring in as part of Xilinx. That’s actually the IP that is now going into PCs.

And so we see significant IP usage there. And together, as we go forward, I have this belief that there’s no one computer that is the right one. You actually need the right computing for the right applications. So whether it’s CPUs or GPUs or FPGAs or adaptive SoCs, you need all of those. And that’s the ecosystem that we’re bringing together.

NP: This tall gentleman over here.

Casey Newton: Hi, Casey Newton from Platformer. I wanted to return to Nilay’s question about regulation. Someday, it’s sad to say, but somebody might try to acquire a bunch of your GPUs for the express purpose of doing harm — training a large language model for that purpose. And so I wonder what sort of regulations, if any, do you think government should place around who gets access to large numbers of GPUs and what size training runs they’re allowed to do.

LS: That’s a good question. I don’t think we know the answer to that, particularly in terms of how to regulate. Our goal is, again, within all of the export controls that are out there, because GPUs are export controlled, that we follow those regulations. There are the biggest and the next level of GPUs that are there. I think the key is, again, as I said, it’s a combination of both chip and model development that really comes about. And we’re active at those tables and talking about how to do those things. I think we want to ensure that we are very protective of the highest-performing GPUs. But also, it’s an important market where lots of people want access.

Daniel Vestergaard: Hi, I’m Daniel from DR [Danmarks Radio]. To return to something you talked about earlier because everyone here is thinking about implementing AI in their internal workflows — and it’s just so interesting to hear about your thoughts because you have access to the chips and deep machine learning knowledge. Can you specify a bit, what are you using AI internally for in the chip-making process? Because this might point us in the right direction.

LS: Thanks for the question. I think every business is looking at how to implement AI. So for us, for example, there are the engineering functions and the non-engineering: sales, marketing, data analytics, lead generation. Those are all places where AI can be very useful. On the engineering side, we look at it in terms of how can we build chips faster. So they help us with design, they help us with test generation, they help us with manufacturing diagnostics.

Back to Nilay’s question, do I trust it to build a chip with no humans involved? No, of course not. We have lots of engineers. I think copilot functions in particular are actually fairly easy to adopt. Pure generative AI, we need to check and make sure that it works. But it’s a learning process. And the key, I would say, is there’s lots of experimentation, and fast cycles of learning are important. So we actually have dedicated teams that are spending their time looking at how we bring AI into our company development processes as fast as possible.

Jay Peters: Hi, Jay Peters with The Verge. Apple seems to be making a much bigger push in how its devices, and particularly its M-series chips, are really good for AAA gaming. Are you worried about Apple on that front at all?

NP: They told me the iPhone 15 Pro is the world’s best game console. And that’s why it’s “Pro.” It’s a very confusing situation.

LS: I don’t know about that. I would say, look, as I said earlier, gaming is such an important application when you think about entertainment and what we’re doing with it. I always think about all competition. But from my standpoint, it’s how do we get... It’s not just the hardware; it’s really how do we get the gaming ecosystem. People want to be able to take their games wherever and play with their friends and on different platforms. Those are options that we have with the gaming ecosystem today. We’re going to continue to push the envelope on the highest-performing PCs and console chips. And I think we’re going to be pretty good.

NP: I have one more for you. If you listen to Decoder, you know I love asking people about decisions. Chip CEOs have to make the longest-range decisions of basically anybody I can think of. What’s the longest-term bet you’re making right now?

LS: We are definitely designing for the five-plus-year cycle. I talked to you today about MI300. We made some of those architectural decisions four or five years ago. And the thought process there was, “Hey, where’s the world going? What kind of computing do you need?” Being very ambitious in our goals and what we were trying to do. So we’re pretty excited about what we’re building for the next five years.

NP: What’s a bet you’re making right now?

LS: We’re betting on what the next big thing in AI is.

NP: Okay. Thank you, Lisa.

LS: Alright.

NP: I did my best.