Analysis: As Nvidia Takes AI Victory Lap, AMD Doubles The Trouble For Intel

by nlqip
June 17, 2024

The fact that now both Nvidia and AMD have significantly sped up development and plan to release data center accelerator chips every one year instead of every two years is creating double the trouble for Intel, which is trying to catch up with its Gaudi AI chips but remains behind the performance curve.

If Nvidia’s GTC event in March was the main set for its supergroup of AI chips, networking gear, server platforms and software, this month’s Computex event was the chip design giant’s encore, where it revealed plans to continue dominating the accelerated computing market with more powerful processors and ever-expanding ecosystem support.

While many have said we are in the early innings of AI adoption, the Taiwanese industry event nevertheless served as a victory lap of sorts for Nvidia, which surpassed semiconductor stalwart Intel last year in annual revenue thanks to skyrocketing sales of its data center GPUs and platforms, driven by an arms race of generative AI development.

After revealing its next-generation Blackwell data center GPU architecture and promising a massive leap in performance and efficiency for AI computing at GTC, Nvidia used Computex 2024 to make several announcements that it hopes will convince businesses to stay within its ecosystem and make greater investments.

It revealed an expanded road map with Blackwell successors set to launch over the next three years. It announced a plan to make Blackwell GPUs more accessible through a variety of modular server designs when the architecture launches later this year. And it announced massive ecosystem support of its Nvidia Inference Microservices, which could further engrain the company in how AI applications are developed.

The plan to follow up Blackwell with updated GPU architectures in 2025, then 2026, then 2027 is part of a new one-year release cadence Nvidia announced last year, representing an acceleration of the company’s previous strategy of releasing new GPUs every two years.

“Our basic philosophy is very simple: build the entire data center scale, disaggregate it and sell it to you in parts on a one-year rhythm, and we push everything to the technology limits,” Nvidia CEO Jensen Huang said in his Computex keynote earlier this month.

The AI chip giant sped up development in the face of growing competition, including from AMD, which responded earlier this month with its own hastened road map that will see the chip designer, like Nvidia, release data center accelerator chips every one year instead of two. This will start with the Instinct MI325X GPU due out later this year, with successors to follow in 2025, then 2026 to keep up with Nvidia’s increasingly powerful chips.

“Nvidia quite candidly stepped on their accelerator pedal, and when they saw that—‘Holy crap, AMD has got a real part [in the Instinct MI300X]; they’re going to be a real competitor’—they very deliberately stepped on the accelerator, trying to block us and everybody else out. And so we’re responding to that as well,” Forrest Norrod, AMD’s top data center executive, told CRN in an interview earlier this month.

On this front, Intel is investing significant resources into the upcoming launch of its Gaudi 3 chip, which is set to launch with air-cooled versions in the third quarter with the support of Dell Technologies, Hewlett Packard Enterprise, Lenovo, Supermicro, Asus, Gigabyte and other major server vendors. The company will then follow up with the most powerful versions of Gaudi 3, made possible by liquid cooling, in the fourth quarter.

But while Gaudi 3 has significantly greater ecosystem support than Intel’s two-year-old Gaudi 2 chip, the chipmaker may face a challenge because it expects Gaudi 3 to mainly compete with Nvidia’s H100 GPU, which launched in 2022, and, to some extent, the larger-memory H200 successor that recently started shipping.

This is an issue because, by the time Gaudi 3 starts finding its way into servers, Nvidia will be close to doing the same—if not already doing so—with its first Blackwell GPU designs, which are expected to provide levels of performance and efficiency that are magnitudes greater than what the Hopper architecture enables for the H100 and H200.

This year’s fourth quarter is also when AMD expects to start shipping its Instinct MI325X GPU, which the company believes will provide solid competition against Nvidia’s H200 and Blackwell GPUs. Crucially, the MI325X will feature an HBM3e high-bandwidth memory capacity of 288 GB, which is an area that has become critical in generative AI computing due to the increasingly gigantic models that are used to enable cutting-edge capabilities.

The Instinct MI325X could provide another issue for Intel because its memory capacity is 50 percent higher than the HBM3 capacity of AMD’s MI300X, which already surpasses Gaudi 3’s high-bandwidth memory capacity by 50 percent and has now been shipping for several months with support from Dell Technologies, Hewlett Packard Enterprise, Lenovo and Supermicro. In other words, AMD plans to start shipping an accelerator chip with double the memory of Intel’s Gaudi 3 around the same time—and the MI325X will be compatible with server platforms that have already been qualified by major OEMs.

Intel, on the other hand, doesn’t plan to launch the successor to Gaudi 3 until late 2025. Code-named Falcon Shores, the next-gen chip is expected to combine the Xe GPU architecture that appeared in its Max Series GPU with its Gaudi technology, but the company only arrived at that design decision after making multiple changes from its original plan, delaying the chip from its original 2024 launch window.

When Intel does launch Falcon Shores, Nvidia will likely be focusing on ramping Blackwell Ultra, the successor to Blackwell, while AMD turns its sights to the next-gen Instinct MI400. At that point, Intel better hope that it can position Falcon Shores competitively.

The road to a commercially successful data center accelerator product has been a long one for Intel, which first tried to tackle the category with Xeon Phi in the early 2010s, then Nervana in the late 2010s, only then to focus on Xe GPUs and Gaudi chips until the company decided last year to merge their road maps starting with Falcon Shores.

It could take a while longer for Intel to get there. During the company’s third-quarter earnings in April, CEO Pat Gelsinger said the company expected to make more than $500 million this year from Gaudi chips. That pales in comparison to the $4 billion AMD has forecasted for data center GPU revenue in 2024 and, to a much greater extent, the $19.4 billion Nvidia made from data center compute products in the first quarter alone.

Then there’s the fact that there are other companies building competitive data center accelerator chips, from major cloud service providers like Amazon Web Services, Microsoft Azure and Google Cloud to startups like Cerebras Systems. A recent report from research firm TechInsights estimated that Google’s data center processor revenue last year was just behind Intel’s mainly thanks to its Tensor Processing Unit chips.

To make Gaudi 3 a winner, Intel is betting that demand for AI computing will remain insatiable for the next year and that there will be enough time in the market between the availability of Gaudi 3 systems and systems with next-gen chips from rivals so that the company can win customer deals against Nvidia’s H100 and H200 plus AMD’s MI300X.

The semiconductor giant is also making a major emphasis on the economic benefits of Gaudi 3, having taken the unprecedented step of publicly revealing the $125,000 price of its eight-chip Gaudi 3 platform and saying that it will only cost two-thirds the estimated price of Nvidia’s eight-GPU HGX H100 platform. These touted economic benefits will also come from Gaudi 3’s ability to deliver higher performance per watt than the H100 when it comes to large language models that serve up long responses for inferencing.

“Customers are looking for high-performance, cost-effective GenAI training and inferencing solutions, and they started to turn to alternatives like Gaudi. They want choice. They want open software and hardware solutions and time-to-market solutions at dramatically lower [total cost of ownership],” Gelsinger said in his Computex keynote earlier this month.

Intel has also pledged to differentiate from the proprietary nature of Nvidia’s products by building an open enterprise AI platform with the support of big names like VMware, Red Hat and SAP alongside smaller independent software vendors. With Gaudi 3 set to use Ethernet to scale to more than a thousand systems, the company is also backing the development of an open “Ultra Ethernet” standard for future AI data centers.

“Customers are asking for open technology. They don’t want proprietary islands in their data center for their AI solutions,” Gelsinger said at Computex.

Nevertheless, Intel has found itself in a challenging situation, mainly due to Nvidia’s continuing dominance of the profit-rich data center accelerator space, now exacerbated by AMD’s decision to follow the market leader with a faster release cadence of products. And it’s a situation the semiconductor giant must overcome while juggling several other important initiatives as part of Gelsinger’s grand comeback plan.

Source link
lol

The fact that now both Nvidia and AMD have significantly sped up development and plan to release data center accelerator chips every one year instead of every two years is creating double the trouble for Intel, which is trying to catch up with its Gaudi AI chips but remains behind the performance curve. If Nvidia’s…

Analysis: As Nvidia Takes AI Victory Lap, AMD Doubles The Trouble For Intel

Leave a Reply Cancel reply

Recent Posts

Recent Comments