Who can replace Nvidia in terms of depth?

"We are always only 30 days away from bankruptcy," is the catchphrase of NVIDIA founder Huang Renxun.
This statement is quite peculiar from the boss of Nvidia, as the company, as a leader in gaming and artificial intelligence chips, already has lucrative profits that rivals cannot match. But in the past, Nvidia did face the dilemma of running out of funds and on the brink of bankruptcy.
The AI boom that began at the end of 2022 has benefited this company with a strong sense of concern, driving Nvidia's revenue and profits to continue soaring. In the previous financial quarter, Nvidia's total revenue was $22.1 billion, a year-on-year increase of 265%, a month on month increase of 22%, and net profit was $12.3 billion, a year-on-year increase of 769%. In the past fiscal year, Nvidia's revenue reached $60.9 billion, a year-on-year increase of 126%, and its net profit was $29.8 billion, an increase of 581% compared to the previous year.
On February 23rd, Nvidia's market value surged and briefly surpassed $2 trillion, becoming the third largest listed company and the most valuable chip company in the world, making it difficult for Intel, once the world's strongest chip company, to catch up with.
An almost perfect business model
Little is known about Nvidia, which invented GPUs (graphics processors) in the past, supporting the booming gaming industry and making a fortune in cryptocurrency mining.
With the arrival of the AI boom, its business has once again seen a significant boost. Nowadays, Nvidia controls over 80% of the share of AI accelerated computing, and it has also dug a wide and deep "moat" to maintain its advantage. Its success is enviable.
Specifically, GPUs designed for gaming are adept at processing images, scientific computing, and other applications, making them naturally suitable for scenarios such as AI computing that require simultaneous processing of large amounts of data.
20 years ago, Nvidia also began investing in the basic software layer CUDA that can program and fine tune GPUs to reduce the complexity of processing data with GPUs and build an ecosystem. However, CUDA is only exclusive to Nvidia, which means developers cannot freely adjust it.
Through the acquisition, Nvidia also has the network connectivity required for data transmission in server clusters, which is currently the necessary interconnection technology for training AI models. After more than a decade of development, Nvidia GPU has become an AI infrastructure.
In addition, as a chip design company, Nvidia also outsources chip manufacturing and other work to external chip foundries such as TSMC and Samsung, taking the semiconductor industry's division of labor and cooperation model to the extreme - always using the most competitive chip manufacturing process, which means Nvidia will not make the big mistake of Intel's long-term difficulty in fulfilling its promise of chip manufacturing technology updates.
In the end, Nvidia provided the best chips, the best network technology, and the best software. Huang Renxun once stated that the most important aspect of an AI system is not the cost of hardware components, but the cost of training and running AI applications. From this perspective, Huang Renxun believes that Nvidia has no rivals in terms of cost-effectiveness.
From a business perspective, Nvidia's current model is almost impeccable, and the GPU industry has already gone through a transition from entrepreneurship to being dominated by giants, with only Nvidia and AMD remaining. With the rapid development of semiconductors, the barriers to technology and capital continue to rise. Compared to Nvidia, which will be used to translate the technology for drawing game graphics into AI computing, Nvidia's efforts to start from scratch are facing many difficulties.
However, Nvidia's monopoly in AI computing is unsatisfactory, and competitors are striving to break Nvidia's dominance. Customers also need a second source of AI chip supply. Although Nvidia's GPU has many advantages, it may consume too much power and have complex programming when used for AI. From startups to other chip manufacturers and technology giants, Nvidia has endless competitors.
A chip giant striving to catch up
AMD, an established chip manufacturer, is considered the closest competitor to Nvidia in terms of level.
As NVIDIA's long-term competitor in gaming chips, AMD also has its own AI processor products and has established long-term partnerships with data center operators who are eager for computing power.
In the past, AMD has been planning its next-generation AI strategy, including mergers and acquisitions and departmental restructuring. However, the emergence of generative AI has led to further expansion of the company's product lineup: the MI300 chip released in December last year was specifically designed for complex AI models, equipped with 153 billion transistors, 192GB of memory, and 5.3TB of memory bandwidth per second, which is approximately twice, 2.4 times, and 1.6 times that of Nvidia's strongest AI chip H100, respectively.
In terms of software, AMD hopes to open source its ROCm software and provide more convenient migration tools. They translated the CUDA application into code that MI300 could run in an attempt to attract Nvidia's customers.
Compared to Nvidia, AMD has almost started from scratch in the cloud AI chip market, which means its AI chip business growth will be quite fast. Large customers are also willing to try AMD chips, and OpenAI, which developed ChatGPT, has stated that it will use MI300 for partial model training. In the previous quarter, MI300 drove AMD's total GPU revenue in data centers to exceed $400 million, making it the fastest growing product in the company's history.
AMD CEO Su Zifeng predicts that global sales of AI chips will reach $400 billion by 2027, far higher than last year's approximately $40 billion, which means AMD needs to win some of the market from it. Analysts estimate that over time, AMD's market share in the AI chip field may reach 20%.
Intel is also unwilling to fall behind in AI chips and is starting to regroup.
Over the past year, Intel has countered the claim that generative AI can only run on Nvidia chips, vigorously promoting the performance of its Gaudi 2 chips in third-party testing and claiming that customers have a new choice to break free from the closed chip ecosystem.
Compared to Nvidia's latest H100, Intel's most advanced mass-produced AI acceleration chip Gaudi 2 lags behind in performance, with an estimated performance lead of approximately 3.6 times for each H100. However, Intel believes that the Gaudi 2 has lower costs and its price advantage can narrow the cost performance gap with the H100. In addition, Intel also has corresponding network technology and software similar to CUDA with Nvidia.
Currently, Intel is adjusting its GPU strategy to catch up with Nvidia on the cloud AI side. Last year, Intel announced that it would integrate its existing Habana Labs and data center GPUs departments and launch a new platform, Falcon Shores, in 2025 to further enhance its AI chip design capabilities.
AMD and Intel have been conducting acquisition activities in recent years to enhance their AI products.
In 2022, AMD invested $35 billion to acquire programmable chip company Serenius. The chips designed by Serenius can be reprogrammed after manufacturing and can be used for AI computing. Intel acquired Israeli AI startup Habana Labs for approximately $2 billion in 2019, and currently, the AI chips designed by Intel mainly come from the Habana Labs division.
A startup company that takes a different path
If we strictly adhere to commercial competition, GPUs that have experienced a lot of ups and downs may not have a foothold for new companies, and there are also a group of startup companies that are finding ways to manufacture chips that are more suitable for AI than Nvidia's GPUs.
These companies choose to use the AISC (Application Specific Integrated Circuit) architecture to enter cloud based AI computing scenarios.
The design concept of ASIC is to fix some algorithms onto hardware, making chip complexity and development difficulty lower, and higher efficiency for specific tasks, but its universality and flexibility are not as good as GPUs. Among start-up companies based on AISC architecture, Cerebras, Groq, and Graphcore are representatives, while in China, there are Cambrian, Suiyuan, Bitland, and others.
These chips have dazzling names. Groq, which has recently become popular, has launched a chip product called LPU (Language Processing Unit) for large-scale model inference. According to the test performance and promotion provided by Groq, the AI question and answer robot driven by Groq LPU provides a much faster answer speed than ChatGPT (driven by GPU).
However, it has been found that the current configuration of Groq LPU does not bring significant advantages. Firstly, LPU can currently only be used for inference and does not support large model training. If AI companies need to train large models, they still need to purchase Nvidia GPUs. In addition, LPU uses an expensive and low capacity special storage chip, which makes its cost not advantageous.
Alibaba's former chief AI scientist Jia Yangqing believes that, taking the Llama-2 70b model as an example, due to the limitation of storage chip capacity, the number of Groq LPUs required is much larger than H100. This means that under the same data throughput, the hardware cost of Groq is 40 times that of H100, and the energy consumption cost reaches 10 times.
However, the competition between start-up chip companies and Nvidia has not been smooth sailing. Under Nvidia's strong market dominance, high operating costs and unknown business prospects are likely to squeeze them into difficulties.
The startup company Graphcore, also known as NVIDIA in the UK, is like this.
Graphcore has launched an AI chip called IPU (Intelligent Process Unit), targeting Nvidia as a competitor. Previously, Graphcore demonstrated to the outside world that in some artificial intelligence models driven by Graphcore IPU, AI question answering robots provide a speed effect similar to "swiping the screen" and have certain competitiveness.
However, customers still tend to purchase Nvidia GPUs and find it difficult to pay for Graphcore IPUs.
This has made it difficult for this company to open up larger markets and make profits even during last year's wave of artificial intelligence. Graphcore's 2022 financial report released in October last year showed that pre tax losses increased by 11% year-on-year, reaching £ 161 million. According to media reports, Graphcore is in negotiations with large technology companies seeking a sale.
In the long run, the biggest challenge for start-up chip companies is still to establish a software ecosystem that can rival Nvidia. On this aspect, large technology companies hoping to break free from Nvidia restrictions may have more opportunities.
As the early AI craze gradually dissipates and giant companies join the competition, the enthusiasm for venture capital to flock to start-up AI chip companies is dissipating, and the opportunities for startups to invest in AI chips are gradually becoming slim.
A greater threat
In fact, for Nvidia, the greater threat may come from its largest group of customers.
Amazon, Google, Microsoft, and Meta's data centers are all using Nvidia's products and purchasing volumes are huge. The executives of these companies have all told investors during recent earnings conference calls that they plan to increase capital expenditures this year and directly use them to purchase Nvidia's AI chips.
In the global cloud market, Amazon AWS, Microsoft Azure, and Google Cloud dominate the majority of the market. According to market research firm Synergy Research Group, in the fourth quarter of 2023, global enterprise spending on cloud increased by nearly 20% year-on-year to $74 billion, with AWS, Azure, and Google Cloud holding market shares of 31%, 24%, and 11%, respectively.
These wealthy technology companies have the ability to design their own AI chips for their data centers, and in fact, they have also done so.
Google was the first to launch TPU (Tensor Processing Unit) in 2016, specifically optimized for AI computing, and it has now been released to the fifth generation. Now, with the help of its most advanced AI model Gemini and open AI model Gemma, Google is attempting to promote TPU to the outside world.
Amazon AWS, the world's leading cloud computing company in terms of market share, has also launched two series of AI chips, Training and Inferentia, for AI training and inference since 2018, and has also launched a supporting software tool, Neuron. Meanwhile, AWS has developed its own network, storage, and computing system in the cloud, partially replacing Nvidia's AI system.
Microsoft has also joined in. Last November, Microsoft released its self-developed cloud based AI training and inference chip Maia 100 during its own technology conference. Maia 100 adopts a 5-nanometer process and has 105 billion transistors. Microsoft stated that this chip is specifically designed for Microsoft Cloud customization, maximizing the hardware efficiency of Microsoft Cloud and meeting the needs of large model AI computing such as GPT.
The new chips launched by these large technology companies demonstrate their ability to compete with Nvidia in semiconductor hardware, and even design the most suitable AI chips that match their own situation.
However, in the current technology company's generative AI arms race, the imperfect ecosystem and limited production of their own chips make it difficult to replace Nvidia chips on a large scale, and even a lack of sufficient GPUs is fatal. So, although tech giants are putting in a lot of effort to design their own hardware, they will still rely on Nvidia for a period of time.

浏览过的版块