Accurate Sniper Before Nvidia's Financial Report? This unicorn is making a strong push into AI reasoning, achieving the world's fastest speed without using HBM

After local time on Wednesday, Nvidia is about to release its final heavyweight Q2 report for the entire secondary market, causing global investors to be highly nervous. On the previous day (August 27 local time), the US artificial intelligence processor chip Unicorn Cerebras Systems released what it called the world's fastest AI reasoning service based on its own chip computing system, claiming to be 10 to 20 times faster than the system built with Nvidia H100 GPU.
Currently, Nvidia GPUs dominate the market in both AI training and inference. Since launching its first AI chip in 2019, Cerebras has been focusing on selling AI chips and computing systems, dedicated to challenging Nvidia in the field of AI training.
According to a report by the American technology media The Information, OpenAI's revenue is expected to reach $3.4 billion this year thanks to its AI inference services. Since the cake of AI reasoning is so big, Andrew Feldman, co-founder and CEO of Cerebras, said that Cerebras also needs to occupy a place in the AI market.
Cerebras' launch of AI inference services not only opens up the AI chip and computing system, but also launches a comprehensive attack on Nvidia based on the second revenue curve of usage. Stealing enough market share from Nvidia to make them angry, "said Feldman.
Fast and cheap
Cerebras' AI inference services have shown significant advantages in both speed and cost. According to Feldman, measured by the number of tokens that can be output per second, Cerebras' AI inference speed is 20 times faster than AI inference services run by cloud service providers such as Microsoft Azure and Amazon AWS.
Feldman simultaneously launched the AI inference services of Cerebras and Amazon AWS at the press conference. Cerebras can instantly complete inference work and output, with a processing speed of 1832 tokens per second, while AWS takes a few seconds to complete the output, with a processing speed of only 93 tokens per second.
Feldman said that faster inference speed means that real-time interactive voice responses can be achieved, or by calling multiple rounds of results, more external sources, and longer documents, more accurate and relevant answers can be obtained, bringing a qualitative leap to AI inference.
In addition to its speed advantage, Cerebras also has a huge cost advantage. Feldman stated that Cerebras' AI inference service is 100 times more cost-effective than AWS and others. Taking the Llama 3.1 70B open-source large-scale language model running Meta as an example, the price of this service is only 60 cents per token, while the price of the same service provided by general cloud service providers is $2.90 per token.
56 times the current maximum GPU area
The reason why Cerebras' AI inference service is fast and cheap is due to the design of its WSE-3 chip. This is the third generation processor chip launched by Cerebras in March this year. Its size is enormous, almost equivalent to the entire surface of a 12 inch semiconductor chip, or larger than a book, with a single unit area of about 462.25 square centimeters. It is 56 times the current largest GPU area.
The WSE-3 chip does not use independent high bandwidth memory (HBM) that requires interface connection to access, as Nvidia does. On the contrary, it directly embeds memory into the chip.
Thanks to its chip size, the WSE-3 has an on-chip memory of up to 44GB, almost 900 times that of the Nvidia H100, and a memory bandwidth 7000 times that of the Nvidia H100.
Feldman stated that memory bandwidth is the fundamental factor limiting the inference performance of language models. And Cerebras integrates logic and memory into a giant chip, with huge on-chip memory and extremely high memory bandwidth, which can quickly process data and generate inference results. This is a speed that GPUs cannot achieve
In addition to its speed and cost advantages, the WSE-3 chip is also a double-edged sword for AI training and inference, with outstanding performance in handling various AI tasks.
According to the plan, Cerebras will establish AI inference data centers in multiple locations and charge for inference capabilities based on the number of requests. Meanwhile, Cerebras will also attempt to sell the CS-3 computing system based on WSE-3 to cloud service providers.

浏览过的版块