Groq promotes the "fastest" AI inference chip, claiming to be 10 times faster than Nvidia GPUs

In the AI chip market dominated by Nvidia, there has been no news for a long time, but just after the Spring Festival of the the Year of the Loong, a startup called Groq won the title of "fastest" AI reasoning chip from the former.
Groq claims that the inference performance of its LPU (Language Processing Unit) is 10 times that of Nvidia GPU (Graphics Processing Unit), at a cost of only one tenth. In addition, Groq utilizes its self-developed inference chip LPU to generate large models at a speed close to 500 tokens per second (the smallest unit in text), surpassing GPT-3.5's 40 tokens per second.
This has sparked widespread discussion on social media. On February 19th, Groq opened the product experience portal to users, and "Kuai" is the most intuitive feeling brought to people by the open-source big model accelerated by Groq. A user logged into this company's website, typed in a question, and it provided an answer at lightning speed, with 278 tokens per second! Some people commented, "It responds faster than I blink."
However, although Groq's LPU inference speed is fast, this speed is also very expensive, and the cost is much higher than general GPUs. In addition, LPU is somewhat specialized and can only run two Mixtra 8x7B-32K and Llama 2-70B-4K open-source models now. Regarding which models Groq's LPU can run in the future, a reporter from China Business Daily contacted the company for an interview, but as of publication, no response has been received.
Zhang Guobin, CEO of Electronic Innovation Network, told reporters, "Any artificial intelligence algorithm can use Nvidia's H100, but only Mixtral and Llama 2 can use Groq's LPU. LPU is an ASIC (specialized chip) that can only be applied to specific models, has poor universality, and the cost-effectiveness is not high. It is not worth hyping up to avoid misleading chip companies in the field of artificial intelligence in their development direction."
Shocking the world with "speed"
Although some argue that it is the "fastest big model in history," Groq clarified: "We are not a large language model. Our LPU inference engine is a novel end-to-end processing unit system that provides the fastest inference capabilities for computationally intensive applications with sequential components, such as artificial intelligence language applications."
It is understood that Groq was founded by Jonathan Ross, one of the creators of Google's first generation Tensor Processing Units (TPUs), in 2016. He believes that chip design should draw inspiration from Software Defined Networks (SDNs).
Ross stated that Groq's existence is aimed at eliminating the wealth gap and helping everyone in the AI community thrive. He also stated that reasoning is crucial for achieving this goal, as speed is the key to transforming developer ideas into business solutions and life changing applications.
In 2021, Groq received $300 million in funding led by well-known investment institutions Tiger Global Management Fund and D1 Capital, with a total financing amount of $367 million.
At the 2023 High Performance Computing Conference SC23, Groq generated replies at a speed of over 280 tokens per second, breaking the performance record for Llama-2 70B inference. In January 2024, Groq participated in public benchmark testing for the first time and achieved outstanding results on Anyscale's LLMPerf ranking, far surpassing other GPU based cloud service providers.
On February 13th, Groq once again won the latest LLM benchmark test on ArtificialAnalysis.ai, beating eight participants in key performance indicators such as latency and throughput. Its processing throughput was four times that of other inference services, and the fee was less than one-third of Mistral's own.
The core of Groq's innovation lies in its LPU, which aims to accelerate AI models at an unprecedented speed, including language models like ChatGPT. According to Groq's official website, LPU stands for Language Processing Unit (LPU) and is a new type of end-to-end processing unit system that can provide the fastest inference service for computationally intensive applications with sequential components, such as the Large Language Model (LLM).
Why is LPU much faster than GPU when used for LLM and generative AI? The Groq official website explains that LPU aims to overcome two bottlenecks of LLM: computational density and memory bandwidth. For LLM, the computing power of LPU is greater than that of GPU and CPU, which reduces the time required to calculate each word and can generate text sequences faster. In addition, by eliminating external memory bottlenecks, the LPU inference engine can provide several orders of magnitude higher performance on LLM than GPU.
The cost of being fast is a bit high
It is worth noting that unlike GPUs that utilize high bandwidth memory (HBM), Groq's LPU uses SRAM for data storage. However, this design is not an innovative breakthrough, and it is reported that Baidu Kunlun Core and GraphCore in the UK also use similar internal storage methods.
In addition, Groq LPU is based on a new tensor stream processor architecture, with its memory units interleaved with vector and matrix deep learning functional units, thereby utilizing the inherent parallelism of machine learning workloads to accelerate inference.
At the same time as computational processing, each TSP also has the function of network exchange, which can directly exchange information with other TSPs through the network without relying on external network devices. This design improves the parallel processing ability and efficiency of the system.
Groq supports various machine learning development frameworks for model inference, including PyTorch, TensorFlow, and ONNX, but does not support using the LPU inference engine for ML training.
Regarding the uniqueness of Groq chips, according to k_zeros, an investor closely related to Groq, in his X platform account, the operation of LPUs is different from GPUs. It uses a Temporal Instruction Set Computer architecture, which is different from the SIMD (Single Instruction, Multiple Data) used by GPUs. This design allows chips to avoid frequently reloading data from HBM memory like GPUs.
The Groq chip uses SRAM, which is about 20 times faster than the memory used by GPUs. This also helps to avoid the shortage of HB and reduce costs. Currently, HBM's supply relies not only on Samsung and Hynix, but also on TSMC's CoWoS technology in packaging.
More information shows that Groq's chips use a 14nm process and are equipped with 230MB SRAM to ensure memory bandwidth, with an on-chip memory bandwidth of up to 80TB/s. In terms of computing power, the chip has an integer (8-bit) operation speed of 750TOPs and a floating-point (16 bit) operation speed of 188TFLOPs.
After the shock, many industry experts have found that the cost of Groq being fast is a bit high.
Former Facebook AI scientist and Alibaba's former Vice President of Technology, Jia Yangqing, analyzed that the memory capacity of Groq LPU is very small (230MB). A simple calculation shows that running a 70 billion parameter model requires 305 Groq cards, equivalent to using 8 NVIDIA H100 cards. From the current price, this means that at the same throughput, the hardware cost of Groq LPU is about 40 times that of H100, and the energy cost is about 10 times that of H100.
Chip expert Yao Jinxin (Uncle J) stated in an interview with reporters that, based on the same computing power, if we all use INT8 for inference, using Groq's solution would require 9 server clusters containing 72 chips. However, if it is H100, achieving the same computing power would require approximately 2 8-card servers. At this time, the INT8 computing power has reached 64P, and the number of 7B large models that can be deployed simultaneously has reached more than 80. From a cost perspective, the cost of 9 Groq servers is also much higher than that of 2 H100 servers.
On third-party websites, acceleration cards equipped with Groq chips are priced at over 20000 US dollars, approximately 150000 RMB, lower than H100's 25000 to 30000 US dollars. In summary, Groq's architecture is built on small memory and large computing power, so the limited amount of processed content corresponds to extremely high computing power, resulting in its very fast speed. On the contrary, Groq's extremely high speed is based on limited single card throughput, and to ensure the same throughput as H100, more cards are needed.
LPU is a bit specialized
It should be pointed out that currently Groq only supports three open-source large models: Mixtral 8x7B-32K, Llama 2-70B-4K, and Mixtral 7B-8K. The first two have been opened for use and adapted to run on its compiler.
Regarding this, Zhang Guobin said, "Any artificial intelligence algorithm can use Nvidia's H100, but only Mixtral and Llama 2 can use Groq's LPU. If large model companies want to use Groq's products, they need to first determine the requirements and specify specifications, then perform functional verification, and finally produce products that can be used."
Zhang Guobin pointed out that Groq's LPU is a specialized chip specifically designed for large models, so its speed is fast, which is normal. "The speed is faster, the efficiency is higher, and the electricity bill is also saved, which is quite cost-effective. The future market prospects should have, such as intelligent agents and portable terminals that support large models," he said.
However, Zhang Guobin stated that he is not optimistic about LPU, as its limitations are too great and it can only be used for specific models. "In the future, there may be more support for large models, but they are not as good as general models. Currently, I have seen some tests that show that its accuracy is not enough." Zhang Guobin used a metaphor to explain the insufficient accuracy. In a city with complex traffic, LPU collected the directions of everyone going to work in the morning, and then used software to determine traffic lights, turning off all traffic lights on the road, Let all vehicles in the same direction just drive forward on this road.
"It is an ASIC chip that can only be applied to specific models, with poor versatility and low cost-effectiveness. It is not worth hype to avoid misleading chip companies in the field of artificial intelligence in their development direction." Zhang Guobin also stated that artificial intelligence needs to penetrate into various industries, and it is not possible to have an ASIC for every scenario. It is better to have a universal GPU, but in fact, it requires an AI processor that can be used in multiple scenarios.
Groq's business model is aimed at large systems and is also deployed for enterprises, and it does not sell single cards/chips. Because it has the entire technology stack from the chip to the system, and there are no intermediaries, it can generate a price advantage per token. In an interview at the end of 2023, Ross stated that considering the shortage of GPUs and high costs, he believes in Groq's future development potential: "In 12 months, we can deploy 100000 LPUs, and in 24 months, we can deploy 1 million LPUs."
Who is better, General vs. Exclusive? Let the bullets fly a little longer. However, on February 22nd local time in the United States, Nvidia's stock price closed at $785.38, up 16.4%, driven by the latest financial report that exceeded expectations. Its market value surged by $273.3 billion (approximately RMB 2 trillion) in a single day, setting a record for the largest single day market value increase in US stock history.
The overnight growth in Nvidia's market value is equivalent to increasing the market value of a whole Netflix or Adobe, or approaching the market value of half JPMorgan Chase or two Goldman Sachs. Nvidia's market value has reached a new historical high, approaching $2 trillion, becoming the third highest market value in the world after Microsoft and Apple.

浏览过的版块