Baidu Shen Dou: Upgrade computing platform capability for 100000 card computing power cluster, Wenxin large model daily usage exceeds 700 million times

嫦娥的情人矩发表于昨天 13:34

1141 0 0

As the parameter scale of large models becomes larger, the demand for computing power shows an exponential growth trend. At the 2024 Baidu Cloud Intelligence Conference held on September 25, Shen Shao, executive vice president of Baidu AI Cloud Group and president of Baidu Smart Cloud Business Group, said that the famous scaling law in the field of large-scale models is still continuing. This law pointed out that model performance will improve with the increase of parameters, computing power and data set size, and "soon, more 100000 calorie computing power clusters will appear".
According to Shen Dou's observation, in the past year, we have felt a sharp increase in the demand for model training from customers. He introduced, "The landing of the big model industry in 2024 is accelerating. Currently, on the Qianfan big model platform, Wenxin big model has been adjusted more than 700 million times a day, helping users fine tune 30000 big models and developing over 700000 enterprise level applications
The increasing demand for large model training means that the required computing power cluster size is getting larger, and at the same time, the expectation of a continuous decrease in model inference costs is also increasing. Shen Dou stated that these have raised higher requirements for the stability and effectiveness of GPU management. On September 25th, Baidu upgraded its AI heterogeneous computing platform Baige 4.0, which has the ability to deploy and manage 100000 card clusters.
Shen Dou introduced that GPU computing power clusters have three characteristics - extreme scale, extreme high density, and extreme interconnection. Building a 10000 card cluster alone can cost billions of yuan in GPU procurement costs. Shen Dou emphasized that building computing power resources is not simply about buying GPUs and connecting them, but requires a lot of technology. For example, there are more diverse models of GPU chips and more complex management; GPU needs to perform a large amount of parallel computing; The transmission volume of data has increased and the demand for speed has become higher, "he said. Therefore, the Baige computing platform needs to support heterogeneous chips, high-speed interconnection, and efficient storage.
Shen Dou also stated that managing a 100000 card cluster is fundamentally different from managing a 10000 card cluster. Firstly, at the physical level, deploying a cluster with a capacity of 100000 cards would occupy approximately 100000 square meters of space, equivalent to the area of 14 standard football fields. Secondly, in terms of energy consumption, these servers consume approximately 3 million kilowatt hours of electricity per day, equivalent to the daily electricity consumption of residents in the eastern urban area of Beijing. The huge demand for space and energy in a 100000 card cluster far exceeds the capacity of traditional data center deployment methods. If cross regional deployment of data centers is considered, it will bring huge challenges at the network level. In addition, GPU failures in the 100000 card cluster will be very frequent, and the proportion of effective training time will also face new challenges.
Shen Dou introduced that in response to these challenges, Baige 4.0 has built a large-scale congestion free HPN high-performance network at the 100000 card level, a 10ms level ultra high precision network monitoring, and a minute level fault recovery capability for 100000 card clusters. Baige 4.0 is designed for deploying large-scale clusters of 100000 cards. Today's Baige 4.0 already has mature capabilities for deploying and managing 100000 card clusters, aiming to overcome these new challenges and provide a continuously leading computing platform for the entire industry, "said Shen Dou.
Not only Baidu, but more and more tech giants are facing the demand for AI big models and improving their computing infrastructure capabilities. In early September, Musk announced that Colossus, a super AI training cluster created by his AI startup xAI, had been officially launched, equipped with 100000 Nvidia H100 GPU acceleration cards, and will double the number of GPUs in the coming months. On September 19, 2024, at the Yunqi Conference, Alibaba Cloud also stated that GPU based AI computing power will be the dominant computing paradigm in the future. Alibaba Cloud is upgrading its AI infrastructure for the future from chips, servers, networks, storage to cooling, power supply, data centers, and other aspects.

Baidu Shen Dou: Upgrade computing platform capability for 100000 card computing power cluster, Wenxin large model daily usage exceeds 700 million times

大雪網易「複合」後のゲームが徐々にオンライン化『ストーブストーン伝説』国服が本日正式に復帰

2024年上半期の慧択実現総売上高は5億9000万元に

百勝中国と保利発展の戦略的パートナーシップ構築

世界市場：米株3大指数の上昇と下落はまちまち中概株の下落は米光科学技術株価の下落後に大きく上昇