Pika financing and Kwai online Why does Keling Apple's AI product "burn the cold stove"?

Apple Inc. (AAPL. US) launched an AI product called Apple Intelligence at the WWDC Developer Conference, but the stock price closed down 1.91% on the same day. Interestingly, on June 11th, the Sora index (8841756. WI) in Wind data increased by 1.55%.
Why is there such a difference?
Apple has chosen to avoid the current hot video models and has launched AI related updates that focus more on the text field. The rise of domestic concept stocks is closely related to the recent popularity of cultural and educational video models. Foreign companies such as Pika, a celebrity AI video generation company, have completed a new round of financing, with a total of 80 million US dollars in Series B financing. The company's valuation will exceed 470 million US dollars. In China, for example, Kwai (1024. HK) "Kering" video generation model was officially launched, adopting a technical route similar to Sora.
In the eyes of many industry insiders, Apple's focus on integrating AI text rather than video is more driven by considerations such as cost and practicality.
Apple avoids Sora's "battle zone"
The built-in large language model launched by Apple allows iPhone, iPad, and Mac to understand and generate language and images. Siri has semantic retrieval function by connecting to ChatGPT, which can intelligently search for photos, calendars, files, emails, and other content. It can also use most of ChatGPT's functions without registration.
Guo Minggui, an analyst at Tianfeng International Securities, posted a brief review stating that Apple's newly released Apple Intelligence suite demonstrates the advantages of ecological integration and interface design, which is very practical for users but only adds icing on the cake for investors. The latter is looking forward to seeing original and essential features.
Han Xu, Chief Researcher of Facewall Intelligence, told reporters that from the perspective of accessing operating systems, Apple mainly needs AI to understand human intentions and call system level interfaces. These requirements are not completely consistent with Sora's starting point, but are more compatible with the large model of multimodal input and text output. Models like Sora that generate images or videos are currently more suitable for integration with software, especially visual processing software.
Why didn't Apple join Sora's "battle"?
A person from an AIGC video application manufacturer told reporters that from a product thinking and business perspective, Apple will only focus on areas that are relatively mature and have a more significant input-output ratio to visibility. At the level of mobile hardware interaction, there are more scenarios for using text. From research and development investment to actual inference costs, this field is also relatively more cost-effective for Apple's current technological accumulation.
Another industry technician stated that today's LLM service (Large Language Model Service) has basically achieved breakeven in the field of text, but not necessarily in the field of text, graphics, and video. This is also an important reason why the Apple WWDC conference has not yet integrated video AIGC capabilities.
Compared to Apple's actions, the domestic big model market currently has high expectations for the video industry. In April this year, Professor Zhu Jun, vice president of the Artificial Intelligence Research Institute of Tsinghua University, co-founder and chief scientist of Student Digital Science and Technology, on behalf of Tsinghua University and Student Digital Science and Technology, released China's first video model Vidu. Not long ago, the video model "Kering" launched by Kwai also triggered some hot debate.
The reporter took Sora's representative video copy as the prompt word, input Kwai "Keling" to generate video contrast, take "Tokyo street girls walking" as an example, at that time Sora video had errors such as leg deformation, dislocation of leg crossing and transposition, and right leg walking in front twice in a row. Kwai "Kering" also has similar problems.
Tianfeng Securities believes that the improvement of Kwai 3D VAE+DiT architecture in computing power, model and data quality has shown that it can achieve commercial results. At the same time, the customization of time length and proportion has greatly enhanced the availability of generated materials. Although it is inferior to Sora in some complex semantic understanding, there is little difference in a slightly simple scenario.
Multimodal becomes an opportunity for China's big model race
An excellent video generation model needs to consider four core elements - model design, data assurance, computational efficiency, and the expansion of model capabilities.
Regarding the immaturity of Sora, OpenAI has stated that Sora may have difficulty accurately simulating the physical principles of complex scenes, may not understand causal relationships, may confuse spatial details of prompts, and may have difficulty accurately describing events that occur over time, such as following specific camera trajectories.
But this seems more like a common problem. Founder Wang Changhu of Aishi Technology previously stated that current video models directly learn physics knowledge from video data, but real videos often contain a lot of information, making it difficult to accurately learn each physical law separately. By adding 3D modeling information such as human hands and animal tails as constraints while inputting visual images to the model, it can assist in learning the large model and optimize the effect.
The Kelingda model adopts the native cultural and biological video technology route, replacing the combination of image generation and timing modules. At present, mainstream video generation models usually use 2D VAE with Stable Diffusion for spatial compression in hidden space encoding/decoding, but this poses significant information redundancy for videos. Therefore, the Kwai big model team has developed a 3D VAE network by itself, trying to find the balance between training performance and effect. In addition, in terms of temporal information modeling, the Kwai big model team has designed a 3D Attention mechanism as a spatio-temporal modeling module.
Tang Jiayu, CEO of Shengshu Technology, mentioned that research on multimodal large models is still in its early stages and the technological maturity is not yet high. This is different from the hot language models, as foreign countries have already taken the lead by an era. Therefore, compared to struggling with language models, Tang Jiayu believes that multimodality is an important opportunity for domestic teams to seize the big model track. This is similar to Zhou Zhifeng, a partner of Qiming Venture Capital, who also believes that today's big models have gradually moved from pure language mode to multimodal exploration.
Lin Yonghua, Vice President and Chief Engineer of Beijing Zhiyuan Artificial Intelligence Research Institute, told First Financial reporters that China has a certain possibility of overtaking on bends in the multimodal field, but the success factors of multimodal models still lie in computing power, algorithms, and data. At present, at the algorithmic level, there is not a significant difference between the Chinese and American teams, and the industry still has ways to solve computing power problems. However, obtaining massive high-quality data is still very difficult.

浏览过的版块