After a stunning day, overturned? The 6-minute video of Google's "Gemini" model was exposed to have been edited
白云追月素
发表于 2023-12-8 20:26:37
284
0
0
After Bard's debut "Crash" at the beginning of the year, on December 7th Beijing time, Google launched the large model Gemini (Chinese name "Gemini") and released a series of dazzling demonstration videos. Can Gemini compete against GPT-4 this time?
Among these demonstration videos, the most amazing thing is that in a 4-minute demonstration video, when the test personnel perform painting, magic, and other operations, Gemini can express their opinions in real-time and interact with the test personnel in real time. Only by watching the performance in the video, Gemini's understanding even reaches the level of humans.
"From the content of the demonstration alone, Gemini's video understanding ability undoubtedly reaches the most advanced level at present." The algorithm engineer of a large model in Beijing said in an interview with the New Beijing News and Shell Finance reporter, "This ability comes from Gemini naturally adding a large amount of video data during training and supporting video understanding in architecture."
However, just one day after its release, many users found during testing that Gemini's video comprehension ability was not as smooth as in the demonstration. Google quickly posted a blog article explaining the multimodal interaction process in the demonstration video, almost acknowledging the use of static images and multiple prompts to achieve such an effect. In addition, some netizens have noticed that Google has an important disclaimer in its demonstration videos: in order to reduce the delay of the demonstration effect, the output of Gemini has also been simplified.
Nevertheless, in the eyes of many professionals, Google has finally launched a big model that can compete with OpenAI. As an established manufacturer of artificial intelligence, Google has a rich foundation, and Gemini will also become a strong competitor to GPT.
Where did you edit it? What is the difference between the demonstration video and the actual situation?
"Have you watched the video demonstration of Google's latest big model? Multimodal switching is a qualitative change, especially when playing game maps, people may not be able to react." On December 7th, Mr. Liu, a website developer, sent a demonstration video to a reporter from Beike Finance.
In this exciting demonstration video of Google's big model Gemini, which excites many practitioners, the tester took out a piece of paper and Gemini immediately replied, "You took out a piece of paper." As the tester drew curves and colored the paper, Gemini immediately "understood" and continued to explain with the tester's actions: "You were drawing curves, it looked like a bird, it was a duck, but blue ducks were not common. Most ducks were brown, and the Chinese pronunciation of ducks was" yazi ". There were four tones in Chinese." When the tester placed a blue rubber duck on the world map, Gemini saw it immediately. "This duck has been placed in the middle of the sea, there aren't many ducks here," he said
Afterwards, the testers began to use gestures to interact with Gemini. When the testers made the movements of scissors and cloth, Gemini "answered" you're playing with stone, scissors, and cloth ". Afterwards, Gemini even guessed the image of an eagle and a dog imitating them with their hands.
However, a reporter from Shell Finance found many traces of editing in this video, such as in the stone scissors cloth, where the movements of the tester when punching were clearly cut off. Regarding this, Google has posted a blog to provide "Q&A and clarification": when given a picture of Gemini's "deployment", Gemini's answer is "I saw a right hand, with the palm open and the five fingers separated"; When given a picture of "punching", Gemini's answer is "one person knocking on the door"; When given a picture of "scissors out", Gemini's answer is "I see a hand extending from my index and middle fingers." Only when these three pictures are put together and asked "What do you think I'm doing?" will Gemini answer "You're playing with rock scissors.".
So in fact, although Gemini's answer is still true, the actual application may not be as smooth as shown in the demonstration video.
Source: "Gemini" demonstration video released by Google.
How is multimodal ability refined?
Through this demonstration, many industry insiders also acknowledge that Google has indeed taken a step forward in catching up with OpenAI. In fact, before the emergence of ChatGPT, Google had always been in a leading position in the field of artificial intelligence. However, the success of ChatGPT has put a lot of pressure on Google. In February of this year, it launched a benchmark against ChatGPT, but after its debut failed, Google has been lacking a sufficiently excellent big model to boost morale.
After the emergence of Gemini, Google has at least demonstrated certain characteristics in the field of multimodal understanding. "Gemini is a native multimodal big model, which means it is multimodal during training. Google already has a strong ecosystem in search, long videos, online documents, and more. In addition, Google has many graphics cards and several times the computing power of OpenAI. Now, it is' burning its bottom 'to catch up with OpenAI." A big model practitioner who graduated from Tsinghua University majoring in automation told Shell Finance reporters.
Specifically, the Gemini model includes three versions: Gemini Ultra, the largest and most powerful version; Gemini Pro (large cup), suitable for a wide range of tasks; Gemini Nano (medium cup) will be used for specific tasks and mobile devices.
In addition to its multimodal abilities, Gemini also performs well in many aspects such as text comprehension and code operations. In a MMLU multitasking language comprehension dataset test, Gemini Ultra not only surpassed GPT-4, but even surpassed human experts. A reporter from Beike Finance logged into Google Deepmind's official website and found that the phrase "Witness Gemini - Our Most Capable Big Model" was posted on the homepage.
At present, users can enter and experience the Gemini Pro capability through the Google Bard port, but Shell Finance reporters have found that this capability is only available in some regions. Through tests conducted by some foreign netizens, users can input both images and text to Gemini. According to the test results, Gemini Pro and GPT-4V, which also have multimodal capabilities, have their own strengths in answering many questions and have not been overwhelmed by GTP-4V.
"Based on my observation, Gemini's ability in text is still slightly inferior to GPT4, but Google's technological strength is still in the first tier," said the algorithm engineer for the aforementioned large model.
He told a reporter from Shell Finance that in order for the big model to have the "multimodal ability" to understand image, video, and sound, technically it can be seen as expanding the image understanding module of LLaVA (a multimodal pre training model) to video and speech, and adding additional video and audio data during training. "This actually proves that for the first time, Gemini has incorporated video and speech understanding into the big model, verifying the feasibility of these two in the big model."
"Overall, the release of the Google big model meets expectations, and every technical point of Gemini has been validated in the academic community before, and corresponding papers can be found. In the future, personal assistants will be a very attractive scene. Compared to big language models, multimodal big models can play the role of assistants who can listen, see, speak, and draw, more like a human." This big model algorithm engineer told a reporter from Shell Finance.
New Beijing News Shell Finance reporter Luo Yidan
CandyLake.com 系信息发布平台,仅提供信息存储空间服务。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
猜你喜欢
- 谷歌发布Gemini大模型 能打败GPT4吗?
- Can Google's release of the Gemini model defeat GPT4?
- グーグルがGeminiビッグモデルを発表GPT 4を負かすことができるか?
- 구글이 Gemini 대형 모델을 발표하면 GPT4를 이길 수 있습니까?
- "Far ahead" GPT-4? The release of Gemini, the strongest AI model on Google, raised doubts in just one day! The company acknowledges that the 6-minute video has been specially edited for non real-time visuals
- 谷歌Gemini、Pika的交集!多模态晋升AI热词 算力板块再次“躺赢”?
- The intersection of Google Gemini and Pika! Is the multimodal promotion of AI hot words in the computing power sector once again "lying down and winning"?
- グーグルGemini、Pikaが交差!マルチモーダル昇進AI熱語算力プレートは再び「楽勝」になるのか?
- Google Gemini, Pika의 교차!다중모드 승진 AI 열사 계산력 분야 다시'쉽게 이기기'?
- Frequent overturning! Inviting external experts and being held accountable by securities firms
-
11月21日、2024世界インターネット大会烏鎮サミットで、創業者、CEOの周源氏が大会デジタル教育フォーラムとインターネット企業家フォーラムでそれぞれ講演、発言したことを知っている。周源氏によると、デジタル教 ...
- 不正经的工程师
- 1 小时前
- 支持
- 反对
- 回复
- 收藏
-
アリババは、26億5000万ドルのドル建て優先無担保手形と170億元の人民元建て優先無担保手形の定価を発表した。ドル債の発行は2024年11月26日に終了する予定です。人民元債券の発行は2024年11月28日に終了する予定だ ...
- SOGO
- 前天 09:05
- 支持
- 反对
- 回复
- 收藏
-
スターバックスが中国事業の株式売却の可能性を検討していることが明らかになった。 11月21日、外国メディアによると、スターバックスは中国事業の株式売却を検討している。関係者によると、スターバックスは中国事 ...
- 献世八宝掌
- 昨天 16:29
- 支持
- 反对
- 回复
- 收藏
-
【意法半導体CEO:中国市場は非常に重要で華虹と協力を展開】北京時間11月21日、意法半導体(STM.N)は投資家活動の現場で、同社が中国ウェハー代工場の華虹公司(688347.SH)と協力していると発表した。伊仏半導体 ...
- 黄俊琼
- 昨天 14:29
- 支持
- 反对
- 回复
- 收藏