After a stunning day, overturned? The 6-minute video of Google's "Gemini" model was exposed to have been edited

After Bard's debut "Crash" at the beginning of the year, on December 7th Beijing time, Google launched the large model Gemini (Chinese name "Gemini") and released a series of dazzling demonstration videos. Can Gemini compete against GPT-4 this time?
Among these demonstration videos, the most amazing thing is that in a 4-minute demonstration video, when the test personnel perform painting, magic, and other operations, Gemini can express their opinions in real-time and interact with the test personnel in real time. Only by watching the performance in the video, Gemini's understanding even reaches the level of humans.
"From the content of the demonstration alone, Gemini's video understanding ability undoubtedly reaches the most advanced level at present." The algorithm engineer of a large model in Beijing said in an interview with the New Beijing News and Shell Finance reporter, "This ability comes from Gemini naturally adding a large amount of video data during training and supporting video understanding in architecture."
However, just one day after its release, many users found during testing that Gemini's video comprehension ability was not as smooth as in the demonstration. Google quickly posted a blog article explaining the multimodal interaction process in the demonstration video, almost acknowledging the use of static images and multiple prompts to achieve such an effect. In addition, some netizens have noticed that Google has an important disclaimer in its demonstration videos: in order to reduce the delay of the demonstration effect, the output of Gemini has also been simplified.
Nevertheless, in the eyes of many professionals, Google has finally launched a big model that can compete with OpenAI. As an established manufacturer of artificial intelligence, Google has a rich foundation, and Gemini will also become a strong competitor to GPT.
Where did you edit it? What is the difference between the demonstration video and the actual situation?
"Have you watched the video demonstration of Google's latest big model? Multimodal switching is a qualitative change, especially when playing game maps, people may not be able to react." On December 7th, Mr. Liu, a website developer, sent a demonstration video to a reporter from Beike Finance.
In this exciting demonstration video of Google's big model Gemini, which excites many practitioners, the tester took out a piece of paper and Gemini immediately replied, "You took out a piece of paper." As the tester drew curves and colored the paper, Gemini immediately "understood" and continued to explain with the tester's actions: "You were drawing curves, it looked like a bird, it was a duck, but blue ducks were not common. Most ducks were brown, and the Chinese pronunciation of ducks was" yazi ". There were four tones in Chinese." When the tester placed a blue rubber duck on the world map, Gemini saw it immediately. "This duck has been placed in the middle of the sea, there aren't many ducks here," he said
Afterwards, the testers began to use gestures to interact with Gemini. When the testers made the movements of scissors and cloth, Gemini "answered" you're playing with stone, scissors, and cloth ". Afterwards, Gemini even guessed the image of an eagle and a dog imitating them with their hands.
However, a reporter from Shell Finance found many traces of editing in this video, such as in the stone scissors cloth, where the movements of the tester when punching were clearly cut off. Regarding this, Google has posted a blog to provide "Q&A and clarification": when given a picture of Gemini's "deployment", Gemini's answer is "I saw a right hand, with the palm open and the five fingers separated"; When given a picture of "punching", Gemini's answer is "one person knocking on the door"; When given a picture of "scissors out", Gemini's answer is "I see a hand extending from my index and middle fingers." Only when these three pictures are put together and asked "What do you think I'm doing?" will Gemini answer "You're playing with rock scissors.".
So in fact, although Gemini's answer is still true, the actual application may not be as smooth as shown in the demonstration video.
Source: "Gemini" demonstration video released by Google.
How is multimodal ability refined?
Through this demonstration, many industry insiders also acknowledge that Google has indeed taken a step forward in catching up with OpenAI. In fact, before the emergence of ChatGPT, Google had always been in a leading position in the field of artificial intelligence. However, the success of ChatGPT has put a lot of pressure on Google. In February of this year, it launched a benchmark against ChatGPT, but after its debut failed, Google has been lacking a sufficiently excellent big model to boost morale.
After the emergence of Gemini, Google has at least demonstrated certain characteristics in the field of multimodal understanding. "Gemini is a native multimodal big model, which means it is multimodal during training. Google already has a strong ecosystem in search, long videos, online documents, and more. In addition, Google has many graphics cards and several times the computing power of OpenAI. Now, it is' burning its bottom 'to catch up with OpenAI." A big model practitioner who graduated from Tsinghua University majoring in automation told Shell Finance reporters.
Specifically, the Gemini model includes three versions: Gemini Ultra, the largest and most powerful version; Gemini Pro (large cup), suitable for a wide range of tasks; Gemini Nano (medium cup) will be used for specific tasks and mobile devices.
In addition to its multimodal abilities, Gemini also performs well in many aspects such as text comprehension and code operations. In a MMLU multitasking language comprehension dataset test, Gemini Ultra not only surpassed GPT-4, but even surpassed human experts. A reporter from Beike Finance logged into Google Deepmind's official website and found that the phrase "Witness Gemini - Our Most Capable Big Model" was posted on the homepage.
At present, users can enter and experience the Gemini Pro capability through the Google Bard port, but Shell Finance reporters have found that this capability is only available in some regions. Through tests conducted by some foreign netizens, users can input both images and text to Gemini. According to the test results, Gemini Pro and GPT-4V, which also have multimodal capabilities, have their own strengths in answering many questions and have not been overwhelmed by GTP-4V.
"Based on my observation, Gemini's ability in text is still slightly inferior to GPT4, but Google's technological strength is still in the first tier," said the algorithm engineer for the aforementioned large model.
He told a reporter from Shell Finance that in order for the big model to have the "multimodal ability" to understand image, video, and sound, technically it can be seen as expanding the image understanding module of LLaVA (a multimodal pre training model) to video and speech, and adding additional video and audio data during training. "This actually proves that for the first time, Gemini has incorporated video and speech understanding into the big model, verifying the feasibility of these two in the big model."
"Overall, the release of the Google big model meets expectations, and every technical point of Gemini has been validated in the academic community before, and corresponding papers can be found. In the future, personal assistants will be a very attractive scene. Compared to big language models, multimodal big models can play the role of assistants who can listen, see, speak, and draw, more like a human." This big model algorithm engineer told a reporter from Shell Finance.
New Beijing News Shell Finance reporter Luo Yidan

浏览过的版块