After a stunning day, overturned? The 6-minute video of Google's "Gemini" model was exposed to have been edited
白云追月素
发表于 2023-12-8 20:26:37
288
0
0
After Bard's debut "Crash" at the beginning of the year, on December 7th Beijing time, Google launched the large model Gemini (Chinese name "Gemini") and released a series of dazzling demonstration videos. Can Gemini compete against GPT-4 this time?
Among these demonstration videos, the most amazing thing is that in a 4-minute demonstration video, when the test personnel perform painting, magic, and other operations, Gemini can express their opinions in real-time and interact with the test personnel in real time. Only by watching the performance in the video, Gemini's understanding even reaches the level of humans.
"From the content of the demonstration alone, Gemini's video understanding ability undoubtedly reaches the most advanced level at present." The algorithm engineer of a large model in Beijing said in an interview with the New Beijing News and Shell Finance reporter, "This ability comes from Gemini naturally adding a large amount of video data during training and supporting video understanding in architecture."
However, just one day after its release, many users found during testing that Gemini's video comprehension ability was not as smooth as in the demonstration. Google quickly posted a blog article explaining the multimodal interaction process in the demonstration video, almost acknowledging the use of static images and multiple prompts to achieve such an effect. In addition, some netizens have noticed that Google has an important disclaimer in its demonstration videos: in order to reduce the delay of the demonstration effect, the output of Gemini has also been simplified.
Nevertheless, in the eyes of many professionals, Google has finally launched a big model that can compete with OpenAI. As an established manufacturer of artificial intelligence, Google has a rich foundation, and Gemini will also become a strong competitor to GPT.
Where did you edit it? What is the difference between the demonstration video and the actual situation?
"Have you watched the video demonstration of Google's latest big model? Multimodal switching is a qualitative change, especially when playing game maps, people may not be able to react." On December 7th, Mr. Liu, a website developer, sent a demonstration video to a reporter from Beike Finance.
In this exciting demonstration video of Google's big model Gemini, which excites many practitioners, the tester took out a piece of paper and Gemini immediately replied, "You took out a piece of paper." As the tester drew curves and colored the paper, Gemini immediately "understood" and continued to explain with the tester's actions: "You were drawing curves, it looked like a bird, it was a duck, but blue ducks were not common. Most ducks were brown, and the Chinese pronunciation of ducks was" yazi ". There were four tones in Chinese." When the tester placed a blue rubber duck on the world map, Gemini saw it immediately. "This duck has been placed in the middle of the sea, there aren't many ducks here," he said
Afterwards, the testers began to use gestures to interact with Gemini. When the testers made the movements of scissors and cloth, Gemini "answered" you're playing with stone, scissors, and cloth ". Afterwards, Gemini even guessed the image of an eagle and a dog imitating them with their hands.
However, a reporter from Shell Finance found many traces of editing in this video, such as in the stone scissors cloth, where the movements of the tester when punching were clearly cut off. Regarding this, Google has posted a blog to provide "Q&A and clarification": when given a picture of Gemini's "deployment", Gemini's answer is "I saw a right hand, with the palm open and the five fingers separated"; When given a picture of "punching", Gemini's answer is "one person knocking on the door"; When given a picture of "scissors out", Gemini's answer is "I see a hand extending from my index and middle fingers." Only when these three pictures are put together and asked "What do you think I'm doing?" will Gemini answer "You're playing with rock scissors.".
So in fact, although Gemini's answer is still true, the actual application may not be as smooth as shown in the demonstration video.
Source: "Gemini" demonstration video released by Google.
How is multimodal ability refined?
Through this demonstration, many industry insiders also acknowledge that Google has indeed taken a step forward in catching up with OpenAI. In fact, before the emergence of ChatGPT, Google had always been in a leading position in the field of artificial intelligence. However, the success of ChatGPT has put a lot of pressure on Google. In February of this year, it launched a benchmark against ChatGPT, but after its debut failed, Google has been lacking a sufficiently excellent big model to boost morale.
After the emergence of Gemini, Google has at least demonstrated certain characteristics in the field of multimodal understanding. "Gemini is a native multimodal big model, which means it is multimodal during training. Google already has a strong ecosystem in search, long videos, online documents, and more. In addition, Google has many graphics cards and several times the computing power of OpenAI. Now, it is' burning its bottom 'to catch up with OpenAI." A big model practitioner who graduated from Tsinghua University majoring in automation told Shell Finance reporters.
Specifically, the Gemini model includes three versions: Gemini Ultra, the largest and most powerful version; Gemini Pro (large cup), suitable for a wide range of tasks; Gemini Nano (medium cup) will be used for specific tasks and mobile devices.
In addition to its multimodal abilities, Gemini also performs well in many aspects such as text comprehension and code operations. In a MMLU multitasking language comprehension dataset test, Gemini Ultra not only surpassed GPT-4, but even surpassed human experts. A reporter from Beike Finance logged into Google Deepmind's official website and found that the phrase "Witness Gemini - Our Most Capable Big Model" was posted on the homepage.
At present, users can enter and experience the Gemini Pro capability through the Google Bard port, but Shell Finance reporters have found that this capability is only available in some regions. Through tests conducted by some foreign netizens, users can input both images and text to Gemini. According to the test results, Gemini Pro and GPT-4V, which also have multimodal capabilities, have their own strengths in answering many questions and have not been overwhelmed by GTP-4V.
"Based on my observation, Gemini's ability in text is still slightly inferior to GPT4, but Google's technological strength is still in the first tier," said the algorithm engineer for the aforementioned large model.
He told a reporter from Shell Finance that in order for the big model to have the "multimodal ability" to understand image, video, and sound, technically it can be seen as expanding the image understanding module of LLaVA (a multimodal pre training model) to video and speech, and adding additional video and audio data during training. "This actually proves that for the first time, Gemini has incorporated video and speech understanding into the big model, verifying the feasibility of these two in the big model."
"Overall, the release of the Google big model meets expectations, and every technical point of Gemini has been validated in the academic community before, and corresponding papers can be found. In the future, personal assistants will be a very attractive scene. Compared to big language models, multimodal big models can play the role of assistants who can listen, see, speak, and draw, more like a human." This big model algorithm engineer told a reporter from Shell Finance.
New Beijing News Shell Finance reporter Luo Yidan
CandyLake.com 系信息发布平台,仅提供信息存储空间服务。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
声明:该文观点仅代表作者本人,本文不代表CandyLake.com立场,且不构成建议,请谨慎对待。
猜你喜欢
- 谷歌发布Gemini大模型 能打败GPT4吗?
- Can Google's release of the Gemini model defeat GPT4?
- グーグルがGeminiビッグモデルを発表GPT 4を負かすことができるか?
- 구글이 Gemini 대형 모델을 발표하면 GPT4를 이길 수 있습니까?
- "Far ahead" GPT-4? The release of Gemini, the strongest AI model on Google, raised doubts in just one day! The company acknowledges that the 6-minute video has been specially edited for non real-time visuals
- 谷歌Gemini、Pika的交集!多模态晋升AI热词 算力板块再次“躺赢”?
- The intersection of Google Gemini and Pika! Is the multimodal promotion of AI hot words in the computing power sector once again "lying down and winning"?
- グーグルGemini、Pikaが交差!マルチモーダル昇進AI熱語算力プレートは再び「楽勝」になるのか?
- Google Gemini, Pika의 교차!다중모드 승진 AI 열사 계산력 분야 다시'쉽게 이기기'?
- Frequent overturning! Inviting external experts and being held accountable by securities firms
-
ナスダック中国の金龍指数は0.60%上昇し、人気の中概株の多くが上昇した。闘魚は14%超上昇し、小馬智行は8%超上昇し、蔚来は2%超上昇し、網易、小鵬自動車、理想自動車、ピッピッピッピッ、網易は1%超上昇した。下 ...
- 不正经的工程师
- 昨天 23:57
- 支持
- 反对
- 回复
- 收藏
-
テスラの株価は11月に累計38%超上昇し、ここ2年で最高だった。
- 内托体头
- 昨天 23:10
- 支持
- 反对
- 回复
- 收藏
-
アリストテレス氏は、「雪の美しさを楽しむには、冬の寒さに耐えなければならない」と述べているが、米株が年末に差し掛かる中、投資家はどのような姿で壮麗な雪景色を楽しむのだろうか。 Carson Groupのチーフ・マ ...
- 魔幻双琪座双q
- 1 小时前
- 支持
- 反对
- 回复
- 收藏
-
広州で開催された2024網易雲商年度顧客大会で、網易数智副総経理、網易雲商総経理の肖鈺妍氏は「AI技術は顧客サービス、マーケティング戦略、体験管理、ビジネス意思決定などの全プロセスに応用でき、AI戦略を堅持 ...
- 什么大师特
- 昨天 22:13
- 支持
- 反对
- 回复
- 收藏