首页 › News › 正文

Nvidia and other giants exposed for illegally using YouTube data to train models involving 170000 videos

六月清晨搅发表于 2024-7-17 15:00:42

186 0 0

According to media reports, some large tech companies, including Apple, NVIDIA, Salesforce, and Anthropic, have been exposed for using unauthorized data from Google's video website YouTube to train their AI models. These companies used a dataset provided by a third party, which contained a large amount of video subtitle text crawled from YouTube, violating YouTube's ban on unauthorized content crawling from the platform. The report points out that these tech companies all use a dataset called "YouTube Subtitles" when training their AI models, which is 5.7GB in size and contains 489 million words from 173500 videos across over 48000 channels on YouTube. This dataset consists of pure text for video subtitles, including parts uploaded by video bloggers and automatically transcribed text from YouTube. In addition to English, it usually comes with translations for languages such as Japanese, German, and Arabic.

CandyLake.com 系信息发布平台，仅提供信息存储空间服务。
声明：该文观点仅代表作者本人，本文不代表CandyLake.com立场，且不构成建议，请谨慎对待。

支持

反对

转播

Nvidia and other giants exposed for illegally using YouTube data to train models involving 170000 videos

映画・テレビ産業がコンテンツの3大激変を迎えた愛奇芸調整マイクロコントビジネスモデル

理想自動車9月新車納入53,709台、前年同月比48.9%増

オーディオストリーミングプラットフォームSpotifyリカバリサービス

今年9月の極クリプトン車の納入台数は前年同期比77%増の2万13万台