首页 News 正文

According to media reports, some large tech companies, including Apple, NVIDIA, Salesforce, and Anthropic, have been exposed for using unauthorized data from Google's video website YouTube to train their AI models. These companies used a dataset provided by a third party, which contained a large amount of video subtitle text crawled from YouTube, violating YouTube's ban on unauthorized content crawling from the platform. The report points out that these tech companies all use a dataset called "YouTube Subtitles" when training their AI models, which is 5.7GB in size and contains 489 million words from 173500 videos across over 48000 channels on YouTube. This dataset consists of pure text for video subtitles, including parts uploaded by video bloggers and automatically transcribed text from YouTube. In addition to English, it usually comes with translations for languages such as Japanese, German, and Arabic.
您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

六月清晨搅 注册会员
  • 粉丝

    0

  • 关注

    0

  • 主题

    30