In a recent development, OpenAI has received a strong warning from YouTube CEO Neal Mohan about using its platform to train the cutting-edge AI models Sora and ChatGPT. This warning is given in light of possible violations of YouTube’s terms of service as well as worries about the source of training data. A discussion regarding ethical AI research and the obligations of tech corporations has been spurred by the issue surrounding the source of training data for these state-of-the-art AI systems.
Exploring YouTube’s concerns
Mira Murati’s recent interview adds another layer of uncertainty to the already blur picture of AI training practices. What was possibly even more concerning was that, in an interview with The Wall Street Journal conducted just a month ago, OpenAi’s CTO, Mira Murati, expressed uncertainty and lack of clarity over the source of Sora’s training data. Although it’s unclear if YouTube videos were or are being used for training, Neal Mohan, the CEO of the company, has now potentially fired a warning shot by informing OpenAI that using videos on its platform is prohibited.
It prohibits the downloading of materials such as transcripts or video clips, and doing so is a blatant breach of our terms of service, Mohan declared in an interview with Emily Chang for Bloomberg Originals. These are the guidelines for content on our platform. While Google, the parent company of YouTube, has been developing its own multimodal AI dubbed Gemini, which also uses training data, Mohan said that Google follows each creator’s unique contract with YouTube when determining whether to use content from the platform.
Mohan stated,
“It does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service. Those are the rules of the road in terms of content on our platform.”
Source: Bloomberg
Also Mohan added,
“Google adheres to YouTube’s individual contracts with creators before deciding whether to use videos from the platform.”
Source: Bloomberg
Examining Murati’s comments in greater detail highlights how serious the copyright and attribution issue is. It’s possible that OpenAI’s Sora collects everything on the Internet, including YouTube videos and social network posts, given the phrase “publicly available data.” For example, it is highly unlikely that the license terms for all content published on YouTube permit this kind of use.
Copyright maintenance on the internet is a difficult task in and of itself. Simultaneously, OpenAI’s Sora will have access to it and be able to profit from it in addition to using it for educational reasons.
Not just the CTO of OpenAI is reluctant to discuss the datasets that are used in Sora’s learning. In general, the company doesn’t really mention the sources that it uses. There isn’t even a clear mention in Sora’s technical paper that a significant number of movies with accompanying text captions are needed for training text-to-video creation systems.
Due to the fact that these companies do not have the legal right to use the data, their lack of transparency in this regard may be the first indication that they are attempting to avoid legal issues.