One does not have to pay a subscription fee to use GPT-4o. Instead, users pay with their data. Like a black hole, GPT-4o increases in mass by sucking up any and all material that gets too close, accumulating every piece of information that users enter, whether in the form of text, audio files, or images.
GPT-4o gobbles up not only users’ own information but also third-party data that are revealed during interactions with the AI service. Let’s assume you are seeking a summary of a New York Times article’s content. You take a screenshot and share it with GPT-4o, which reads the screenshot and generates the requested summary within seconds. For you, the interaction is over. But OpenAI is now in possession of all the copyrighted material from the screenshot you provided, and it can use that information to train and enhance its model.
OpenAI is not alone. In the past year, many firms – including Microsoft, Meta, Google, and X – have quietly updated their privacy policies in ways that potentially allow them to collect user data and apply it to train generative AI models. Though leading AI companies have already faced numerous lawsuits in the United States over their unauthorised use of copyrighted content for this purpose, their appetite for data remains as voracious as ever. After all, the more they obtain, the better they can make their models.
The problem for leading AI firms is that high-quality training data has become increasingly scarce. In late 2021, OpenAI was so desperate for more data that it reportedly transcribed over a million hours of YouTube videos, violating the platform’s rules. (Google, YouTube’s parent company, has not pursued legal action against OpenAI, possibly to avoid accountability for its own harvesting of YouTube videos, the copyrights for which are owned by their creators.)
With GPT-4o, OpenAI is trying a different approach, leveraging a large and growing user base – drawn in by the promise of free service – to crowdsource massive amounts of multimodal data. This approach mirrors a well-known tech-platform business model: charge users nothing for services, from search engines to social media, while profiting from app tracking and data harvesting – what Harvard professor Shoshana Zuboff famously called “surveillance capitalism.”
To be sure, users can prohibit OpenAI from using their “chats” with GPT-4o for model training. But the obvious way to do this – on ChatGPT’s settings page – automatically turns off the user’s chat history, causing users to lose access to their past conversations. There is no discernable reason why these two functions should be linked, other than to discourage users from opting out of model training.
If users want to opt out of model training without losing their chat history, they must, first, figure out that there is another way, as OpenAI highlights only the first option. They must then navigate through OpenAI’s privacy portal – a multi-step process. Simply put, OpenAI has made sure that opting out carries significant transaction costs, in the hopes that users will not do it.
Even if users consent to the use of their data for AI training, consent alone would not guard against copyright infringement, because users are providing data that they may not actually own. Their interactions with GPT-4o thus have spillover effects on the creators of the content being shared – what economists call “externalities.” In this sense, consent means little.
While OpenAI’s crowdsourcing activities could lead to copyright violations, holding the company – or others like it – accountable will be no easy feat. AI-generated output rarely looks like the data that informed it, which makes it difficult for copyright holders to know for certain whether their content was used in model training. Moreover, a firm might be able to claim ignorance: users provided the content during interactions with its services, so how can the company know where they got it from?
Creators and publishers have employed a number of methods to keep their content from being sucked into the AI-training blackhole. Some have introduced technological solutions to block data scraping. Others have updated their terms of service to prohibit the use of their content for AI training. Last month, Sony Music – one of the world’s largest record labels – sent letters to more than 700 generative-AI companies and streaming platforms, warning them not to use its content without explicit authorisation.
But as long as OpenAI can exploit the “user-provided” loophole, such efforts will be in vain. The only credible way to address GPT-4o’s externality problem is for regulators to limit AI firms’ ability to collect and use the data their users share.
• Multimodal capabilities and user data collection
GPT-4o integrates text, voice, and visual capabilities, offering a significantly faster user experience. However, instead of charging subscription fees, OpenAI collects extensive user data, including text, audio files, and images, as a form of payment.
• Third-party data harvesting
GPT-4o collects not only users’ personal data but also third-party data shared during interactions, such as copyrighted material from screenshots, using it to train and improve its AI models.
• Industry-wide data collection practices
Many major tech firms, including Microsoft, Meta, Google, and X, have updated their privacy policies to allow extensive data collection for AI training, despite facing legal challenges over the unauthorised use of copyrighted content.
• Challenges of opting out
Users can opt out of data collection for AI training, but the process is cumbersome and linked to losing chat history, discouraging many from opting out. OpenAI’s privacy settings are designed to make opting out difficult and inconvenient.
• Regulatory and ethical concerns
The practice of crowdsourcing data raises concerns about copyright infringement and ethical data use. While some content creators have tried to block data scraping and updated terms of service, regulators need to step in to limit AI firms’ data collection capabilities to address these issues effectively.
— Project Syndicate
- Angela Huyue Zhang, Associate Professor of Law and Director of the Philip K.H. Wong Center for Chinese Law at the University of Hong Kong, is the author of High Wire: How China Regulates Big Tech and Governs Its Economy.
- S Alex Yang is Professor of Management Science and Operations at London Business School.