ChatGPT like AI models running out of text to train, claims UC Berkeley professor

AFP

Stuart Russell, an artificial intelligence expert and professor at University of California, has raised concerns about AI-powered language models, such as ChatGPT, potentially "running out of text in the universe" that is used to train them.

He explained that the technology behind AI bots, which rely on vast amounts of text data, is "starting to hit a brick wall".

Russell shared this insight during an interview with the International Telecommunication Union, a UN communications agency, last week. He emphasised that there is a finite amount of digital text available for these language models to consume.

The implications of this text scarcity may influence the future practices of generative AI developers as they collect data and train their technologies.

However, he maintained his belief that AI will increasingly replace humans in various language-dependent jobs. Russell referred to these jobs as "language in, language out" tasks during the interview. His comments contributed to the ongoing discussion surrounding data acquisition practices conducted by OpenAI and other developers of generative AI models.

Concerns have been raised by creators worried about their work being replicated without consent, as well as by social media executives dissatisfied with the unrestricted usage of their platforms' data. Russell's observations drew attention to another potential vulnerability: the scarcity of text available for training these datasets.

A study conducted by Epoch, a group of AI researchers, in November, revealed that machine learning datasets are likely to deplete all "high-quality language data" before 2026. The study defined "high-quality" language data as originating from sources like "books, news articles, scientific papers, Wikipedia, and filtered web content".

Today's most popular generative AI tools, powered by large language models (LLMs), were trained on massive amounts of published text extracted from public online sources, including digital news platforms and social media websites. The practice of "data scraping" from the latter was a contributing factor behind Elon Musk's decision to limit daily tweet views, as he previously stated.

Russell highlighted in the interview that OpenAI, in particular, had to supplement its public language data with "private archive sources" to develop GPT-4, the company's most robust and advanced AI model to date. However, he acknowledged in his email to Insider that OpenAI has yet to disclose the exact training datasets used for GPT-4. Recent lawsuits filed against OpenAI allege the use of datasets containing personal data and copyrighted materials in training ChatGPT. Notably, a prominent lawsuit was filed by 16 unnamed plaintiffs, asserting that OpenAI utilised sensitive data like private conversations and medical records.

Another lawsuit, involving comedian Sarah Silverman and two additional authors, accused OpenAI of copyright infringement due to ChatGPT's capability to generate accurate summaries of their work. Authors Mona Awad and Paul Tremblay also filed a similar lawsuit against OpenAI in late June.

More from Business

  • Nasdaq set to confirm bear market as Trump tariffs trigger recession fears

    The tech-heavy Nasdaq Composite index was set to confirm it was in a bear market on Friday, down more than 20 per cent from a recent record high, as investors fled riskier assets on fears that tariffs imposed by President Donald Trump could spark a trade war and tip the global economy into recession.

  • Dana Gas and Crescent Petroleum exceed 500M boe in Khor Mor field

    UAE-based Dana Gas and Crescent Petroleum, alongside their partners in the Pearl Petroleum consortium, have said the cumulative production from their Khor Mor project, the largest non-associated gas field in Iraq, has exceeded 500 million barrels of oil equivalent (boe).

  • China to impose tariffs of 34% on all US goods

    China has announced a slew of additional tariffs and restrictions against US goods as a countermeasure to sweeping tariffs imposed by US President Donald Trump. The Finance Ministry said it would impose additional tariffs of 34 per cent on all US goods from April 10.

  • Shares bruised, dollar crumbles as Trump tariffs stir recession fears

    Stocks limped to the end of the week on Friday, the dollar was set for its worst week in a month while gold flirted with a record peak as investors feared US President Donald Trump's sweeping tariffs would tip the global economy into a recession.

  • Wall Street futures sink as tariffs fuel recession fears

    US stock index futures tumbled on Thursday after President Donald Trump's sweeping tariffs on major trade partners heightened fears of an all-out trade war that could push the global economy into a recession.