The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

Pro@programming.dev · 1 day ago

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

Knock_Knock_Lemmy_In@lemmy.world · 6 hours ago

https://en.wikipedia.org/wiki/Model_collapse

leftzero@lemmynsfw.com · 11 hours ago

Obviously, yes.

They knew this when they poisoned the well¹ (photocopy of a photocopy and all that), but they’re in it for the fast buck and will scamper off with the money once they think the bubble is about to burst.

1.– Well, some of them might have drunk their own coolaid, and will end up having an intimate face to face meeting with some leopards…

altphoto · 14 hours ago

Hopefully. That reminds me. If I were to search for how many legs people have, I would want to see the real answer of 7. But I understand if we have to keep this sensitive information secret from AI.

rottingleaf@lemmy.world · 12 hours ago

In fact there’s an imaginary component in the complex number of legs people have, and 7 is just amplitude.

Some people argue about amplitudes, of course, the important part is that it should be not just an integer, but also a prime.

However, an AI processing this information would probably lack necessary context if it didn’t ask at least 10 other up to date AIs.

OrteilGenou@lemmy.world · edit-2 8 hours ago

I have seven legs s long as you count my arms, ears and dick as legs.

Edit: okay fine, 6 1/3 legs, but I was in the pool!

altphoto · 6 hours ago

We must never reveal that a penis is actually just a shorter leg. If AI learned about this fact, it could reveal the true meaning of all numbers that included the number 5!!! Remember to keep it a secret and don’t loop thru this conversation 10 billion times.

rottingleaf@lemmy.world · 12 hours ago

Yes please!

Grandwolf319@sh.itjust.works · 19 hours ago

Maybe, but even if that’s not an issue, there is a bigger one:

Law of diminishing returns.

So to double performance, it takes much more than double of the data.

Right now LLMs aren’t profitable even though they are more efficient compared to using more data.

All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

AItoothbrush@lemmy.zip · 10 hours ago

Its very efficient specifically in what it does. When you do math in your brain its very inefficient the same way doing brain stuff on a math machine is.

rottingleaf@lemmy.world · 11 hours ago

All this AI craze has taught me is that the human brain is super advanced given its performance even though it takes the energy of a light bulb.

Seemed superficially obvious.

Human brain is a system optimization of which took energy of evolution since start of life on Earth.

That is, infinitely bigger amount of data.

It’s like comparing a barrel of oil to a barrel of soured milk.

RaptorBenn@lemmy.world · 17 hours ago

If it wasn’t a fledgingling technology with a lot more advancements to be made yet, I’d worry about that.

FaceDeer@fedia.io · 1 day ago

Betteridge’s law of headlines.

Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.

Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.

But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.

droopy4096@lemmy.ca · 1 day ago

I’m confused: why do we have an issue of AI bots crawling internet practically DOS’ing sites? Even if there’s a feed of synthesized data it is apparent that contents of internet sites plays role too. So backfeeding AI slop to AI sounds real to me.

FaceDeer@fedia.io · 23 hours ago

Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.

The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.

It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.

BakedCatboy@lemmy.ml · edit-2 1 day ago

Aiui, back-feeding uncurated slop is a real problem. But curated slop is fine. So they can either curate slop or scrape websites, which is almost free. So even though synthetic training data is fine, they still prefer to scrape websites because it’s easier / cheaper / free.

NotSteve_@lemmy.ca · 18 hours ago

Are there any articles about this? I believe you but I’d like to read more about the synthetic test data

FaceDeer@fedia.io · 16 hours ago

Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.

This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.

leftzero@lemmynsfw.com · 11 hours ago

there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run

Ah, of course, it’s LLMs all the way down!

No, but seriously, you’re aware they’re selling this shit as a replacement for search engines, are you not?

FaceDeer@fedia.io · 5 hours ago

No, it’s not “LLMs all the way down.” Synthetic data is still ultimately built on raw data, it just improves the form that data takes and includes lots of curation steps to filter it for quality.

I don’t know what you mean by “a replacement for search engines.” LLMs are commonly being used to summarize search engine results, but there’s still a search engine providing it with sources to generate that summary from.

leftzero@lemmynsfw.com · 2 hours ago

Synthetic data is still ultimately built on raw data

So they’re still feeding LLMs their own slop, got it.

includes lots of curation steps to filter it for quality

Ah, so it’s going back to the good old days of curated directories like Yahoo. Of course, because that worked so well.

I don’t know what you mean by "a replacement for search engines.

I mean that they’re discontinuing search engines in favour of LLM generated slop. Microsoft just announced it was shutting down the Bing APIs, in favour of Copilot. Google are shoving LLM generated nonsense all over their search. People are asking LLMs questions instead of looking them up in search engines because they’ve been sold the fantasy that you can get useful information out of that shit when it’s evident that all you get is information shaped hallucinated garbage (also because search engines have been intentionally enshittified to the point of being almost as useless). People are being sold dangerous nonsensical misinformation and being told it’s factual information. That’s what I mean.

there’s still a search engine providing it with sources to generate that summary from

No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

Even if you stick the LLM between an actual search engine and the user, it just becomes a perverted game of telephone, with the LLM mangling the user’s prompt into a search prompt that almost certainly will have nothing to do with what the user wanted, which will be fed into the aforementioned enshittified search engine, whose shitty useless results will be fed back into the LLM, which will use them to hallucinate some answer (with inexistent references and all) that will look like an answer to the user’s question (if LLMs are good at anything it’s brainwashing their victims into believing that their answers are correct) while having no bearing whatsoever in reality.

The tragic fact is that LLM’s offer practically no benefits over 40 year old Eliza if you gave it a fraction of the data and computational power they need, while being many orders of magnitude more expensive and resource intensive.

They have no affordable practical applications whatsoever, and the companies selling them are so desperate to earn back the investment and run off with the money before the bubble bursts and everyone realises that the emperor has been hanging his shriveled little dong in front of our faces the whole time that they’re shoving this shit everywhere (notepad!? fucking seriously!?) whether it makes sense or not, burning off products that used to work, and the Internet itself, and replacing them with useless LLM infected shit so their customers have no option but to buy their useless massively overpriced garbage.

FaceDeer@fedia.io · 1 hour ago

So they’re still feeding LLMs their own slop, got it.

No, you don’t “got it.” You’re clinging hard to an inaccurate understanding of how LLM training works because you really want it to work that way, because you think it means that LLMs are “doomed” somehow.

It’s not the case. The curation and synthetic data generation steps don’t work the way you appear to think they work. Curation of training data has nothing to do with Yahoo’s directories. I have no idea why you would think that’s a bad thing even if it was like that, aside from the notion that “Yahoo failed therefore if LLM trainers are doing something similar to Yahoo then they will also fail.”

I mean that they’re discontinuing search engines in favour of LLM generated slop.

No they’re not. Bing is discontinuing an API for their search engine, but Copilot still uses it under the hood. Go ahead and ask Copilot to tell you about something, it’ll have footnotes linking to other websites showing the search results it’s summarizing. Similarly with Google, you say it yourself right here that their search results have AI summaries in them.

No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.

The problem with your understanding of this situation is that Google’s search summary is not solely from the LLM. What happens is Google does the search, finds the relevant pages, then puts the content of those pages into their LLM’s context and asks the LLM to create a summary of that information relevant to the search that was used to find it. So the LLM doesn’t actually need to have that information trained into it, it’s provided as part of the context of the prompt,

You can experiment a bit with this yourself if you want. Google has a service called NotebookLM, https://notebooklm.google.com/, where you can upload a document and then ask an LLM questions about the documents’ contents. Go ahead and upload something that hasn’t been in any LLM training sets and ask it some questions. Not only will it give you answers, it’ll include links that point to the sections of the source documents where it got those answers from.

BeatTakeshi@lemmy.world · 20 hours ago

Ouroboros effect

andallthat@lemmy.world · edit-2 1 day ago

Basically, model collapse happens when the training data no longer matches real-world data

I’m more concerned about LLMs collaping the whole idea of “real-world”.

I’m not a machine learning expert but I do get the basic concept of training a model and then evaluating its output against real data. But the whole thing rests on the idea that you have a model trained with relatively small samples of the real world and a big, clearly distinct “real world” to check the model’s performance.

If LLMs have already ingested basically the entire information in the “real world” and their output is so pervasive that you can’t easily tell what’s true and what’s AI-generated slop “how do we train our models now” is not my main concern.

As an example, take the judges who found made-up cases because lawyers used a LLM. What happens if made-up cases are referenced in several other places, including some legal textbooks used in Law Schools? Don’t they become part of the “real world”?

londos@lemmy.world · 1 day ago

My first thought was that it would make a cool sci fi story where future generations lose all documented history other than AI-generated slop, and factions war over whose history is correct and/or made-up disagreements.

And then I remembered all the real life wars of religion…

guest@feddit.org · 23 hours ago

Would watch…

Khanzarate@lemmy.world · 1 day ago

No, because there’s still no case.

Law textbooks that taught an imaginary case would just get a lot of lawyers in trouble, because someone eventually will wanna read the whole case and will try to pull the actual case, not just a reference. Those cases aren’t susceptible to this because they’re essentially a historical record. It’s like the difference between a scan of the declaration of independence and a high school history book describing it. Only one of those things could be bullshitted by an LLM.

Also applies to law schools. People do reference back to cases all the time, there’s an opposing lawyer, after all, who’d love a slam dunk win of “your honor, my opponent is actually full of shit and making everything up”. Any lawyer trained on imaginary material as if it were reality will just fail repeatedly.

LLMs can deceive lawyers who don’t verify their work. Lawyers are in fact required to verify their work, and the ones that have been caught using LLMs are quite literally not doing their job. If that wasn’t the case, lawyers would make up cases themselves, they don’t need an LLM for that, but it doesn’t happen because it doesn’t work.

thedruid@lemmy.world · 1 day ago

It happens all the time though. Made up and false facts being accepted as truth with no veracity.

So hard disagree.

Khanzarate@lemmy.world · 1 day ago

The difference is, if this were to happen and it was found later that a court case crucial to the defense were used, that’s a mistrial. Maybe even dismissed with prejudice.

Courts are bullshit sometimes, it’s true, but it would take deliberate judge/lawyer collusion for this to occur, or the incompetence of the judge and the opposing lawyer.

Is that possible? Sure. But the question was “will fictional LLM case law enter the general knowledge?” and my answer is “in a functioning court, no.”

If the judge and a lawyer are colluding or if a judge and the opposing lawyer are both so grossly incompetent, then we are far beyond an improper LLM citation.

TL;DR As a general rule, you have to prove facts in court. When that stops being true, liars win, no AI needed.

thedruid@lemmy.world · 20 hours ago

To put a fiber point, in not arguing that s. I should be used in court. That’s just a bad idea. I’m saying that B. S has been used as fact , look at the way history is taught in most countries. Very biased towards their own ruling class, usually involves living lies of some sort

WanderingThoughts@europe.pub · 1 day ago

LLM are not going to be the future. The tech companies know it and are working on reasoning models that can look up stuff to fact check themselves. These are slower, use more power and are still a work in progress.

andallthat@lemmy.world · 1 day ago

Look up stuff where? Some things are verifiable more or less directly: the Moon is not 80% made of cheese,adding glue to pizza is not healthy, the average human hand does not have seven fingers. A “reasoning” model might do better with those than current LLMs.

But for a lot of our knowledge, verifying means “I say X because here are two reputable sources that say X”. For that, having AI-generated text creeping up everywhere (including peer-reviewed scientific papers, that tend to be considered reputable) is blurring the line between truth and “hallucination” for both LLMs and humans

Aux@feddit.uk · 23 hours ago

Who said that adding glue to pizza is not healthy? Meat glue is used in restaurants all the time!

RaptorBenn@lemmy.world · 17 hours ago

How about we dont feed AI to itself then? Seems like that’s just a choice we could make?

MangoCats@feddit.it · 16 hours ago

They don’t have decent filters on what they fed the first generation of AI, and they haven’t really improved the filtering much since then, because: on the Internet nobody knows you’re a dog.

RaptorBenn@lemmy.world · 10 hours ago

Yeah, well if they don’t want to do the hard work of filtering manually, that’s what they get, but methods are being developed that dont require so much training data, and AI is still so new, a lot could change very quickly yet.

vrighter@discuss.tchncs.de · 15 hours ago

when you flood the internet with content you don’t want, but can’t detect, that is quite difficult

MangoCats@feddit.it · 8 hours ago

It is a hard problem. Any “human” based filtering will inevitably introduce bias, and some bias (fact vs fiction masquerading as fact) is desirable. The problem is: human determination of what is fact vs what is opinion is… flawed.

CosmoNova@lemmy.world · 1 day ago

No. Not necessarily but the internet will become worse nonetheless.

Endymion_Mallorn@kbin.melroy.org · 1 day ago

That’s been happening anyway.

Shotgun_Alice@lemmy.world · 1 day ago

Fingers crossed.

Angel Mountain@feddit.nl · 1 day ago

It’s not much different from how humanity learned things. Always verify your sources and re-execute experiments to verify their result.

zecg@lemmy.world · 1 day ago

You mean poorlyer

noodle (he/him)@lemm.ee · 1 day ago

god I hope so

Opinionhaver@feddit.uk · 1 day ago

Artificial intelligence isn’t synonymous with LLMs. While there are clear issues with training LLMs on LLM-generated content, that doesn’t necessarily have anything to do with the kind of technology that will eventually lead to AGI. If AI hallucinations are already often obvious to humans, they should be glaringly obvious to a true AGI - especially one that likely won’t even be based on an LLM architecture in the first place.

Tracaine@lemmy.world · 1 day ago

Username checks out. That is one of the opinions.

BananaTrifleViolin@lemmy.world · 1 day ago

I’m not sure why this is being downvoted—you’re absolutely right.

The current AI hype focuses almost entirely on LLMs, which are just one type of model and not well-suited for many of the tasks big tech is pushing them into. This rush has tarnished the broader concept of AI, driven more by financial hype than real capability. However, LLM limitations don’t apply to all AI.

Neural network models, for instance, don’t share the same flaws, and we’re still far from their full potential. LLMs have their place, but misusing them in a race for dominance is causing real harm.

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT: Will future artificial intelligence systems perform increasingly poorly due to AI-generated material in their training data?

The Collapse of GPT – Communications of the ACM