ChatGPT Has Already Polluted the Internet So Badly That It's Hobbling Future AI Development

Bender12@lemmy.world · 2 days ago

ChatGPT Has Already Polluted the Internet So Badly That It's Hobbling Future AI Development

altphoto · 1 day ago

Its hilarious because I am researching something really mundane but very specific with only a handful of white papers available. But now Chat GPT is going to the same forums I go to and feeding me back old assumptions I had, which I have now discovered to be incorrect. Like I’ve totally fucked that up for myself.

jordanlund@lemmy.world · 2 days ago

You would have to train new AI models to recognize and ignore other AI content. But that would be an admission that AI content is useless and can’t be trusted.

Almacca@aussie.zone · 1 day ago

Can it also ignore useless and untrustworthy content created by humans as well? There’s a lot of that around as well.

pelespirit@sh.itjust.works · 2 days ago

It would be hilarious if they had to hire a bunch of humans to check AI sources and “facts.” I hope that’s happening.

LandedGentry@lemmy.zip · 2 days ago

Didn’t you hear? Fact checking is a radical left commie trans plot against FREEDOM

kibiz0r@midwest.social · 2 days ago

They rarely mention the humans trying to make use of the polluted landscape.

ZILtoid1991@lemmy.world · 1 day ago

Everyone talks about vibecoding this, vibecoding that. No one talks about how hard it became to find non-AI slop tutorials for development stuff.

ElectroVagrant@lemmy.world · edit-2 22 hours ago

Odd url…Here’s the original: https://futurism.com/chatgpt-polluted-ruined-ai-development

Nice detail to use when searching the internet btw:

“But if you’re collecting data before 2022 you’re fairly confident that it has minimal, if any, contamination from generative AI,” he added. “Everything before the date is ‘safe, fine, clean,’ everything after that is ‘dirty.’”

Try running searches set pre-2022, at least for older info, to reduce the possibilities of AI generated noise.

Anyway, kinda funny to see these generators may be producing enough noise to make producing more noise somewhat harder. Hopefully this doesn’t also impact more productive AI development, such as what’s used in scientific research and the like, as that would genuinely suck.

Edit:
Revised from generators “have produced” to “may be producing” to better reflect the lack of concrete info regarding generative AI data pollution as someone else pointed out. As they note:

“Now, it’s not clear to what extent model collapse will be a problem, but if it is a problem, and we’ve contaminated this data environment, cleaning is going to be prohibitively expensive, probably impossible,” he told The Register.

JeremyHuntQW12@lemmy.world · 1 day ago

There’s nothing in the article, the Register article or any references that claim there is actual pollution of data.

It’s based on speculation made years ago.

chickenf622@sh.itjust.works · 2 days ago

Plus side of actual useful application of LLM/AI is the data is usually a small subset of data, and it would have to be tested anyways since it would have to be used in the real world. I think the main use of LLM/AI in mainstream is using it on small datasets like that instead of the race for the holy grail of “General” AI.

blazeknave@lemmy.world · 2 days ago

Fuck. Will this next epoch retrospectively be considered a dark age, not bc disinformation, but bc after 2022 we were giberishing morons?

nucleative@lemmy.world · 2 days ago

It makes me think about how low background steel has become a precious commodity. Steel that was made prior to the first atomic bombs has a unique value because it’s uncontaminated.

We have archives of the internet prior to AI as we currently know it coming into widespread use. It seems like the future of all LLM model designers are going to need to be very crafty about their source of data and not just ingest everything they crawl.

BeBopALouie@lemmy.ca · 2 days ago

Nice.

supersquirrel@sopuli.xyz · 2 days ago

wat

fubarx@lemmy.world · 2 days ago

Meh, it’s all good enough.

/s

bacon_pdp@lemmy.world · 2 days ago

Well they are very wrong about one point, all datasets are intentionally polluted to poison scrapping AI.

They will either have to pay for expanded datasets or limit themselves to those that are ethically sourced

ChatGPT Has Already Polluted the Internet So Badly That It's Hobbling Future AI Development

ChatGPT Has Already Polluted the Internet So Badly That It's Hobbling Future AI Development

301 Moved