I think your average geek used to be like, somewhat academic and erudite and into arcane knowledge and had some level of good faith of wanting to engage in discussion
Now it’s all frauds and absolutely braindead elon stans and crypto dipshits and conservative freaks and people who enjoy and defend watching big tech destroy everything.
https://huggingface.co/datasets/defunct-datasets/the_pile_books3
https://web.archive.org/web/20220522050247/https://huggingface.co/datasets/the_pile_books3
I emphasize “well known” because it was literally in the description when it was initially uploaded to the internet. It was always right out in the front that this was all the ebooks from private torrent tracker Bibliotik. Shawn Presser/books3 never lied about where it came from. As you can see with the archive.org link, that description about it’s sourcing was on the page in May 2022.
Bibliotik is a well known private tracker for ebooks and even peddles tools for removing DRM from ebooks. So, arguably, not only are the books pirated, but at some point, a DMCA criminal violation occurred when the DRM was stripped from them. So OpenAIs willingness to use it without question to get their company started should be evidence they’re not concerned about where the data came from or getting it in more legal ways.
Thank you for the links and reading!