caption

a screenshot of the text:

Tech companies argued in comments on the website that the way their models ingested creative content was innovative and legal. The venture capital firm Andreessen Horowitz, which has several investments in A.I. start-ups, warned in its comments that any slowdown for A.I. companies in consuming content “would upset at least a decade’s worth of investment-backed expectations that were premised on the current understanding of the scope of copyright protection in this country.”

underneath the screenshot is the “Oh no! Anyway” meme, featuring two pictures of Jeremy Clarkson saying “Oh no!” and “Anyway”

screenshot (copied from this mastodon post) is of a paragraph of the NYT article “The Sleepy Copyright Office in the Middle of a High-Stakes Clash Over A.I.

  • OmnipotentEntity@beehaw.org
    link
    fedilink
    arrow-up
    1
    ·
    10 months ago

    What LLMs and other models are doing is analogous to reading a book and writing a book report.

    It is purported to be analogous to that. But given that in actuality it can also simply reproduce nearly entire articles word for word from a short prompt, it’s clear that the analogy that you are attempting to draw is flawed. Inside of the LLM, encoded in the weights and biases of the network, is that article and many others, it has been copied into the network, encoded, and can be referenced.

    The Pile is 825GiB of text. ChatGPT-4 is about 400 billion parameters, and each of those parameters is 2 bytes, which is 800GiB of data. There’s certainly enough redundancy in whatever corpus they’re using to just memorize the entire thing and still have sufficient network space leftover to actually make some sense of it.