• @jmcs@discuss.tchncs.de
    link
    fedilink
    English
    261 year ago

    I guess they will get to analyze OpenAI’s dataset during discovery. I bet OpenAI didn’t have authorization to use even 1% of the content they used.

    • @maynarkh@feddit.nl
      link
      fedilink
      English
      151 year ago

      That’s why they don’t feel they can operate in the EU, as the EU will mandate AI companies to publish what datasets they trained their solutions on.

    • @Jaded@lemmy.dbzer0.com
      link
      fedilink
      English
      71 year ago

      Things might change but right now, you simply don’t need anyones authorization.

      Hopefully it doesn’t change because only a handful of companies have the data or the funds to buy the data, it would kill any kind of open source or low priced endeavour.

      • @Flaky@iusearchlinux.fyi
        link
        fedilink
        English
        41 year ago

        FWIW, Common Crawl - a free/open-source dataset of crawled internet pages - was used by OpenAI for GPT-2 and GPT-3 as well as EleutherAI’s GPT-NeoX. Maybe on GPT3.5/ChatGPT as well but they’ve been hush about that.