• tal
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 day ago

    You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).

    True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.

    And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.