LLM scrapers are taking down FOSS projects’ infrastructure, and it’s getting worse.

  • sudo@programming.dev
    link
    fedilink
    arrow-up
    8
    arrow-down
    2
    ·
    1 day ago

    Whats confusing the hell out of me is: why are they bothering to scrape the git blame page? Just download the entire git repo and feed that into your LLM!

    9/10 the best solution is to block nonresidential IPs. Residential proxies exist but they’re far more expensive than cloud proxies and providers will ask questions. Residential proxies are sketch AF and basically guarded like munitions. Some rookie LLM maker isn’t going to figure that out.

    Anubis also sounds trivial to beat. If its just crunching numbers and not attempting to fingerprint the browser then its just a case of feeding the page into playwright and moving on.

    • refalo@programming.dev
      link
      fedilink
      arrow-up
      4
      arrow-down
      1
      ·
      edit-2
      1 day ago

      I don’t like the approach of banning nonresidential IPs. I think it’s discriminatory and unfairly blocks out corporate/VPN users and others we might not even be thinking about. I realize there is a bot problem but I wish there was a better solution. Maybe purely proof-of-work solutions will get more popular or something.

      • sudo@programming.dev
        link
        fedilink
        arrow-up
        3
        arrow-down
        3
        ·
        20 hours ago

        Proof of Work is a terrible solution because it assumes computational costs are significant expense for scrapers compared to proxy costs. It’ll never come close to costing the same as residential proxies and meanwhile every smartphone user will be complaining about your website draining their battery.

        You can do something like only challenge data data center IPs but you’ll have to do better than Proof-of-Work. Canvas fingerprinting would work.

        • refalo@programming.dev
          link
          fedilink
          arrow-up
          1
          ·
          5 hours ago

          Proof of Work is a terrible solution

          Hard disagree, because:

          it assumes computational costs are significant expense for scrapers compared to proxy costs

          The assumption is correct. PoW has been proven to significantly reduce bot traffic… meanwhile the mere existence of residential proxies has exploded the availability of easy bot campaigns.

          Canvas fingerprinting would work.

          Demonstrably false… people already do this with abysmal results. Need to visit a clownflare site? Endless captcha loops. No thanks