GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Friday, 11-Oct-2024 00:31:56 JST K. Ryabitsev ???? K. Ryabitsev ????
    AI farms can just up and die, please. Add overscraping to the long list of public resources being ruined by commercial greed -- to join overfishing, overlogging, and overgrazing.
    In conversation about 8 months ago from social.kernel.org permalink
    • Embed this notice
      翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 00:31:54 JST 翠星石 翠星石
      in reply to
      @monsieuricon To compare the waste of processor cycles, to public resources like a forest is questionable, as a processor doesn't wear any more from processing 1 billion cycles a second than 3.5 billion a second, while a forest runs out of trees very fast and likely forever.

      It seems the only effective method to curtail that would be get your lawyers to send the scrapers Cease & Detest letters, noting that you're not clueless and know that they're clearly planning on infringing the GPLv2 and you'll be enforcing your copyright if it happens.
      In conversation about 8 months ago permalink
    • Embed this notice
      K. Ryabitsev ???? (monsieuricon@social.kernel.org)'s status on Friday, 11-Oct-2024 00:31:55 JST K. Ryabitsev ???? K. Ryabitsev ????
      in reply to
      And no, it's not easy to "just throttle them" when it's thousands of IPs all coming from public cloud subnets with user-agents matching common modern browsers.
      In conversation about 8 months ago permalink
    • Embed this notice
      翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:19:39 JST 翠星石 翠星石
      in reply to
      • 翠星石
      @monsieuricon Throttling is not what you want - you want to temporarily null route such scrapers until they go away.

      Too bad git and also apt is very bursty, meaning you cannot add a ratelimit sufficiently low to stop auto ratelimit adjusting scrapers, while also never blocking people who merely run `git clone` (I was wondering why git clone wasn't working until I realized that a high ratelimit wasn't high enough).

      The only thing that seems to reveal those scrapers is the hours of time they spend connected scraping - so maybe a rate+connection time ratelimit could work - too bad you can pull of a lot of scraping in a few hours.
      In conversation about 8 months ago permalink
    • Embed this notice
      翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:07 JST 翠星石 翠星石
      in reply to
      • opal
      @wowaname It's much worse.

      LLM scrapers scrape every single page and every single file, changing useragents, IPs, auto-adjusting their scrape rate to avoid and even setting a useragent that is "" (shows up as "-" in nginx logs) or even "-", that causes you to inadvertently 403 the wrong useragent, or even accidentally 403 all of them.

      Meanwhile, crawlers seem to at least identify themselves with a crawler useragent, which you can 403 or at least have a crawl rate that utilizes a negligible amount of bandwidth on any half-decent connection.

      I believe that you're probably being hit by LLM scrapers that are pretending to be spiders.
      In conversation about 8 months ago permalink
    • Embed this notice
      opal (wowaname@freesoftwareextremist.com)'s status on Friday, 11-Oct-2024 01:25:08 JST opal opal
      in reply to
      @monsieuricon is it genuinely worse than all the misbehaved spiders we've always had? i'm asking because i haven't been able to keep any websites up this year to monitor my logs
      In conversation about 8 months ago permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.