GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    Russ Garrett (russss@chaos.social)'s status on Monday, 07-Apr-2025 21:51:05 JST Russ Garrett Russ Garrett
    in reply to
    • Harry Wood

    @harry_wood oh it's probably LLMs or something LLM-adjacent. But I doubt it's the big AI players who are responsible for the excessive scraping.

    In conversation about a month ago from chaos.social permalink
    • clacke likes this.
    • Embed this notice
      Harry Wood (harry_wood@en.osm.town)'s status on Monday, 07-Apr-2025 21:51:06 JST Harry Wood Harry Wood
      in reply to

      @russss Are you saying the data isn't even necessarily being used for training LLMs? The problem is just correlated to rise of LLMs because LLMs are making it a lot easier to write scrapers (and I guess chatGPT will also happily advise on how to bypass mitigations)

      In conversation about a month ago permalink
    • Embed this notice
      Russ Garrett (russss@chaos.social)'s status on Monday, 07-Apr-2025 21:51:07 JST Russ Garrett Russ Garrett
      in reply to
      • Harry Wood

      @harry_wood I don't think it's the big AI companies which are scraping excessively, it's random people who probably got a LLM to write their scraper bots... To make it more confusing, in some cases they're cloning the user-agents of the major AI bots.

      In conversation about a month ago permalink
    • Embed this notice
      Harry Wood (harry_wood@en.osm.town)'s status on Monday, 07-Apr-2025 21:51:08 JST Harry Wood Harry Wood

      Vicious criticism of LLMs from this sysadmin who has to deal with their scrapers: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html
      The LLM scraper problem seems surprising to me. The makers of big new generative AI systems are mostly big-tech firms. Don't they value their reputation, or even the reputation of the AI concept overall, better than to commission these cowboys to do their scraping? But maybe they've already decided that, due to the copyrights risk, it's best done at arms length via shady intermediaries.

      In conversation about a month ago permalink

      Attachments

      1. No result found on File_thumbnail lookup.
        Please stop externalizing your costs directly into my face
      clacke repeated this.
    • Embed this notice
      Harry Wood (harry_wood@en.osm.town)'s status on Friday, 25-Apr-2025 17:32:10 JST Harry Wood Harry Wood
      in reply to

      More on the LLM scraper problem https://go-to-hellman.blogspot.com/2025/03/ai-bots-are-destroying-open-access.html by @gluejar "Thousands of developer hours are being spent on defense against the dark bots and those hours are lost to us forever. We'll never see the wonderful projects and features they would have come up with in that time"

      In conversation about 22 days ago permalink

      Attachments

      1. Domain not in remote thumbnail source whitelist: blogger.googleusercontent.com
        AI bots are destroying Open Access
        There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, ...
      Haelwenn /элвэн/ :triskell: and clacke like this.

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.