GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    brennen (brennen@federation.p1k3.com)'s status on Wednesday, 25-Mar-2026 02:32:16 JST brennen brennen

    my mental model of the scraping load on most of the public web is basically:

    people want to train models, want them trained on the latest stuff, and there is money for doing this. there is little incentive to be efficient or responsible about it, so we wind up with a bunch of crawlers just absolutely going to town.

    do you actually work in this field and know better than me? am i missing something important?

    In conversation about 17 days ago from federation.p1k3.com permalink

    Attachments

    1. No result found on File_thumbnail lookup.
      Town: Ciudad Cooperativa — ¡Aprendamos sobre cooperativismo!
    • Embed this notice
      Evan Prodromou (evan@cosocial.ca)'s status on Wednesday, 25-Mar-2026 02:32:14 JST Evan Prodromou Evan Prodromou
      in reply to

      @brennen I used to work for Wikimedia Foundation and I do know that they have huge downloadable versions of all the wikis, as well as enterprise support packages for specific users. There's also a good (well several good) API for reading data. So, there's really not a good reason to scrape Wikipedia directly.

      In conversation about 17 days ago permalink
    • Embed this notice
      Evan Prodromou (evan@cosocial.ca)'s status on Wednesday, 25-Mar-2026 05:18:20 JST Evan Prodromou Evan Prodromou
      in reply to

      @brennen doh! Of course.

      I think the only thing that slows them down is rate limiting.

      In conversation about 17 days ago permalink
    • Embed this notice
      brennen (brennen@federation.p1k3.com)'s status on Wednesday, 25-Mar-2026 05:18:21 JST brennen brennen
      in reply to
      • Evan Prodromou

      @evan i currently work for WMF, and i concur re: reasons. why i am asking about this anyway is left as an exercise for the reader. :)

      that said, the question could probably just as well have been framed to cover code forges, issue trackers, forums, blogs, etc. wikipedia may be a special case in some sense because it's a particularly rich / high profile resource, but anecdotally the effect is widespread.

      In conversation about 17 days ago permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.