Public
- Public
- Network
- Groups
- Featured
- Popular
- People

To understand what's going on here, know these things: OpenAI is the company that makes ChatGPT A spider is a kind of bot that autonomously crawls the web and sucks up web pages robots.txt is a standard text file that most web sites use to inform spiders whether or not they have permission to crawl the site; basically a No Trespassing sign for robots OpenAI's spider is ignoring robots.txt (very rude!) the web.sp.am site is a research honeypot created to trap ill-behaved spiders, consisting of billions of nonsense garbage pages that look like real content to a dumb robot OpenAI is training its newest ChatGPT model using this incredibly lame content, having consumed over 3 million pages and counting... It's absurd and horrifying at the same time.

Download link

To understand what's going on here, know these things: OpenAI is the company that makes ChatGPT A spider is a kind of bot that autonomously crawls the web and sucks up web pages robots.txt is a standard text file that most web sites use to inform spiders whether or not they have permission to crawl the site; basically a No Trespassing sign for robots OpenAI's spider is ignoring robots.txt (very rude!) the web.sp.am site is a research honeypot created to trap ill-behaved spiders, consisting of billions of nonsense garbage pages that look like real content to a dumb robot OpenAI is training its newest ChatGPT model using this incredibly lame content, having consumed over 3 million pages and counting... It's absurd and horrifying at the same time.
https://static.toot.community/media_attachments/files/112/268/194/879/706/394/original/b2d500df292cc186.png

Notices where this attachment appears

Embed this notice
Sam pausiert (weirdmustard@toot.community)'s status on Sunday, 14-Apr-2024 19:04:17 JST Sam pausiert

Offenbar klaut OpenAI einfach weiter, auch wenn man sie aktiv von seinen Websites ausschließt.
Edit: Ja, meine Güte, natürlich schließt man sie nicht aus, sondern sagt dem Crawler "Hier nicht crawlen". Der Punkt ist: Die spiders ignorieren die Vorgabe.

In conversation about 11 months ago from toot.community permalink