Public
- Public
- Network
- Groups
- Featured
- Popular
- People

John Levine writes Anyone got a contact at OpenAI. They have a spider problem. As I think I have mentioned before, I have the world's lamest content farm at https://www.web.sp.am/. Click on a link or two and you'll get the idea. Unfortunately, GPTBot has found it and has not gotten the idea. It has fetched over 3 million pages today. Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page. Of those 3 million page fetches, 1.8 million were for robots.txt. It's not like it's hard to figure out what's going on since the pages all look nearly the same, and they're all on the same IP address with the same wildcard SSL certificate. Amazon's spider got stuck there a month or two ago but fortunately I was able to find someone to pass the word and it stopped. Got any contacts at OpenAI? R's, John PS: If you were wondering what they're using to train GPT-5, well, now you know.

Download link

John Levine writes Anyone got a contact at OpenAI. They have a spider problem. As I think I have mentioned before, I have the world's lamest content farm at https://www.web.sp.am/. Click on a link or two and you'll get the idea. Unfortunately, GPTBot has found it and has not gotten the idea. It has fetched over 3 million pages today. Before someone tells me to fix my robots.txt, this is a content farm so rather than being one web site with 6,859,000,000 pages, it is 6,859,000,000 web sites each with one page. Of those 3 million page fetches, 1.8 million were for robots.txt. It's not like it's hard to figure out what's going on since the pages all look nearly the same, and they're all on the same IP address with the same wildcard SSL certificate. Amazon's spider got stuck there a month or two ago but fortunately I was able to find someone to pass the word and it stopped. Got any contacts at OpenAI? R's, John PS: If you were wondering what they're using to train GPT-5, well, now you know.
https://static.toot.community/media_attachments/files/112/268/194/518/126/354/original/cdafe8dbb67a8acb.png

Notices where this attachment appears

Embed this notice
Sam pausiert (weirdmustard@toot.community)'s status on Sunday, 14-Apr-2024 19:04:17 JST Sam pausiert

Offenbar klaut OpenAI einfach weiter, auch wenn man sie aktiv von seinen Websites ausschließt.
Edit: Ja, meine Güte, natürlich schließt man sie nicht aus, sondern sagt dem Crawler "Hier nicht crawlen". Der Punkt ist: Die spiders ignorieren die Vorgabe.

In conversation about 11 months ago from toot.community permalink