In honor of the dawn of the age of web crawling being for LLMs instead of search engines, I have deleted the robots.txt for https://wookieepedia.org/.
Let the floodgates open. Crawl away, my friends, crawl away.
In honor of the dawn of the age of web crawling being for LLMs instead of search engines, I have deleted the robots.txt for https://wookieepedia.org/.
Let the floodgates open. Crawl away, my friends, crawl away.
I feel like an important thing to mention to put the previous post in context is that the links work.
ALL the links work.
@inthehands wait a minute there... do I hear echoes of A Plan For Spam?
@jmeowmeow To that, I can only say mvuoooo ruuoau raoaaauvuaua ruuauawoaaou rroaaaavoouoa voiahouoaoa noa, wuoaa mouuoaruaourv ruaouu ruiu
@voltagex HRRAOOOOOWAOOO!
@inthehands OUTSTANDING.
@inthehands (loud laughter). Let the Wookiee win!
@inthehands you might also want to match on /*
@voltagex No way, gotta keep the site organized
There’s not a way to explicitly submit one’s site to OpenAI for crawling, right?
Like…they just do secretive mass crawls on their own schedule, I assume?
@inthehands it…. It still works….
@corycarson Amazing
@corycarson It took some cajoling, but I got GPT to start speaking Wookiee:
@inthehands MUST GO DEEPER
@corycarson The question is, that is so shockingly similar to the style of “Wookiee” that Wookieepedia uses, it’s hard to believe that Wookieepedia wasn’t the training source. But until a few minutes ago, the site’s robots.txt disallowed crawlers for everything except the homepage….
@talby One the home page, IIRC, but yeah
@inthehands I think LLMs mostly use https://commoncrawl.org/ rather than crawling the web themselves. The Internet Archive's Wayback Machine uses Common Crawl as a source and has Wookieepedia so I think it's likely in there already.
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.