I haven't thought "I should try to build my *own* web spider, then maybe I could find things." since... Well, since 1998.
:/
I haven't thought "I should try to build my *own* web spider, then maybe I could find things." since... Well, since 1998.
:/
Not a bad idea! My (vaguely) related is to fork a Fediverse app / make a browser plugin that caches and indexes only the Fediverse posts that I've browsed - whether on my timeline or on the explore page or whatever. Then I could search the content I've had access to, and I don't feel like I'd be violating anyone's privacy for caching and indexing the content I've already been allowed to view exclusively for my own personal use.
Obviously, it'd be other problems if I started crawling and indexing content for public usage, but I think using a computer to augment my own fallible memory would be acceptable so I can find the posts I wanted to remember 2 weeks later.
Well I'm thinking of doing something a little smaller and more targeted like this:
https://sauropods.win/@futurebird/113744151630008623
Because making a proper full web spider is a massive project. And even my small idea could be too big.
@futurebird I recently have been thinking of what it would take to run my own spider... for the first time in about 25 years. The search results I'm getting lately are so bad, that a DIY spider might actually improve the situation for me.
The problem is only partly that Google has gotten so much worse. It's also that SEO, botspam, LLM spam, and affiliate link spam has gotten so good that it's functionally impossible to algorithmically filter them out of the results. So just running your own spider is unlikely to matter much.
@JessTheUnstill @futurebird @tsturm This is what I wish instances would do, optionally, also warning you if a post you view this way was deleted that further publication, even off platform, except as evidence of abuse may result in moderation action against your account.
To remove & externalise bookmark dependency from browsers, I’ve resorted to manually collecting & curating links as I find them, with personal notes+tags reminding me why they are of interest. They’re always 100% searchable & findable.
Given the inconsiderate, effective DDOS behavior of AI scraper bots, adding to that melee with more robo-indexing may not produce a usable search index - https://mastodon.social/@dahukanna/113741237599333856
Importantly this database would grow over time, it wouldn't be focused on "what's new" ... basically I have a high level of trust in the way people #onhere associate hash tags with links and I think that'd be a great way to find things.
In fact I do it manually often enough, but it's time consuming. I just want all of the links sometimes.
I think so, yes. Basically I want a database of every single link that's been posted to *my* feed. It would also contain any hash tags used with the link, the post ID so I can go back and see the context.
Next I'd strip out all of the "big sites" and focus more on the obscure.
Then if I'm curious about, say # fossils I would get links mentioned in that context.
And if # fossils is used with the tag # crinoids often I could move laterally and find more links.
… extract links from within the post and links to the source post?
I'm thinking of something much more modest:
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.