@hipsterelectron Anyways, I’m definitely a Hobbyist Slash Professionally Adjacent nerd rather than an expert on any of this stuff, but loving the conversation, thank you!
@hipsterelectron yeah, I think there’sabsolutely truth to some of the broader ideas about languages and laanguagelike things having consistently occurring qualities and structures, and the *nature of human existence and social experience* means we’re lokely to produce statistically similar things across lots of languages, but seems like a matter of the tool fitting the need, rather than THE NEED BEING SHAPED LIKE X, INHERENTLY, GENETICALLY, etc
@hipsterelectron I’ve always tended towards the idea that language representations without experience/understanding of the language and understanding of the purpose the language is being used for is kind of a “teach a horse math” trick, it can be impressive but struggles badly at solving the actual problems we use language for
@hipsterelectron having done a terrifying mess of large-scale parsing work on a crawler/scraper project for clients, I have to agree. Fascinating project, I’m gonna give it a look!
@hipsterelectron Which… well, I suppose that IS exactly the load of caveats you mention, lol.
I think the reason I find it intriguing is that those questions of language/meaning/communication are all really important to me and have been long before LLMs, but unfortunately there’s little cultural appetite or capacity for them, which means the really critical nuance is collapsed to “But I asked it! And it answered right!” anecdotes
@hipsterelectron Yeah, like, I’d say that “making stuff up” is a category error, it’s not a question of invention or imagination but of… slavish inherent devotion to training data and syntactical/semantic patterns without regard for pragmatics or more complex or nuanced meaning that’s more than skin deep. Thus “shortening” rather than”summarizing” — it won’t simpky invent unrelated text, but there is ho assurance what it produces is in fact “the theme” or the important elements at all
@hipsterelectron the best description I’ve ever managed is that LLMs are really good at *shortening*, but not *summarizing*. The more neutral, factual, and verbose the original, the more useful that is — but that’s a heck of a caveat.
@hipsterelectron@ireneista@adrienne the PDF.js project is actually an interesting one to experiment with; among other things it handles a lot of the document-level stuff, and lets you hook your own logic in to manage the page by page conversion of a PDF to text: “here’s a pile of text and graphic objects with metadata about each one, feel free to iterate them and give us back a string when you’re done!” Etc
@ireneista@hipsterelectron@adrienne yyyyyup. PDF really truly is a standard meant to reproduce a visual design; PDF to text, even without OCR, uses wild techniques like “get the XY coordinates of every word on the page and extrapolate ‘sentences’ using hope and heuristics”
This is the kind of article that’s incredibly rewarding to both create and consume: it breaks down multiple complex problems, explains the pros and cons of different paths forward, and advocates without presenting a straw-man of the other options. That is a really impressive achievement. https://front-end.social/@jensimmons/113346886761140404