@trochee@dair-community.social @skinnylatte@hachyderm.io (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.
Hmmmm. I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize? Sounds like a number of languages already fill condition B... and probably fill condition A.
Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text? Or "in theory", could tokenization work, it's just that not enough work has been done?
Embed Notice
HTML Code
Corresponding Notice
- Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP]