heeeeeey #python cats!
anyone know of a decent multi-language text tokenizer?
To be clear: I am explicitly looking to use it for non-generative-AI and other [slop/scab/labor theft] purposes.
Not sure of the specific terms I need to be looking up, frankly, since I'm mostly just finding Python's built in tokenize library which seems to be focused just on Python code.
Thank you!
#techPosting
Conversation
Notices
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:10:37 JST Asta [AMP]
-
Embed this notice
Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:10:36 JST Adrianna Tan
@aud nltk?
-
Embed this notice
Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:27:39 JST Adrianna Tan
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:41 JST Asta [AMP]
@trochee@dair-community.social @skinnylatte@hachyderm.io Yeah, I'm trying to think of how I would construct a reverse index at the database/data model level, and I want to not bake in assumptions about the language at this level.
So having a quick run through of what's considered "best in class", library wise, for language handling should give me an idea of what the input and output need to look like (and explicitly, how I should store them, etc). -
Embed this notice
Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:41 JST Jeremy Kahn
I'm sure that @skinnylatte knows this too -- as a far more developed polyglot than me -- but things get _really_ weird when you tokenize orthographies that don't use whitespace at all (or almost never):
Chinese and Japanese are the obvious ones (that have pretty good but not well-standardized tokenizers), but Thai is also super unreliable about whitespace/word boundaries, and there's a bunch of south asian scripts with similar confusion
-
Embed this notice
Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:43 JST Jeremy Kahn
I mean, assuming that "split on whitespace, drop punctuation" isn't good enough -- which it isn't, except for English
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:45 JST Asta [AMP]
@skinnylatte@hachyderm.io ooooh, this looks extremely promising and probably exactly what I need! Thank you!
-
Embed this notice
Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:45 JST Jeremy Kahn
nltk is my first tool out of the box when doing personal NLP projects too
-
Embed this notice
Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:31:24 JST Adrianna Tan
@aud @trochee oh yes, there’s that too. honestly i need to research it more, but just knowing how some of those languages work (and Chinese), it’s hard to break apart a word into different components the way you would for.. tokenizing. Haha
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP]
@trochee@dair-community.social @skinnylatte@hachyderm.io (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.
Hmmmm. I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize? Sounds like a number of languages already fill condition B... and probably fill condition A.
Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text? Or "in theory", could tokenization work, it's just that not enough work has been done? -
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP]
@trochee@dair-community.social @skinnylatte@hachyderm.io (to maybe more accurately phrase my question: is tokenization a concept basically born out how text is written in Romantic/Germanic/etc languages, and is it not so appropriate to try and model certain south Asian languages with it?)
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:15 JST Asta [AMP]
@skinnylatte@hachyderm.io @trochee@dair-community.social Now that I think about it, if you don't structure your data model on the assumption that your phrase will be "a set of tokens that are strings" matching against "tokens are that strings", that opens up the field pretty widely. If your phrase is instead an image, and your 'documents' are also images, well... (not sure that'd be easy to 'reverse index', but nothing precludes you from doing it. Even a poor attempt at 'tokenizing' images in this manner, so to speak, would likely yield some vaguely useful results).
-
Embed this notice
Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:42:15 JST Adrianna Tan
@aud @trochee I would argue that mandarin almost always has pretty clear meaning from single words as well. You would be able to guess quite accurately the content, I think.
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:16 JST Asta [AMP]
@skinnylatte@hachyderm.io @trochee@dair-community.social haaaad a feeling, hah.
Well, and even with English, 'tokenization' is a pretty poor concept for capturing meaning.*
Hmmmmmmm... perhaps what I should do, then, is to not bake in any assumptions that I'll be even be doing something like tokenization, but instead model the relationship between 'phrase' and 'document' as lightly as possible for constructing a reverse index so that I can use whatever set of tools are appropriate for that language.
* big asterisk on this part because I suspect this is a very large and very deep can of worms and I should be careful what I say here, because even what is meant by meaning itself is probably rather contexual and... -
Embed this notice
Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:48:19 JST Adrianna Tan
@trochee @aud oh yeah, true. Thanks folks for the banter, it’s going to give me a bit of a push to research more.
Btw, PSU has a multilingual NLP/ML lab.
-
Embed this notice
Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:48:20 JST Jeremy Kahn
i think that is often the case but that may be an artifact of the language-culture having a deep love for folk-etymology
gaoxing means happy and can be understood as "tall+prosper" (?) and while I can sorta see how they fit together to mean "glad" (it's not any weirder than "in high spirits" in English) it's a bit of a stretch to say that the meaning is compositional.
-
Embed this notice