Conversation

Notices

Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:10:37 JST Asta [AMP]

heeeeeey #python cats!

anyone know of a decent multi-language text tokenizer?

To be clear: I am explicitly looking to use it for non-generative-AI and other [slop/scab/labor theft] purposes.

Not sure of the specific terms I need to be looking up, frankly, since I'm mostly just finding Python's built in tokenize library which seems to be focused just on Python code.

Thank you!

#techPosting

In conversation about 2 months ago from fire.asta.lgbt permalink
- Embed this notice
  Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:10:36 JST Adrianna Tan
  in reply to
  
  @aud nltk?
  
  In conversation about 2 months ago permalink
- Embed this notice
  Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:27:39 JST Adrianna Tan
  in reply to
  - Jeremy Kahn
  @trochee @aud oh yeah pretty bad for south asian languages
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:41 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  - Jeremy Kahn
  @trochee@dair-community.social @skinnylatte@hachyderm.io Yeah, I'm trying to think of how I would construct a reverse index at the database/data model level, and I want to not bake in assumptions about the language at this level.
  
  So having a quick run through of what's considered "best in class", library wise, for language handling should give me an idea of what the input and output need to look like (and explicitly, how I should store them, etc).
  
  In conversation about 2 months ago permalink
- Embed this notice
  Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:41 JST Jeremy Kahn
  in reply to
  - Adrianna Tan
  @aud
  I'm sure that @skinnylatte knows this too -- as a far more developed polyglot than me -- but things get _really_ weird when you tokenize orthographies that don't use whitespace at all (or almost never):
  Chinese and Japanese are the obvious ones (that have pretty good but not well-standardized tokenizers), but Thai is also super unreliable about whitespace/word boundaries, and there's a bunch of south asian scripts with similar confusion
  
  In conversation about 2 months ago permalink
- Embed this notice
  Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:43 JST Jeremy Kahn
  in reply to
  - Adrianna Tan
  @aud @skinnylatte
  I mean, assuming that "split on whitespace, drop punctuation" isn't good enough -- which it isn't, except for English
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:45 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  @skinnylatte@hachyderm.io ooooh, this looks extremely promising and probably exactly what I need! Thank you!
  
  In conversation about 2 months ago permalink
- Embed this notice
  Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:45 JST Jeremy Kahn
  in reply to
  - Adrianna Tan
  @aud @skinnylatte
  nltk is my first tool out of the box when doing personal NLP projects too
  
  In conversation about 2 months ago permalink
- Embed this notice
  Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:31:24 JST Adrianna Tan
  in reply to
  - Jeremy Kahn
  @aud @trochee oh yes, there’s that too. honestly i need to research it more, but just knowing how some of those languages work (and Chinese), it’s hard to break apart a word into different components the way you would for.. tokenizing. Haha
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  - Jeremy Kahn
  @trochee@dair-community.social @skinnylatte@hachyderm.io (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.
  
  Hmmmm. I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize? Sounds like a number of languages already fill condition B... and probably fill condition A.
  
  Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text? Or "in theory", could tokenization work, it's just that not enough work has been done?
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  - Jeremy Kahn
  @trochee@dair-community.social @skinnylatte@hachyderm.io (to maybe more accurately phrase my question: is tokenization a concept basically born out how text is written in Romantic/Germanic/etc languages, and is it not so appropriate to try and model certain south Asian languages with it?)
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:15 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  - Jeremy Kahn
  @skinnylatte@hachyderm.io @trochee@dair-community.social Now that I think about it, if you don't structure your data model on the assumption that your phrase will be "a set of tokens that are strings" matching against "tokens are that strings", that opens up the field pretty widely. If your phrase is instead an image, and your 'documents' are also images, well... (not sure that'd be easy to 'reverse index', but nothing precludes you from doing it. Even a poor attempt at 'tokenizing' images in this manner, so to speak, would likely yield some vaguely useful results).
  
  In conversation about 2 months ago permalink
- Embed this notice
  Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:42:15 JST Adrianna Tan
  in reply to
  - Jeremy Kahn
  @aud @trochee I would argue that mandarin almost always has pretty clear meaning from single words as well. You would be able to guess quite accurately the content, I think.
  
  In conversation about 2 months ago permalink
- Embed this notice
  Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:16 JST Asta [AMP]
  in reply to
  - Adrianna Tan
  - Jeremy Kahn
  @skinnylatte@hachyderm.io @trochee@dair-community.social haaaad a feeling, hah.
  
  Well, and even with English, 'tokenization' is a pretty poor concept for capturing meaning.*
  
  Hmmmmmmm... perhaps what I should do, then, is to not bake in any assumptions that I'll be even be doing something like tokenization, but instead model the relationship between 'phrase' and 'document' as lightly as possible for constructing a reverse index so that I can use whatever set of tools are appropriate for that language.
  
  * big asterisk on this part because I suspect this is a very large and very deep can of worms and I should be careful what I say here, because even what is meant by meaning itself is probably rather contexual and...
  
  In conversation about 2 months ago permalink
- Embed this notice
  Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:48:19 JST Adrianna Tan
  in reply to
  - Jeremy Kahn
  @trochee @aud oh yeah, true. Thanks folks for the banter, it’s going to give me a bit of a push to research more.
  Btw, PSU has a multilingual NLP/ML lab.
  
  In conversation about 2 months ago permalink
- Embed this notice
  Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:48:20 JST Jeremy Kahn
  in reply to
  - Adrianna Tan
  @skinnylatte
  i think that is often the case but that may be an artifact of the language-culture having a deep love for folk-etymology
  gaoxing means happy and can be understood as "tall+prosper" (?) and while I can sorta see how they fit together to mean "glad" (it's not any weirder than "in high spirits" in English) it's a bit of a stretch to say that the meaning is compositional.
  @aud
  
  In conversation about 2 months ago permalink

Public

Conversation

Notices

Feeds