GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:10:37 JST Asta [AMP] Asta [AMP]

    heeeeeey #python cats!

    anyone know of a decent multi-language text tokenizer?

    To be clear: I am explicitly looking to use it for non-generative-AI and other [slop/scab/labor theft] purposes.

    Not sure of the specific terms I need to be looking up, frankly, since I'm mostly just finding Python's built in tokenize library which seems to be focused just on Python code.

    Thank you!

    #techPosting

    In conversation about 2 months ago from fire.asta.lgbt permalink
    • Embed this notice
      Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:10:36 JST Adrianna Tan Adrianna Tan
      in reply to

      @aud nltk?

      In conversation about 2 months ago permalink
    • Embed this notice
      Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:27:39 JST Adrianna Tan Adrianna Tan
      in reply to
      • Jeremy Kahn

      @trochee @aud oh yeah pretty bad for south asian languages

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:41 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan
      • Jeremy Kahn

      @trochee@dair-community.social @skinnylatte@hachyderm.io Yeah, I'm trying to think of how I would construct a reverse index at the database/data model level, and I want to not bake in assumptions about the language at this level.

      So having a quick run through of what's considered "best in class", library wise, for language handling should give me an idea of what the input and output need to look like (and explicitly, how I should store them, etc).

      In conversation about 2 months ago permalink
    • Embed this notice
      Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:41 JST Jeremy Kahn Jeremy Kahn
      in reply to
      • Adrianna Tan

      @aud

      I'm sure that @skinnylatte knows this too -- as a far more developed polyglot than me -- but things get _really_ weird when you tokenize orthographies that don't use whitespace at all (or almost never):

      Chinese and Japanese are the obvious ones (that have pretty good but not well-standardized tokenizers), but Thai is also super unreliable about whitespace/word boundaries, and there's a bunch of south asian scripts with similar confusion

      In conversation about 2 months ago permalink
    • Embed this notice
      Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:43 JST Jeremy Kahn Jeremy Kahn
      in reply to
      • Adrianna Tan

      @aud @skinnylatte

      I mean, assuming that "split on whitespace, drop punctuation" isn't good enough -- which it isn't, except for English

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:27:45 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan

      @skinnylatte@hachyderm.io ooooh, this looks extremely promising and probably exactly what I need! Thank you!

      In conversation about 2 months ago permalink
    • Embed this notice
      Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:45 JST Jeremy Kahn Jeremy Kahn
      in reply to
      • Adrianna Tan

      @aud @skinnylatte

      nltk is my first tool out of the box when doing personal NLP projects too

      In conversation about 2 months ago permalink
    • Embed this notice
      Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:31:24 JST Adrianna Tan Adrianna Tan
      in reply to
      • Jeremy Kahn

      @aud @trochee oh yes, there’s that too. honestly i need to research it more, but just knowing how some of those languages work (and Chinese), it’s hard to break apart a word into different components the way you would for.. tokenizing. Haha

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan
      • Jeremy Kahn

      @trochee@dair-community.social @skinnylatte@hachyderm.io (I'm not a polyglot, but thankfully I'm not strictly a monoglot, either) I'm pretty much going with the idea that most assumptions I would make about language based on my knowledge aren't going to hold up, especially in languages that aren't as widely spoken or read, which is sort of where I would want to pay special attention.

      Hmmmm. I wonder if there's a language that is both A. "underserved" by technical tools and B. rather difficult to tokenize? Sounds like a number of languages already fill condition B... and probably fill condition A.

      Is there a better... mmm, model, either in the computational sense or otherwise, with which to approach how to break up the text? Or "in theory", could tokenization work, it's just that not enough work has been done?

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:31:25 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan
      • Jeremy Kahn

      @trochee@dair-community.social @skinnylatte@hachyderm.io (to maybe more accurately phrase my question: is tokenization a concept basically born out how text is written in Romantic/Germanic/etc languages, and is it not so appropriate to try and model certain south Asian languages with it?)

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:15 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan
      • Jeremy Kahn

      @skinnylatte@hachyderm.io @trochee@dair-community.social Now that I think about it, if you don't structure your data model on the assumption that your phrase will be "a set of tokens that are strings" matching against "tokens are that strings", that opens up the field pretty widely. If your phrase is instead an image, and your 'documents' are also images, well... (not sure that'd be easy to 'reverse index', but nothing precludes you from doing it. Even a poor attempt at 'tokenizing' images in this manner, so to speak, would likely yield some vaguely useful results).

      In conversation about 2 months ago permalink
    • Embed this notice
      Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:42:15 JST Adrianna Tan Adrianna Tan
      in reply to
      • Jeremy Kahn

      @aud @trochee I would argue that mandarin almost always has pretty clear meaning from single words as well. You would be able to guess quite accurately the content, I think.

      In conversation about 2 months ago permalink
    • Embed this notice
      Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:16 JST Asta [AMP] Asta [AMP]
      in reply to
      • Adrianna Tan
      • Jeremy Kahn

      @skinnylatte@hachyderm.io @trochee@dair-community.social haaaad a feeling, hah.

      Well, and even with English, 'tokenization' is a pretty poor concept for capturing meaning.*

      Hmmmmmmm... perhaps what I should do, then, is to not bake in any assumptions that I'll be even be doing something like tokenization, but instead model the relationship between 'phrase' and 'document' as lightly as possible for constructing a reverse index so that I can use whatever set of tools are appropriate for that language.

      * big asterisk on this part because I suspect this is a very large and very deep can of worms and I should be careful what I say here, because even what is meant by meaning itself is probably rather contexual and...

      In conversation about 2 months ago permalink
    • Embed this notice
      Adrianna Tan (skinnylatte@hachyderm.io)'s status on Thursday, 03-Apr-2025 14:48:19 JST Adrianna Tan Adrianna Tan
      in reply to
      • Jeremy Kahn

      @trochee @aud oh yeah, true. Thanks folks for the banter, it’s going to give me a bit of a push to research more.

      Btw, PSU has a multilingual NLP/ML lab.

      In conversation about 2 months ago permalink
    • Embed this notice
      Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:48:20 JST Jeremy Kahn Jeremy Kahn
      in reply to
      • Adrianna Tan

      @skinnylatte

      i think that is often the case but that may be an artifact of the language-culture having a deep love for folk-etymology

      gaoxing means happy and can be understood as "tall+prosper" (?) and while I can sorta see how they fit together to mean "glad" (it's not any weirder than "in high spirits" in English) it's a bit of a stretch to say that the meaning is compositional.

      @aud

      In conversation about 2 months ago permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.