Public
- Public
- Network
- Groups
- Featured
- Popular
- People

Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:41 JST

Embed this notice
Jeremy Kahn (trochee@dair-community.social)'s status on Thursday, 03-Apr-2025 14:27:41 JST Jeremy Kahn
in reply to
- Adrianna Tan
- Asta [AMP]
@aud
I'm sure that @skinnylatte knows this too -- as a far more developed polyglot than me -- but things get _really_ weird when you tokenize orthographies that don't use whitespace at all (or almost never):
Chinese and Japanese are the obvious ones (that have pretty good but not well-standardized tokenizers), but Thai is also super unreliable about whitespace/word boundaries, and there's a bunch of south asian scripts with similar confusion

In conversation about 2 months ago from gnusocial.jp permalink

Feeds