I'm sure that @skinnylatte knows this too -- as a far more developed polyglot than me -- but things get _really_ weird when you tokenize orthographies that don't use whitespace at all (or almost never):
Chinese and Japanese are the obvious ones (that have pretty good but not well-standardized tokenizers), but Thai is also super unreliable about whitespace/word boundaries, and there's a bunch of south asian scripts with similar confusion