@aardrian @odddev @thilo @aral
My approach would be this:
If there are any non-Latin characters present, tokenise. For each non-Latin token, use a pre-defined hash table to rewrite each symbol to its Latin equivalent (if there is one). If the result is a purely Latin token, lemmatise it to determine whether it's an existing word in the post's language. If so, read the natural word instead of the non-Latin token.