Somehow I had missed this. AI is going fine.
> On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large language models like those that power such chatbots, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens the model uses to parse and compress Chinese prompts. [...] The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops.
https://www.technologyreview.com/2024/05/17/1092649/gpt-4o-chinese-token-polluted/