@skinnylatte@hachyderm.io @trochee@dair-community.social haaaad a feeling, hah.
Well, and even with English, 'tokenization' is a pretty poor concept for capturing meaning.*
Hmmmmmmm... perhaps what I should do, then, is to not bake in any assumptions that I'll be even be doing something like tokenization, but instead model the relationship between 'phrase' and 'document' as lightly as possible for constructing a reverse index so that I can use whatever set of tools are appropriate for that language.
* big asterisk on this part because I suspect this is a very large and very deep can of worms and I should be careful what I say here, because even what is meant by meaning itself is probably rather contexual and...
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:16 JST
-
Embed this notice
Asta [AMP] (aud@fire.asta.lgbt)'s status on Thursday, 03-Apr-2025 14:42:16 JST Asta [AMP]