Why does this PHP construct:
normalizer_normalize( $search_string, \Normalizer::FORM_D );
Convert ÖÖÖ to OOO, but keeps ÅÅÅ as ÅÅÅ ... WTF?! 🤔
Why does this PHP construct:
normalizer_normalize( $search_string, \Normalizer::FORM_D );
Convert ÖÖÖ to OOO, but keeps ÅÅÅ as ÅÅÅ ... WTF?! 🤔
@joho Because ö is a diacritic while å is a letter
@joho NFD (#Unicode Normalization Form Canonical Decomposition) should fully decompose the strings, so Ö should become O + combining diaresis, and Å (and Å) would be A + combining ring above.
NFC (...Canonical Composition) is usually more compact, it recombines into base characters, so Ö stays an Ö, O + diaresis becomes an Ö, and an Å becomes an Å.
I would expect "FORM_D" to be NFD, but I am not a #PHP programmer.
@joho @heiglandreas @thanius @lpwaterhouse I rolled my own transliteration, using RFC 1345 as a base, once.
I do not recommend doing that (not only because the RFC is severely outdated now, but also because the output turns into garbage).
Yes, transliteration is the way to go in this case, which is what I'm doing now.
Thanks for all the advice, and pointers in the right direction.
The data is stored in an SQL database. I've started to encrypt the (sensitive parts of) data at rest. So I need to do in-memory comparisons and sorting.
Normally, I would compare w/all umlauts, etc, but in this particular case, I want to get a match on "vårsol" when I'm searching for "vårsol" or "varsol". And this matching is, after decryption, done in the application layer.
(And I don't want to use specific database functionality to handle all this.)
@joho But wouldn't transliteration be more what you are looking for?
'Cause Normalization just handles how the Unicode-Character is stored internally. So an 'Ä' should always 'look' the same, but the HEX-code might be different.
But transliteration converts from something into something else. And in your case you want to compare kind of based on ASCII if I see that correctly.
Feel free to check out https://andreas.heigl.org/2021/06/23/transliter-what/
@heiglandreas I didn't do that part, I'm just looking at the output, which is what I need to be correct.
But @thanius and @lpwaterhouse may be onto something here.
Maybe I'll just stick to transliteration then. I'm probably overworking the code, but I hate to leave thing to "chance" when I develop.
@joho Stupid question perhaps: Why are you using normalization when the output just needs to look correct?
What problem are you trying to solve?
/cc @thanius @lpwaterhouse
@joho And what does the HEX characters actually say?
@joho Wasn't there something with locale? Or the underlying ICU version? Something knocks from deep down in my mind....
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.