Conversation

Notices

Embed this notice
Joaquim Homrighausen (joho@mastodon.online)'s status on Friday, 31-Jan-2025 19:55:58 JST Joaquim Homrighausen

Why does this PHP construct:
normalizer_normalize( $search_string, \Normalizer::FORM_D );
Convert ÖÖÖ to OOO, but keeps ÅÅÅ as ÅÅÅ ... WTF?! 🤔
#programming #php #wtf #utf #utf8

In conversation about 6 months ago from mastodon.online permalink
- Embed this notice
  Tobias Hellgren (thanius@mastodon.chuggybumba.com)'s status on Friday, 31-Jan-2025 19:55:58 JST Tobias Hellgren
  in reply to
  
  @joho Because ö is a diacritic while å is a letter
  
  In conversation about 6 months ago permalink
- Embed this notice
  Peter Krefting (nafmo@social.vivaldi.net)'s status on Friday, 31-Jan-2025 23:11:27 JST Peter Krefting
  in reply to
  
  @joho NFD (#Unicode Normalization Form Canonical Decomposition) should fully decompose the strings, so Ö should become O + combining diaresis, and Å (and Å) would be A + combining ring above.
  NFC (...Canonical Composition) is usually more compact, it recombines into base characters, so Ö stays an Ö, O + diaresis becomes an Ö, and an Å becomes an Å.
  I would expect "FORM_D" to be NFD, but I am not a #PHP programmer.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Peter Krefting (nafmo@social.vivaldi.net)'s status on Saturday, 01-Feb-2025 00:05:44 JST Peter Krefting
  in reply to
  @joho @heiglandreas @thanius @lpwaterhouse I rolled my own transliteration, using RFC 1345 as a base, once.
  I do not recommend doing that (not only because the RFC is severely outdated now, but also because the output turns into garbage).
  
  In conversation about 6 months ago permalink
- Embed this notice
  Joaquim Homrighausen (joho@mastodon.online)'s status on Saturday, 01-Feb-2025 00:05:45 JST Joaquim Homrighausen
  in reply to
  @heiglandreas
  Yes, transliteration is the way to go in this case, which is what I'm doing now.
  Thanks for all the advice, and pointers in the right direction.
  @thanius @lpwaterhouse
  @nafmo
  
  In conversation about 6 months ago permalink
- Embed this notice
  Joaquim Homrighausen (joho@mastodon.online)'s status on Saturday, 01-Feb-2025 00:05:46 JST Joaquim Homrighausen
  in reply to
  @heiglandreas
  The data is stored in an SQL database. I've started to encrypt the (sensitive parts of) data at rest. So I need to do in-memory comparisons and sorting.
  Normally, I would compare w/all umlauts, etc, but in this particular case, I want to get a match on "vårsol" when I'm searching for "vårsol" or "varsol". And this matching is, after decryption, done in the application layer.
  (And I don't want to use specific database functionality to handle all this.)
  @thanius @lpwaterhouse
  
  In conversation about 6 months ago permalink
- Embed this notice
  Alerta! Alerta! (heiglandreas@phpc.social)'s status on Saturday, 01-Feb-2025 00:05:46 JST Alerta! Alerta!
  in reply to
  - Tobias Hellgren
  - Lawrence Pritchard Waterhouse
  @joho But wouldn't transliteration be more what you are looking for?
  'Cause Normalization just handles how the Unicode-Character is stored internally. So an 'Ä' should always 'look' the same, but the HEX-code might be different.
  But transliteration converts from something into something else. And in your case you want to compare kind of based on ASCII if I see that correctly.
  Feel free to check out https://andreas.heigl.org/2021/06/23/transliter-what/
  /cc @thanius @lpwaterhouse
  In conversation about 6 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  Joaquim Homrighausen (joho@mastodon.online)'s status on Saturday, 01-Feb-2025 00:05:48 JST Joaquim Homrighausen
  in reply to
  @heiglandreas I didn't do that part, I'm just looking at the output, which is what I need to be correct.
  But @thanius and @lpwaterhouse may be onto something here.
  Maybe I'll just stick to transliteration then. I'm probably overworking the code, but I hate to leave thing to "chance" when I develop.
  
  In conversation about 6 months ago permalink
- Embed this notice
  Alerta! Alerta! (heiglandreas@phpc.social)'s status on Saturday, 01-Feb-2025 00:05:48 JST Alerta! Alerta!
  in reply to
  - Tobias Hellgren
  - Lawrence Pritchard Waterhouse
  @joho Stupid question perhaps: Why are you using normalization when the output just needs to look correct?
  What problem are you trying to solve?
  /cc @thanius @lpwaterhouse
  
  In conversation about 6 months ago permalink
- Embed this notice
  Alerta! Alerta! (heiglandreas@phpc.social)'s status on Saturday, 01-Feb-2025 00:05:49 JST Alerta! Alerta!
  in reply to
  
  @joho And what does the HEX characters actually say?
  
  In conversation about 6 months ago permalink
- Embed this notice
  Alerta! Alerta! (heiglandreas@phpc.social)'s status on Saturday, 01-Feb-2025 00:05:50 JST Alerta! Alerta!
  in reply to
  
  @joho Wasn't there something with locale? Or the underlying ICU version? Something knocks from deep down in my mind....
  
  In conversation about 6 months ago permalink

Public

Conversation

Notices

Feeds