Conversation

Notices

Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:14:48 JST Rich Felker

I talk about Unicode a lot (or at least more than the average person) - sometimes in a good light, sometimes not so much - but I'd like to take a moment to revisit why it's so dear to me, especially for folks who came along later and missed out on the world before Unicode.
1/N

In conversation about 23 days ago from hachyderm.io permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:21:15 JST Rich Felker
  in reply to
  
  Up until basically the end of the 20th century, computers represented textual information differently depending on your political locality.
  The same bytes could mean different things depending on where you were (and that was usually implicit), the set of languages you could represent at one time was determined by who the prevailing political and economic forces deemed relevant for you to have communication relationships with, and less-well-represented languages couldn't even be represented at all without font hacks to just replace letter shapes.
  2/N
  
  In conversation about 23 days ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:26:33 JST Rich Felker
  in reply to
  
  As much as there were pains and not everyone got quite what they wanted, Unicode was an immense both *technical* and *political* triumph over that system.
  3/N
  
  In conversation about 23 days ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:30:36 JST Rich Felker
  in reply to
  
  Nowadays it's normal that you can enter text in whatever language and writing system you want basically anywhere (feel free to troll my replies with cursed places you still can't 🤬), but being able to do that felt revolutionary up through at least 2010 or so.
  4/N
  
  In conversation about 23 days ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:33:23 JST Rich Felker
  in reply to
  - LisPi
  @lispi314 It required a very painful "consensus" (not quite) process to actually make it happen though, and it really came close to not happening. The alignment of UCS and Unicode, with both sides fixing mistakes, was a really big deal. So I don't think it was entirely inevitable. If we were in that position today, Google, Facebook, Apple, and Tencent would probably all have their own different incompatible things.
  
  In conversation about 23 days ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.
- Embed this notice
  LisPi (lispi314@udongein.xyz)'s status on Monday, 02-Jun-2025 09:33:25 JST LisPi
  in reply to
  
  @dalias I think that silly prior issue with representation was tolerated solely because of the cost and weakness of earlier machines.
  
  It seems to me that it was nearly inevitable that upon computers becoming cheap-enough, someone would become annoyed enough at the status quo to do something about it.
  
  In conversation about 23 days ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 09:53:28 JST Rich Felker
  in reply to
  - LisPi
  @lispi314 Yes, early history of this problem space was based on *non-unification*. Just tagging every piece of text with the legacy encoding it was in and escape sequences to switch between them. That was the Emacs way (MULE).
  This would have left us with a world where text in most languages was locked to the political locality it was written in (didn't compare equal or match searches for same thing written in another form).
  
  In conversation about 23 days ago permalink
- Embed this notice
  LisPi (lispi314@udongein.xyz)'s status on Monday, 02-Jun-2025 09:53:29 JST LisPi
  in reply to
  - LisPi
  @dalias Emacs also has support for multiple representations too, of course.
  
  In conversation about 23 days ago permalink
- Embed this notice
  LisPi (lispi314@udongein.xyz)'s status on Monday, 02-Jun-2025 09:53:31 JST LisPi
  in reply to
  
  @dalias > If we were in that position today, Google, Facebook, Apple, and Tencent would probably all have their own different incompatible things.
  
  There being one representation might be different.
  
  Also, GNU Emacs had its input method fairly early on, I wouldn't discount some representation alternative also being pushed into the GNU project.
  
  In conversation about 23 days ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 11:17:58 JST Rich Felker
  in reply to
  - LisPi
  @lispi314 There are all kinds of technical reasons that's immensely difficult and wouldn't have happened. Whatever you think is difficult with Unicode, it's orders of magnitude harder to solve these problems when you have unbounded non-locality, lack of any real specification of what is equivalent to what, etc. - and it was hard enough to motivate people to solve the problems even when given a technical framework to make it tractable.
  
  In conversation about 23 days ago permalink
- Embed this notice
  LisPi (lispi314@udongein.xyz)'s status on Monday, 02-Jun-2025 11:17:59 JST LisPi
  in reply to
  
  @dalias > This would have left us with a world where text in most languages was locked to the political locality it was written in (didn't compare equal or match searches for same thing written in another form).
  
  Literally the only reason that "locking" was effective was because computers were too weak to have all of them stored & working. That state of affairs didn't exactly last all that much longer.
  
  grep current benefits from everything using the same encoding, but there's no particular reason a grep extension with (customizable) conversion rules to search across encodings couldn't be made.
  
  Databases could even more easily keep track of such things, though they would probably choose for internal presentation something that is capable of representing everything else.
  
  In conversation about 23 days ago permalink
- Embed this notice
  cliffle@hachyderm.io's status on Monday, 02-Jun-2025 11:25:40 JST cliffle
  in reply to
  
  @dalias Unicode: it's complex, it's imperfect, it's frustrating, and it's _so much better_ than anything that came before. 💯
  
  In conversation about 23 days ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 11:31:52 JST Rich Felker
  in reply to
  - LisPi
  @lispi314 Like the whole reason a normalizing comparison in Unicode is even possible in constant space and linear time is the result of some serious engineering effort.
  It's easy to see the code & data representation problem I'm working on right now and 🤬 at Unicode for being so difficult, but the reality is that it would be *impossible* with any of the legacy page-switching proposals being pushed before (and during the introduction of) it. Everything would be needing dynamic allocation of painful data structures and/or quadratic-time stuff all over the place.
  In conversation about 23 days ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    http://effort.It/
- Embed this notice
  Brian Campbell (unlambda@hachyderm.io)'s status on Monday, 02-Jun-2025 12:21:44 JST Brian Campbell
  in reply to
  
  @dalias Unicode is really an amazing project.
  It's solving a problem with a tremendous amount of essential complexity, so much of the complexity of Unicode is simply just a reflection of the essential complexity of the problems that it is solving. There is a lot of complexity in all of the world's writing systems, and providing a uniform way to include them all in a single format, so there's a huge amount of essential complexity.
  But there are definitely some pieces of accidental complexity. One of the big ones could be seen as either accidental or essential depending on your point of view; the need to provide easy ways to migrate from legacy encodings, and during the migration to round trip. In some ways, this is is an essential complexity of a universal character set, but in the grander scheme of things it's an accidental complexity of the fact that we got here via all of these legacy character sets.
  And then there are the places where there's accidental complexity introduced by Unicode itself. One of the biggest sources of these I like to think of as Unicode's original sin: thinking that 16 bits was enough for a universal character set, which also led to thinking of Unicode as simply a 16 bit "wide" character encoding, and led to things like aggressive Han unification that nearly led to Unicode failing in a substantial portion of the world.
  It didn't take long to realize the errors and fix them, but unfortunately some of the damage had already been done; Unicode APIs introduced with 16-bit wide characters, which now had to be adapted to UTF-16, all the complexities of endianness in using UTF-16 as a transfer format, and so on.
  UTF-8 was a brilliant design that helped with a lot of this, but so much accidental complexity has been introduced by this early error that I think it's fair to call the original 16 bit UCS-2 Unicode's original sin, and the biggest driver of complexity and slower of adoption of Unicode.
  Anyhow, overall, Unicode is an amazing project, it's such a huge win overall. Had its missteps, it's not perfect, but it's so much better than the plethora of legacy character sets and encodings that it has almost entirely replaced at this point.
  In conversation about 23 days ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: project.it
    
    Project Informatica
    
    Fornisce hardware e software, soluzioni sistemistiche e di rete personalizzate. I prodotti informatici sono corredati da servizi di assistenza e manutenzione personalizzati.
  2. Domain not in remote thumbnail source whitelist: www.world.it
    
    Home
    
    from francesco
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Monday, 02-Jun-2025 12:26:27 JST Rich Felker
  in reply to
  - Brian Campbell
  @unlambda Yeah, the need to round-trip led to a lot of the otherwise non-essential complexity, from double-encoding and often even encoding things that shouldn't have been characters but 2 or more characters.
  And trying to be 16-bit (this was largely a Microsoft agenda, with destroying all existing text-based protocols and data formats as part of the goal) was indeed also a huge unforced error that still makes problems. Only the resolution with UCS/ISO, and the introduction of UTF-8, averted disaster and likely failure here.
  
  In conversation about 23 days ago permalink
  
  Haelwenn /элвэн/ :triskell: likes this.