GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 20:35:50 JST Wolf480pl Wolf480pl

    Just realized it's impossible to use UCS-2 (UTF-16) for passing arguments to unix programs, because arguments are nul-terminated, and in UCS-2, almost every other byte is zero...

    In conversation about 3 months ago from mstdn.io permalink
    • Embed this notice
      翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 29-Mar-2025 20:35:49 JST 翠星石 翠星石
      in reply to
      @wolf480pl Unix programs? Those are GNU's Not Unix programs sir.

      UTF-16 is a useless format, as it's a multibyte encoding that almost doubles the storage size of text, unless all you are encoding is Chinese characters.

      Just use UTF-8 - it's ASCII compatible and you can pass it to whatever program and it will work unless the program does something stupid.

      If you have some UTF-16 encoded files, you can convert them to UTF-8 with GNU iconv.
      In conversation about 3 months ago permalink
    • Embed this notice
      翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 29-Mar-2025 20:46:04 JST 翠星石 翠星石
      in reply to
      • :umu: :umu:
      @a1ba @wolf480pl >It's double the size for most things (most things are ASCII)
      >It's somewhat faster to decode.
      ????

      Thinking about the differences between the variable encodings of UTF-8 and UTF-16, I don't see how either is meaningfully faster to decode than the other.
      In conversation about 3 months ago permalink
    • Embed this notice
      :umu: :umu: (a1ba@suya.place)'s status on Saturday, 29-Mar-2025 20:46:05 JST :umu: :umu: :umu: :umu:
      in reply to
      • 翠星石
      @Suiseiseki @wolf480pl utf-16 is somewhat faster to decode. Don't even have to be Chinese, it's true even for Cyrillic text and the other half of Latin-1.

      But then it's still double the size on everything that's ASCII.

      Just use right tools to achieve the goal.
      In conversation about 3 months ago permalink
    • Embed this notice
      iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:02:28 JST iced depresso iced depresso
      in reply to
      • 翠星石
      • :umu: :umu:
      • divVerent
      @divVerent @Suiseiseki @wolf480pl @a1ba utf-16 was propped up by microsoft and sun as a misguided attempt to get out of how unicode turns all string operations from o(1) to o(n). the idea was if you just use 16-bit cells then you are back to being able to just reach at an arbitrary rune.

      this is false because diacritics still exist in utf-16. and utf-16 STILL has characters it cannot represent (the ones outside the standard bitmap plane) so you STILL have to perform local checks to see if you are about to slice directly in to a rune at the wrong place.

      basically unicode sucks and some corporate coders tried to get around it and made everything suck even more.
      In conversation about 3 months ago permalink
    • Embed this notice
      divVerent (divverent@social.vivaldi.net)'s status on Saturday, 29-Mar-2025 21:02:31 JST divVerent divVerent
      in reply to
      • 翠星石
      • :umu: :umu:

      @Suiseiseki @wolf480pl @a1ba "It depends". UTF-16 is definitely faster to decode because you have fewer loop iterations for the same string (8bit and 16bit RAM reads are about the same speed on the CPU).

      HOWEVER, especially when all codepoints are ASCII, UTF-16 uses twice the memory bandwidth. And that hurts too.

      So, ultimately depends on the character set / language used.

      In conversation about 3 months ago permalink

      Attachments


    • Embed this notice
      iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:03:45 JST iced depresso iced depresso
      in reply to
      • 翠星石
      • iced depresso
      • :umu: :umu:
      • divVerent
      @divVerent @Suiseiseki @a1ba @wolf480pl this assumes you care about being correct. if you don't, and evidently companies in current year do not, then SHRUG as long as you normalize the input and confine everything to the BMP then it kind of works.
      In conversation about 3 months ago permalink
    • Embed this notice
      iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:06:48 JST iced depresso iced depresso
      in reply to
      • 翠星石
      • :umu: :umu:
      • divVerent
      @wolf480pl @Suiseiseki @divVerent @a1ba yes :cat_sad:
      In conversation about 3 months ago permalink
    • Embed this notice
      Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 21:06:49 JST Wolf480pl Wolf480pl
      in reply to
      • 翠星石
      • iced depresso
      • :umu: :umu:
      • divVerent

      @icedquinn @Suiseiseki @divVerent @a1ba ok but like

      Did Unicode contain surrogates and modifier codepoints at the time when UTF-16 was designed?

      In conversation about 3 months ago permalink
    • Embed this notice
      iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:09:41 JST iced depresso iced depresso
      in reply to
      • 翠星石
      • :umu: :umu:
      • divVerent
      @wolf480pl @Suiseiseki @divVerent @a1ba no. it's more that they coded for precursor formats https://www.ibm.com/docs/en/i/7.4?topic=unicode-ucs-2-its-relationship-utf-16 and then tried to upgrade by gesticulating wildly.

      (i wrote a whole utf-8 module once upon a horrible time)
      In conversation about 3 months ago permalink

      Attachments

      1. No result found on File_thumbnail lookup.
        UCS-2 and its relationship to Unicode (UTF-16)
        The UCS-2 standard, an early version of Unicode, is limited to 65 535 characters. However, the data processing industry needs over 94 000 characters; the UCS-2 standard has been superseded by the Unicode UTF-16 standard.
    • Embed this notice
      Wolf480pl (wolf480pl@mstdn.io)'s status on Saturday, 29-Mar-2025 21:09:42 JST Wolf480pl Wolf480pl
      in reply to
      • 翠星石
      • iced depresso
      • :umu: :umu:
      • divVerent

      @icedquinn @Suiseiseki @divVerent @a1ba
      So I can't even blame this on Unicode Consortium's scope creep ;_;

      In conversation about 3 months ago permalink
    • Embed this notice
      iced depresso (icedquinn@blob.cat)'s status on Saturday, 29-Mar-2025 21:23:15 JST iced depresso iced depresso
      in reply to
      • 翠星石
      • :umu: :umu:
      • divVerent
      @divVerent @Suiseiseki @wolf480pl @a1ba back when i had that code i basically made several accessors and iterators to deal with it. you told it if you were dealing with graphemes, or just code points, and it had heuristics and loops to check the nearest safe split point at a given byte.

      i don't think i have that C code since ages. shame since it would have been neat resume fodder before chatgpt.
      In conversation about 3 months ago permalink
    • Embed this notice
      divVerent (divverent@social.vivaldi.net)'s status on Saturday, 29-Mar-2025 21:23:16 JST divVerent divVerent
      in reply to
      • 翠星石
      • iced depresso
      • :umu: :umu:

      @icedquinn @Suiseiseki @wolf480pl @a1ba TBH diacritics are less of an issue - most operations on strings can easily work on a per-codepoint basis, such as word wrapping - you just need to handle diacritics and other combining codepoints as if they're a word character.

      And for stuff like line length computation, you need to take the different per-character width of your font into account anyway.

      What's really annoying is string comparing, as you now have to apply a normalization first...

      In conversation about 3 months ago permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.