GNU social JP
  • FAQ
  • Login
GNU social JPは日本のGNU socialサーバーです。
Usage/ToS/admin/test/Pleroma FE
  • Public

    • Public
    • Network
    • Groups
    • Featured
    • Popular
    • People

Conversation

Notices

  1. Embed this notice
    luna, only carbon now (luna@pony.social)'s status on Thursday, 13-Mar-2025 21:00:42 JST luna, only carbon now luna, only carbon now

    C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?

    edit: yes i know how to decode utf8 manually, my query is about the stl specifically

    In conversation about 3 months ago from pony.social permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:00:30 JST Rich Felker Rich Felker
      in reply to

      @luna *If* the configured locale is using UTF-8 as its encoding, the standard multibyte interfaces in C or C++ will do this. mblen() returns the amount to advance by in bytes.

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:07:44 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.

      In conversation about 3 months ago permalink
    • Embed this notice
      Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:07:46 JST Peter Brett Peter Brett
      in reply to
      • Michael T Babcock

      @mikebabcock Quick guide to Unicode terminology:

      - code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
      - codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
      - graphemes: the smallest functional units of a script, formed from one or more codepoints
      - grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Thursday, 13-Mar-2025 21:07:47 JST Michael T Babcock Michael T Babcock
      in reply to
      • Peter Brett

      @krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!

      In conversation about 3 months ago permalink
    • Embed this notice
      Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:07:48 JST Peter Brett Peter Brett
      in reply to
      • Michael T Babcock

      @mikebabcock Those are code units

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Thursday, 13-Mar-2025 21:07:49 JST Michael T Babcock Michael T Babcock
      in reply to

      @luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:11:33 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @krans @mikebabcock Are unassigned values "mapped to a character"? 🤪

      Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.

      In conversation about 3 months ago permalink
    • Embed this notice
      Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:11:35 JST Peter Brett Peter Brett
      in reply to
      • Michael T Babcock
      • Rich Felker

      @dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.

      @mikebabcock

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:20:04 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.

      In conversation about 3 months ago permalink
    • Embed this notice
      Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:20:06 JST Peter Brett Peter Brett
      in reply to
      • Michael T Babcock
      • Rich Felker

      @dalias I thought surrogates were USVs but not codepoints? @mikebabcock

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 01:12:56 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)

      UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 01:12:57 JST Michael T Babcock Michael T Babcock
      in reply to
      • Rich Felker
      • Peter Brett

      @dalias @krans I prefer to think of UTF as an encoding system of UCS, as that's how it was designed even though sometimes it has other side-effects.

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 02:04:13 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @mikebabcock @krans 16 bit code units were a stupid idea but we're stuck with them all over the place thanks to Windows and Java.

      Anything modern uses UTF-8.

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:04:15 JST Michael T Babcock Michael T Babcock
      in reply to
      • Rich Felker
      • Peter Brett

      @krans @dalias aside, this is just more proof that the terminology needs revision.
      The fact that Unicode is just a numbered list of possible visual language thingies that can sometimes be combined to make other logical language thingies and those numbers can be encoded in a bunch of different ways is already complex enough for most people.
      (Never mind 16+ bit encodings having endian issues)

      In conversation about 3 months ago permalink
      Rich Felker repeated this.
    • Embed this notice
      Peter Brett (krans@mastodon.me.uk)'s status on Saturday, 15-Mar-2025 02:04:27 JST Peter Brett Peter Brett
      in reply to
      • Michael T Babcock
      • Rich Felker

      @mikebabcock Human language is complex. As far as I can, most of the complexity in Unicode arises from scripts being inherently complex; the remainder is due to providing a migration path from older encoding forms.

      I haven't found any complexity in Unicode that I didn't (grudgingly) agree was necessary, apart from emoji…

      @dalias

      In conversation about 3 months ago permalink
    • Embed this notice
      Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 02:07:16 JST Rich Felker Rich Felker
      in reply to
      • Michael T Babcock
      • Peter Brett

      @mikebabcock @krans Oh please this has been debunked so many times. Ultimately because compression makes it irrelevant in most contexts where size matters, but also, ideographic languages have a much higher *base* information density. 3 UTF-8 bytes of kanji typically contain as much information as 3-8 bytes of Latin script.

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:07:17 JST Michael T Babcock Michael T Babcock
      in reply to
      • Rich Felker
      • Peter Brett

      @dalias @krans the result of which has been that languages that primarily use ASCII benefit greatly in byte-count from using UTF-8 as an encoding system, where languages like Japanese (iirc) end up using 3 bytes in UTF-8 but only two in 16-bit encodings.

      In conversation about 3 months ago permalink
    • Embed this notice
      Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:07:19 JST Michael T Babcock Michael T Babcock
      in reply to
      • Rich Felker
      • Peter Brett

      @dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
      UTF does that.
      Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
      Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.

      In conversation about 3 months ago permalink

Feeds

  • Activity Streams
  • RSS 2.0
  • Atom
  • Help
  • About
  • FAQ
  • TOS
  • Privacy
  • Source
  • Version
  • Contact

GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.

Creative Commons Attribution 3.0 All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.