Conversation

Notices

Embed this notice
luna, only carbon now (luna@pony.social)'s status on Thursday, 13-Mar-2025 21:00:42 JST luna, only carbon now

C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?
edit: yes i know how to decode utf8 manually, my query is about the stl specifically

In conversation about 3 months ago from pony.social permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:00:30 JST Rich Felker
  in reply to
  
  @luna *If* the configured locale is using UTF-8 as its encoding, the standard multibyte interfaces in C or C++ will do this. mblen() returns the amount to advance by in bytes.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:07:44 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:07:46 JST Peter Brett
  in reply to
  - Michael T Babcock
  @mikebabcock Quick guide to Unicode terminology:
  - code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
  - codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
  - graphemes: the smallest functional units of a script, formed from one or more codepoints
  - grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Thursday, 13-Mar-2025 21:07:47 JST Michael T Babcock
  in reply to
  - Peter Brett
  @krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!
  
  In conversation about 3 months ago permalink
- Embed this notice
  Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:07:48 JST Peter Brett
  in reply to
  - Michael T Babcock
  @mikebabcock Those are code units
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Thursday, 13-Mar-2025 21:07:49 JST Michael T Babcock
  in reply to
  
  @luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:11:33 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @krans @mikebabcock Are unassigned values "mapped to a character"? 🤪
  Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:11:35 JST Peter Brett
  in reply to
  - Michael T Babcock
  - Rich Felker
  @dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.
  @mikebabcock
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Thursday, 13-Mar-2025 21:20:04 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Peter Brett (krans@mastodon.me.uk)'s status on Thursday, 13-Mar-2025 21:20:06 JST Peter Brett
  in reply to
  - Michael T Babcock
  - Rich Felker
  @dalias I thought surrogates were USVs but not codepoints? @mikebabcock
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 01:12:56 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)
  UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 01:12:57 JST Michael T Babcock
  in reply to
  - Rich Felker
  - Peter Brett
  @dalias @krans I prefer to think of UTF as an encoding system of UCS, as that's how it was designed even though sometimes it has other side-effects.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 02:04:13 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @mikebabcock @krans 16 bit code units were a stupid idea but we're stuck with them all over the place thanks to Windows and Java.
  Anything modern uses UTF-8.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:04:15 JST Michael T Babcock
  in reply to
  - Rich Felker
  - Peter Brett
  @krans @dalias aside, this is just more proof that the terminology needs revision.
  The fact that Unicode is just a numbered list of possible visual language thingies that can sometimes be combined to make other logical language thingies and those numbers can be encoded in a bunch of different ways is already complex enough for most people.
  (Never mind 16+ bit encodings having endian issues)
  
  In conversation about 3 months ago permalink
  
  Rich Felker repeated this.
- Embed this notice
  Peter Brett (krans@mastodon.me.uk)'s status on Saturday, 15-Mar-2025 02:04:27 JST Peter Brett
  in reply to
  - Michael T Babcock
  - Rich Felker
  @mikebabcock Human language is complex. As far as I can, most of the complexity in Unicode arises from scripts being inherently complex; the remainder is due to providing a migration path from older encoding forms.
  I haven't found any complexity in Unicode that I didn't (grudgingly) agree was necessary, apart from emoji…
  @dalias
  
  In conversation about 3 months ago permalink
- Embed this notice
  Rich Felker (dalias@hachyderm.io)'s status on Saturday, 15-Mar-2025 02:07:16 JST Rich Felker
  in reply to
  - Michael T Babcock
  - Peter Brett
  @mikebabcock @krans Oh please this has been debunked so many times. Ultimately because compression makes it irrelevant in most contexts where size matters, but also, ideographic languages have a much higher *base* information density. 3 UTF-8 bytes of kanji typically contain as much information as 3-8 bytes of Latin script.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:07:17 JST Michael T Babcock
  in reply to
  - Rich Felker
  - Peter Brett
  @dalias @krans the result of which has been that languages that primarily use ASCII benefit greatly in byte-count from using UTF-8 as an encoding system, where languages like Japanese (iirc) end up using 3 bytes in UTF-8 but only two in 16-bit encodings.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Michael T Babcock (mikebabcock@floss.social)'s status on Saturday, 15-Mar-2025 02:07:19 JST Michael T Babcock
  in reply to
  - Rich Felker
  - Peter Brett
  @dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
  UTF does that.
  Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
  Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.
  
  In conversation about 3 months ago permalink

Public

Conversation

Notices

Feeds