C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?
edit: yes i know how to decode utf8 manually, my query is about the stl specifically
C++ friends, is there a standard way to iterate over unicode code points (not code units) in a string (or i guess a u8string)?
edit: yes i know how to decode utf8 manually, my query is about the stl specifically
@luna *If* the configured locale is using UTF-8 as its encoding, the standard multibyte interfaces in C or C++ will do this. mblen() returns the amount to advance by in bytes.
@krans @mikebabcock Are D800-DFFF "codepoints"? I don't think so, but I usually use the unambiguous term "Unicode scalar values" where they're clearly excluded.
@mikebabcock Quick guide to Unicode terminology:
- code units: the in-memory elements of the text encoding, i.e. bytes for UTF-8, 32-bit integers for UTF-32, etc
- codepoints: the numbers in the range 0–0x10FFFF that are mapped to abstract characters
- graphemes: the smallest functional units of a script, formed from one or more codepoints
- grapheme clusters: the things people usually would describe as ”a character” for the purpose of cursor motion, “the number of characters,” etc.
@krans oh okay, my reversal I'm sorry. As a Python programmer we just call those characters because Python innately differentiates between characters and encodings. My C++ knowledge is 10 years out of date alas so I'm not helpful but good luck!
@mikebabcock Those are code units
@luna@pony.so basically, iterating bytes in UTF-8 or words in UTF/UCS-16?
@krans @mikebabcock Are unassigned values "mapped to a character"? 🤪
Sorry, not picking on you, just pointing out that the definitions here are subtle & sometimes painful. Not gratuitously, but intrinsically.
@dalias Yes, it's not mapped to a character it's not a codepoint. Sorry, my wording was ambiguous.
@krans @mikebabcock Nope, a UTF is defined as a bijection between the Unicode Scalar Values and some subset of the possible sequences of code units. Thus UTFs can't/don't represent numbers in the surrogate range but do represent & round-trip noncharacter things like 0xFFFF.
@dalias I thought surrogates were USVs but not codepoints? @mikebabcock
@mikebabcock @krans What does that mean precisely though? (IOW what do you mean by "encoding system of UCS"?)
UTF-8 was originally conceived without a lot of rigor as an encoding of 31-bit numbers with non-unique encodings, but that was quickly realized to be a mistake and fixed. The other UTFs, and the unified definition of a UTF (which also includes GB18030!), were developed more rigorously, and involve the concept of USVs.
@dalias @krans I prefer to think of UTF as an encoding system of UCS, as that's how it was designed even though sometimes it has other side-effects.
@mikebabcock @krans 16 bit code units were a stupid idea but we're stuck with them all over the place thanks to Windows and Java.
Anything modern uses UTF-8.
@krans @dalias aside, this is just more proof that the terminology needs revision.
The fact that Unicode is just a numbered list of possible visual language thingies that can sometimes be combined to make other logical language thingies and those numbers can be encoded in a bunch of different ways is already complex enough for most people.
(Never mind 16+ bit encodings having endian issues)
@mikebabcock Human language is complex. As far as I can, most of the complexity in Unicode arises from scripts being inherently complex; the remainder is due to providing a migration path from older encoding forms.
I haven't found any complexity in Unicode that I didn't (grudgingly) agree was necessary, apart from emoji…
@mikebabcock @krans Oh please this has been debunked so many times. Ultimately because compression makes it irrelevant in most contexts where size matters, but also, ideographic languages have a much higher *base* information density. 3 UTF-8 bytes of kanji typically contain as much information as 3-8 bytes of Latin script.
@dalias @krans the result of which has been that languages that primarily use ASCII benefit greatly in byte-count from using UTF-8 as an encoding system, where languages like Japanese (iirc) end up using 3 bytes in UTF-8 but only two in 16-bit encodings.
@dalias @krans so you have a system that uses arbitrarily large numbers. You can store those numbers as very large words or dwords or you can encode them into smaller serially-decoded parcels.
UTF does that.
Each UTF byte is either the start of a larger value or a follow-up value (based on the high bit being set). This means the first 127 characters in ASCII and UTF-8 match by the way.
Small numbers? fewer bytes to encode. Large numbers? more bytes. UTF=variable. UCS=fixed.
GNU social JP is a social network, courtesy of GNU social JP管理人. It runs on GNU social, version 2.0.2-dev, available under the GNU Affero General Public License.
All GNU social JP content and data are available under the Creative Commons Attribution 3.0 license.