Trying to figure out how to print a UTF32 character in C and so far the answer seems to be "you can't"
Conversation
Notices
-
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 19:59:45 JST Eniko Fox -
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 19:59:44 JST Rich Felker @eniko On conforming implementions, printf("%lc", unicode_codepoint_val);
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 20:00:50 JST Rich Felker @eniko The myth that this is hard is entirely Microsoft's implementation being gratuitously and intentionally broken.
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 20:44:53 JST Rich Felker @eniko wint_t, but default promotions from wchar_t should be fine.
-
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 20:44:54 JST Eniko Fox @dalias what type is unicode_codepoint_val
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 20:51:20 JST Rich Felker @eniko Because Windows is wrong. If wchar_t is too narrow for full Unicode you're not allowed to support all of Unicode. C explicitly forbids "multi wchar_t chars" (thus UTF-16) which they do because they insisted on contradicting the experts in the early 90s who told them 16 bits wasn't enough and got themselves stuck. C11 strongly prefers wchar_t numeric vals be UCS codepoints (there's a macro that tells you this) and unless I'm misremembering, C23 requires it.
Haelwenn /элвэн/ :triskell: likes this. -
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 20:51:21 JST Eniko Fox @dalias everything i've found tells me not to use wchar_t because it is unclear what width its going to be
-
Embed this notice
Lulu · לולו (lulu@hachyderm.io)'s status on Sunday, 08-Dec-2024 20:57:57 JST Lulu · לולו The fact that UTF-16 can't die is just wild.
-
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 20:58:48 JST Eniko Fox @dalias ok so then how do i support printing cross platform 32-bit unicode code points
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 20:58:48 JST Rich Felker @eniko With modern Windows, you can set the locale codepage to UTF-8 and it should just work doing everything in UTF-8 not touching wchar_t. Arguably this is the best way to do things, but it doesn't respect systems with legacy unix systems with non-UTF-8 encodings. Modern C also has char32_t (always UTF-32) which can be used if you're worried the system wchar_t is broken like on Windows but what you can easily do with it is limited..
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 21:16:46 JST Rich Felker @eniko 7.30.1 from C23:
-
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 21:16:47 JST Eniko Fox @dalias from what I read char32_t isn't actually guaranteed to be utf32 and also I couldn't find a way to print it
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 21:30:05 JST Rich Felker @eniko Unfortunately the only way to print it is c32rtomb to convert it to a multibyte char string (in any reasonable setup this is UTF-8) in the current locale encoding.
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 21:44:05 JST Rich Felker @kittylyst @lulu @eniko Getting rid of Java? 😈
-
Embed this notice
Ben Evans (kittylyst@mastodon.social)'s status on Sunday, 08-Dec-2024 21:44:06 JST Ben Evans @lulu @dalias @eniko Java's internal representation for non-ASCII strings is UTF-16 and its not immediately clear how that could be changed. So I think it'll be around for the forseeable future.
-
Embed this notice
Ben Evans (kittylyst@mastodon.social)'s status on Sunday, 08-Dec-2024 21:45:57 JST Ben Evans @dalias @lulu @eniko Number of active server JVMs in the wild continues to increase, having doubled in ~6 years IIRC.
-
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 21:45:57 JST Rich Felker -
Embed this notice
Rich Felker (dalias@hachyderm.io)'s status on Sunday, 08-Dec-2024 21:57:01 JST Rich Felker @eniko C23 now mandates that.
-
Embed this notice
Eniko Fox (eniko@peoplemaking.games)'s status on Sunday, 08-Dec-2024 21:57:02 JST Eniko Fox @dalias i found https://beej.us/guide/bgc/html/split/unicode-wide-characters-and-all-that.html earlier and it says:
are values in these stored in UTF-16 or UTF-32? Depends on the implementation.
But you can test to see if they are. If the macros __STDC_UTF_16__ or __STDC_UTF_32__ are defined (to 1) it means the types hold UTF-16 or UTF-32, respectively.
-
Embed this notice