Welcome to novaBBS (click a section below)
|mail  files  register  newsreader  login|
|C Locale Braindeath||AnonUser|
Stumbled upon an mpv commit that's about C locale handling. It's worth reading the entire thing.
Fix that libarchive fails to return filenames for UTF-8/UTF-16 entries.
The reason is that it uses locales and all that garbage, and mpv does
not set a locale.
Both C locales and wchar_t are shitfucked retarded legacy braindeath. If
the C/POSIX standard committee had actually competent members, these
would have been deprecated or removed long ago. (I mean, they managed to
remove gets().) To justify this emotional outbreak potentially insulting
to unknown persons, I will write a lot of text. Those not comfortable
with toxic language should pretend this is a religious text.
C locales are supposed to be a way to support certain languages and
cultures easier. One example are character codepages. Back when UTF-8
was not invented yet, there were only 255 possible characters, which is
not enough for anything but English and some european languages. So they
decided to make the meaning of a character dependent on the current
codepage. The locale (LC_CTYPE specifically) determines what character
encoding is currently used.
Of course nowadays, this is legacy nonsense. Everything uses UTF-8 for
"char", and what doesn't is broken and terrible anyway. But the old ways
stayed with us, and the stupidity of it as well.
C locales were utterly moronic even when they were invented. The locale
(via setlocale()) is global state, and global state is not a reasonable
way to do anything. It will break libraries, or well modularized code.
(The latter would be forced to strictly guard all entrypoints set
set/restore locales, assuming a single threaded world.)
On top of that, setting a locale randomly changes the semantics of a
bunch of standard functions. If a function respects locale, you suddenly
can't rely on it to behave the same on all systems. Some behavior can
come as a surprise, and of course it will be dependent on the region of
the user (it doesn't help that most software is US-centric, and the US
locale is almost like the C locale, i.e. almost what you expect).
Idiotically, locales were not just used to define the current character
encoding, but the concept was used for a whole lot of things, like e. g.
whether numbers should use "," or "." as decimal separaror. The latter
issue is actually much worse, because it breaks basic string conversion
or parsing of numbers for the purpose of interacting with file formats
Much can be said about how retarded locales are, even beyond what I just
wrote, or will wrote below. They are so hilariously misdesigned and
insufficient, I can't even fathom how this shit was _standardized_. (In
any case, that meant everyone was forced to implement it.) Many C
functions can't even do it correctly. For example, the character set
encoding can be a multibyte encoding (not just UTF-8, but awful garbage
like Shift JIS (sometimes called SHIT JIZZ), yet functions like
toupper() can return only 1 byte. Or just take the fact that the locale
API tries to define standard paper sizes (LC_PAPER) or telephone number
formatting (LC_TELEPHONE). Who the fuck uses this, or would ever use
But the badness doesn't stop here. At some point, they invented threads.
And they put absolutely no thought into how threads should interact with
locales. So they kept locales as global state. Because obviously, you
want to be able to change the semantics of basic string processing
functions _while_ they're running, right? (Any thread can call
setlocale() at any time, and it's supposed to change the locale of all
At this point, how the fuck are you supposed to do anything correctly?
You can't even temporarily switch the locale with setlocale(), because
it would asynchronously fuckup the other threads. All you can do is to
enforce a convention not to set anything but the C local (this is what
mpv does), or to duplicate standard functions using code that doesn't
query locale (this is what e.g. libass does, a close dependency of mpv).
Imagine they had done this for certain other things. Like errno, with
all the brokenness of the locale API. This simply wouldn't have worked,
shit would just have been too broken. So they didn't. But locales give a
delicious sweet spot of brokenness, where things are broken enough to
cause neverending pain, but not broken enough that enough effort would
have spent to fix it completely.
On that note, standard C11 actually can't stringify an error value. It
does define strerror(), but it's not thread safe, even though C11
supports threads. The idiots could just have defined it to be thread
safe. Even if your libc is horrible enough that it can't return string
literals, it could just just some thread local buffer. Because C11 does
define thread local variables. But hey, why care about details, if you
can just create a shitty standard?
(POSIX defines strerror_r(), which "solves" this problem, while still
not making strerror() thread safe.)
Anyway, back to threads. The interaction of locales and threads makes no
sense. Why would you make locales process global? Who even wanted it to
work this way? Who decided that it should keep working this way, despite
being so broken (and certainly causing implementation difficulties in
libc)? Was it just a fucked up psychopath?
Several decades later, the moronic standard committees noticed that this
was (still is) kind of a bad situation. Instead of fixing the situation,
they added more garbage on top of it. (Probably for the sake of
"compatibility"). Now there is a set of new functions, which allow you
to override the locale for the current thread. This means you can
temporarily override and restore the local on all entrypoints of your
code (like you could with setlocale(), before threads were invented).
And of course not all operating systems or libcs implement this. For
example, I'm pretty sure Microsoft doesn't. (Microsoft got to fuck it up
as usual, and only provides _configthreadlocale(). This is shitfucked on
its own, because it's GLOBAL STATE to configure that GLOBAL STATE should
not be GLOBAL STATE, i.e. completely broken garbage, because it requires
agreement over all modules/libraries what behavior should be used. I
mean, sure, makign setlocale() affect only the current thread would have
been the reasonable behavior. Making this behavior configurable isn't,
because you can't rely on what behavior is active.)
POSIX showed some minor decency by at least introducing some variations
of standard functions, which have a locale argument (e.g. toupper_l()).
You just pass the locale which you want to be used, and don't have to do
the set locale/call function/restore locale nonense. But OF COURSE they
fucked this up too. In no less than 2 ways:
- There is no statically available handle for the C locale, so you have
to initialize and store it somewhere, which makes it harder to make
utility functions safe, that call locale-affected standard functions
and expect C semantics. The easy solution, using pthread_once() and a
global variable with the created locale, will not be easily accepted
by pedantic assholes, because they'll worry about allocation failure,
or leaking the locale when using this in library code (and then
unloading the library). Or you could have complicated library
init/uninit functions, which bring a big load of their own mess.
Same for automagic DLL constructors/destructors.
- Not all functions have a variant that takes a locale argument, and
they missed even some important ones, like snprintf() or strtod() WHAT
THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT
THE FUCK WHAT THE FUCK WHAT THE FUCK WHAT THE FUCK
I would like to know why it took so long to standardize a half-assed
solution, that, apart from being conceptually half-assed, is even
incomplete and insufficient. The obvious way to fix this would have
- deprecate the entire locale API and their use, and make it a NOP
- make UTF-8 the standard character type
- make the C locale behavior the default
- add new APIs that explicitly take locale objects
- provide an emulation layer, that can be used to transparently build
legacy code without breaking them
But this wouldn't have been "compatible", and the apparently incompetent
standard committees would have never accepted this. As if anyone
actually used this legacy garbage, except other legacy garbage. Oh yeah,
and let's care a lot about legacy compatibility, and let's not care at
all about modern code that either has to suffer from this, or subtly
breaks when the wrong locales are active.
Last but not least, the UTF-8 locale name is apparently not even
standardized. At the moment I'm trying to use "C.UTF-8", which is
apparently glibc _and_ Debian specific. Got to use every opportunity to
make correct usage of UTF-8 harder. What luck that this commit is only
for some optional relatively obscure mpv feature.
Why is the C locale not UTF-8? Why did POSIX not standardize an UTF-8
locale? Well, according to something I heard a few years ago, they're
considering disallowing UTF-8 as locale, because UTF-8 would violate
certain ivnariants expected by C or POSIX. (But I'm not sure if I
remember this correctly - probably better not to rage about it.)