[vlc-devel] Summary of VLC and unicode
Rémi Denis-Courmont
rdenis at simphalempin.com
Wed Nov 15 17:04:34 CET 2006
Hello everyone,
It's over a year than libVLC was switched to UTF-8 internally, and it's
now getting kind of useable again. Still, there's been quite many
changes since the last summary, so let's make a new one.
As a general rules, all character strings passed to LibVLC should be in
UTF-8. This includes but is far from limited to the playlist, and all
the "chain" from the input/access to the outputs. We pretty much needed
Unicode anyway; the only options were UTF-8 and wide char. We poke
UTF-8 because, as disruptive as it has been, it remains much less
disruptive than wide characters. Also wide characters means different
things on different platforms (e.g. UCS2LE on old Windows, UTF-16LE on
modern Windows, UTF-32 on glibc), are a pain to use in POSIX
environment (including Mac OS X and Linux which most of the devs are
using), and might have needed lots of libc replacements wc* functions
on some platforms.
The original approach has been to rely on ToLocale, FromLocale and
LocaleFree (and ToLocaleDup, FromLocaleDup and free) whenever a
character string in the OS native representation was needed. It turned
out to be a big mistake on Windows, which uses no less than three
different representations internally: UTF-16 in the kernel and file
system (Unicode/wide), ACP (ANSI Code Page) which is a locale-dependant
8-bit character set (CP1252 in the West), and even "OEM" (cp437 in the
US, cp850 in Western Europe). Unfortunately, (To|From)Locale were using
ACP, which prevents usage of Unicode code points outside the range of
the local charset, while the filesystem and native Win32 supports them.
This means VLC cannot "reach" some files.
So we now a comprehensive (but not fully yet) set of higher-level
wrappers that accept UTF-8 character string to represent filenames, or
text to be outputted to the console. At the time of writing, trunk
includes:
* utf8_open, utf8_fopen, utf8_stat, utf8_lstat, utf8_mkdir,
utf8_opendir, utf8_readdir, utf8_scandir for file system operation,
* utf8_fprintf, utf8_vfprintf for console output (or actually output in
any file who needs local charset).
All of these functions behave like their non "utf8_"-prefixed equivalent
with the exceptions that they expect UTF-8 character strings.
We also have EnsureUTF8 that force a string into UTF-8 (replaces invalid
character sequences with '?') and IsUTF8 that checks if the string is
valid UTF-8 (returns NULL if not). These are useful when reading UTF-8
from untrusted sources, such as the network, before injecting the data
into VLC.
ToLocale and FromLocale should NEVER be used for filesystem operations
anymore, except on code that is not aimed at Windows. And even then
it's better to use wrappers; for instance, Mac OS X has some conversion
logic for its directory listing. When using the Win32 API manually,
always use the Unicode version (the one ending with 'W' instead
of 'A'). If you need to convert to UTF-8 for use in LibVLC, you can use
FromWide (Win32 only!). Of course, it's better to resort to the
wrappers otherwise you always need a specific Windows implementations.
Please refer to the various pieces of code using Win32
SHFolderGetPath() for reference implementations.
Now for the good (?) news: in trunk, the only parts that still uses the
*Locale VLC API are:
- ncurses (if you are brave, you can try porting to wide ncurses...)
- GnomeVFS (which is not aimed at Win32 anyway)
- GnuTLS (not support from GnuTLS library :( )
- skins2
- WxWidgets (only on non-Win32)
I have however only checked the core and plugins, so there might be
problems with browser plugins and bindings. Also, if you intend to
write new plugins, please keep this in mind.
In addition to this, I remind you that we now have the
GetFallbackEncoding() API that returns the ANSI code page used by
Windows for the given locale, to be passed to iconv. For instance, if
your system is configured to run in German, it returns "CP1252". This
is convenient when handling subtitles or various kind of texts that are
typically aimed at Windows users.
--
Rémi Denis-Courmont
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc-devel/attachments/20061115/4197fe8a/attachment.sig>
More information about the vlc-devel
mailing list