[vlc-devel] Summary of VLC and unicode

Rémi Denis-Courmont rdenis at simphalempin.com
Wed Nov 15 17:04:34 CET 2006


	Hello everyone,

It's over a year than libVLC was switched to UTF-8 internally, and it's 
now getting kind of useable again. Still, there's been quite many 
changes since the last summary, so let's make a new one.

As a general rules, all character strings passed to LibVLC should be in 
UTF-8. This includes but is far from limited to the playlist, and all 
the "chain" from the input/access to the outputs. We pretty much needed 
Unicode anyway; the only options were UTF-8 and wide char. We poke 
UTF-8 because, as disruptive as it has been, it remains much less 
disruptive than wide characters. Also wide characters means different 
things on different platforms (e.g. UCS2LE on old Windows, UTF-16LE on 
modern Windows, UTF-32 on glibc), are a pain to use in POSIX 
environment (including Mac OS X and Linux which most of the devs are 
using), and might have needed lots of libc replacements wc* functions 
on some platforms.

The original approach has been to rely on ToLocale, FromLocale and 
LocaleFree (and ToLocaleDup, FromLocaleDup and free) whenever a 
character string in the OS native representation was needed. It turned 
out to be a big mistake on Windows, which uses no less than three 
different representations internally: UTF-16 in the kernel and file 
system (Unicode/wide), ACP (ANSI Code Page) which is a locale-dependant 
8-bit character set (CP1252 in the West), and even "OEM" (cp437 in the 
US, cp850 in Western Europe). Unfortunately, (To|From)Locale were using 
ACP, which prevents usage of Unicode code points outside the range of 
the local charset, while the filesystem and native Win32 supports them. 
This means VLC cannot "reach" some files.

So we now a comprehensive (but not fully yet) set of higher-level 
wrappers that accept UTF-8 character string to represent filenames, or 
text to be outputted to the console. At the time of writing, trunk 
includes:

* utf8_open, utf8_fopen, utf8_stat, utf8_lstat, utf8_mkdir, 
utf8_opendir, utf8_readdir, utf8_scandir for file system operation,

* utf8_fprintf, utf8_vfprintf for console output (or actually output in 
any file who needs local charset).

All of these functions behave like their non "utf8_"-prefixed equivalent 
with the exceptions that they expect UTF-8 character strings.

We also have EnsureUTF8 that force a string into UTF-8 (replaces invalid 
character sequences with '?') and IsUTF8 that checks if the string is 
valid UTF-8 (returns NULL if not). These are useful when reading UTF-8 
from untrusted sources, such as the network, before injecting the data 
into VLC.


ToLocale and FromLocale should NEVER be used for filesystem operations 
anymore, except on code that is not aimed at Windows. And even then 
it's better to use wrappers; for instance, Mac OS X has some conversion 
logic for its directory listing. When using the Win32 API manually, 
always use the Unicode version (the one ending with 'W' instead 
of 'A'). If you need to convert to UTF-8 for use in LibVLC, you can use 
FromWide (Win32 only!). Of course, it's better to resort to the 
wrappers otherwise you always need a specific Windows implementations. 
Please refer to the various pieces of code using Win32 
SHFolderGetPath() for reference implementations.

Now for the good (?) news: in trunk, the only parts that still uses the 
*Locale VLC API are:
- ncurses (if you are brave, you can try porting to wide ncurses...)
- GnomeVFS (which is not aimed at Win32 anyway)
- GnuTLS (not support from GnuTLS library :( )
- skins2
- WxWidgets (only on non-Win32)

I have however only checked the core and plugins, so there might be 
problems with browser plugins and bindings. Also, if you intend to 
write new plugins, please keep this in mind.


In addition to this, I remind you that we now have the 
GetFallbackEncoding() API that returns the ANSI code page used by 
Windows for the given locale, to be passed to iconv. For instance, if 
your system is configured to run in German, it returns "CP1252". This 
is convenient when handling subtitles or various kind of texts that are 
typically aimed at Windows users.

-- 
Rémi Denis-Courmont
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc-devel/attachments/20061115/4197fe8a/attachment.sig>


More information about the vlc-devel mailing list