[vlc-devel] Re: Summary of VLC and unicode
Jean-Paul Saman
jean-paul.saman at planet.nl
Thu Nov 16 22:15:59 CET 2006
Rémi Denis-Courmont wrote:
> Hello everyone,
>
> It's over a year than libVLC was switched to UTF-8 internally, and it's
> now getting kind of useable again. Still, there's been quite many
> changes since the last summary, so let's make a new one.
May I suggest to also put this e-mail on the developers website? It is a
good overview of the functions and purpose of this unicode API.
Gtz,
Jean-Paul Saman.
>
> As a general rules, all character strings passed to LibVLC should be in
> UTF-8. This includes but is far from limited to the playlist, and all
> the "chain" from the input/access to the outputs. We pretty much needed
> Unicode anyway; the only options were UTF-8 and wide char. We poke
> UTF-8 because, as disruptive as it has been, it remains much less
> disruptive than wide characters. Also wide characters means different
> things on different platforms (e.g. UCS2LE on old Windows, UTF-16LE on
> modern Windows, UTF-32 on glibc), are a pain to use in POSIX
> environment (including Mac OS X and Linux which most of the devs are
> using), and might have needed lots of libc replacements wc* functions
> on some platforms.
>
> The original approach has been to rely on ToLocale, FromLocale and
> LocaleFree (and ToLocaleDup, FromLocaleDup and free) whenever a
> character string in the OS native representation was needed. It turned
> out to be a big mistake on Windows, which uses no less than three
> different representations internally: UTF-16 in the kernel and file
> system (Unicode/wide), ACP (ANSI Code Page) which is a locale-dependant
> 8-bit character set (CP1252 in the West), and even "OEM" (cp437 in the
> US, cp850 in Western Europe). Unfortunately, (To|From)Locale were using
> ACP, which prevents usage of Unicode code points outside the range of
> the local charset, while the filesystem and native Win32 supports them.
> This means VLC cannot "reach" some files.
>
> So we now a comprehensive (but not fully yet) set of higher-level
> wrappers that accept UTF-8 character string to represent filenames, or
> text to be outputted to the console. At the time of writing, trunk
> includes:
>
> * utf8_open, utf8_fopen, utf8_stat, utf8_lstat, utf8_mkdir,
> utf8_opendir, utf8_readdir, utf8_scandir for file system operation,
>
> * utf8_fprintf, utf8_vfprintf for console output (or actually output in
> any file who needs local charset).
>
> All of these functions behave like their non "utf8_"-prefixed equivalent
> with the exceptions that they expect UTF-8 character strings.
>
> We also have EnsureUTF8 that force a string into UTF-8 (replaces invalid
> character sequences with '?') and IsUTF8 that checks if the string is
> valid UTF-8 (returns NULL if not). These are useful when reading UTF-8
> from untrusted sources, such as the network, before injecting the data
> into VLC.
>
>
> ToLocale and FromLocale should NEVER be used for filesystem operations
> anymore, except on code that is not aimed at Windows. And even then
> it's better to use wrappers; for instance, Mac OS X has some conversion
> logic for its directory listing. When using the Win32 API manually,
> always use the Unicode version (the one ending with 'W' instead
> of 'A'). If you need to convert to UTF-8 for use in LibVLC, you can use
> FromWide (Win32 only!). Of course, it's better to resort to the
> wrappers otherwise you always need a specific Windows implementations.
> Please refer to the various pieces of code using Win32
> SHFolderGetPath() for reference implementations.
>
> Now for the good (?) news: in trunk, the only parts that still uses the
> *Locale VLC API are:
> - ncurses (if you are brave, you can try porting to wide ncurses...)
> - GnomeVFS (which is not aimed at Win32 anyway)
> - GnuTLS (not support from GnuTLS library :( )
> - skins2
> - WxWidgets (only on non-Win32)
>
> I have however only checked the core and plugins, so there might be
> problems with browser plugins and bindings. Also, if you intend to
> write new plugins, please keep this in mind.
>
>
> In addition to this, I remind you that we now have the
> GetFallbackEncoding() API that returns the ANSI code page used by
> Windows for the given locale, to be passed to iconv. For instance, if
> your system is configured to run in German, it returns "CP1252". This
> is convenient when handling subtitles or various kind of texts that are
> typically aimed at Windows users.
>
--
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://developers.videolan.org/lists.html
More information about the vlc-devel
mailing list