[vlc-devel] Re: Non-western character encoding

Måns Rullgård mru at inprovide.com
Sun Mar 12 17:54:03 CET 2006


Rémi Denis-Courmont <rem at videolan.org> writes:

> Le Dimanche 12 Mars 2006 15:49, Måns Rullgård a écrit :
>> How is "the local character encoding" determined?
>
> It comes from LC_ALL, LC_CTYPE or LANG. The mapping is 
> in /usr/share/i18n/SUPPORTED.

No such file on my system.

>> If LC_ALL, LC_CTYPE or LANG (checked in that order) specifies an
>> encoding, that should be used.  If none is specified, the best that
>> can be done is to choose a default for each locale.  The user should
>> always have an options to override the default should s/he wish to.
>
> I have to disagree here. I don't believe japanese subtitles 
> automagically change from Shift-JIS to EUC-JP as they are downloaded on 
> a Linux system.

No, but the user might convert them manually.  It only takes a few
seconds.

> Japanese Windows users use CP932 variant of Shift-JIS, so Japanese
> subtitles are in Shift-JIS/CP932. There is no point in trying to
> decode these as EUC-JP, even if that is the encoding for the ja_JP C
> library locale on Linux.
>
> And I *know* that French subtitles don't automagically get converted 
> from Latin-1/CP1252 to UTF-8 juste because my system's LC_CTYPE is 
> fr_FR.UTF-8 instead of fr_FR.

I see what you are getting at.  The thing is, with no other indication
of encoding (e.g. specified in an HTTP header) the best guess is still
to use the locale settings.  I usually convert any files I intend to
access often to utf-8, unless they have some builtin means of
indicating what encoding they use.  It generally simplifies things
having all files in the same encoding.

> What I do believe is that we get a much bigger rate of matching encoding 
> by looking at the local system language (ie. the first part of LC_ALL 
> or LANG), rather than by using the local system charset. In fact, it 
> makes almost all subtitles work, while they would otherwise almost 
> always fail. If you don't believe me, just try to use subtitles from 
> some western language that has lots of accents (French, German, 
> Swedish...) with a pre-[14724] VLC on a Linux system using a UTF-8 (as 
> in LANG=??_??.UTF-8) locale variant for said language. And feel the 
> pain.
>
> The current approach is just utterly broken (except on Windows). The 
> proposed approach brings VLC subtitles decoding on Linux & company to 
> the same, much higher, “success” rate of VLC on Windows.

The problem is that there is little correlation between the encoding
of the files and the system locale setting.  Your method will still
fail if my locale is en_US (or sv_SE or de_DE) and I try to watch a
movie with sjis subtitles (not that I'm very likely to do that).

A better idea might be to guess the encoding based on the language of
the subtitles if this is known.

And an override option should be present, whatever other methods are
used.

> And then, we might also consider UTF-8 autodetection (à la 
> irssi>=0.8.10), though I'm yet to find any UTF-8 subtitle file.

Why don't you just create one with iconv?

-- 
Måns Rullgård
mru at inprovide.com

-- 
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://developers.videolan.org/lists.html



More information about the vlc-devel mailing list