[vlc-devel] Re: Non-western character encoding

Måns Rullgård mru at inprovide.com
Sun Mar 12 21:38:46 CET 2006


Rémi Denis-Courmont <rem at videolan.org> writes:

> Le Dimanche 12 Mars 2006 17:54, Måns Rullgård a écrit :
>> Rémi Denis-Courmont <rem at videolan.org> writes:
>> > It comes from LC_ALL, LC_CTYPE or LANG. The mapping is
>> > in /usr/share/i18n/SUPPORTED.
>>
>> No such file on my system.
>
> Debian specific, I believe.

Could be.  I'm not running debian.

>> I see what you are getting at.  The thing is, with no other
>> indication of encoding (e.g. specified in an HTTP header) the best
>> guess is still to use the locale settings.  I usually convert any
>> files I intend to access often to utf-8, unless they have some
>> builtin means of indicating what encoding they use.  It generally
>> simplifies things having all files in the same encoding.
>
> Does it make sense to try to use the file as UTF-8 when iconv tells
> you it is not possible?

Of course not.

> Does it make sense to force the user into using the advanced file
> opening dialog, and go through the advanced subtitles setting to
> define its subtitles encoding manually, while we can simply try
> UTF-8 and fallback to CP1252, given we know his/her locale is one of
> some western language from some pretty finite list?

Now you're talking about trying several encodings until we find one
that seems to work.  Including all common encodings used with the
language of the current locale makes sense.

> The average user expect its subtitle to work provided it is in his/her 
> language and it only differs from the video by its extension. That's 
> how it works on Windows!

The average user often expects things that are next to impossible.
Also keep in mind that the average user uses Windows with a latin1
locale.  Pretending everything is latin1 is likely to work in the
majority of cases.  I was, however, under the impression that we
wanted to do things properly.

>> The problem is that there is little correlation between the encoding
>> of the files and the system locale setting. 
>
> This is untrue. Most if not all text files will be encoded according 
> either to the locale setting, or to the Windows ACP for its language 
> (or to whatever standard said Windows ACP is derived from, and is 
> compatible with).

Every seems to have at least two, often more, frequently used
encodings.  The proper solution is probably to detect which of these
is used in each case.  The encodings for western European languages
are usually easy to distinguish.  The situation is worse for Asian
scripts where interpreting a file using the wrong encoding will often
give valid, but meaningless, characters.

>> Your method will still fail if my locale is en_US (or sv_SE or de_DE)
>> and I try to watch a movie with sjis subtitles (not that I'm very
>> likely to do that).
>
> In that rare particular case, you can go through the complicated 
> encoding setting. Is that a reason for forcing you to do it in the most 
> common case, though?

Would moving the encoding selection to the simple file open dialog be
viable?

>> A better idea might be to guess the encoding based on the language of
>> the subtitles if this is known.
>
> Provided we have an AI for language recognition...

OK, so the file doesn't specify it.

>> And an override option should be present, whatever other methods are
>> used.
>
> An override option *is* present. Did I ever say I wanted to remove it? 

No.

> I'm only arguing we should have more clever defaults.

Good.

-- 
Måns Rullgård
mru at inprovide.com

-- 
This is the vlc-devel mailing-list, see http://www.videolan.org/vlc/
To unsubscribe, please read http://developers.videolan.org/lists.html



More information about the vlc-devel mailing list