[vlc-devel] [PATCH 1/2] lua: add a guess_encoding function to convert Latin1 to UTF8

Rémi Denis-Courmont remi at remlab.net
Tue Aug 27 19:39:00 CEST 2013


Le mardi 27 août 2013 19:23:22 Edward Wang a écrit :
> On Tue, Aug 27, 2013 at 7:20 PM, Rémi Denis-Courmont <remi at remlab.net> 
wrote:
> > That is falling back to Latin-1 encoding (FromLatin1()) from UTF-8
> > encoding
> > (strdup()) if the sequence pass UTF-8 validation (IsUTF8()).
> > 
> > (Note that any byte sequence is valid Latin-1.)
> 
> IsUTF8() checks for UTF8 validation, correct?

IsUTF8() checks whether the byte sequence can be decoded without error to a 
sequence of Unicode code points using the rules for UTF-8 decoding.

> Therefore, if IsUTF8() says that the string is valid UTF-8, we should
> be able to use it as valid UTF8.

A byte sequence that passes UTF-8 validation will also pass validation as any 
ISO 8859-x encoding and a number of others, even though they may decode to 
different Unicode character sequences. Certain valid UTF-8 sequences are also 
valid UTF-16 sequences and vice versa. And so on.

Basically, there can be false positives.

> If not, then I think that there is a bug in IsUTF8().

There is no way to unambiguously determine if a valid UTF-8 byte sequence 
should be decoded as UTF-8 or Latin-1. We assume UTF-8 because it is 
statistically much more likely owing to the design of the UTF-8 coding.

-- 
Rémi Denis-Courmont
http://www.remlab.net/




More information about the vlc-devel mailing list