[vlc-devel] [PATCH 1/2] lua: add a guess_encoding function to convert Latin1 to UTF8
remi at remlab.net
Tue Aug 27 19:39:00 CEST 2013
Le mardi 27 août 2013 19:23:22 Edward Wang a écrit :
> On Tue, Aug 27, 2013 at 7:20 PM, Rémi Denis-Courmont <remi at remlab.net>
> > That is falling back to Latin-1 encoding (FromLatin1()) from UTF-8
> > encoding
> > (strdup()) if the sequence pass UTF-8 validation (IsUTF8()).
> > (Note that any byte sequence is valid Latin-1.)
> IsUTF8() checks for UTF8 validation, correct?
IsUTF8() checks whether the byte sequence can be decoded without error to a
sequence of Unicode code points using the rules for UTF-8 decoding.
> Therefore, if IsUTF8() says that the string is valid UTF-8, we should
> be able to use it as valid UTF8.
A byte sequence that passes UTF-8 validation will also pass validation as any
ISO 8859-x encoding and a number of others, even though they may decode to
different Unicode character sequences. Certain valid UTF-8 sequences are also
valid UTF-16 sequences and vice versa. And so on.
Basically, there can be false positives.
> If not, then I think that there is a bug in IsUTF8().
There is no way to unambiguously determine if a valid UTF-8 byte sequence
should be decoded as UTF-8 or Latin-1. We assume UTF-8 because it is
statistically much more likely owing to the design of the UTF-8 coding.
More information about the vlc-devel