[vlc-devel] [RFC] EIT character sets conversion
rem at videolan.org
Fri Aug 31 21:30:44 CEST 2007
Le vendredi 31 août 2007, Sigmund Augdal a écrit :
> The specs say strings starting with \x10 are "ISO/IEC 8859", and in
> this case the two following bytes are said to be a unsigned big
> endian 16 bit integer to be used as index into a table listed in the
> spec. This table has 16 entries where 0 is listed as reserved and
> 1-15 is listed as meaning the corresponding 8859 variant. No entries
> are given above this. I assume this is in order to make backwards
> compatible extensions to the spec possible in the future, but no such
> thing exists today to the best of my knowledge.
So basically, Latin-10/ISO_8859-16 is currently not allowed. Not that
anybody uses it in real life anyway ;-)
> Strings starting with \x11 are speced to mean ISO/IEC 10646-1 - Basic
> Multilingual Plane. I don't know what that means, but apperantly I
> have earlier interpreted it as meaning UTF-16, feel free to correct
> if my assumption was wrong.
BMP is the subset of Unicode code points that fit into a single word
when using UTF-16. This is commonly known as "UCS-2" (but some people
say UCS-2 = UTF-16). However, that does not say what the endianness is
supposed to be, which is what I am interested in. Maybe the spec
assumes Big Endian all over the place? In that case, we should
pass "UTF-16BE" rather than "UTF-16" to iconv.
> \x15: UTF-8 encoding of ISO/IEC 10646-1 - Basic Multilingual Plane
BMP again. When reading, it's ok to use UTF-8, of which UTF-8 BMP is a
subset. In theory, if we were writing such a string, we would have to
discard non-BMP code points.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 197 bytes
Desc: This is a digitally signed message part.
More information about the vlc-devel