[vlc-devel] [RFC] EIT character sets conversion
Rémi Denis-Courmont
rem at videolan.org
Wed Sep 5 21:14:01 CEST 2007
Le Tuesday 04 September 2007 01:57:07, vous avez écrit :
> Apperantly I was under that assumption. I have never set out to learn
> how UTF-16 works...
BMP UTF-16 is pretty trivial: it assigns one "code point" to each character,
and represent code point as a word. But of course, that is
endianness-dependant (contrary to UTF-8), which is why iconv has UTF-16LE and
UTF-16BE.
> This spec does not make any assumptions about endianness that I know
> of, it most often sticks a uimsbf (meaning
> unsigned integer most significant bit first) next to any numerical field.
> That said I don't think there are any little endian fields mentioned in the
> entire spec, so one could perhaps deduct such an assumption...
It's probably UTF-16BE then.
> > > \x15: UTF-8 encoding of ISO/IEC 10646-1 - Basic Multilingual Plane
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^
> > BMP again. When reading, it's ok to use UTF-8, of which UTF-8 BMP is a
> > subset. In theory, if we were writing such a string, we would have to
> > discard non-BMP code points.
>
> I think actually the ts muxer can produce some structures containing
> strings that should follow these rules. I have no idea whatsoever
> what this code does about encodings.
So long as nobody tries to use one non-BMP character the bug will not trigger
anyway. There are probably no TV channels in a language that is not
represented within the BMP.
As for the ISO6397 stuff, if there actually are compliant streams, as well as
non compliant streams using ISO-8859-1, we are pretty much screwed as these
are mutually incompatible but I could not find much satisfactory detection
heuristic.
--
Rémi Denis-Courmont
More information about the vlc-devel
mailing list