[vlc-devel] [RFC] EIT character sets conversion

Wed Sep 5 21:14:01 CEST 2007

Le Tuesday 04 September 2007 01:57:07, vous avez écrit :
> Apperantly I was under that assumption. I have never set out to learn
> how UTF-16 works...

BMP UTF-16 is pretty trivial: it assigns one "code point" to each character, 
and represent code point as a word. But of course, that is 
endianness-dependant (contrary to UTF-8), which is why iconv has UTF-16LE and 
UTF-16BE.

> This spec does not make any assumptions about endianness that I know
> of, it most often sticks a uimsbf (meaning
> unsigned integer most significant bit first) next to any numerical field.
> That said I don't think there are any little endian fields mentioned in the
> entire spec, so one could perhaps deduct such an assumption...

It's probably UTF-16BE then.

> > > \x15: UTF-8 encoding of ISO/IEC 10646-1 - Basic Multilingual Plane
> >
> >                                             ^^^^^^^^^^^^^^^^^^^^^^^^
> > BMP again. When reading, it's ok to use UTF-8, of which UTF-8 BMP is a
> > subset. In theory, if we were writing such a string, we would have to
> > discard non-BMP code points.
>
> I think actually the ts muxer can produce some structures containing
> strings that should follow these rules. I have no idea whatsoever
> what this code does about encodings.

So long as nobody tries to use one non-BMP character the bug will not trigger 
anyway. There are probably no TV channels in a language that is not 
represented within the BMP.

As for the ISO6397 stuff, if there actually are compliant streams, as well as 
non compliant streams using ISO-8859-1, we are pretty much screwed as these 
are mutually incompatible but I could not find much satisfactory detection 
heuristic.

-- 
Rémi Denis-Courmont