[vlc-devel] [RFC] EIT character sets conversion
fenrir at via.ecp.fr
Fri Aug 31 14:26:20 CEST 2007
On Fri, Aug 31, 2007, Rémi Denis-Courmont wrote:
> I have a few doubts concerning EITConvertToUTF8 (from
> modules/demux/ts.c). I have no access to the relevant specifications,
> neither to real-life streams using that.
French TNT uses such descriptors and EITConvertToUTF8 is needed for
> First, if the "string" starts with \x10\x00, it appears we assume the
> third byte codes the number of an ISO_8859 character set. Is there any
> reason why this is limited to the range 1-15? As of now, there is also
> ISO_8859-16 (a.k.a. "Latin-10"), and who knows if more will not be
> Second, if the string starts with \x11, we assume the rest is a sequence
> of UTF-16. That being noted, iconv reckons three different kind of
> UTF-16. I am not sure, but I believe "UTF-16" needs a Byte-Order-Mark at
> the beginning, otherwise "UTF-16LE" and "UTF16-BE" must be used when
> the byte endianess is arbitrarily specified.
It is described in EN 300 468 (DVB: Specification of Service Information in
DVB Systems) Annexe A (Selection of Charactere table).
Here is an extract:
* if the first byte of the text field has a value in the range "0x20" to "0xFF" then this and all subsequent bytes in the text item are coded using the default character coding table (table 00 - Latin alphabet) of figure A.1;
* if the first byte of the text field has a value in the range "0x01" to "0x0F" then the remaining bytes in the text item are coded in accordance with the character coding tables which are given in table A.3;
* if the first byte of the text field has a value "0x10" then the following two bytes carry a 16-bit value (uimsbf) N to indicate that the remaining data of the text field is coded using the character code table specified by ISO Standard 8859, parts 1 to 9;
* if the first byte of the text field has a value "0x11" then the remaining bytes in the text item are coded in pairs in accordance with the Basic Multilingual Plane of ISO/IEC 10646-1 ;
* if the first byte of the text field has a value "0x12" then the remaining bytes in the text item are coded in accordance with the Korean Character Set KSC5601-1987 ;
* if the first byte of the text field has a value "0x13" then the remaining bytes in the text item are coded in accordance with the Simplified Chinese Character Set GB-2312-1980;
* if the first byte of the text field has a value "0x14" then the remaining bytes in the text item are coded in accordance with the Big5 subset of ISO/IEC 10646-1  for use with Traditional Chinese.
* Values for the first byte of "0x15" to "0x1F" are reserved for future use.
A.1: This table is a superset of ISO/IEC 6937  with addition of the Euro symbol.
First Byte Character code table
0x01 ISO/IEC 8859-5 
0x02 ISO/IEC 8859-6 
0x03 ISO/IEC 8859-7 
0x04 ISO/IEC 8859-8 
0x05 ISO/IEC 8859-9 
0x06 ISO/IEC 8859-10 
0x07 ISO/IEC 8859-11 
0x08 ISO/IEC 8859-12 (see bibliography)
0x09 ISO/IEC 8859-13 
0x0A ISO/IEC 8859-14 
0x0B ISO/IEC 8859-15 
0x0C to 0x0F reserved for future use
For the first case, in real life ISO 8859-1 is used intead of ISO6937.
(Commentary from modules/access/dvb/en50221.c where you will find a the
About UTF-16, I have never seen it used. UTF-16 allows to insert a marker
at the start to specify LE or BE, so I hope that broadcasters use it.
More information about the vlc-devel