[vlc-devel] [RFC] EIT character sets conversion

Fri Aug 31 14:26:20 CEST 2007

Hi,

On Fri, Aug 31, 2007, Rémi Denis-Courmont wrote:
> I have a few doubts concerning EITConvertToUTF8 (from 
> modules/demux/ts.c). I have no access to the relevant specifications, 
> neither to real-life streams using that.
 French TNT uses such descriptors and EITConvertToUTF8 is needed for
them.

> First, if the "string" starts with \x10\x00, it appears we assume the 
> third byte codes the number of an ISO_8859 character set. Is there any 
> reason why this is limited to the range 1-15? As of now, there is also 
> ISO_8859-16 (a.k.a. "Latin-10"), and who knows if more will not be 
> added.
> 
> Second, if the string starts with \x11, we assume the rest is a sequence 
> of UTF-16. That being noted, iconv reckons three different kind of 
> UTF-16. I am not sure, but I believe "UTF-16" needs a Byte-Order-Mark at 
> the beginning, otherwise "UTF-16LE" and "UTF16-BE" must be used when 
> the byte endianess is arbitrarily specified.
 It is described in EN 300 468 (DVB: Specification of Service Information in
DVB Systems) Annexe A (Selection of Charactere table).
Here is an extract:
----
 * if the first byte of the text field has a value in the range "0x20" to "0xFF" then this and all subsequent bytes in the text item are coded using the default character coding table (table 00 - Latin alphabet) of figure A.1;
 * if the first byte of the text field has a value in the range "0x01" to "0x0F" then the remaining bytes in the text item are coded in accordance with the character coding tables which are given in table A.3;
 * if the first byte of the text field has a value "0x10" then the following two bytes carry a 16-bit value (uimsbf) N to indicate that the remaining data of the text field is coded using the character code table specified by ISO Standard 8859, parts 1 to 9;
 *   if the first byte of the text field has a value "0x11" then the remaining bytes in the text item are coded in pairs in accordance with the Basic Multilingual Plane of ISO/IEC 10646-1 [8];
 * if the first byte of the text field has a value "0x12" then the remaining bytes in the text item are coded in accordance with the Korean Character Set KSC5601-1987 [17];
 * if the first byte of the text field has a value "0x13" then the remaining bytes in the text item are coded in accordance with the Simplified Chinese Character Set GB-2312-1980;
 * if the first byte of the text field has a value "0x14" then the remaining bytes in the text item are coded in accordance with the Big5 subset of ISO/IEC 10646-1 [8] for use with Traditional Chinese.
 * Values for the first byte of "0x15" to "0x1F" are reserved for future use.

A.1: This table is a superset of ISO/IEC 6937 [9] with addition of the Euro symbol.  
A.3
 First Byte  Character code table
   0x01      ISO/IEC 8859-5 [31]
   0x02      ISO/IEC 8859-6 [30]
   0x03      ISO/IEC 8859-7 [29]
   0x04      ISO/IEC 8859-8 [28]
   0x05      ISO/IEC 8859-9 [27]
   0x06      ISO/IEC 8859-10 [26]
   0x07      ISO/IEC 8859-11 [25]
   0x08      ISO/IEC 8859-12 (see bibliography)
   0x09      ISO/IEC 8859-13 [24]
   0x0A      ISO/IEC 8859-14 [23]
   0x0B      ISO/IEC 8859-15 [22]
   0x0C to 0x0F reserved for future use
----

Notes:
 For the first case, in real life ISO 8859-1 is used intead of ISO6937.
(Commentary from modules/access/dvb/en50221.c where you will find a the
original function)
 About UTF-16, I have never seen it used. UTF-16 allows to insert a marker
at the start to specify LE or BE, so I hope that broadcasters use it.

-- 
fenrir