[vlc-devel] [RFC] EIT character sets conversion
Laurent Aimar
fenrir at via.ecp.fr
Fri Aug 31 14:26:20 CEST 2007
Hi,
On Fri, Aug 31, 2007, Rémi Denis-Courmont wrote:
> I have a few doubts concerning EITConvertToUTF8 (from
> modules/demux/ts.c). I have no access to the relevant specifications,
> neither to real-life streams using that.
French TNT uses such descriptors and EITConvertToUTF8 is needed for
them.
> First, if the "string" starts with \x10\x00, it appears we assume the
> third byte codes the number of an ISO_8859 character set. Is there any
> reason why this is limited to the range 1-15? As of now, there is also
> ISO_8859-16 (a.k.a. "Latin-10"), and who knows if more will not be
> added.
>
> Second, if the string starts with \x11, we assume the rest is a sequence
> of UTF-16. That being noted, iconv reckons three different kind of
> UTF-16. I am not sure, but I believe "UTF-16" needs a Byte-Order-Mark at
> the beginning, otherwise "UTF-16LE" and "UTF16-BE" must be used when
> the byte endianess is arbitrarily specified.
It is described in EN 300 468 (DVB: Specification of Service Information in
DVB Systems) Annexe A (Selection of Charactere table).
Here is an extract:
----
* if the first byte of the text field has a value in the range "0x20" to "0xFF" then this and all subsequent bytes in the text item are coded using the default character coding table (table 00 - Latin alphabet) of figure A.1;
* if the first byte of the text field has a value in the range "0x01" to "0x0F" then the remaining bytes in the text item are coded in accordance with the character coding tables which are given in table A.3;
* if the first byte of the text field has a value "0x10" then the following two bytes carry a 16-bit value (uimsbf) N to indicate that the remaining data of the text field is coded using the character code table specified by ISO Standard 8859, parts 1 to 9;
* if the first byte of the text field has a value "0x11" then the remaining bytes in the text item are coded in pairs in accordance with the Basic Multilingual Plane of ISO/IEC 10646-1 [8];
* if the first byte of the text field has a value "0x12" then the remaining bytes in the text item are coded in accordance with the Korean Character Set KSC5601-1987 [17];
* if the first byte of the text field has a value "0x13" then the remaining bytes in the text item are coded in accordance with the Simplified Chinese Character Set GB-2312-1980;
* if the first byte of the text field has a value "0x14" then the remaining bytes in the text item are coded in accordance with the Big5 subset of ISO/IEC 10646-1 [8] for use with Traditional Chinese.
* Values for the first byte of "0x15" to "0x1F" are reserved for future use.
A.1: This table is a superset of ISO/IEC 6937 [9] with addition of the Euro symbol.
A.3
First Byte Character code table
0x01 ISO/IEC 8859-5 [31]
0x02 ISO/IEC 8859-6 [30]
0x03 ISO/IEC 8859-7 [29]
0x04 ISO/IEC 8859-8 [28]
0x05 ISO/IEC 8859-9 [27]
0x06 ISO/IEC 8859-10 [26]
0x07 ISO/IEC 8859-11 [25]
0x08 ISO/IEC 8859-12 (see bibliography)
0x09 ISO/IEC 8859-13 [24]
0x0A ISO/IEC 8859-14 [23]
0x0B ISO/IEC 8859-15 [22]
0x0C to 0x0F reserved for future use
----
Notes:
For the first case, in real life ISO 8859-1 is used intead of ISO6937.
(Commentary from modules/access/dvb/en50221.c where you will find a the
original function)
About UTF-16, I have never seen it used. UTF-16 allows to insert a marker
at the start to specify LE or BE, so I hope that broadcasters use it.
--
fenrir
More information about the vlc-devel
mailing list