<br><br><div><span class="gmail_quote">On 8/30/07, <b class="gmail_sendername">Rémi Denis-Courmont</b> <<a href="mailto:rem@videolan.org">rem@videolan.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hello,<br><br>I have a few doubts concerning EITConvertToUTF8 (from<br>modules/demux/ts.c). I have no access to the relevant specifications,<br>neither to real-life streams using that.<br><br>First, if the "string" starts with \x10\x00, it appears we assume the
<br>third byte codes the number of an ISO_8859 character set. Is there any<br>reason why this is limited to the range 1-15? As of now, there is also<br>ISO_8859-16 (a.k.a. "Latin-10"), and who knows if more will not be
<br>added.</blockquote><div>The specs say strings starting with \x10 are "ISO/IEC 8859", and in this case the two <br>following bytes are said to be a unsigned big endian 16 bit integer to be used as index <br>into a table listed in the spec. This table has 16 entries where 0 is listed as reserved and 1-15
<br>is listed as meaning the corresponding 8859 variant. No entries are given above this. I assume <br>this is in order to make backwards compatible extensions to the spec possible in the future, but <br>no such thing exists today to the best of my knowledge.
<br> </div><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Second, if the string starts with \x11, we assume the rest is a sequence<br>of UTF-16. That being noted, iconv reckons three different kind of
<br>UTF-16. I am not sure, but I believe "UTF-16" needs a Byte-Order-Mark at<br>the beginning, otherwise "UTF-16LE" and "UTF16-BE" must be used when<br>the byte endianess is arbitrarily specified.
</blockquote><div><br>Strings starting with \x11 are speced to mean ISO/IEC 10646-1 - Basic Multilingual Plane. I don't<br>know what that means, but apperantly I have earlier interpreted it as meaning UTF-16, feel free to
<br>correct if my assumption was wrong.<br><br>Furthermore the spec defines strings starting with values all the way up to \x15, these are as follows:<br>\x12: KSC5601-1987 - Korean Character Set<br>\x13: GB2312-1980 - Simplified Chinese Character
<br>\x14: Big5 subset of ISO/IEC 10646-1 - Traditional Chinese<br>\x15: UTF-8 encoding of ISO/IEC 10646-1 - Basic Multilingual Plane<br><br>I think maybe I didn't have the most resent spec when I wrote that code, so some of these might be missing.
<br><br></div>If interested you can download the spec from <a href="http://www.etsi.org">www.etsi.org</a>. It's called en 300 468, and the relevant section is <br>in annex A.<br><br>Regards<br><br>Sigmund<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Help wanted.<br><br>--<br>Rémi Denis-Courmont<br><a href="http://www.remlab.net/">http://www.remlab.net/</a><br><br>_______________________________________________<br>vlc-devel mailing list<br>To unsubscribe or modify your subscription options:
<br><a href="http://mailman.videolan.org/listinfo/vlc-devel">http://mailman.videolan.org/listinfo/vlc-devel</a><br><br><br></blockquote></div><br>