[vlc] Re: I'm implementing subtitle charset-detector by using firefox's library
Rémi Denis-Courmont
rem at videolan.org
Tue Feb 20 20:28:58 CET 2007
Le mardi 20 février 2007 19:58, Tsai Dung-Bang a écrit :
> Rémi Denis-Courmont 写道:
> > Well, I don't know. It's enough to discriminate UTF-8 from the
> > local character encoding in any case that I've dealt with. The
> > underlying assumption is that valid UTF-8 byte sequences are indeed
> > UTF-8 - and in particular valid US-ASCII sequences are US-ASCII
> > (lets assume nobody uses QP, base64 or UTF-7 subtitles).
>
> It seems that there are some local charset words in the uft-8 1920
> codes. (110yyyyy 10zzzzzz). So if we do not have enough data, we
> could not distinguish a local charset from raw data.
Yeah, and they are valid UTF-8 sequences that are actually Latin-1, but
that is a bit unlikely over an entire sentence (except when Latin-1
people talk about UTF-8 but that's not very common a subject in
movies).
> And in this architecture, how can we add the support of BOM(Byte
> Order Mark)? Lots of unicode subtitles have this.
The real question is, does ALL unicode subtitles have either BOM, or an
externally specified explicit charset (demux meta-infos or *.utf
filename extension). If that is the case, the currnt UTF-8
autodetection is essentially useless; we can simply read the BOM. The
good news, is that it is in fact likely that all Unicode subs have a
BOM. Besides, we ALREADY have BOM detection (in src/input/stream.c),
and it should already work with the subtitle demuxer.
> I think the behavior of option in the GUI of "Input/Codec->Other
> Codec->Subtitles" UTF-8 subtitles autodetection should be when you
> set Subitles text encoding into Big5, but you load an UTF-8
> subtitles, vlc could autodetect it by using the codepage range
> algorithm. But it seems not like what I thought.
VLC assumes that if you selected Big5, you knew what you were doing.
UTF-8 autodetection currently is only enabled with the "Default"
character set. But if BOM detection is sufficient, then UTF-8
autodetection can be removed from the subsdec decoder.
--
Rémi Denis-Courmont
http://www.remlab.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc/attachments/20070220/e1c9a944/attachment.sig>
More information about the vlc
mailing list