[vlc] Re: I'm implementing subtitle charset-detector by using firefox's library

Rémi Denis-Courmont rem at videolan.org
Tue Feb 20 20:28:58 CET 2007


Le mardi 20 février 2007 19:58, Tsai Dung-Bang a écrit :
> Rémi Denis-Courmont 写道:
> > Well, I don't know. It's enough to discriminate UTF-8 from the
> > local character encoding in any case that I've dealt with. The
> > underlying assumption is that valid UTF-8 byte sequences are indeed
> > UTF-8 - and in particular valid US-ASCII sequences are US-ASCII
> > (lets assume nobody uses QP, base64 or UTF-7 subtitles).
>
> It seems that there are some local charset words in the uft-8 1920
> codes. (110yyyyy 10zzzzzz). So if we do not have enough data, we
> could not distinguish a local charset from raw data.

Yeah, and they are valid UTF-8 sequences that are actually Latin-1, but 
that is a bit unlikely over an entire sentence (except when Latin-1 
people talk about UTF-8 but that's not very common a subject in 
movies).

> And in this architecture, how can we add the support of BOM(Byte
> Order Mark)? Lots of unicode subtitles have this.

The real question is, does ALL unicode subtitles have either BOM, or an 
externally specified explicit charset (demux meta-infos or *.utf 
filename extension). If that is the case, the currnt UTF-8 
autodetection is essentially useless; we can simply read the BOM. The 
good news, is that it is in fact likely that all Unicode subs have a 
BOM. Besides, we ALREADY have BOM detection (in src/input/stream.c), 
and it should already work with the subtitle demuxer.

> I think the behavior of option in the GUI of "Input/Codec->Other
> Codec->Subtitles" UTF-8 subtitles autodetection should be when you
> set Subitles text encoding into Big5, but you load an UTF-8
> subtitles, vlc could autodetect it by using the codepage range
> algorithm. But it seems not like what I thought.

VLC assumes that if you selected Big5, you knew what you were doing. 
UTF-8 autodetection currently is only enabled with the "Default" 
character set. But if BOM detection is sufficient, then UTF-8 
autodetection can be removed from the subsdec decoder.

-- 
Rémi Denis-Courmont
http://www.remlab.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc/attachments/20070220/e1c9a944/attachment.sig>


More information about the vlc mailing list