[vlc-devel] Re: [vlc] I'm implementing subtitle charset-detector by using firefox's library
Rémi Denis-Courmont
rem at videolan.org
Tue Feb 20 18:24:21 CET 2007
Hello,
Le mardi 20 février 2007 15:53, Tsai Dung-Bang a écrit :
> I'm implementing charset-detector by using firefox's library
> it seems that psz_subtitle only stores one line of srt file
> so it does not have enough worlds for detector.... any suggestion?
Well, I don't know. It's enough to discriminate UTF-8 from the local
character encoding in any case that I've dealt with. The underlying
assumption is that valid UTF-8 byte sequences are indeed UTF-8 - and in
particular valid US-ASCII sequences are US-ASCII (lets assume nobody
uses QP, base64 or UTF-7 subtitles).
I think this works fine for UTF-8 against the entire ISO-8859 series,
and it definitely works for UTF-8 against latin character sets. It also
seems quite good for Shift-JIS, EUC-KR, GB18030 and Big5 on Asian side.
Reading more than one line at a time because it makes the whole stuff A
LOT more complicated (you have to probe multiple lines ahead of what
you need). Moreover we have to support subtitles for streaming content
too.
> If there are more data for detector, it will be more accurate for
> detector.
I would start trying to fix the GetFallbackLanguage() implementation
rather than start a big gas factory if I were you. You can always build
the factory later if it is really required.
--
Rémi Denis-Courmont
http://www.remlab.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc-devel/attachments/20070220/8facb949/attachment.sig>
More information about the vlc-devel
mailing list