[vlc-devel] Re: [vlc] I'm implementing subtitle charset-detector by using firefox's library

Tue Feb 20 18:24:21 CET 2007

	Hello,

Le mardi 20 février 2007 15:53, Tsai Dung-Bang a écrit :
> I'm implementing charset-detector by using firefox's library
> it seems that psz_subtitle only stores one line of srt file
> so it does not have enough worlds for detector.... any suggestion?

Well, I don't know. It's enough to discriminate UTF-8 from the local 
character encoding in any case that I've dealt with. The underlying 
assumption is that valid UTF-8 byte sequences are indeed UTF-8 - and in 
particular valid US-ASCII sequences are US-ASCII (lets assume nobody 
uses QP, base64 or UTF-7 subtitles).

I think this works fine for UTF-8 against the entire ISO-8859 series, 
and it definitely works for UTF-8 against latin character sets. It also 
seems quite good for Shift-JIS, EUC-KR, GB18030 and Big5 on Asian side.

Reading more than one line at a time because it makes the whole stuff A 
LOT more complicated (you have to probe multiple lines ahead of what 
you need). Moreover we have to support subtitles for streaming content 
too.

> If there are more data for detector, it will be more accurate for
> detector.

I would start trying to fix the GetFallbackLanguage() implementation 
rather than start a big gas factory if I were you. You can always build 
the factory later if it is really required.

-- 
Rémi Denis-Courmont
http://www.remlab.net/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://mailman.videolan.org/pipermail/vlc-devel/attachments/20070220/8facb949/attachment.sig>