[vlc-devel] [PATCH 0/1] Subtitles: Encoding detection using uchardet
Salah-Eddin Shaban
salah at videolan.org
Wed Sep 5 09:13:32 CEST 2018
Hello,
A couple of notes here:
- The subtitle demuxer reads the whole subtitle file using
vlc_stream_ReadLine. At first I considered doing the detection and
conversion to UTF-8 there, since vlc_stream_ReadLine already converts
UTF-16. That was not possible because for one thing, conversion from
UTF-16 has to be done even when uchardet is not compiled in (because
of #304) and for another, encoding detection is not reliable when done
line by line.
- I was not sure about the use of strlen when calling
uchardet_handle_data, but the problem with null bytes seems to have
affected only UTF-16 (#304 again) and null terminated strings are
being used anyway in the subtitle demuxer and later converted to UTF-8
without issue. It should be no different for uchardet.
- At first I didn't do UTF-8 conversion in the demuxer, I just set the
detected encoding on the subtitle ES (fmt.subs.psz_encoding) and let
the subtitle decoder do the conversion. But uchardet sometimes
mis-identifies the encoding (in my tests that often happened with
Windows-1256 subtitle files), so it's better to try converting the
whole file in the demuxer and bail out if the conversion fails. The
subtitle decoder can then try the conversion again using the encoding
specified in preferences.
Salah-Eddin Shaban (1):
Subtitles: Encoding detection using uchardet
configure.ac | 18 +++++++++
modules/demux/Makefile.am | 5 +++
modules/demux/subtitle.c | 95 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 118 insertions(+)
--
2.13.7
More information about the vlc-devel
mailing list