[vlc-devel] [PATCH 0/1] Subtitles: Encoding detection using uchardet

Salah-Eddin Shaban salah at videolan.org
Wed Sep 5 09:13:32 CEST 2018


A couple of notes here:

- The subtitle demuxer reads the whole subtitle file using
  vlc_stream_ReadLine. At first I considered doing the detection and
  conversion to UTF-8 there, since vlc_stream_ReadLine already converts
  UTF-16. That was not possible because for one thing, conversion from
  UTF-16 has to be done even when uchardet is not compiled in (because
  of #304) and for another, encoding detection is not reliable when done
  line by line.

- I was not sure about the use of strlen when calling
  uchardet_handle_data, but the problem with null bytes seems to have
  affected only UTF-16 (#304 again) and null terminated strings are
  being used anyway in the subtitle demuxer and later converted to UTF-8
  without issue. It should be no different for uchardet.

- At first I didn't do UTF-8 conversion in the demuxer, I just set the
  detected encoding on the subtitle ES (fmt.subs.psz_encoding) and let
  the subtitle decoder do the conversion. But uchardet sometimes
  mis-identifies the encoding (in my tests that often happened with
  Windows-1256 subtitle files), so it's better to try converting the
  whole file in the demuxer and bail out if the conversion fails. The
  subtitle decoder can then try the conversion again using the encoding
  specified in preferences.

Salah-Eddin Shaban (1):
  Subtitles: Encoding detection using uchardet

 configure.ac              | 18 +++++++++
 modules/demux/Makefile.am |  5 +++
 modules/demux/subtitle.c  | 95 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 118 insertions(+)


