[vlc-devel] [PATCH 2/3] Detect subtitles charset using uchardet
Rémi Denis-Courmont
remi at remlab.net
Mon Apr 8 20:10:31 CEST 2019
Le sunnuntaina 7. huhtikuuta 2019, 21.38.25 EEST pertuleha at gmail.com a écrit :
> +#ifdef SUBTITLE_C_NEED_MERGE_TEXT
> +
> +static char * MergeTxtLines( text_t *txt ) {
> + char *psz_merged = malloc( 1 );
> + size_t i_merged_len = 0;
> + psz_merged[i_merged_len] = '\0';
> +
> + TextResetLine( txt );
> + for ( char *psz_line = TextGetLine( txt );
> + NULL != psz_line;
> + psz_line = TextGetLine( txt ) ) {
> +
> + size_t i_line_len = strlen( psz_line );
> +
> + psz_merged = realloc( psz_merged, i_merged_len + i_line_len + 1 );
> + if ( NULL == psz_merged ) {
> + return NULL;
> + }
> +
> + /* strcat( (dst + dst_len), src ) instead of simple strcat( dst,
> src ) + optimizes text concat to O(N) instead of O(N^2) */
Well... no. It's still quadratic (horribly slow) because realloc will do a
memory copy internally.
Besides, using strcat() is completely pointless if you know the offset to the
end of the destination string.
> + strcat( (psz_merged + i_merged_len), psz_line );
> + i_merged_len += i_line_len;
> + }
> + TextResetLine( txt );
> +
> + return psz_merged;
> +}
> +
> +#endif /* SUBTITLE_C_NEED_MERGE_TEXT */
> +
> +
> +#ifdef HAVE_UCHARDET
> +
> +static char * DetectCharset( text_t *txt ) {
> + uchardet_t ud = uchardet_new();
> +
> + /* subtitles lines are merged because
> + uchardet's full-text result is better than line-by-line result */
> + char *psz_text = MergeTxtLines( txt );
> +
> + uchardet_handle_data( ud, psz_text, strlen( psz_text ) );
> + uchardet_data_end( ud );
> +
> + char *psz_detected_charset = (char *) uchardet_get_charset( ud );
> + if ( 0 == strcmp( psz_detected_charset, "" )
> + || 0 == strcmp (psz_detected_charset, "ASCII" ) ) {
> +
> + psz_detected_charset = NULL;
> + } else {
> + /* uchardet's result will be freed on uchardet_delete() => strdup
> */ + psz_detected_charset = strdup( psz_detected_charset );
> + }
> +
> + uchardet_delete( ud );
> +
> + return psz_detected_charset;
> +}
> +
> +#endif /* HAVE_UCHARDET */
--
雷米‧德尼-库尔蒙
http://www.remlab.net/
More information about the vlc-devel
mailing list