[vlc-devel] [PATCH 2/3] Detect subtitles charset using uchardet

Rémi Denis-Courmont remi at remlab.net
Mon Apr 8 20:10:31 CEST 2019


Le sunnuntaina 7. huhtikuuta 2019, 21.38.25 EEST pertuleha at gmail.com a écrit :
> +#ifdef SUBTITLE_C_NEED_MERGE_TEXT
> +
> +static char * MergeTxtLines( text_t *txt ) {
> +    char *psz_merged = malloc( 1 );
> +    size_t i_merged_len = 0;
> +    psz_merged[i_merged_len] = '\0';
> +
> +    TextResetLine( txt );
> +    for ( char *psz_line = TextGetLine( txt );
> +          NULL != psz_line;
> +          psz_line = TextGetLine( txt ) ) {
> +
> +        size_t i_line_len = strlen( psz_line );
> +
> +        psz_merged = realloc( psz_merged, i_merged_len + i_line_len + 1 );
> +        if ( NULL == psz_merged ) {
> +            return NULL;
> +        }
> +
> +        /* strcat( (dst + dst_len), src ) instead of simple strcat( dst,
> src ) +           optimizes text concat to O(N) instead of O(N^2) */

Well... no. It's still quadratic (horribly slow) because realloc will do a 
memory copy internally.

Besides, using strcat() is completely pointless if you know the offset to the 
end of the destination string.

> +        strcat( (psz_merged + i_merged_len), psz_line );
> +        i_merged_len += i_line_len;
> +    }
> +    TextResetLine( txt );
> +
> +    return psz_merged;
> +}
> +
> +#endif /* SUBTITLE_C_NEED_MERGE_TEXT */
> +
> +
> +#ifdef HAVE_UCHARDET
> +
> +static char * DetectCharset( text_t *txt ) {
> +    uchardet_t ud = uchardet_new();
> +
> +    /* subtitles lines are merged because
> +       uchardet's full-text result is better than line-by-line result */
> +    char *psz_text = MergeTxtLines( txt );
> +
> +    uchardet_handle_data( ud, psz_text, strlen( psz_text ) );
> +    uchardet_data_end( ud );
> +
> +    char *psz_detected_charset = (char *) uchardet_get_charset( ud );
> +    if ( 0 == strcmp( psz_detected_charset, "" )
> +         || 0 == strcmp (psz_detected_charset, "ASCII" ) ) {
> +
> +        psz_detected_charset = NULL;
> +    } else {
> +        /* uchardet's result will be freed on uchardet_delete() => strdup
> */ +        psz_detected_charset = strdup( psz_detected_charset );
> +    }
> +
> +    uchardet_delete( ud );
> +
> +    return psz_detected_charset;
> +}
> +
> +#endif /* HAVE_UCHARDET */


-- 
雷米‧德尼-库尔蒙
http://www.remlab.net/





More information about the vlc-devel mailing list