[vlc-commits] [Git][videolan/vlc][master] 10 commits: demux/subtitle: clarify comment

Jean-Baptiste Kempf gitlab at videolan.org
Fri Jun 18 07:09:18 UTC 2021



Jean-Baptiste Kempf pushed to branch master at VideoLAN / VLC


Commits:
550cd952 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: clarify comment

this does not just apply to srt files, it is done for all formats handled
by this demuxer. it also needed additional clarification.

- - - - -
eb07c5f2 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: remove pointless statement

we find the period for the extension and set it to null such that we can
then find the second to last period that should preceed the substring we
want to extract. there is absolutely no point in restoring the period
afterwards in the working copy which we are then just about to destroy.

the comment did not even make any sense.

- - - - -
d582bf98 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: add missing alloc check

- - - - -
04aed075 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: minor reorganisation of filename language extraction

(non-functional)

prepares for handling an alternate common pattern; removes unnecessary
variable (we can reuse the `psz_tmp` var now instead of also having
`psz_language_begin`); better readability.

- - - - -
0984f11c by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: use only filename for filename language substr extraction

...and thus prevent some ugly failures.

(the string given to the function is the full filepath not just the filename).

the attempt to determine the language from subtitle filenames is based upon
a single common pattern - PATH/filename.LANG.ext. whilst it works just fine
for this pattern, it is not the only pattern commonly used, for instance
PATH/Subs/x_LANG.ext (where 'x' is an integer).

in such cases where the period for the extension is the only one in the
filename, the function could produce an ugly result should any directory in
the path happen to contain a period (if not, NULL would be returned). it
would incorrectly capture a chunk of the path as part of the substring
extraction, producing results like "FOOBAR/Subs/1_English" (or worse) which
then end up as the language name displayed under the subtitle menu and
elsewhere.

this commit strips the string processed down to filename only and thus
prevents such ugliness. the next commit will introduce proper handling for
the just mentioned alternate common pattern.

- - - - -
0dd41106 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: handle PATH/Subs/1_English.srt type filename lang extraction

the only pattern handled was PATH/filename.LANG.ext. another common one is
PATH/Subs/x_LANG.ext which this adds handling for.

this simply replies upon falling back to trying to get the substring after
the last underscore if trying to get the substring after a period fails.

we do not explicitly require the second pattern to only occur in files
found under a 'Subs' subdir, since it is not certain that there is value in
implementing such a restriction.

- - - - -
58ce4b96 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitle: prepare for lang detection via codec properties

at least one subtitle format handled by this demuxer may hold the language
as a property specified within the file. we should allow the parser to
extract and use that as an alternative to the filename based substring
extraction. this sets things up to allow the parser functions to provide
that extracted property string.

- - - - -
17b2c064 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitles: clarify debug message

the substring obtained from filename extraction in some cases is perfect
but in other cases may not be a language at all, just some portion of the
filename. stating 'detected language FOO' is a bit odd if it turns out to
not actually be a language name that we've extracted. let's fix that by
clarifying what we've actually retrieved, and thus distinguish the less
reliable filename extraction result from the likely more reliable property
available in some subtitle files.

also, enclose in quotes in both cases. for the filename based case since
this simply makes sense. in the property case, since this may be a language
code, as it is for ASS/SSA.

- - - - -
49c7098e by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitles: capture language attribute from SSA/ASS files

... for language identification.

this info property has been supported by libass since v0.10.0. it is currently
a 2-char iso-639-1 code.

libass commit adding support:
https://github.com/libass/libass/commit/c979365946b2dc2499ede862b6f7da15f9bc0ed1

discussion about enhancing the attribute to support 3-char iso-639-2 codes,
possibly bcp-47: https://github.com/libass/libass/issues/404

- - - - -
d7d8cff6 by Lyndon Brown at 2021-06-18T06:50:04+00:00
demux/subtitles: avoid unnecessary allocations for SSA/ASS

we only need to allocate the `psz_text` buffer when handling `Dialogue`
and `Language` lines. restricting allocation to lines beginning with 'D'
or 'L' is a simple way of avoiding most/all that are unnecessary.

- - - - -


1 changed file:

- modules/demux/subtitle.c


Changes:

=====================================
modules/demux/subtitle.c
=====================================
@@ -128,6 +128,7 @@ typedef struct
     vlc_tick_t  i_microsecperframe;
 
     char        *psz_header; /* SSA */
+    char        *psz_lang;
 
     struct
     {
@@ -318,6 +319,7 @@ static int Open ( vlc_object_t *p_this )
     p_sys->subtitles.p_array  = NULL;
 
     p_sys->props.psz_header         = NULL;
+    p_sys->props.psz_lang           = NULL;
     p_sys->props.i_microsecperframe = VLC_TICK_FROM_MS(40);
     p_sys->props.jss.b_inited       = false;
     p_sys->props.mpsub.b_inited     = false;
@@ -686,15 +688,20 @@ static int Open ( vlc_object_t *p_this )
     if( p_sys->subtitles.i_count > 0 )
         p_sys->i_length = p_sys->subtitles.p_array[p_sys->subtitles.i_count-1].i_stop;
 
-    /* Stupid language detection in the filename */
-    char * psz_language = get_language_from_filename( p_demux->psz_filepath );
-
-    if( psz_language )
+    if( p_sys->props.psz_lang )
     {
-        fmt.psz_language = psz_language;
-        msg_Dbg( p_demux, "detected language %s of subtitle: %s", psz_language,
+        fmt.psz_language = p_sys->props.psz_lang;
+        p_sys->props.psz_lang = NULL;
+        msg_Dbg( p_demux, "detected language '%s' of subtitle: %s", fmt.psz_language,
                  p_demux->psz_location );
     }
+    else
+    {
+        fmt.psz_language = get_language_from_filename( p_demux->psz_filepath );
+        if( fmt.psz_language )
+            msg_Dbg( p_demux, "selected '%s' as possible filename language substring of subtitle: %s",
+                     fmt.psz_language, p_demux->psz_location );
+    }
 
     char *psz_description = var_InheritString( p_demux, "sub-description" );
     if( psz_description && *psz_description )
@@ -1232,12 +1239,25 @@ static int  ParseSSA( vlc_object_t *p_obj, subs_properties_t *p_props,
          * Dialogue: Layer#,0:02:40.65,0:02:41.79,Wolf main,Cher,0000,0000,0000,,Et les enregistrements de ses ondes delta ?
          */
 
-        /* The output text is - at least, not removing numbers - 18 chars shorter than the input text. */
-        psz_text = malloc( strlen(s) );
-        if( !psz_text )
-            return VLC_ENOMEM;
+        psz_text = NULL;
+        if( s[0] == 'D' || s[0] == 'L' )
+        {
+            /* The output text is always shorter than the input text. */
+            psz_text = malloc( strlen(s) );
+            if( !psz_text )
+                return VLC_ENOMEM;
+        }
 
-        if( sscanf( s,
+        /* Try to capture the language property */
+        if( s[0] == 'L' &&
+            sscanf( s, "Language: %[^\r\n]", psz_text ) == 1 )
+        {
+            free( p_props->psz_lang ); /* just in case of multiple instances */
+            p_props->psz_lang = psz_text;
+            psz_text = NULL;
+        }
+        else if( s[0] == 'D' &&
+            sscanf( s,
                     "Dialogue: %15[^,],%d:%d:%d.%d,%d:%d:%d.%d,%[^\r\n]",
                     temp,
                     &h1, &m1, &s1, &c1,
@@ -2416,24 +2436,37 @@ static int ParseSCC( vlc_object_t *p_obj, subs_properties_t *p_props,
     return VLC_SUCCESS;
 }
 
-/* Matches filename.xx.srt */
+/* Tries to extract language from common filename patterns PATH/filename.LANG.ext
+   and PATH/Subs/x_LANG.ext (where 'x' is an integer). */
 static char * get_language_from_filename( const char * psz_sub_file )
 {
     char *psz_ret = NULL;
-    char *psz_tmp, *psz_language_begin;
+    char *psz_tmp;
+
+    if( !psz_sub_file )
+        return NULL;
 
-    if( !psz_sub_file ) return NULL;
-    char *psz_work = strdup( psz_sub_file );
+    /* Remove path */
+    const char *psz_fname = strrchr( psz_sub_file, DIR_SEP_CHAR );
+    psz_fname = (psz_fname == NULL) ? psz_sub_file : psz_fname + 1;
 
-    /* Removing extension, but leaving the dot */
-    psz_tmp = strrchr( psz_work, '.' );
+    char *psz_work = strdup( psz_fname );
+    if( !psz_work )
+        return NULL;
+
+    psz_tmp = strrchr( psz_work, '.' ); /* Find extension */
     if( psz_tmp )
     {
-        psz_tmp[0] = '\0';
-        psz_language_begin = strrchr( psz_work, '.' );
-        if( psz_language_begin )
-            psz_ret = strdup(++psz_language_begin);
-        psz_tmp[0] = '.';
+        psz_tmp[0] = '\0'; /* Remove it */
+
+        /* Get substr after next last period - hopefully our language string */
+        psz_tmp = strrchr( psz_work, '.' );
+        /* Otherwise try substr after last underscore for alternate pattern */
+        if( !psz_tmp )
+            psz_tmp = strchr( psz_work, '_' );
+
+        if( psz_tmp )
+            psz_ret = strdup(++psz_tmp);
     }
 
     free( psz_work );



View it on GitLab: https://code.videolan.org/videolan/vlc/-/compare/a9c6d334482db5bc28b9a4b8b499dea9cb1bddc0...d7d8cff680540cdf9fbdb38276ae7c7e99035860

-- 
View it on GitLab: https://code.videolan.org/videolan/vlc/-/compare/a9c6d334482db5bc28b9a4b8b499dea9cb1bddc0...d7d8cff680540cdf9fbdb38276ae7c7e99035860
You're receiving this email because of your account on code.videolan.org.




More information about the vlc-commits mailing list