[vlc-devel] [PATCH] Introducing Intelligent Approximate playlist search

Rémi Denis-Courmont remi at remlab.net
Sat Mar 16 19:43:52 CET 2013


	Hello,

Comments

Le samedi 16 mars 2013 20:08:37, Abhiram Ravi a écrit :
> diff --git a/src/playlist/playlist_internal.h
> b/src/playlist/playlist_internal.h index 9b6c000..ced9e52 100644
> @@ -145,12 +146,20 @@ int playlist_DeleteItem( playlist_t * p_playlist,
> playlist_item_t *, bool); void ResetCurrentlyPlaying( playlist_t
> *p_playlist, playlist_item_t *p_cur ); void ResyncCurrentIndex( playlist_t
> *p_playlist, playlist_item_t *p_cur );
> 
> +/**********************************************************************
> + * Intelligent-Approximate Search Helpers
> + **********************************************************************/
> +int levenshtein_distance ( const char* psz_string_1, const char*
> psz_string_2 ); +
> +#define REG_SPLIT N_("()[ ,.-]_") //The string of characters which are
> used as delimiters for splitting the meta and query strings
> +#define
> MAX_ERROR 1 //Maximum allowed error in Levenshtein Distance, to turn up in
> playlist search results
> +//TODO: Add a command line parameter/option to
> GUI and allow user to select filter level i.e MAX_ERROR

Please wrap comments to 80 characters. I know modern displays can print much 
longer lines, but they tire the human eyes.

Also those macros do not need to be in the header file if it's used only in a 
single source module.

> /**
>   * @}
>   */
> 
> -#define PLAYLIST_DEBUG 1
>  //#undef PLAYLIST_DEBUG2
> +#define PLAYLIST_DEBUG 1

Please keep that change to your computer.

> 
>  #ifdef PLAYLIST_DEBUG
>   #define PL_DEBUG( ... ) msg_Dbg( p_playlist, __VA_ARGS__ )
> diff --git a/src/playlist/search.c b/src/playlist/search.c
> index 0c711ca..fe950c1 100644
> --- a/src/playlist/search.c
> +++ b/src/playlist/search.c
> @@ -5,6 +5,7 @@
>   * $Id$
>   *
>   * Authors: Clément Stenac <zorglub at videolan.org>
> + *             Abhiram Ravi <abhiram.ravi.s at gmail.com>
>   *
>   * This program is free software; you can redistribute it and/or modify it
>   * under the terms of the GNU Lesser General Public License as published
> by @@ -29,6 +30,78 @@
>  #include <vlc_playlist.h>
>  #include <vlc_charset.h>
>  #include "playlist_internal.h"
> +#include <string.h>
> +
> +/*************************************************************************
> ** + * Levenshtein Distance between two strings
> +
> **************************************************************************
> */ +
> +/**
> + * A dynamic programming algorithm to
> + * calculate the levenshtein distance between any
> + * two strings ( used for matching query and playlist items )
> + * @param psz_string_1: The first string
> + * @param psz_string_2: The second string
> + * @return the edit distance between the two strings
> + */
> +int levenshtein_distance ( const char* psz_string_1, const char*
> psz_string_2 )

Missing static keyword. Could use shorter variable identifiers.

> +{
> +    // Converting the strings to lower case
> +    for ( char *p = psz_string_1; *p; ++p )
> *p=*p>0x40&&*p<0x5b?*p|0x60:*p;
> +    for ( char *p = psz_string_2; *p; ++p
> ) *p=*p>0x40&&*p<0x5b?*p|0x60:*p;

This breaks pointer constancy. The compiler should warn you already.

More worryingly, that code seems to break non-English case-insensitive search, 
as supported by vlc_strcasestr().

> +
> +    char char_1, char_2;
> +    int i_cost;
> +    int i_length_1 = strlen ( psz_string_1 );
> +    int i_length_2 = strlen ( psz_string_2 );

The canonical type for strlen() return is size_t.

> +
> +    if ( i_length_1 == 0 )
> +        return i_length_2;
> +    if ( i_length_2 == 0 )
> +        return i_length_1;
> +
> +    int** iarray_distance = malloc ( ( i_length_1 + 1 ) * sizeof ( int* )
> );

Either use xmalloc() if you are lazy, or check for errors.

> +    for ( int i = 0; i < i_length_1 + 1; i++ )
> +    {
> +        iarray_distance [i] = malloc ( ( i_length_2 + 1 ) * sizeof ( int )
> );
> +    }

Could you not allocate a single bidimensional matrix in a single malloc()?

> +    int i;
> +    for ( i = 0; i <= i_length_1; i++ )
> +        iarray_distance [i] [0] = i;
> +    for ( i = 0; i <= i_length_2; i++ )
> +        iarray_distance [0] [i] = i;
> +    for ( i = 1; i <= i_length_1; i++ )

Use C99/C++-style for loops...

> +    {
> +        char_1 = psz_string_1 [i - 1];
> +        int j;
> +        for ( j = 1; j <= i_length_2; j++ )
> +        {
> +            char_2 = psz_string_2 [j - 1];
> +            if ( char_1 == char_2 )
> +                i_cost = 0;
> +            else
> +                i_cost = 1;
> +
> +            iarray_distance [i] [j] = 10000000; //TODO:define this
> somewhere +            if ( iarray_distance [i - 1] [j] + 1 <
> iarray_distance [i] [j] ) +                iarray_distance [i] [j] =
> iarray_distance [i - 1] [j] + 1; +            if ( iarray_distance [i] [j
> - 1] + 1 < iarray_distance [i] [j] ) +                iarray_distance [i]
> [j] = iarray_distance [i] [j - 1] + 1; +            if ( iarray_distance
> [i - 1] [j - 1] + i_cost < iarray_distance [i] [j] ) +               
> iarray_distance [i] [j] = iarray_distance [i - 1] [j - 1] + i_cost; +     
>   }
> +    }
> +    int answer = iarray_distance [i_length_1] [i_length_2];
> +
> +    // Freeing the memory

...like you do here.

> +    for ( int i = 0; i < i_length_1 + 1; i++ )
> +    {
> +        free ( iarray_distance [i] );
> +    }
> +    free ( iarray_distance );
> +
> +    return answer;
> +}
> 
>  /*************************************************************************
> ** * Item search functions
> @@ -85,6 +158,38 @@ playlist_item_t* playlist_ItemGetByInput( playlist_t *
> p_playlist, * Live search handling
>  
> **************************************************************************
> */
> 
> +/*
> + * Split a given sentence based on a string of characters to use for the
> split + * \param psz_split_destination - the array of words obtained after
> splitting + * \param psz_sentence - the input string
> + * \param split_char_string - the string of characters to use for the
> split + * @return - the number of words in the sentence
> + * */
> +static int split_into_words ( char*** psz_split_destination, const char*
> psz_sentence, const char* split_char_string ) +{
> +    // Making a copy to work with strtok
> +    char* psz_sentence_copy = malloc ( ( strlen ( psz_sentence ) + 1 ) *
> sizeof ( char ) );
> +    strcpy ( psz_sentence_copy, psz_sentence );

We have strdup().

> +    const char* psz_buf;
> +
> +    // Allocating space for the required number of words
> +    char** psz_sentence_split = ( char ** ) malloc ( strlen (
> psz_sentence_copy ) * sizeof ( char* ) );
> +    psz_buf = strtok ( psz_sentence_copy, REG_SPLIT );

strtok() is not reentrant. Use strtok_r(), strsep() or whatever instead.

> +
> +    int i_count = 0;
> +    // Running strtok and copying all the words obtained to a new split
> string array +    while ( psz_buf != NULL )
> +    {
> +        int i_buflen = strlen ( psz_buf );
> +        psz_sentence_split [i_count] = ( char* ) malloc ( ( i_buflen + 1 )
> * sizeof ( char ) ); +        strcpy ( psz_sentence_split [i_count],
> psz_buf );
> +        psz_buf = strtok ( NULL, split_char_string );
> +        i_count++;
> +    }
> +    *psz_split_destination = psz_sentence_split;
> +    return i_count;
> +
> +}
>  /**
>   * Enable all items in the playlist
>   * @param p_root: the current root item
> @@ -103,6 +208,9 @@ static void playlist_LiveSearchClean( playlist_item_t
> *p_root )
> 
>  /**
>   * Enable/Disable items in the playlist according to the search argument
> + * Every string is split into words (split based on REG_SPLIT) and an item
> is + * enabled only if all the words in the query string have found a + *
> match(approximated by MAX_ERROR) in the metadata.
>   * @param p_root: the current root item
>   * @param psz_string: the string to search
>   * @return true if an item match
> @@ -111,12 +219,20 @@ static bool playlist_LiveSearchUpdateInternal(
> playlist_item_t *p_root, const char *psz_string, bool b_recursive ) {
>      int i;
> +    // Splitting the query words into strings for separate word-matching
> +    char** psz_query_split = NULL;
> +    int i_query_split = split_into_words( &psz_query_split, psz_string,
> REG_SPLIT ); +
> +    // The bit vector of size = number of words in query string - used to
> ensure that ALL the +    // words in the query string match before
> enabling the item in the playlist +    bool* bitvector_matches = malloc (
> i_query_split * sizeof ( bool ) ); +
>      bool b_match = false;
>      for( i = 0 ; i < p_root->i_children ; i ++ )
>      {
>          bool b_enable = false;
>          playlist_item_t *p_item = p_root->pp_children[i];
> -        // Go recurssively if their is some children
> +        // Go recursively if there are some children
>          if( b_recursive && p_item->i_children >= 0 &&
>              playlist_LiveSearchUpdateInternal( p_item, psz_string, true )
> ) {
> @@ -133,14 +249,135 @@ static bool playlist_LiveSearchUpdateInternal(
> playlist_item_t *p_root, const char *psz_title = vlc_meta_Get(
> p_item->p_input->p_meta, vlc_meta_Title ); if( !psz_title )
>                      psz_title = p_item->p_input->psz_name;
> +
>                  const char *psz_album = vlc_meta_Get(
> p_item->p_input->p_meta, vlc_meta_Album ); const char *psz_artist =
> vlc_meta_Get( p_item->p_input->p_meta, vlc_meta_Artist ); -               
> b_enable = ( psz_title && vlc_strcasestr( psz_title, psz_string ) ) || -  
>                         ( psz_album && vlc_strcasestr( psz_album,
> psz_string ) ) || -                           ( psz_artist &&
> vlc_strcasestr( psz_artist, psz_string ) ); +
> +                // Initializing the bit vector to 0s
> +                for ( int i = 0; i < i_query_split; i++ )
> +                {
> +                    bitvector_matches [i] = 0;
> +                }

Use memset() or directly calloc() for zeroed allocation.

> +
> +                // First check if the words of the query are direct
> substrings, and enable matches for +                // the corresponding
> query words
> +                for ( int i = 0; i < i_query_split; i++ )
> +                {
> +                    bitvector_matches[i] |=
> +                            ( ( psz_title && vlc_strcasestr( psz_title,
> psz_query_split[i] ) ) || +                            ( psz_album &&
> vlc_strcasestr( psz_album, psz_query_split[i] ) ) || +                    
>        ( psz_artist && vlc_strcasestr( psz_artist, psz_query_split[i] ) )
> ); +                }
> +
> +                // Matching the query words and each word of the title
> string. +                // Uses approximate matching with the help of
> levenshtein distance. +                if ( psz_title )
> +                {
> +                    char** psz_title_split = NULL;
> +                    int i_title_split = split_into_words(
> &psz_title_split, psz_title, REG_SPLIT ); +
> +                    for ( int i = 0; i < i_query_split; i++ )
> +                    {
> +                        for ( int j = 0; j < i_title_split; j++ )
> +                        {
> +                            bitvector_matches[i] |= ( psz_title_split[j]
> && levenshtein_distance( psz_title_split[j], psz_query_split[i] ) <=
> MAX_ERROR ) ; +                        }
> +                    }
> +
> +                    for ( int i = 0; i < i_title_split; i++ )
> +                    {
> +                        free ( psz_title_split[i] );
> +                    }
> +                    free ( psz_title_split );
> +                }
> +                // Similarly matching approximately, the query words and
> each word of the album string. +                if ( psz_album )
> +                {
> +                    char** psz_album_split = NULL;
> +                    int i_album_split = split_into_words(
> &psz_album_split, psz_album, REG_SPLIT ); +
> +                    for ( int i = 0; i < i_query_split; i++ )
> +                    {
> +                        for ( int j = 0; j < i_album_split; j++ )
> +                        {
> +                            bitvector_matches[i] |= ( psz_album_split[j]
> && levenshtein_distance( psz_album_split[j], psz_query_split[i] ) <=
> MAX_ERROR ) ; +                        }
> +                    }
> +                    for ( int i = 0; i < i_album_split; i++ )
> +                    {
> +                        free ( psz_album_split[i] );
> +                    }
> +                    free ( psz_album_split );
> +                }
> +                // Similarly matching approximately, the query words and
> each word of the artist string. +                if ( psz_artist )
> +                {
> +                    char** psz_artist_split = NULL;
> +                    int i_artist_split = split_into_words(
> &psz_artist_split, psz_artist, REG_SPLIT ); +
> +                    for ( int i = 0; i < i_query_split; i++ )
> +                    {
> +                        for ( int j = 0; j < i_artist_split; j++ )
> +                        {
> +                            bitvector_matches[i] |= ( psz_artist_split[j]
> && levenshtein_distance( psz_artist_split[j], psz_query_split[i] ) <=
> MAX_ERROR ) ; +                        }
> +                    }
> +                    for ( int i = 0; i < i_artist_split; i++ )
> +                    {
> +                        free ( psz_artist_split[i] );
> +                    }
> +                    free ( psz_artist_split );
> +                }

These code snippets seem repetitive enough to become a separate function.

> +
> +                // Now we check if all the query words have matched.
> +                bool b_super_match = 1;
> +                for ( int i = 0; i < i_query_split; i++ )
> +                {
> +                    b_super_match &= bitvector_matches[i];
> +                }
> +                // Enabled if all the query words matched.
> +                b_enable |= b_super_match;
> +
>              }
>              else
> -                b_enable = p_item->p_input->psz_name && vlc_strcasestr(
> p_item->p_input->psz_name, psz_string ); +            {
> +                // If there is no meta, then we just consider the name
> +                const char* psz_title = p_item->p_input->psz_name;
> +
> +                // Initializing the bit vector to 0s
> +                for ( int i = 0; i < i_query_split; i++ )
> +                {
> +                    bitvector_matches [i] = 0;
> +                }
> +
> +                if ( psz_title )
> +                {
> +                    char** psz_title_split = NULL;
> +                    int i_title_split = split_into_words(
> &psz_title_split, psz_title, REG_SPLIT ); +
> +                    for ( int i = 0; i < i_query_split; i++ )
> +                    {
> +                        for ( int j = 0; j < i_title_split; j++ )
> +                        {
> +                            bitvector_matches [i] |= ( psz_title_split[j]
> && levenshtein_distance( psz_title_split[j], psz_query_split[i] ) <=
> MAX_ERROR ) ; +                        }
> +                    }
> +
> +                    for ( int i = 0; i < i_title_split; i++ )
> +                    {
> +                        free ( psz_title_split[i] );
> +                    }
> +                    free ( psz_title_split );
> +                }
> +                // Now we check if all the query words have matched.
> +                bool b_super_match = 1;
> +                for ( int i = 0; i < i_query_split; i++ )
> +                {
> +                    b_super_match &= bitvector_matches[i];
> +                }
> +                // Enabled if all the query words matched.
> +                b_enable |= b_super_match;
> +            }
>              vlc_mutex_unlock( &p_item->p_input->lock );
>          }
> 
> @@ -151,6 +388,7 @@ static bool playlist_LiveSearchUpdateInternal(
> playlist_item_t *p_root,
> 
>          b_match |= b_enable;
>     }
> +   free ( bitvector_matches );
>     return b_match;
>  }

-- 
Rémi Denis-Courmont
http://www.remlab.net/



More information about the vlc-devel mailing list