[vlc-devel] [PATCH] Introducing Intelligent Approximate playlist search

Abhiram Ravi abhiram.ravi.s at gmail.com
Sat Mar 16 19:08:37 CET 2013


VLC currently uses plain substring matching in the playlist search engine.
This has many disadvantages, and is too strict a search scheme.

Here I introduce a new intelligent approximate search algorithm. The query string and all the metadata are split into words. For every word in the query string, we search for an approximate match in the items' metadata. (This approximation is calculated based on the Levenshtein Edit distance between the two words). Substrings are also checked. Only those items for which all query words have a match turn up in the resulting filtered playlist. 
The current maximum error (for approximate matching) is set to 1. In other words  if we search for say, "Turn", then songs/items with the word "burn", "turk", "urn" etc, also turn up in the result, because they differ in edit distance from the word "Turn" by  atmost 1 letter. Also if we search for say, "Glass castle", then an item with name say "Castle of Glass" turns up in the result, because this item had a match for all the query  words i.e "Glass" and "castle". 
Note that "Castle of Glass" would also turn up in the result if the query string was "Glask Cattle" or "Glass cattle", because each query word has an approximate match (with error <= 1 in this case for each word). 
Also, all permutations of the query words give the same  result. For example, searching for "burn it down" and "Down burn it" would give the same result. So the user can just enter words that he knows, and our search will take care of the rest. For example, if the query was "burn skies", then the song "Burning in the skies" would turn up in the result! 
This will also be very helpful if the user makes any typing errors. (If the user searches for "Cata;yst", his favorite song "The Catalyst" will still turn up in the resulting playlist :D ). Also, adding redundant spaces to the search query will not affect the search. 
P.S : I am an aspiring VLC GSoC 2013 candidate. This is my first patch :D.
IRC nick on #videolan: Hydrodam

---
 src/playlist/playlist_internal.h |   11 +-
 src/playlist/search.c            |  248 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 253 insertions(+), 6 deletions(-)

diff --git a/src/playlist/playlist_internal.h b/src/playlist/playlist_internal.h
index 9b6c000..ced9e52 100644
--- a/src/playlist/playlist_internal.h
+++ b/src/playlist/playlist_internal.h
@@ -6,6 +6,7 @@
  *
  * Authors: Samuel Hocevar <sam at zoy.org>
  *          Clément Stenac <zorglub at videolan.org>
+ *          Abhiram Ravi <abhiram.ravi.s at gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU Lesser General Public License as published by
@@ -145,12 +146,20 @@ int playlist_DeleteItem( playlist_t * p_playlist, playlist_item_t *, bool);
 void ResetCurrentlyPlaying( playlist_t *p_playlist, playlist_item_t *p_cur );
 void ResyncCurrentIndex( playlist_t *p_playlist, playlist_item_t *p_cur );
 
+/**********************************************************************
+ * Intelligent-Approximate Search Helpers
+ **********************************************************************/
+int levenshtein_distance ( const char* psz_string_1, const char* psz_string_2 );
+
+#define REG_SPLIT N_("()[ ,.-]_") //The string of characters which are used as delimiters for splitting the meta and query strings
+#define MAX_ERROR 1 //Maximum allowed error in Levenshtein Distance, to turn up in playlist search results
+//TODO: Add a command line parameter/option to GUI and allow user to select filter level i.e MAX_ERROR
 /**
  * @}
  */
 
-#define PLAYLIST_DEBUG 1
 //#undef PLAYLIST_DEBUG2
+#define PLAYLIST_DEBUG 1
 
 #ifdef PLAYLIST_DEBUG
  #define PL_DEBUG( ... ) msg_Dbg( p_playlist, __VA_ARGS__ )
diff --git a/src/playlist/search.c b/src/playlist/search.c
index 0c711ca..fe950c1 100644
--- a/src/playlist/search.c
+++ b/src/playlist/search.c
@@ -5,6 +5,7 @@
  * $Id$
  *
  * Authors: Clément Stenac <zorglub at videolan.org>
+ *             Abhiram Ravi <abhiram.ravi.s at gmail.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms of the GNU Lesser General Public License as published by
@@ -29,6 +30,78 @@
 #include <vlc_playlist.h>
 #include <vlc_charset.h>
 #include "playlist_internal.h"
+#include <string.h>
+
+/***************************************************************************
+ * Levenshtein Distance between two strings
+ ***************************************************************************/
+
+/**
+ * A dynamic programming algorithm to
+ * calculate the levenshtein distance between any
+ * two strings ( used for matching query and playlist items )
+ * @param psz_string_1: The first string
+ * @param psz_string_2: The second string
+ * @return the edit distance between the two strings
+ */
+int levenshtein_distance ( const char* psz_string_1, const char* psz_string_2 )
+{
+    // Converting the strings to lower case
+    for ( char *p = psz_string_1; *p; ++p ) *p=*p>0x40&&*p<0x5b?*p|0x60:*p;
+    for ( char *p = psz_string_2; *p; ++p ) *p=*p>0x40&&*p<0x5b?*p|0x60:*p;
+
+    char char_1, char_2;
+    int i_cost;
+    int i_length_1 = strlen ( psz_string_1 );
+    int i_length_2 = strlen ( psz_string_2 );
+
+    if ( i_length_1 == 0 )
+        return i_length_2;
+    if ( i_length_2 == 0 )
+        return i_length_1;
+
+    int** iarray_distance = malloc ( ( i_length_1 + 1 ) * sizeof ( int* ) );
+    for ( int i = 0; i < i_length_1 + 1; i++ )
+    {
+        iarray_distance [i] = malloc ( ( i_length_2 + 1 ) * sizeof ( int ) );
+    }
+    int i;
+    for ( i = 0; i <= i_length_1; i++ )
+        iarray_distance [i] [0] = i;
+    for ( i = 0; i <= i_length_2; i++ )
+        iarray_distance [0] [i] = i;
+    for ( i = 1; i <= i_length_1; i++ )
+    {
+        char_1 = psz_string_1 [i - 1];
+        int j;
+        for ( j = 1; j <= i_length_2; j++ )
+        {
+            char_2 = psz_string_2 [j - 1];
+            if ( char_1 == char_2 )
+                i_cost = 0;
+            else
+                i_cost = 1;
+
+            iarray_distance [i] [j] = 10000000; //TODO:define this somewhere
+            if ( iarray_distance [i - 1] [j] + 1 < iarray_distance [i] [j] )
+                iarray_distance [i] [j] = iarray_distance [i - 1] [j] + 1;
+            if ( iarray_distance [i] [j - 1] + 1 < iarray_distance [i] [j] )
+                iarray_distance [i] [j] = iarray_distance [i] [j - 1] + 1;
+            if ( iarray_distance [i - 1] [j - 1] + i_cost < iarray_distance [i] [j] )
+                iarray_distance [i] [j] = iarray_distance [i - 1] [j - 1] + i_cost;
+        }
+    }
+    int answer = iarray_distance [i_length_1] [i_length_2];
+
+    // Freeing the memory
+    for ( int i = 0; i < i_length_1 + 1; i++ )
+    {
+        free ( iarray_distance [i] );
+    }
+    free ( iarray_distance );
+
+    return answer;
+}
 
 /***************************************************************************
  * Item search functions
@@ -85,6 +158,38 @@ playlist_item_t* playlist_ItemGetByInput( playlist_t * p_playlist,
  * Live search handling
  ***************************************************************************/
 
+/*
+ * Split a given sentence based on a string of characters to use for the split
+ * \param psz_split_destination - the array of words obtained after splitting
+ * \param psz_sentence - the input string
+ * \param split_char_string - the string of characters to use for the split
+ * @return - the number of words in the sentence
+ * */
+static int split_into_words ( char*** psz_split_destination, const char* psz_sentence, const char* split_char_string )
+{
+    // Making a copy to work with strtok
+    char* psz_sentence_copy = malloc ( ( strlen ( psz_sentence ) + 1 ) * sizeof ( char ) );
+    strcpy ( psz_sentence_copy, psz_sentence );
+    const char* psz_buf;
+
+    // Allocating space for the required number of words
+    char** psz_sentence_split = ( char ** ) malloc ( strlen ( psz_sentence_copy ) * sizeof ( char* ) );
+    psz_buf = strtok ( psz_sentence_copy, REG_SPLIT );
+
+    int i_count = 0;
+    // Running strtok and copying all the words obtained to a new split string array
+    while ( psz_buf != NULL )
+    {
+        int i_buflen = strlen ( psz_buf );
+        psz_sentence_split [i_count] = ( char* ) malloc ( ( i_buflen + 1 ) * sizeof ( char ) );
+        strcpy ( psz_sentence_split [i_count], psz_buf );
+        psz_buf = strtok ( NULL, split_char_string );
+        i_count++;
+    }
+    *psz_split_destination = psz_sentence_split;
+    return i_count;
+
+}
 /**
  * Enable all items in the playlist
  * @param p_root: the current root item
@@ -103,6 +208,9 @@ static void playlist_LiveSearchClean( playlist_item_t *p_root )
 
 /**
  * Enable/Disable items in the playlist according to the search argument
+ * Every string is split into words (split based on REG_SPLIT) and an item is
+ * enabled only if all the words in the query string have found a
+ * match(approximated by MAX_ERROR) in the metadata.
  * @param p_root: the current root item
  * @param psz_string: the string to search
  * @return true if an item match
@@ -111,12 +219,20 @@ static bool playlist_LiveSearchUpdateInternal( playlist_item_t *p_root,
                                                const char *psz_string, bool b_recursive )
 {
     int i;
+    // Splitting the query words into strings for separate word-matching
+    char** psz_query_split = NULL;
+    int i_query_split = split_into_words( &psz_query_split, psz_string, REG_SPLIT );
+
+    // The bit vector of size = number of words in query string - used to ensure that ALL the
+    // words in the query string match before enabling the item in the playlist
+    bool* bitvector_matches = malloc ( i_query_split * sizeof ( bool ) );
+
     bool b_match = false;
     for( i = 0 ; i < p_root->i_children ; i ++ )
     {
         bool b_enable = false;
         playlist_item_t *p_item = p_root->pp_children[i];
-        // Go recurssively if their is some children
+        // Go recursively if there are some children
         if( b_recursive && p_item->i_children >= 0 &&
             playlist_LiveSearchUpdateInternal( p_item, psz_string, true ) )
         {
@@ -133,14 +249,135 @@ static bool playlist_LiveSearchUpdateInternal( playlist_item_t *p_root,
                 const char *psz_title = vlc_meta_Get( p_item->p_input->p_meta, vlc_meta_Title );
                 if( !psz_title )
                     psz_title = p_item->p_input->psz_name;
+
                 const char *psz_album = vlc_meta_Get( p_item->p_input->p_meta, vlc_meta_Album );
                 const char *psz_artist = vlc_meta_Get( p_item->p_input->p_meta, vlc_meta_Artist );
-                b_enable = ( psz_title && vlc_strcasestr( psz_title, psz_string ) ) ||
-                           ( psz_album && vlc_strcasestr( psz_album, psz_string ) ) ||
-                           ( psz_artist && vlc_strcasestr( psz_artist, psz_string ) );
+
+                // Initializing the bit vector to 0s
+                for ( int i = 0; i < i_query_split; i++ )
+                {
+                    bitvector_matches [i] = 0;
+                }
+
+                // First check if the words of the query are direct substrings, and enable matches for
+                // the corresponding query words
+                for ( int i = 0; i < i_query_split; i++ )
+                {
+                    bitvector_matches[i] |=
+                            ( ( psz_title && vlc_strcasestr( psz_title, psz_query_split[i] ) ) ||
+                            ( psz_album && vlc_strcasestr( psz_album, psz_query_split[i] ) ) ||
+                            ( psz_artist && vlc_strcasestr( psz_artist, psz_query_split[i] ) ) );
+                }
+
+                // Matching the query words and each word of the title string.
+                // Uses approximate matching with the help of levenshtein distance.
+                if ( psz_title )
+                {
+                    char** psz_title_split = NULL;
+                    int i_title_split = split_into_words( &psz_title_split, psz_title, REG_SPLIT );
+
+                    for ( int i = 0; i < i_query_split; i++ )
+                    {
+                        for ( int j = 0; j < i_title_split; j++ )
+                        {
+                            bitvector_matches[i] |= ( psz_title_split[j] && levenshtein_distance( psz_title_split[j], psz_query_split[i] ) <= MAX_ERROR ) ;
+                        }
+                    }
+
+                    for ( int i = 0; i < i_title_split; i++ )
+                    {
+                        free ( psz_title_split[i] );
+                    }
+                    free ( psz_title_split );
+                }
+                // Similarly matching approximately, the query words and each word of the album string.
+                if ( psz_album )
+                {
+                    char** psz_album_split = NULL;
+                    int i_album_split = split_into_words( &psz_album_split, psz_album, REG_SPLIT );
+
+                    for ( int i = 0; i < i_query_split; i++ )
+                    {
+                        for ( int j = 0; j < i_album_split; j++ )
+                        {
+                            bitvector_matches[i] |= ( psz_album_split[j] && levenshtein_distance( psz_album_split[j], psz_query_split[i] ) <= MAX_ERROR ) ;
+                        }
+                    }
+                    for ( int i = 0; i < i_album_split; i++ )
+                    {
+                        free ( psz_album_split[i] );
+                    }
+                    free ( psz_album_split );
+                }
+                // Similarly matching approximately, the query words and each word of the artist string.
+                if ( psz_artist )
+                {
+                    char** psz_artist_split = NULL;
+                    int i_artist_split = split_into_words( &psz_artist_split, psz_artist, REG_SPLIT );
+
+                    for ( int i = 0; i < i_query_split; i++ )
+                    {
+                        for ( int j = 0; j < i_artist_split; j++ )
+                        {
+                            bitvector_matches[i] |= ( psz_artist_split[j] && levenshtein_distance( psz_artist_split[j], psz_query_split[i] ) <= MAX_ERROR ) ;
+                        }
+                    }
+                    for ( int i = 0; i < i_artist_split; i++ )
+                    {
+                        free ( psz_artist_split[i] );
+                    }
+                    free ( psz_artist_split );
+                }
+
+                // Now we check if all the query words have matched.
+                bool b_super_match = 1;
+                for ( int i = 0; i < i_query_split; i++ )
+                {
+                    b_super_match &= bitvector_matches[i];
+                }
+                // Enabled if all the query words matched.
+                b_enable |= b_super_match;
+
             }
             else
-                b_enable = p_item->p_input->psz_name && vlc_strcasestr( p_item->p_input->psz_name, psz_string );
+            {
+                // If there is no meta, then we just consider the name
+                const char* psz_title = p_item->p_input->psz_name;
+
+                // Initializing the bit vector to 0s
+                for ( int i = 0; i < i_query_split; i++ )
+                {
+                    bitvector_matches [i] = 0;
+                }
+
+                if ( psz_title )
+                {
+                    char** psz_title_split = NULL;
+                    int i_title_split = split_into_words( &psz_title_split, psz_title, REG_SPLIT );
+
+                    for ( int i = 0; i < i_query_split; i++ )
+                    {
+                        for ( int j = 0; j < i_title_split; j++ )
+                        {
+                            bitvector_matches [i] |= ( psz_title_split[j] && levenshtein_distance( psz_title_split[j], psz_query_split[i] ) <= MAX_ERROR ) ;
+                        }
+                    }
+
+                    for ( int i = 0; i < i_title_split; i++ )
+                    {
+                        free ( psz_title_split[i] );
+                    }
+                    free ( psz_title_split );
+                }
+                // Now we check if all the query words have matched.
+                bool b_super_match = 1;
+                for ( int i = 0; i < i_query_split; i++ )
+                {
+                    b_super_match &= bitvector_matches[i];
+                }
+                // Enabled if all the query words matched.
+                b_enable |= b_super_match;
+            }
             vlc_mutex_unlock( &p_item->p_input->lock );
         }
 
@@ -151,6 +388,7 @@ static bool playlist_LiveSearchUpdateInternal( playlist_item_t *p_root,
 
         b_match |= b_enable;
    }
+   free ( bitvector_matches );
    return b_match;
 }
 
-- 
1.7.9.5




More information about the vlc-devel mailing list