Damerau–Levenshtein Distance, Lua Implementation
I stumbled across Levenshtein distance today and had to try my hand at writing an implementation in Lua. I choose the slightly more complex Damerau–Levenshtein distance, and I think it turned out pretty well.
Some notes of interest:
- Complexity is O( (#t+1) * (#s+1) ) when lim isn’t specified.
- This function can be used to compare array-like tables as easily as strings.
- This function is case sensitive when comparing strings.
- Using this function to compare against a dictionary of 250,000 words took about 0.6 seconds on my machine for the word “Teusday”, around 10 seconds for very poorly spelled words. Both tests used lim.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 | --[[ Function: EditDistance Finds the edit distance between two strings or tables. Edit distance is the minimum number of edits needed to transform one string or table into the other. Parameters: s - A *string* or *table*. t - Another *string* or *table* to compare against s. lim - An *optional number* to limit the function to a maximum edit distance. If specified and the function detects that the edit distance is going to be larger than limit, limit is returned immediately. Returns: A *number* specifying the minimum edits it takes to transform s into t or vice versa. Will not return a higher number than lim, if specified. Example: :EditDistance( "Tuesday", "Teusday" ) -- One transposition. :EditDistance( "kitten", "sitting" ) -- Two substitutions and a deletion. returns... :1 :3 Notes: * Complexity is O( (#t+1) * (#s+1) ) when lim isn't specified. * This function can be used to compare array-like tables as easily as strings. * The algorithm used is Damerau–Levenshtein distance, which calculates edit distance based off number of subsitutions, additions, deletions, and transpositions. * Source code for this function is based off the Wikipedia article for the algorithm <http://en.wikipedia.org/w/index.php?title=Damerau%E2%80%93Levenshtein_distance&oldid=351641537>. * This function is case sensitive when comparing strings. * If this function is being used several times a second, you should be taking advantage of the lim parameter. * Using this function to compare against a dictionary of 250,000 words took about 0.6 seconds on my machine for the word "Teusday", around 10 seconds for very poorly spelled words. Both tests used lim. Revisions: v1.00 - Initial. ]] function EditDistance( s, t, lim ) local s_len, t_len = #s, #t -- Calculate the sizes of the strings or arrays if lim and math.abs( s_len - t_len ) >= lim then -- If sizes differ by lim, we can stop here return lim end -- Convert string arguments to arrays of ints (ASCII values) if type( s ) == "string" then s = { string.byte( s, 1, s_len ) } end if type( t ) == "string" then t = { string.byte( t, 1, t_len ) } end local min = math.min -- Localize for performance local num_columns = t_len + 1 -- We use this a lot local d = {} -- (s_len+1) * (t_len+1) is going to be the size of this array -- This is technically a 2D array, but we're treating it as 1D. Remember that 2D access in the -- form my_2d_array[ i, j ] can be converted to my_1d_array[ i * num_columns + j ], where -- num_columns is the number of columns you had in the 2D array assuming row-major order and -- that row and column indices start at 0 (we're starting at 0). for i=0, s_len do d[ i * num_columns ] = i -- Initialize cost of deletion end for j=0, t_len do d[ j ] = j -- Initialize cost of insertion end for i=1, s_len do local i_pos = i * num_columns local best = lim -- Check to make sure something in this row will be below the limit for j=1, t_len do local add_cost = (s[ i ] ~= t[ j ] and 1 or 0) local val = min( d[ i_pos - num_columns + j ] + 1, -- Cost of deletion d[ i_pos + j - 1 ] + 1, -- Cost of insertion d[ i_pos - num_columns + j - 1 ] + add_cost -- Cost of substitution, it might not cost anything if it's the same ) d[ i_pos + j ] = val -- Is this eligible for tranposition? if i > 1 and j > 1 and s[ i ] == t[ j - 1 ] and s[ i - 1 ] == t[ j ] then d[ i_pos + j ] = min( val, -- Current cost d[ i_pos - num_columns - num_columns + j - 2 ] + add_cost -- Cost of transposition ) end if lim and val < best then best = val end end if lim and best >= lim then return lim end end return d[ #d ] end |