I stumbled across Levenshtein distance today and had to try my hand at writing an implementation in Lua. I choose the slightly more complex Damerauâ€“Levenshtein distance, and I think it turned out pretty well.

Some notes of interest:

- Complexity is O( (#t+1) * (#s+1) ) when lim isn’t specified.
- This function can be used to compare array-like tables as easily as strings.
- This function is case sensitive when comparing strings.
- Using this function to compare against a dictionary of 250,000 words took about 0.6 seconds on my machine for the word “Teusday”, around 10 seconds for very poorly spelled words. Both tests used lim.

[ccn_lua]–[[

Function: EditDistance

Finds the edit distance between two strings or tables. Edit distance is the minimum number of

edits needed to transform one string or table into the other.

Parameters:

s – A *string* or *table*.

t – Another *string* or *table* to compare against s.

lim – An *optional number* to limit the function to a maximum edit distance. If specified

and the function detects that the edit distance is going to be larger than limit, limit

is returned immediately.

Returns:

A *number* specifying the minimum edits it takes to transform s into t or vice versa. Will

not return a higher number than lim, if specified.

Example:

:EditDistance( “Tuesday”, “Teusday” ) — One transposition.

:EditDistance( “kitten”, “sitting” ) — Two substitutions and a deletion.

returns…

:1

:3

Notes:

* Complexity is O( (#t+1) * (#s+1) ) when lim isn’t specified.

* This function can be used to compare array-like tables as easily as strings.

* The algorithm used is Damerauâ€“Levenshtein distance, which calculates edit distance based

off number of subsitutions, additions, deletions, and transpositions.

* Source code for this function is based off the Wikipedia article for the algorithm

* This function is case sensitive when comparing strings.

* If this function is being used several times a second, you should be taking advantage of

the lim parameter.

* Using this function to compare against a dictionary of 250,000 words took about 0.6

seconds on my machine for the word “Teusday”, around 10 seconds for very poorly

spelled words. Both tests used lim.

Revisions:

v1.00 – Initial.

]]

function EditDistance( s, t, lim )

local s_len, t_len = #s, #t — Calculate the sizes of the strings or arrays

if lim and math.abs( s_len – t_len ) >= lim then — If sizes differ by lim, we can stop here

return lim

end

— Convert string arguments to arrays of ints (ASCII values)

if type( s ) == “string” then

s = { string.byte( s, 1, s_len ) }

end

if type( t ) == “string” then

t = { string.byte( t, 1, t_len ) }

end

local min = math.min — Localize for performance

local num_columns = t_len + 1 — We use this a lot

local d = {} — (s_len+1) * (t_len+1) is going to be the size of this array

— This is technically a 2D array, but we’re treating it as 1D. Remember that 2D access in the

— form my_2d_array[ i, j ] can be converted to my_1d_array[ i * num_columns + j ], where

— num_columns is the number of columns you had in the 2D array assuming row-major order and

— that row and column indices start at 0 (we’re starting at 0).

for i=0, s_len do

d[ i * num_columns ] = i — Initialize cost of deletion

end

for j=0, t_len do

d[ j ] = j — Initialize cost of insertion

end

for i=1, s_len do

local i_pos = i * num_columns

local best = lim — Check to make sure something in this row will be below the limit

for j=1, t_len do

local add_cost = (s[ i ] ~= t[ j ] and 1 or 0)

local val = min(

d[ i_pos – num_columns + j ] + 1, — Cost of deletion

d[ i_pos + j – 1 ] + 1, — Cost of insertion

d[ i_pos – num_columns + j – 1 ] + add_cost — Cost of substitution, it might not cost anything if it’s the same

)

d[ i_pos + j ] = val

— Is this eligible for tranposition?

if i > 1 and j > 1 and s[ i ] == t[ j – 1 ] and s[ i – 1 ] == t[ j ] then

d[ i_pos + j ] = min(

val, — Current cost

d[ i_pos – num_columns – num_columns + j – 2 ] + add_cost — Cost of transposition

)

end

if lim and val < best then
best = val
end
end
if lim and best >= lim then

return lim

end

end

return d[ #d ]

end[/ccn_lua]

Gist of the same source code.