Suppose for the purposes of our example we have string
the_string
of length n, and we're trying to determine if string
the_substring
of length m is found within
the_string
.
The straightforward approach in many languages would be to use a
find()
or
indexOf()
function on the string. It might look like this:
substring_found_at_index = the_string.index(the_substring)
However, if no such method exists, the straightforward approach would be to just scan all the substrings and compare them against
the_substring
until you find a match. Even if the aforementioned function exists, it likely uses the same strategy:
def find_substring_pos(the_string, the_substring)
(0..(the_string.length-1)).each do |i|
this_sub = the_string[i,the_substring.length]
return i if this_sub == the_substring
end
return nil
end
That is an O(n) function, which is normally fast enough.
Even though I'm one of the guys who camps out in line so I can be one of the first to say "don't prematurely optimize your code," there are situations where the most straightforward way to program something just doesn't work. One of those situations is where you have a long string (or set of data), and you will need to do many comparisons over it looking for substrings. Although you'll find it in many cases, an example of the need for this I've seen relatively recently occurs in bioinformatics, when searching through an organism's genome for specific subsequences. (Can you think of any other examples you've seen?)
In that case, with m much smaller than a very large n, O(m * log n) represents a significant improvement over O(n) (or worst case m*n). We can get there with a
suffix array.
Of course building the suffix array takes some time - so much so that if we had to build it for each comparison, we're better off with the straightforward approach. But the idea is that we'll build it once, and reuse it many times, amortizing the cost out to "negligible" over time.
The idea of the suffix array is that you store every suffix of a string (and it's position) in a sorted array. This way, you can do a binary search for the substring in log n time. After that, you just need to compare to see if
the_substring
is there, and if so, return the associated index.
The Wikipedia page linked above uses the example of "abracadabra." The suffix array would store each of these suffixes, in order:
a
abra
abracadabra
acadabra
adabra
bra
bracadabra
cadabra
dabra
ra
racadabra
Below is an implementation of a suffix array in Ruby. You might want to write a more efficient sort algorithm, as I'm not sure what approach
Enumerable#sort takes. Also, you might want to take into account the
ability to get all substrings, not just the first one to be found.
class SuffixArray
def initialize(the_string)
@the_string = the_string
@suffix_array = Array.new
#build the suffixes
last_index = the_string.length-1
(0..last_index).each do |i|
the_suffix = the_string[i..last_index]
the_position = i
# << is the append (or push) operator for arrays in Ruby
@suffix_array << { :suffix=>the_suffix, :position=>the_position }
end
#sort the suffix array
@suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }
end
def find_substring(the_substring)
#uses typical binary search
high = @suffix_array.length - 1
low = 0
while(low <= high)
mid = (high + low) / 2
this_suffix = @suffix_array[mid][:suffix]
compare_len = the_substring.length-1
comparison = this_suffix[0..compare_len]
if comparison > the_substring
high = mid - 1
elsif comparison < the_substring
low = mid + 1
else
return @suffix_array[mid][:position]
end
end
return nil
end
end
sa = SuffixArray.new("abracadabra")
puts sa.find_substring("ac") #outputs 3
Thoughts, corrections, and improvements are always appreciated.
Update: Thanks to Walter's comment below, the return statement above has been corrected.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
Leave a comment