The Suffix Array

home | about | contact | privacy statement

Posted by Sam on Dec 24, 2007 at 04:52 PM UTC - 6 hrs

Suppose for the purposes of our example we have string the_string of length n, and we're trying to determine if string the_substring of length m is found within the_string.

The straightforward approach in many languages would be to use a find() or indexOf() function on the string. It might look like this:

substring_found_at_index = the_string.index(the_substring)

However, if no such method exists, the straightforward approach would be to just scan all the substrings and compare them against the_substring until you find a match. Even if the aforementioned function exists, it likely uses the same strategy:

def find_substring_pos(the_string, the_substring)

  (0..(the_string.length-1)).each do |i|

    this_sub = the_string[i,the_substring.length]

    return i if this_sub == the_substring

  end

  return nil

end

That is an O(n) function, which is normally fast enough.

Even though I'm one of the guys who camps out in line so I can be one of the first to say "don't prematurely optimize your code," there are situations where the most straightforward way to program something just doesn't work. One of those situations is where you have a long string (or set of data), and you will need to do many comparisons over it looking for substrings. Although you'll find it in many cases, an example of the need for this I've seen relatively recently occurs in bioinformatics, when searching through an organism's genome for specific subsequences. (Can you think of any other examples you've seen?)

In that case, with m much smaller than a very large n, O(m * log n) represents a significant improvement over O(n) (or worst case m*n). We can get there with a suffix array.

Of course building the suffix array takes some time - so much so that if we had to build it for each comparison, we're better off with the straightforward approach. But the idea is that we'll build it once, and reuse it many times, amortizing the cost out to "negligible" over time.

The idea of the suffix array is that you store every suffix of a string (and it's position) in a sorted array. This way, you can do a binary search for the substring in log n time. After that, you just need to compare to see if the_substring is there, and if so, return the associated index.

The Wikipedia page linked above uses the example of "abracadabra." The suffix array would store each of these suffixes, in order:

a
abra
abracadabra
acadabra
adabra
bra
bracadabra
cadabra
dabra
ra
racadabra

Below is an implementation of a suffix array in Ruby. You might want to write a more efficient sort algorithm, as I'm not sure what approach Enumerable#sort takes. Also, you might want to take into account the ability to get all substrings, not just the first one to be found.

class SuffixArray

  def initialize(the_string)

    @the_string = the_string

    @suffix_array = Array.new

    #build the suffixes 

    last_index = the_string.length-1

    (0..last_index).each do |i|

      the_suffix = the_string[i..last_index]

      the_position = i

      # << is the append (or push) operator for arrays in Ruby

      @suffix_array << { :suffix=>the_suffix, :position=>the_position }

    end

    #sort the suffix array

    @suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }

  end

  def find_substring(the_substring)

    #uses typical binary search

    high = @suffix_array.length - 1

    low = 0

    while(low <= high)

      mid = (high + low) / 2

      this_suffix = @suffix_array[mid][:suffix]

      compare_len = the_substring.length-1

      comparison = this_suffix[0..compare_len]

      if   comparison > the_substring

        high = mid - 1

      elsif comparison < the_substring

        low = mid + 1

      else 

        return @suffix_array[mid][:position]

      end

    end

    return nil

  end

end

sa = SuffixArray.new("abracadabra")

puts sa.find_substring("ac") #outputs 3

Thoughts, corrections, and improvements are always appreciated.

Update: Thanks to Walter's comment below, the return statement above has been corrected.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

Last modified on Jul 11, 2008 at 10:19 AM UTC - 6 hrs

Comments

Leave a comment

See this question
http://stackoverflow.com/questions/5322428/finding...

Posted by biorelated on Mar 16, 2011 at 02:59 AM UTC - 6 hrs

Thanks for pointing it out to me. I answered the question there, and also added some code to have it return an array of indices that match. That code is on github: https://github.com/codeodor/miscellany/blob/master...

Posted by Sammy Larbi on Mar 16, 2011 at 03:04 PM UTC - 6 hrs

Leave a comment

Topics
.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit

RSS 2.0: Full Post | Short Blurb
Subscribe by email: