Our second meeting of the
UH Code Dojo was just as fun as the first. This time, we decided to switch languages from Java to Ruby. And although we started quite slowly (we came unprepared and undecided on a problem and a language), we pretty much finished the anagram problem.
Now, I mentioned it was slow at first - because we were trying to decide on a problem. I'm guessing we spent about 30-45 minutes just looking over
Ruby Quiz and then moving on to
Pragmatic Dave's Code Kata. We learned from our experience though, and resolved to determine before-hand what we would do in the future. In any case, we finally decided on
anagrams as our problem, and one of us mentioned something to the effect of "I don't know about you all, but I use Java too much at work." Of course, there's not much you can say to an argument like that - Ruby it was!
Since
we found ourselves violating YAGNI at the first meeting, we decided to do a little more discussion of the problem before we started coding. One of the initial paths we explored was looping over each letter and generating every possible combination of letters, from 1 to n (n being the length of the input). We then realized that would need a variable number of nested loops, so we moved on to recursion. After that, we explored trying to use
yield
in conjunction with recursion, in an isolated environment. I don't recall the reasoning behind it, but whatever it was, we were starting to discover that when we passed that fork on the road a few minutes back, we took the path that led to the cannibals. (As a side note, if you're unfamiliar:
yield
sits in a function, which takes a closure as an argument, and runs the code in the closure -- I think that's a simple way of putting it, anyway).
After smelling the human-stew awaiting us, we backtracked a little and started looking for another idea. Enter idea number two: I'm not sure how to describe it in words, so I'll just show the code now and try to explain it afterwards:
char_count = Array.new(26).fill(0)
dictionary = ['blah', 'lab', 'timmy', 'in', 'hal', 'rude', 'open']
word = "BlAhrina".downcase
word.each_byte { |x| char_count[x - ?a] += 1 }
dictionary.each do |entry|
char_count2 = char_count.clone
innocent = true
entry.each_byte do |letter|
index = letter - ?a
if char_count2[index] > 0
char_count2[index] -= 1
else
innocent = false
break
end
end
puts entry if innocent
end
That's it: quite simple. First we initialize an array with a cell corresponding to each letter of the alphabet. Each cell holds a number, which represents the number of times that letter is used in out input, called
word
. These cells are set by using the line
word.each_byte {...}
.
Then for each entry in the dictionary, we do something similar: loop through each letter. If the total count for each letter goes to 0, we have a match (and in our case, simply print it to the standard output device). It's really a simple, elegant solution, and I think we're all glad we didn't go down the painful path of recursion. It would be fairly trivial to add a file which contained a real dictionary, and loop over that instead of each word in our array, but we didn't have one handy (nor did we happen to notice that Dave had provided one). And it would have been just an extra step on top of that to find all the anagrams in the dictionary.
I know this is just a silly little problem that you're not likely to encounter in real life, but it shows how even the simplest of problems can be tough without some thought, and I found it to be great practice. In particular, one problem we had was with trying to use
TDD. Although we spent some time looking for tests we could write, and ways to test, and we even wrote an empty test thinking about what to put in there - none of us seemed to find anything to test. Now that we see the solution, it's fairly easy to figure out how to test it, but trying to drive the design with the test was proving fruitless for us. Dave alluded to this on the Kata page:
Apart from having some fun with words, this kata should make you think somewhat about algorithms. The simplest algorithms to find all the anagram combinations may take inordinate amounts of time to do the job. Working though alternatives should help bring the time down by orders of magnitude. To give you a possible point of comparison, I hacked a solution together in 25 lines of Ruby. It runs on the word list from my web site in 1.5s on a 1GHz PPC. It’s also an interesting exercise in testing: can you write unit tests to verify that your code is working correctly before setting it to work on the full dictionary.
I didn't read that before we had started (in fact, it wasn't until we had finished that anyone noticed it), but as you can tell, this exercise performed as promised. Our solution was under 25 lines, and while we didn't test it on his word list, I think our results would have been comparable (in fact, I wouldn't be surprised if we had the same basic solution he did).
Thoughts anybody?
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
A better way of doing this might be
#Compare words that are sorted by characters
dictionary.each do |entry|
word.split('').sort == entry.split('').sort return entry
end
If you can obtain the dictionary words already sorted by characters, this can be even faster to search with radix sort (some performance tuning with custom C code underneath Ruby? / I don't know anything about that part but I would assume that it is possible).
Posted by
Dat Chu
on Mar 30, 2007 at 08:05 PM UTC - 6 hrs
I really like that approach - it *greatly* simplifies that loop. Good call! But, how does it find words with length less than the length of our word (in our case, "BlAhrina")? (I may just be missing something)
Also, I'm not completely sure how getting the dictionary with each word sorted by the characters would work out. I understand how it would be faster, but I think we'd need 2 dictionaries, and then some way to map the sorted version to the normal one (otherwise, how do we know what the entry was?)
Could you explain further, or is my thinking correct?
Posted by
Sam
on Mar 31, 2007 at 09:38 AM UTC - 6 hrs
Maybe my definition of anagram is wrong but I thought that blah and blahrina are not anagram of each other.
However, we can subtract the array elements (Class Array has this method built in, I believe) from each other to determine if one has all the elements of another.
To obtain the sorted dictionary, we can perform multi-thread/process sorting to obtain it. Then serialize and save it on the disk or something for easy loading in the future.
Posted by
Dat Chu
on Mar 31, 2007 at 03:25 PM UTC - 6 hrs
On looking more into this, array difference behavior cannot be used like I mentioned. I believe you might need to modify it a bit.
Posted by
Dat Chu
on Mar 31, 2007 at 06:52 PM UTC - 6 hrs
No, you are right about the definition of anagram, according to Wikipedia, you should use "all the original letters exactly once." However, a couple of dictionaries I checked are a bit more vague, though none explicitly permit words with less length.
I guess we just decided to be more liberal in our definition of anagram =). This would certainly work for "real" anagrams then (well, I haven't tested it, but it certainly looks sound, and it is much more elegant than looping through each letter!)
Posted by
Sam
on Apr 01, 2007 at 12:54 PM UTC - 6 hrs
And on your difference thing - I think I have seen something like that before. Maybe for our next problem we can do something that uses it so we can get a better idea.
I'm finalizing the decision tonight and should have it sent to the group before Monday afternoon...
Posted by
Sam
on Apr 01, 2007 at 12:55 PM UTC - 6 hrs
Leave a comment