My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement | getting started with cfrails
Code Reuse Does Not Mean Copy and Paste
Pay attention - I'm only going to say this a few times. DRY was the most important programming principle I've ever learned.

Was there a major turning point in your software development career? One occurred for me, (I often half-joke) when I learned that "code reuse" did not mean copy and paste.

The technique of taking prior art, text, or symbols and rearranging them into something new and valuable may work for artists and spammers (or not, if you're Tristan Tzara at a 1920s Surrealist rally), but it's no way to write a program.

Cut-up, and pasted back together.

(The reuse part is fine of course. It would be moronic (but sometimes not) to build systems which consist of previously written components that could have been reused. Even in the exceptional case, it's not often you'll need to rewrite everything.

Can you imagine the absurdity of academic research publications if they were unable to build upon prior findings?

Whatever, enough of the justification. I hope we're in agreement that copy-and-paste code-reuse is about the most evil thing you can do to maintenance programmers. If you want to punish them, you don't send them to hell. You duplicate as much buggy code as you can. Even better if it looks like it has the reproduction ability of the fruit fly.)

Or, you know, like rabbits.

So back to copy-and-paste reuse. It's not what people mean when they say you should reuse code, or when they tell you to write code that is reusable. It took me a while to become cognizant of that fact.

Because of my experience in the land of cut-and-paste, I've always wanted to write a program that would root out code that isn't "DRY," and then just point and laugh. Something to help it dry off. A towel for your code, if you will.

Because of that, I was pretty excited when Giles Bowkett announced Towlie, a Ruby library for keeping your code DRY. I wanted to hack away at it that weekend. Unfortunately, Hurricane Ike had other plans for me.

Ike from above

Ike from below, just before the SHTF.

However, I did get to take a look at the source code before our power went out, and I had an email discussion with Giles after the power came back on.

Illustrating the conversation with a small screen cap of some email...

I wanted to share with you some of the ideas we talked about in our discussion (his email used with permission, of course).

Three Types of Repetition To Detect
The way I see it, there are three types of duplication to identify (I'm not claiming there are only three, just that I only thought of three).
  1. Duplicate methods, which Towelie already identifies.

  2. Methods which contain only some duplication from each other. I'm not sure what Towelie identifies here. I know it looks at the ParseTree, but the specs show only exact duplicate methods. It could be extended to find exact duplicated regions fairly easily.

    1. Something harder to find (but worthwhile, in my opinion) would be duplicate code which is only a part of a method, but which is not exact).

  3. Duplication of result, where the methods may be doing the exact same thing in a different manner. We can easily check return values of two functions given the same input over a few discrete cases to assign probabilities of duplication. We can also compare state of potentially affected objects.

    Doing so would amount to comparing member variables of objects who were passed in to the method as well as the object the method belongs to (checking if changes to each were made, and if so, are they the same changes?). Limiting to that type of analysis would be doable and not very time consuming.
Duplication of Results
Giles correctly pointed out that determining "duplication of results" is fairly easy, and people are already doing that in the test generation world:
You mean you want to give two methods the same input and then determine if they return the same output? That part is easy, you can do that with a code block which auto-generates tests or specs.
...
Regarding the auto-generated testing, you could just throw the kitchen sink at legacy methods and see which ones barf. E.g.

lambda{maybe_this_takes_a_string("why not")}.should_not raise_error

And I think that code would be both useful and funny, like flog or heckle. Weird how testing tools can be witty. But I don't think that would necessarily get you output you could actually do very much with.
I'm not convinced of the kitchen sink approach in unDRY detection either. If you send everything you can think of or find, then the time complexity is no longer polynomial, growing combinatorially with respect to the number of methods, the number of arguments, and the number of types in the system.

Given enough time, it would work. But since you're calling each method with each combination of arguments possible from the space of all objects, my best premature optimization guess is that it would get intractable for the usage I'm interested in.

I don't necessarily care for generating tests or finding duplicate code within seconds or milliseconds, but lower minutes would be a requirement, potentially as part of a build process.

Instantaneous would be awesome for running as part of my test suite (the one I run every few minutes) but I expect code duplication to be entered slowly, so running it less frequently might not be a problem. I'd rather run it every time, if possible though. After all, I heard something good about TATFT.

To get it where I think it would be most useful, you'd need to do some static analysis to help narrow down the type of arguments that can be sent to a particular method. Doing so may provide some clues. However, what might be more interesting is building a dynamic observer to see what happens when objects are created and their methods are run (would tell us what types it can accept).

I don't have any idea how I'd go about doing either of those things, but an idea Giles floated was to hack Rubinius for doing the dynamic observation. It would be worth looking at if you agree that finding "duplication of results" and limiting the running time are important.

Methods with Partial Duplication
In its first release, Towelie only detected entirely duplicate methods. I figured it would be easy enough to extend its usage of ParseTree to dig a bit deeper and find parts of methods that were duplicated. Asking Giles about it, he agreed and went in a little more depth about the challenges (I added emphasis and formatting):
I'm probably going to have Towelie go inside methods and find duplicate bits of code. Was just looking at that today, in fact. But: can't guarantee it'll work, and the drawback is that you've got these trees, if you go recursive enough you'll be comparing them on the element-by-element level, where you'll find craploads of duplication which is utterly meaningless. So extracting useful information is the tricky part there.

Duplicated methods are just a nice easy place to start - obviously if you have exact duplicates in your code base, the next step there from a DRY perspective is easy. In addition to extracting duplicate blocks, I also want Towelie to be able to recognize that the methods in its current test data only differ by one literal value. That's actually relatively easy - you can do recursive tests for equality, collect the differences, and then determine whether the differences represent literals. No problem. "Easy" in the developer sense, of course, which translates in real life to "theoretically possible and I have a vague plan."

Finding near-duplicate code fragments within a method - if I get the other stuff working it may become possible to find this, currently it'd be a shitload of work.
That problem of noise brings up the question: what do we consider duplication?

If I have a method "return x+y" versus one whose body is just "x + y" should I consider that as repetition? In the case one one liners, I'd say yes. But would I say in-line addition is repetition in a general sense? Probably not.

I'd consider counting the numbers of consecutive lines, or counting distance from each other in determining if something is duplicated. You could normalize it by dividing by the length of the smaller method, or perhaps something more complex.

Heuristics such as these can help in determining what is duplicate, and in finding interleaved or "almost" duplicate code. I wouldn't expect our DRYer to identify things that use (0..(arr.length-1)) {...} versus arr.each_index. On the contrary, I was thinking more like the code is duplicated by copy and paste, but where the codepaster introduced a new variable in that frame as well.

Putting the question to you all
How important is the DRY principle to you? Does repetitive code warrant having a tool to report its existence, or are you and your team doing just fine without it? Most importantly, how would you go about detecting duplicated code, especially if you were to programmatically try to do it?

(Note on the title: The opportunity for three "tions" in a row could not be passed up for the DRYer title of "Ideas for (Repeti + Detec + Automa) * tion and The Importance of DRY" (assuming the Distributive Property of Strings holds))

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!


Comments
Leave a comment

@sammy, i think (bells go off here) i'm doing a good job with keeping things DRY. I know where I have some opportunities where I rushed some code through, but my biggest concern is not what I code, but what I inherit.

i am currently working on a process that the previous approach was, make a copy of the file and change the query. that was done 96 times (that are active... unknown how many orphaned files are in the dir). that is a lot of duplication that i am proud to rid.

i think it would perhaps be a good tool to help people think differently. there may be places that i don't realize i am duplicating efforts. at the same time, i may need a tip that says, "if you change these 3 pieces you'll be more abstract and reduce 4 methods to 1".

i would say this is mostly dependent on the developer, and their skill level.

Posted by shag on Oct 01, 2008 at 11:52 AM UTC - 6 hrs

To detect methods that produce the same output, why not just run a static code analysis? Should not be too hard to do an abstract interpretation of two methods and record the side effects & ret value, and then compare that.

Posted by Adrian Kuhn on Oct 01, 2008 at 06:36 PM UTC - 6 hrs

@shag - I feel your pain and your pride! I probably created that system a time or two.

@Adrian - Thanks for commenting.

Something like that is easy enough to do in Java or C#, where we explicitly declare the type of a variable. My question was aimed more at languages where that's not the case - where we could be throwing /anything/ at methods to see if it sticks.

Does that change your response any?

Posted by Sammy Larbi on Oct 03, 2008 at 11:03 AM UTC - 6 hrs

I can be done in dynamic languages. It is actually simple in dynamic languages :) I have run similar analysis for Smalltalk, which turned out to be fairly "simple" using an abstract interpreter. It might require some more work in Ruby though, since your syntax is more complex and since Ruby is not shipped with a read-to-subclass Ruby interpreter written in Ruby (at least as far I know).

Another solution might be to invoke the method with "recorder objects". In Smalltalk this is realized using an Object that subclasses from nil and thus does not understand any messages, and then you can record all sends in the method_missing hook. However, I do not see how one could achieve to record any messages sent to self or global constants with that approach.

Posted by Adrian Kuhn on Oct 04, 2008 at 06:11 PM UTC - 6 hrs

s/actually simple in/actually simpler in/

Posted by Adrian Kuhn on Oct 04, 2008 at 06:12 PM UTC - 6 hrs

Adrian - Smalltalk's abilities are certainly the stuff of legend. I've not worked in it myself, but I know that having a self-interpreter adds tons of power.

Ruby does not ship with one, but there is a project, Rubinius, whose goal is just that. It is mentioned in the post, but I didn't think to use it in the way you're talking about here.

Thanks for the great idea. Do you have any recommendations on Smalltalk projects to look at, perhaps even the one you mentioned you worked on?

Posted by Sammy Larbi on Oct 06, 2008 at 08:19 PM UTC - 6 hrs

Hi Sammy,

I just wanted to point out that in Java, this sort of thing is possible with CPD and Checkstyle.

CPD: http://pmd.sourceforge.net/cpd.html
Checkstyle: http://checkstyle.sourceforge.net/5.x/config_dupli...

Both of those utilities can be easily integrated into a maven build as a plugin as described here: http://docs.codehaus.org/display/MAVENUSER/MavenPl...

It's pretty sweet.

Take care, and thanks for the content!

Damian

Posted by Damian Carrillo on Oct 09, 2008 at 01:18 PM UTC - 6 hrs

Dear Sammy, please apologize that I cannot yet reply to your request. I am busy preparing for OOPSLA, I will polish and publish my two related projects (ie RBCrawler and TeachableObeject) in the week after.

Posted by Adrian Kuhn on Oct 14, 2008 at 07:48 PM UTC - 6 hrs

@Damian: Thanks for posting those resources.

@Adrian: Take your time, and good luck! =)

Posted by Sammy Larbi on Oct 15, 2008 at 07:10 AM UTC - 6 hrs

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)
Website

Comment:

Subcribe to this comment thread
Remember my details
Google
Web CodeOdor.com

Me
Picture of me

Topics
.NET (26)
AI/Machine Learning (15)
Bioinformatics (2)
C++ (7)
cfrails (22)
ColdFusion (84)
Customer Relations (20)
Databases (2)
DRY (19)
DSLs (13)
Electronics (1)
Future Tech (6)
Games (8)
Groovy/Grails (8)
Hardware (1)
IDEs (10)
Java (44)
JavaScript (5)
Lisp (2)
Mac OS (3)
Management (3)
Miscellany (63)
OOAD (39)
Programming (130)
Programming Quotables (9)
Rails (21)
Ruby (58)
Save Your Job (63)
scriptaGulous (4)
Software Development Process (27)
TDD (43)
TDDing xorblog (6)
Tools (6)
Web Development (8)
YAGNI (12)

Resources
Agile Manifesto & Principles
Principles Of OOD
ColdFusion
CFUnit
Ruby
Ruby on Rails
JUnit



RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner