Posted by Sam on Oct 01, 2008 at 06:25 AM UTC - 5 hrs
Code Reuse Does Not Mean Copy and Paste
Pay attention - I'm only going to say this a few times. DRY
was the most important programming principle I've ever learned.
Was there a major turning point in your software development career? One occurred for me, (I often half-joke)
when I learned that "code reuse" did not mean copy and paste.
(The reuse part is fine of course. It would be moronic (but sometimes not) to build systems which consist
of previously written components that could have been reused. Even in the exceptional case, it's not
often you'll need to rewrite everything.
Can you imagine the absurdity of academic research publications if they were unable to build upon prior
findings?
Whatever, enough of the justification. I hope we're in agreement that copy-and-paste code-reuse
is about the most evil thing you can do
to maintenance programmers. If you want to punish them, you don't send them to hell.
You duplicate as much buggy code as you can. Even better if it looks like it has the reproduction ability of the fruit fly.)
So back to copy-and-paste reuse. It's not what people mean when they say you should reuse code, or
when they tell you to write code
that is reusable. It took me a while to
become cognizant of that fact.
Because of my experience in the land of cut-and-paste,
I've always wanted to write a program that would root out code that isn't "DRY," and then just point and
laugh. Something to help it dry off. A towel for your code, if you will.
However, I did get to take a
look at the source code before our power went out, and I had an email discussion with Giles after the power
came back on.
I wanted to share with you some of the ideas we talked about in our discussion (his email used with permission, of course).
Three Types of Repetition To Detect
The way I see it, there are three types of duplication to identify
(I'm not claiming there are only three, just that I only thought of three).
Duplicate methods, which Towelie already identifies.
Methods which contain only some duplication from each other.
I'm not sure what Towelie identifies here. I know it looks at the ParseTree,
but the specs show only exact duplicate methods. It could be extended to find exact
duplicated regions fairly easily.
Something harder to find (but worthwhile, in my opinion) would be duplicate
code which is only a part of a method, but which is not exact).
Duplication of result, where the methods may be doing
the exact same thing in a different manner. We can easily check return values of two functions
given the same input over a few discrete cases to assign probabilities of duplication. We can
also compare state of potentially affected objects.
Doing so would amount to comparing member variables of objects who were passed in to the
method as well as the object the method belongs to (checking if changes to each were made, and if so, are they the same changes?).
Limiting to that type of analysis would be doable and not very time consuming.
Duplication of Results
Giles correctly pointed out that determining "duplication of results" is fairly easy, and people
are already doing that in the test generation world:
You mean you want to give two methods the same input and then
determine if they return the same output? That part is easy, you can
do that with a code block which auto-generates tests or specs.
...
Regarding the auto-generated testing, you could just throw the kitchen
sink at legacy methods and see which ones barf. E.g.
And I think that code would be both useful and funny, like flog or
heckle. Weird how testing tools can be witty. But I don't think that
would necessarily get you output you could actually do very much with.
I'm not convinced of the kitchen sink approach in unDRY detection either.
If you send everything you can think of or find, then the time complexity is no longer polynomial,
growing combinatorially with respect to the number of methods, the number of arguments, and the number of types
in the system.
Given enough time, it would work.
But since you're calling each method with each combination of arguments possible from the space of
all objects, my best premature optimization guess is that it would get intractable for the
usage I'm interested in.
I don't necessarily care for generating tests or finding duplicate code within seconds
or milliseconds, but lower minutes would be a requirement, potentially as part of a build process.
Instantaneous would be awesome for running as part of my test suite (the one I run every few minutes)
but I expect code duplication to be entered slowly, so running it less frequently might not be a
problem. I'd rather run it every time, if possible though. After all, I heard something good about
TATFT.
To get it where I think it would be most useful, you'd need to do some static analysis
to help narrow down the type of arguments that can be sent to a particular method. Doing so may provide
some clues. However, what might be more interesting is building a dynamic observer to see
what happens when objects are created and their methods are run (would tell us what types it can accept).
I don't have any idea how I'd go about doing either of those things, but an idea Giles floated was to hack
Rubinius for doing the dynamic observation. It would be worth looking
at if you agree that finding "duplication of results" and limiting the running time are important.
Methods with Partial Duplication
In its first release, Towelie only detected entirely duplicate methods. I figured it would be easy enough
to extend its usage of ParseTree to dig a bit deeper and find parts of methods that were duplicated.
Asking Giles about it, he agreed and went in a little more depth about
the challenges (I added emphasis and formatting):
I'm probably going to have Towelie go inside methods
and find duplicate bits of code. Was just looking at that today, in
fact. But: can't guarantee it'll work, and the drawback is that you've
got these trees, if you go recursive enough you'll be comparing them
on the element-by-element level, where you'll find craploads of
duplication which is utterly meaningless. So extracting useful
information is the tricky part there.
Duplicated methods are just a
nice easy place to start - obviously if you have exact duplicates in
your code base, the next step there from a DRY perspective is easy. In
addition to extracting duplicate blocks, I also want Towelie to be
able to recognize that the methods in its current test data only
differ by one literal value. That's actually relatively easy - you can
do recursive tests for equality, collect the differences, and then
determine whether the differences represent literals. No problem.
"Easy" in the developer sense, of course, which translates in real
life to "theoretically possible and I have a vague plan."
Finding near-duplicate code fragments within a method - if I get the
other stuff working it may become possible to find this, currently
it'd be a shitload of work.
That problem of noise brings up the question: what do we consider duplication?
If I have a method "return x+y" versus one whose body is just "x + y" should I consider that as
repetition? In the case one one liners, I'd say yes. But would I say in-line addition is
repetition in a general sense? Probably not.
I'd consider counting the numbers of consecutive lines, or counting distance from each other
in determining if something is duplicated. You could normalize it by dividing by the length of the
smaller method, or perhaps something more complex.
Heuristics such as these can help in determining what is duplicate, and in finding interleaved or "almost" duplicate code.
I wouldn't expect our DRYer to identify things that use (0..(arr.length-1)) {...} versus
arr.each_index. On the contrary, I was thinking more like the code is duplicated by copy and paste,
but where the codepaster introduced a new variable in that frame as well.
Putting the question to you all
How important is the DRY principle to you? Does
repetitive code warrant having a tool to report its existence, or are you and your team doing just fine without it?
Most importantly, how would you go about detecting duplicated code, especially if you were to programmatically try to do it?
(Note on the title: The opportunity for three "tions" in a row could not be passed up for the DRYer title of
"Ideas for (Repeti + Detec + Automa) * tion and The Importance of DRY" (assuming the Distributive Property of Strings holds))
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Posted by Sam on Sep 25, 2008 at 10:38 AM UTC - 5 hrs
A bit off topic with programming, but it fits in with the "future tech" or "random technology at my whim" about which I claim the right to post.
Anyway, I thought you all might appreciate it.
A rap song entitled Astrobiology. It's not "bad rap" where they're trying to be funny with rhymes. It's actually pretty good, if you ask me.
Posted by Sam on Sep 19, 2008 at 02:32 PM UTC - 5 hrs
Chad Fowler describes the problem:
What I've noticed since coming back from India is that in America we are
so focused on ourselves that we don't even take the time to learn about our
teammates from other parts of the United States. What's the special food in
Minnesota? What do Arizonans do on the weekends in their nonexistent
winters? The United States is a diverse place, and we don't even bother to
learn about our own diverse culture, much less the cultures of people on
the outside.
I don't want to get into the merits of whether or not Americans are inward-looking and selfish. The
important part here is that as the world becomes smaller, nation-states are losing importance as
cultural and political boundaries, and we're increasingly exposed (and exposing ourselves to) new
people from unfamiliar places. We can choose to embrace this, or fight it.
It's pointless to fight it.
You can try to change the world. Or, you could think of changing yourself.
Luckily, it's fairly easy to turn this to your advantage: just show people that you care about them
as humans, not just colleagues.
If I have to depend on someone to get something
done for me or to deliver a piece of software that I have to successfully
integrate with, I'm going to have much better luck if that person feels I
respect them and if they respect me. Would you respect someone who
wouldn't even bother to learn how to pronounce your name?
...
If you show your teammates that
you are interested in them as people, you will form tighter bonds and, on
the whole, do better work.
On the contrary, you could be an ass - perhaps without even realizing it:
As I got to know our team members in India, I often heard them say that I
wasn't like the typical American manager. When I asked what they meant,
those who felt comfortable enough would say, You actually take an interest
in us. Most of you are just angry and short with us.
I had a fellow student at school say the same thing to me. He asked, "Are you natively American?" It was
eye opening to think that he had been treated so poorly by other Americans that he had to ask me if I am one
(I am.)
Incidentally, this is a tactic in getting anyone to like you generally, so if you have friends, you don't
need to learn any new skills. In this case, it may be even easier because you know
tons of things exist that you don't know about them: just pick a couple and ask about them. All Chad
had to do was say "Hello, my name is Chad" in their native language.
It's about making a small effort, that's all.
Meta
This week marks the end of the Save Your Job series,
at least as far as following each chapter of Chad Fowler's book, My Job Went To India.
Why? Well, because it's the last one in the book. I'll still post to that category as things come up,
but it's not likely to be weekly.
This is a book that, in my opinion, is a must-read for software developers, and it's so short you can read it
multiple times - to remind yourself as you slip back into old habits, or to reinvigorate interest in goals
you set for yourself in times past.
In any case, I hope you've enjoyed the weekly series, and more than anything else, I hope you got something
useful from it. It was useful to me!
While I thoroughly enjoyed Chad's book, I must say I'm glad to be done with the series. I've been wanting
to free up some time to do some more technical things, like playing with my new Arduino Diecimila.
The justification for paying out the ass for just a few minutes of your proctologist's time is that they need to pay their insane educational loans. And don't forget about how hard they worked in school that long just to learn their asscraft.
"Still," we always say, "do we really need the doctor? My nurse did everything I needed done, and I could have told you I have strep throat and needed some pennicilin." Even if we don't trust the medical support staff to make the decisions, we have ways of making those decisions without a doctor.
As medical expert systems become better, we might expect the doctor to become obsolete. From my vantage point, I'm unable to see what about a doctor could be better than inputting a list of symptoms to a machine and getting back most likely diagnoses, which could ask questions to further refine the results. It could even consult a list of prescriptions (drugs, therapies, surgeries, et cetera), cross reference it with your medical profile (or DNA, when we have medicines tailored to individual genomes), and give you advice on what to do next.
Some people might even think an application like that would be better than human doctors.
Under such a system, it's unlikely all human doctors would become obsolete. For example, we'll always have a need for the maverick hacker doctor who can think outside the box.
But in most cases, most doctors are going to continue to diagnose the same diseases and prescribe the same (potentially biased) treatments to each patient who comes in. It's a factory, and we're on the conveyor belts.
Of course, even if we were to have the capability to replace most doctors (my belief is that we do), I don't think most people would feel comfortable consulting a machine about their problems. Dr. Sbaitso only gets us so far. We want the comfort of another human telling us what's wrong with our health, not a heartless machine.
Still, I think it would be interesting to see the results of large scale experiments pitting man vs. machine in the field of medicine. How much more successful is one over the other? Is that success due only to non-life-threatening conditions? Would that benefit to society be outweighed if one failed more often on the catastrophic problems of a few individuals?
I'm not a doctor, and I can't tell you exactly what they bring to the table. Could be something I've completely overlooked, or something we're unlikely to know unless we're in that industry. But that's how it looks to me. I'd like to confirm or disprove that hypothesis with a test.
Posted by Sam on Sep 12, 2008 at 08:25 AM UTC - 5 hrs
If we accept the notion that we need to figure out how to work with outsourcing
because it's more likely to increase than decrease or stagnate, then it would be beneficial for us to become
"Distributed Software Development Experts" (Fowler, pg 169).
To do that, you need to overcome challenges associated
with non-colocated teams that exceed those experienced by teams who work in the same geographic location.
Chad lists a few of them in this week's advice from
My Job Went To India (I'm not quoting):
More...
Communication bandwidth is lower when it's not face to face. Most will be done through email,
so most of it will suck comparatively.
Being in (often widely) different time zones means synchronous communication is limited to few overlapping
hours of work. If you get stuck and need an answer, you stay stuck until you're in one of those overlaps.
That sucks.
Language and cultural barriers contribute to dysfunctional communication. You might need an accent to accent
translator to desuckify things.
Because of poor communication, we could find ourselves in situations where we don't know what each other
is doing. That leads to duplicative work in some cases, and undone work in others. Which leads to
more sucking for your team.
The bad news is that there's a lot of potential to suck. The good news is there's already a model
for successful and unsuccessful geographically distributed projects: those of open source.
You can learn in the trenches by participating. You can find others' viewpoints on successes and
failures by asking them directly, or by reviewing
open source project case studies.
Try to think about the differences and be creative with ways to address them.
Doing that means you'll be better equipped to cope with challenges inherent
with outsourced development. And it puts you miles ahead of your bitchenmoaning colleagues who end
up trying to subvert the outsourcing model.