My Secret Life as a Spaghetti Coder
home | about | contact | privacy statement
Being a programmer, when I see something repetitive that I can automate, I normally opt to do so. We know that one way to save your job is by automating, but another is to know when not to automate. It sounds obvious, but when you get into the habit of rooting out all duplication of effort, you can lose sight of the fact that sometimes, it costs more to automate something than to do it by hand.

I came across such a situation the other day.

When populating a database w/ static content, it's tempting to make the spider do it all. But judicious mixing of human data entry may prove faster.

In this case I was working with static content on a website that wanted to go dynamic. It wasn't just a case of writing a spider to follow all the links and dump all the HTML into a database - there was some structure to the data, and the database would need reflect it.

In this case, there was a hierarchy of data. For simplicity's sake, let's say there were three levels to the tree: departments, sections, and products. At the top we have very few departments. In the middle, there are several sections per department. And there are many products in each section.

Each level of the hierarchy is different - so you'll need at least three spider/parser/scrapers. Within each level, most of the content is fairly uniform, but there are some special cases to consider. We can also assume each level requires roughly the same amount of effort in writing an automaton to process it's data.

It's natural to start at the top (for me, anyway -- you are free to differ), since you can use that spider to collect not only the content for each department, but the links to the section pages as well. Then you'll write the version for the sections which collect the content there and the links to the products. Finally, you get to the bulk of the data which is contained in the products. (And don't forget the special cases in each level!)

But that's the wrong way to proceed.

You ought to start at the bottom, where you get the most return on your investment first. (Or at least skip the top level.) Spidering each level to collect links to the lower levels is exceedingly easy. It's the parsing and special cases in the rest of the content that makes each level a challenge.

Since there are so few cases at the top level, you can input that data by hand quicker than you can write the automation device. It may not be fun, but it saves a few hours of you (and your customer's) time.

Random picture of sexy Linux girl that doesn't add value to the story.

Random picture of explosion that doesn't add value to the story.

Carry on now, nothing to see here.

Hey! Why don't you make your life easier and subscribe to the full post or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!

Leave a comment

There are no comments for this entry yet.

Leave a comment

Leave this field empty
Your Name
Email (not displayed, more info?)


Subcribe to this comment thread
Remember my details

Picture of me

.NET (19)
AI/Machine Learning (14)
Answers To 100 Interview Questions (10)
Bioinformatics (2)
Business (1)
C and Cplusplus (6)
cfrails (22)
ColdFusion (78)
Customer Relations (15)
Databases (3)
DRY (18)
DSLs (11)
Future Tech (5)
Games (5)
Groovy/Grails (8)
Hardware (1)
IDEs (9)
Java (38)
JavaScript (4)
Linux (2)
Lisp (1)
Mac OS (4)
Management (15)
MediaServerX (1)
Miscellany (76)
OOAD (37)
Productivity (11)
Programming (168)
Programming Quotables (9)
Rails (31)
Ruby (67)
Save Your Job (58)
scriptaGulous (4)
Software Development Process (23)
TDD (41)
TDDing xorblog (6)
Tools (5)
Web Development (8)
Windows (1)
With (1)
YAGNI (10)

Agile Manifesto & Principles
Principles Of OOD
Ruby on Rails

RSS 2.0: Full Post | Short Blurb
Subscribe by email:

Delivered by FeedBurner