Being a programmer, when I see something repetitive that I can automate, I normally opt to do so.
We know that one way to
save your job is by
automating, but another is to know when
not to automate. It sounds obvious, but when you get
into the habit of rooting out all duplication of effort, you can lose sight of the fact that sometimes, it costs more
to automate something than to do it by hand.
I came across such a situation the other day.
.
In this case I was working with static content on a website that wanted to go dynamic. It wasn't just a case of
writing a spider to follow all the links and dump all the HTML into a database - there was some structure to the
data, and the database would need reflect it.
In this case, there was a hierarchy of data. For simplicity's sake, let's say there were three levels to the tree:
departments, sections, and products. At the top we have very few departments. In the middle, there are several
sections per department. And there are many products in each section.
Each level of the hierarchy is different - so you'll need at least three spider/parser/scrapers. Within each level,
most of the content is fairly uniform, but there are some special cases to consider. We can also assume each
level requires roughly the same amount of effort in writing an automaton to process it's data.
It's natural to start at the top (for me, anyway -- you are free to differ), since you can use that spider to
collect not only the content for each department, but the links to the section pages as well. Then you'll
write the version for the sections which collect the content there and the links to the products. Finally, you get
to the bulk of the data which is contained in the products. (And don't forget the special cases in each level!)
But that's the wrong way to proceed.
You ought to start at the bottom, where you get the most return on your investment first. (Or at least skip the top
level.) Spidering each level to collect links to the lower levels is exceedingly easy. It's the parsing and special
cases in the rest of the content that makes each level a challenge.
Since there are so few cases at the top level, you can input that data by hand quicker than you can write the automation
device. It may not be fun, but it saves a few hours of you (and your customer's) time.
Carry on now, nothing to see here.
Hey! Why don't you make your life easier and subscribe to the full post
or short blurb RSS feed? I'm so confident you'll love my smelly pasta plate
wisdom that I'm offering a no-strings-attached, lifetime money back guarantee!
Leave a comment
There are no comments for this entry yet.
Leave a comment