Sunday, July 16, 2006

Reminder: reading from disk is expensive

M_01898Diddlbiker is working on a little project at work. What the project consists of is not important, but in one step of the process I have to find transportation links between two locations.
There is all kind of theory written over this, but in this case, the problem was very simple: I have rail links in table 'A', and truck links in table 'B'. What is the best routing option from Houston to New Jersey? Straight over Newark or is railing to Philadelpia so much cheaper that the extra trucking cost doesn't matter.
So, for each route there are about thirty rail links to choose from, and there is only one truck link that connects the rail link to the final destination. So what's the big deal? I have to do it for 200,000 entries, and that takes some time.
The initial approach was a brute-force pure-SQL attempt. Well, that didn't go over that well.
The second approach was a VBA approach. Each link would start with a recordset of all rail links originating from it's starting point, for each rail link I can than look up (another recordset) the truck link, and when I all have them, find the cheapest one. That worked, but it took an amazing amount of time - after 45 minutes of processing, only about 1% of the entries was done!
From here I had three options:

  1. Intelligent approach: go and study and find an optimal way of doing it. I don't think there is a lot to be gained here - I'm only investigation 30 options per link, after all

  2. Reduced data: instead of 30 options per link, I could reduce the rail set by 30% by predeterming what the 10 links are that are closest to the final destination. I wouldn't want to go below 10 - 5 or 6 links might be the closest location, but different stations, and you really want to see if routing over a different location works. It would reduce processing time by 70% - but processing would still take around 20 to 30 hours.

  3. Caching: a lot of time is spent reading the truck rates, and there aren't that many of them (about 110,000 to choose from). Writing a simple cache wasn't that hard, and the result was amazing - total processing time went down to 45 minutes.

One of the nice things about VBA is that you get a pretty good Dictionary for free (it's in the Scripting Runtime Library). A small hashing function was written in minutes.

Total time invested: about one hour
Total time saved: about 80

Yay me!

No comments: