Friday, October 20, 2006

Think Big, Program Less

Today I was programming an indexing application in Java. It starts from a specified path and creates a Hash table with all files in that path and all folders underneath. Its purpose is to find duplicate files even if their filename is different. The code is pretty simple but I made one critical mistake: I didn't stop and think how my procedures would behave in a large scale. That cost me about an hour of debugging (more like my head banging against the wall).

Here is the problem: Although the code is correct, it recursively does it all in a single method. This means the Garbage Collector, Java's memory freeing mechanism, won't do anything until this entity is no longer in use. As a result no memory is being freed during the application's operation. This is very hard to notice when using small files as a test bench but what happens when you have to index a couple of dozen of gigabytes? I'll tell you what happens: withing the first seconds of runtime, the application consumes all available memory and the Virtual Machine crashes. If you feed (the beast) with more memory it will simply grow bigger before crashing but will never finish. Only if you could provide virtual memory equal to the total size of the files to index, the application would complete its job but that's impossible and of course a very very very very very very bad idea!

The solution: design a new method explicitly for hashing the files, one at a time. So for every file, you invoke that method, load its contents in memory, digest them, unload the method, release its contents (for the appetite of the Garbage Collector) and return the result. So simple! And again let me stress out that the unsuspected developer would consider the two approaches as equal.

I was pretty sure my application would work the first time and was about to release it through my web site when, just for the fun of it, decided to check out if I had any duplicate MP3s. Out of pure luck I discovered that my code would behave very badly (or if you like, would not behave at all) under real-life conditions.

What I've learned from this is to think out of the box (at least try) and try different angles when designing something. My point of view may be entirely different that yours and I have to take all factors into account if I expect my programs to function properly in systems besides my own :)
Oh, there's another useful point that comes out here: you may be using a high-level language but you must never forget your computer's architecture and capabilities. In this case, don't forget about memory management just because Garbage Collector does it for you.

P.S.: This reminded me of a major bug caused by the overflow of a common “int i” temporary variable in an “average number calculation” implementation. I remember pointing out that certain programming “habits” should be revised to avoid (at best) the embarrassment. You can find the story here.

No comments: