One of the unfortunate side effects of Moore’s Law is that the immense amounts of computing power at our fingertips masks over many horribly inefficient practices. For example, something I have commonly done for a very long time is use a sort, uniq, sort pipeline to tally up something; say you have a CSV containing users and the 2nd column is the country, so you want a quick list of how many you have from each country:
$ cat u | awk -F, '{print $2}' | sort | uniq -c | sort -n
1 ca
1 mx
14 in
16 us
Obviously for a short list that first sort is not expensive at all, but with my current laptop, I can do the same thing on a file with 2 million records and it finishes in 5 seconds, despite the massive sort. Obviously, this could be done using a O(n) algorithm like so (you could do this with Awk, of course, but I’m more fluent in Perl):
$ cat u | perl -ne '$a = (split(/,/))[1]; $c{$a}++; END { foreach my $i (sort {$c{$a}<=>$c{$b} } keys %c) { printf "%10d %s\n", $c{$i}, $i; } }'
1 ca
1 mx
14 in
16 us
(For those who have been through a job interview with me will recognize this question… few of you got it right)
Here’s a different example: When I was in high school the student government ran an annual fundraiser by doing a “computer dating” event: everyone would fill out a short multiple choice survey, and those would be sent off to some company to generate reports matching people.
One day, the computer teacher obliquely asked me if I could write a program that could take a list of people and randomly put together lists and generate reports. I knew what he was hinting at: let’s skip hiring the company and generate fake reports ourselves. I wrote that code, but, being a curious sort (and not entirely comfortable with the deception), I decided to try doing it the “right” way. The basic algorithm is simply to take a string of responses and match it against everyone else in the list, take the top ten matches and generate a report. Simple, right?
I worked on that program through the summer (keep in mind I was 16 writing in BASIC on an Apple ][), but I got it working. When we finally ran it on the full data set (a few hundred surveys), it sat and ran for over a week on an Apple ][ computer. [A personal note: since I was nursing this through, I saw every report slowly trickling out; my name came up on exactly one of those reports, and at number 7, at that. I was disappointed but not surprised.] We all know the same thing would, nowadays, be a few dozen lines of code and would run in seconds, despite being an O(n^2) algorithm.
So, this leads to the current problem. I was generating a report and I noticed that it was taking several minutes to run. That wasn’t a big deal, but my curiosity got the best of me, so I broke out a profiler and found this:
Holy cow! It makes sense as the source file has almost two million records, and I am parsing the date on every record. Who would have thought simple date parsing could take up so much time? I don’t know what I can do to make date parsing more efficient, but I did realize that I only need the date in some limited cases, so I changed the code to only do the date parsing at the time I actually need to compare timestamps, Changing this to lazy evaluation, bringing me into better alignment with the three virtues. The result:
Instead of parsing 1.8 million date entries, I now only parsed 82k.
Now, in my case, my code will never have to scale much beyond the current dataset, but if I did, there would be a catastrophe waiting to happen here.
The moral of the story is that we should all think about these seemingly minor decisions we make and think about what happens when the dataset scales up by orders of magnitude. That expensive sort or needless date parsing could hurt someday. Think about how your code will scale. Profile your code! You may discover things you didn’t expect.
The only good thing that spammers do for us is to give us something to laugh at. The latest case was a comment in my moderation queue which started like this:
{I have|I’ve} been {surfing|browsing} online more than {three|3|2|4} hours today, yet I never found any interesting article like yours.
It continued on for quite a while, but I’ll spare you. Clearly someone screwed up their automation. Just for grins, I wanted to see if I could write a one-liner to process this input, and, viola!
perl -pe 's(\{(.+?)\})(@a=split(/\|/, $1); $a[rand(1+$#a)])ge' < spam
I’ve replaced a spammer with a one liner! If only that meant they’d go away.
Some years ago I ran into a piece of code which shocked me, and in the time since then I have realized that it exemplified a lot of what is wrong with software. Sadly, I have since lost the code, so here is an approximation:
unless (open(F, "/some/important/file"))
{
# We don't want to scare the users with an error message
# warn "Unable to read config file";
}
Am I the only one who is outraged by this? What is scarier to a user, to get an error message when a genuine error situation occurred or let the software plod on getting even stranger and more non-sensical errors which cascade from this initial problem? For example, imagine the following code further on:
my $req = $http->request($config->{url});
die "Unable to contact web server $config->{url}\n" unless $req;
The config structure was empty because it could not be read due to the earlier problem, so the error message simply says “Unable to contact web server”. So now you are led to believe that the problem is with some unspecified web server. How much time will you waste trying to track that down?
So which is worse, “scared” or confused and frustrated?