I used to rant on here about awful error messages, but I kind of gave up as the problem kept getting worse. It seemed that error messages were getting less informative and error handling in code was getting worse. (I complained about this years ago) But here’s a cautionary tale about what the cost of this sloppiness can be.
So I inherited an internal service from another team, and one day I go to deploy an update, and the service will not start. I barely understand this thing or the language it is written in, and now I have to debug it!
At first I think the service is stuck trying to contact some external service which is gone. But after studying the error messages I find a cyclical nature to the errors: “starting service”, “starting subsystem x”, “starting subsystem y”, etc., “starting service”, repeating endlessly. So it is not hanging but crashing repeatedly with no error messages.
Now I come to the horrifying realization: If my service should go down for any reason, it will not start up again. I cannot patch the OS, or do anything which would cause a restart. If that happens, I will now have a crisis on my hands, and I don’t even know what is happening.
I poked around further and discovered that the on-disk logs had one more error message. I suppose this error did not make it out to the logging server before the service crashed. Sounds like another bug to me. But this error message was bewildering: “java.sql.SQLException: An attempt by a client to checkout a Connection has timed out.
” That makes no sense. First off, which database? The service connects to several. Secondly, the time between the previous log entry and this one is around 2 seconds; no sane person would set a timeout that short. I found a timeout setting which was, bewilderingly, set to 1 second! I changed it to 10 seconds. Now the time gap in the log was 20 seconds. I am not sure why it is off by a factor of two. Perhaps there are two attempts to connect, or there is a straight up math error.
Clearly there is something going wrong while connecting to the database, but that makes no sense: the database is running, and nothing has changed in any of the servers or network settings. That I know of. Better test it. The MySQL command line client gets through just fine. But maybe the library being used is at fault, so now I have to learn enough of this language to write a simple database client. Several hours later I have some code working. It works fine in every relevant context. So the “timeout” is, as I thought, a lie.
I dig through the stack trace and it seems whatever is going wrong is happening right here:
Connection out = driver().connect( jdbcUrl, properties );
if (out == null)
throw new SQLException("Apparently, jdbc URL '" + jdbcUrl + "' is not valid for the underlying " +
"driver [" + driver() + "].");
From the stack trace it is clear the exception is being thrown, but somewhere along the line that error message, as uninformative as it is (i.e. there is no indication of why it failed), is getting lost or suppressed.
Fortunately, one of the previous maintainers pointed out a way to enable more detailed logging. Miraculously, it revealed this error message:
java.sql.SQLException: The server time zone value 'UTC' is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the serverTimezone configuration property) to use a more specifc time zone value if you want to utilize time zone support.
So a fatal error is happening when connecting to the database. But that fatal error is suppressed unless debugging output is enabled. Let’s just pause and re-read the last sentence. Furthermore, the message from the other exception shown above is getting dropped along the way. But regardless a helpful article on Stack Overflow which pointed out a simple solution to add “serverTimezone=UTC” to the URL. Problem solved!
Two great mysteries remain: What changed in order to cause the timezone disagreement? My best guess is that an OS update changed some obscure timezone setting in such a way as to cause one of the servers to become confused about their timezone. The other mystery is why this fatal error was only sometimes fatal, and that “sometimes” became more and more likely as time went on, so early on, the service would restart repeatedly but then come up after some number of retries, eventually this became infinite.
But regardless, this illustrates the cost of sloppy error handling and uninformative error messages. I spent two full weeks clawing through source code trying to find the actual problem (the whole time in a panic that my production servers might get restarted and my service would be entirely down with no way to bring it back). But once the real error message was exposed the problem was fixed in minutes. That’s two weeks lost because someone couldn’t be bothered to do proper error reporting.
I first learned to program when the primary mechanism of error handling was to check the return value of each function call and react appropriately. This often led to awful code where functions are called with no checking, and things mysteriously fail several steps removed from the actual error. Therefore I learned to add in lots of error handling so that any error messages would be very specific and easy to fix, to the point that I was once told I had too much error handling. Theoretically more modern programming languages with their try/catch syntax would help, but I am not convinced of that. syntactic sugar is no replacement for programmer discipline. Here’s a case in point, what’s the difference between this:
open(F, "$confdir/config")
$conf = parseconfig(<F>);
dbconnect($conf->{db}{uri}) or die;
and this:
try {
open(F, "$confdir/config")
$conf = parseconfig(<F>);
dbconnect($conf->{db}{uri});
catch {
die;
}
I can think of at least 6 ways this could fail, all yielding the exact same error, in either version. Don’t tell me you’ve never seen code like either one of those!
But, moving on from hypotheticals, here’s a simplified version of my code:
use Search::Elasticsearch;
my $e = Search::Elasticsearch->new( nodes => [$esurl]);
die "Error: unable to connect to $esurl\n" unless $e->ping;
print "connected to $esurl\n";
Running it gets this error:
[NoNodes] ** No nodes are available: [https://es.example.com:9200], called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at foo.pl line 31.
Hmmm… neither my die or print generated any output; it is crashing the entire program while inside ping(). So this is my fault, I didn’t realize this module was written assuming try/catch blocks would be used, but doing that doesn’t really change much. The error message is the same; I still don’t know why it is failing. The server at the url works perfectly fine. So I dig in with the Perl debugger, and finally narrow down the crash to this line in HTTP::Tiny:
_croak(qq{$type URL must be in format http[s]://[auth@]<host>:<port>/\n});
That looks like a potentially helpful error message! It’s too bad it is getting lost somewhere along the way. Going up the call stack I come to this:
return HTTP::Tiny->new( %args, %{ $self->handle_args } );
No error handling, no try/catch, they just blindly create the HTTP::Tiny object and hope it all works. There are a number of try/catch blocks further up the stack, but the error message gets lost along the way, and by the time my catch happens, it is gone. I did discover that enabling detailed logging via “use Log::Any::Adapter qw(Stderr);” did yield that error message:
http_proxy URL must be in format http[s]://[auth@]<host>:<port>/
at /usr/local/share/perl5/Search/Elasticsearch/Cxn/HTTPTiny.pm line 86.
Proxy?! There shouldn’t be a proxy set! After some searching I finally find that someone kludged /etc/environment with the proxy. Obviously that was a bad idea. I never could figure out what was wrong with the proxy url, but since it should not have happened, and I had wasted at least an hour on this, I stopped digging.
So what went wrong here? First the error message says that the proxy url is invalid, but it doesn’t show me what that value was. Secondly, that real error message got lost somewhere along the way. This sort of sloppiness is what leads to many of the awful error messages I have documented here previously (like this), and trying to catch the exceptions doesn’t fix sloppy code.
So I just ran into a piece of code which looked something like this
request = http.request(someurl)
if (request.status == 200)
{
mystuff = request.body.foo
}
I showed this to my wife, who knows almost nothing about code. She said “what if it gets a status other than 200?” Despite the fact that her one and only coding class was in high school and she barely remembers it, she has managed to do better than the veteran who wrote that code! I responded, “Well, at least they are checking for the status rather than blindly proceeding!” I should not have said that. Fifteen minutes later, in the same file, I ran into code which was like this:
request = http.request(someurl)
mystuff = request.body.foo
Larry Niven once said “That’s the thing about people who think they hate computers … What they really hate is lousy programmers.”
So I create numerous SVN replicas and since it takes several steps to do this, I have it automated all the icky bits in a script. Usually it worked fine, but this time, the whole thing mysteriously fails with this:
svn: E165001: Revprop change blocked by pre-revprop-change hook (exit code 255) with no output.
Let’s translate “with no output”: “somewhere along the line a programmer neglected to detect and/or issue an error message”. A fatal error never fails with no output unless someone, somewhere, screwed up the error handling code.
So I run the propset command manually… works fine. I run it with the debugger… all is well, but the propset command fails. I run the propset command in another window… works fine. Now, from in the debugger I run the propset with strace. Buried in the output I find this:
18704 chdir(".") = -1 EACCES (Permission denied)
18704 exit_group(-1) = ?
Sure enough! I had su’ed to the repository owner id, but my personal home directory was locked up:
$ ls
ls: cannot open directory .: Permission denied
One thing to note is that there is no attempt to issue an error message between the chdir() and the exit_group()! I wonder which would cost more: the programmer adding one line of code to issue an error message, or me spending half an hour figuring out this problem?
I had a whole bunch of error message screenshots saved and as I was looking through, I realized two things: first, even though there are only 2 dozen of them I don’t know if I have the energy to compose witty comments about each without becoming entirely demoralized. Secondly, I began to notice some themes. Therefore, I am going to take all the error messages I have and roughly categorize them. Since there is often overlap I am going to present it as a table, a sort of bingo card, indicating the categorization.
The first category I call “Sorry, not sorry”: inauthentic apologies for things that the author of the error message is likely responsible for in the first place.
The next category I call “Funny, not funny”: instead of giving inauthentic apologies for our screw up, we will try to distract from it with a “cute” saying (“aw, snap”) or a frowny face icon. I am not laughing, just stop.
The next category is what I call “helpfully not helpful”: the error message gives some excess, though unhelpful, detail with the error message. You still don’t know what went wrong but you spent twice as long reading the verbiage. Also useless suggestions like “try again later” or “retry” fall into this bucket.
The category “dunno” covers most of these errors: I don’t know what happened (even though I am coding an exception block), so I will just feign ignorance. It’s best when these are in passive voice, and extra credit for using the word “something” in the error.
The next category, which rarely comes up, is when there are clues as to what went wrong. I myself have (unintentionally) written errors like this with code like this:
warn "Error: unable to open file $file\n";
Of course, if $file is blank you get:
Error: unable to open file
Someone with some coding experience may pick up that a filename belonged there. Putting quotes around the file would have at least given a hint that I got an empty filename (and failed to sanitize my inputs). Note: I never said I was blameless in this error message hall of shame.
Here we go:
error | notsorry | notfunny | nothelp | dunno | clues |
---|---|---|---|---|---|
x | x | x | x | ||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | |||||
x | x | ||||
x | x | x | |||
x | x | ||||
x | |||||
x | x | ||||
x | |||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | x | x | ||
x | x | ||||
x | |||||
x |
How many of those have you seen? Perhaps I should have a giveaway for the first person to have personally seen every single one. I don’t know what sort of prize it would be. Maybe we could sit down and share some whiskey… I feel like I need it after looking at all those.
I’m glad there could be a reason for this error. But, on the other hand, it may just be random, like the lottery in reverse.
Surely you cannot be unsure!
In all my years of programming I’ve never had an operation return an uncertain result. I’ve had my share of unexpected results, but that’s just sloppy programming somewhere along the line. Perhaps it is returning a result of type SchrödingersCat.
I recently upgraded my workstation and, once again, had to change music players. Though this is just as well since Banshee’s 20 seconds of unresponsiveness after changing tracks was pretty intolerable. So now I tried Clementine. It seems like a reasonably good player, but on certain tracks I get this:
Presumably the tracks in question were MP3, but knowing the track/file name and the plugin it wanted would be most helpful in locating the exact plugin I need. After installing a whole bunch of packages with different codecs nothing changed.
In my youth I used to think that free software would ultimately be better quality. So much for that.
The guessing games continue. This time we get to guess as to why the remote desktop viewer was unable to connect.
Perhaps the host was down? Perhaps my password was wrong? Perhaps there was a protocol error or incompatability? Perhaps the remote service was down? Perhaps my breath? Perhaps, perhaps, perhaps…
I haven’t been writing about error messages in a while. But this is not because these horrible error messages are hard to find, but rather I’m overwhelmed by them. We are all steeped in incomprehensible, unhelpful error messages every day and it just starts to become part of the background noise of our lives. It’s like blogging about drops of water in the river.
So in hopes of getting this going again, here’s one of the worst ones I’ve seen in a long time:
The “details” link is a nice touch, I’m not sure why I didn’t capture what happened when I clicked that; I can’t remember what is said, but I am certain it was utterly unhelpful.
Here’s an error message I first remember seeing a few decades ago. Sadly I just ran into it again. It seems guessing games never go out of style.
Error while changing the NIS password.
The NIS password has not been changed on somenisserver.
For many years I was forced to use Windows on my work computers. I almost always listen to music as I work, so I needed a music player. I quickly grew dissatisfied with Windows Media Player, and after some searching I found Foobar2000, which I really liked. A utilitarian, but capable, interface which did most everything I needed and did it pretty well.
A few years ago I switched my desktop machine to Linux. Hurrah! But now the music player again became an issue. I tried Rhythmbox, Banshee and Amarok. None of them were very good. Nothing close to Foobar2000. All seemed to suffer from the same issues: lots of pointless graphical flourish, confusing and/or dumbed down interfaces, poor functionality, and lousy error handling. By process of elimination, I ended up with Banshee. A grudging choice which I regret on most days.
A main source of pain is the Podcast extension. It suffers from my usual pet peeves: no indication what, if anything, is happening and poor to non-existent error reporting. Here’s a prime example: I hit the button to check for new podcasts. Nothing seems to happen for a few minutes, but then some new episodes pop in. But they don’t download. I right click on a new episode, and select download. Nothing. I wait. Still nothing. So I go digging. After several minutes of searching I find a log file buried under my home directory, it contains this:
[Warn 09:24:14.993] HttpDownloadTask The Permaculture Podcast - Honeybees with Dr. Dennis vanEngelsdorp Error: System.Net.WebException: The remote server returned an error: (404) Not Found.
So, it actually did attempt to download, but didn’t bother to say anything. When that download failed, it quietly tucked the error into an obscure log file, and on top of that labeled it as “Warn”. I tend to be of the opinion that when I ask for something to be done, and it cannot be, that is an “Error”, not a “Warning”. But even so, the error message doesn’t mention the URL or whether it was going through a proxy, without those I still don’t know why it happened. As I recall I continued experimenting with settings until I figured out that the proxy was the issue.
I didn’t become a software engineer so I could spend my days pushing upstream against issues like this.
Years ago one of my co-workers complained that my code had “too much error handling”. I was astonished, but said little in defence since I was the new guy on the team. Looking back on this, years later, I am bothered by this attitude. It is easy to write code that works correctly when everything it depends upon works correctly. Given the complexity of modern software and hardware, there are an endless number of things which can fail.
Therefore, error handling becomes the most critical part of the code. We have to code with the assumption that anything can fail. In my experience, it will, sooner or later. When the failure does happen, it must be dealt with in a reasonable manner. Ideally that would be some sort of self-healing, retrying in the case of transient issues, and, failing that a useful and comprehensive error message.
I first started writing this post at least 4 years ago, and in the meantime it has become apparent that my point of view is the minority amongst programmers. Silent failures, incomprehensible error messages, and, crashes are a daily part of life amongst the recent wave of gadgetry. But I guess the plus side is it gives me something to complain about here.
I noted this error, but forgot to note where I got it. I’m pretty sure it was IBM’s support site:
There is an issue with your login that needs to be addressed before you can successfully login to SR. Please contact the SR help desk.
So I can’t log into the SR system due to an “issue”, and to get it resolved, I need to file an SR. My head is spinning, just like the logic.
I often rant about lousy error messages, but it seems that a growing (and maybe more worrisome) phenomenon is silent failures. Here’s an example I just ran into with ClearCase (this is not a new bug, I first encountered it over 10 years ago):
$ ct desc -l vob:. | grep MITTR
SLC0314:C:\views\MITTR.vws [uuid 107553f5.340c4e00.bd2c.3f:52:42:42:31:7b]
$ ct rmview -force -uuid 107553f5.340c4e00.bd2c.3f:52:42:42:31:7b
Removed references to view "SLC0314:C:\views\MITTR.vws" from VOB "/vobs/factcheck".
$ ct desc -l vob:. | grep MITTR
SLC0314:C:\views\MITTR.vws [uuid 107553f5.340c4e00.bd2c.3f:52:42:42:31:7b]
What exactly is happening is unimportant, what you can see is that there was an entry in the “desc” output, then I tried to remove it, which the output indicates happened, but double-checking indicates nothing was done. It_ failed_, and lied about it.
My assertion that this is a defect was ignored by IBM Tech Support. I guess their definition of “defect” is different from the rest of us.
I could be wrong, but it seems obvious to me that this sort of failure is a classic case of someone not checking for errors, that is, leaving off the last part of something like this:
rmviewuuid($uuid, $vob) or warn "Error: failed to remove view $uuid from $vob\n";
While built-in exception handling (à la java) can help, I’ve seen plenty of code like this (even in Java textbooks):
try { vob.rmview(uuid) } catch { }
Back when I started programming, little bits of sloppiness like this weren’t a big deal. It was often not hard to see what was going wrong when a 1000 line program was running. But when you have millions of lines of code, often spread amongst numerous libraries, in many different interlocking programs, the problems grow exponentially. As we all experience daily.
This is a fun one from svnsync:
Synchronizing repository some-replica...
Transmitting file data ......................................!
.......................svnsync: Error! while replaying commit
What’s even better is that a Google search reveals a number of people running into this exact error. Sadly, few of them got replies, and those that did get replies did not get helpful replies. Certainly no solutions.
Mashing the error message on the previous line is a nice touch as it makes it harder to spot the error message, especially when there are numerous periods (thousands, in my case).
I’m a firm believer that every program has to have the ability to enable debugging output, as that could compensate for such poor error messages. But in this case, no luck. Maybe the java stack trace isn’t so bad after all.
Is what I get for dutifully installing all updates?
Telling me that the install failed twice is nice, but does not hide the fact that I still don’t know why. I feel like I’m playing a guessing game with my 3 year old daughter: “I’m not telling!”
I know that if I go to IT they’ll want to reimage my machine. At first their predilection for reimaging bugged me as I thought more effort should go into figuring out the root cause. But over the years I’ve come to sympathize with their position. Usually after hours of probing you run into a brick wall, an OS bug or some silent failure and without source code there’s no way around it. And figuring out a root cause only helps if there’s some hope that the problem may be fixed. So why bother?
Anyway, I’ve got a Linux CD which could fix the issue once and for all.