One of the unfortunate side effects of Moore’s Law is that the immense amounts of computing power at our fingertips masks over many horribly inefficient practices. For example, something I have commonly done for a very long time is use a sort, uniq, sort pipeline to tally up something; say you have a CSV containing users and the 2nd column is the country, so you want a quick list of how many you have from each country:
$ cat u | awk -F, '{print $2}' | sort | uniq -c | sort -n
1 ca
1 mx
14 in
16 us
Obviously for a short list that first sort is not expensive at all, but with my current laptop, I can do the same thing on a file with 2 million records and it finishes in 5 seconds, despite the massive sort. Obviously, this could be done using a O(n) algorithm like so (you could do this with Awk, of course, but I’m more fluent in Perl):
$ cat u | perl -ne '$a = (split(/,/))[1]; $c{$a}++; END { foreach my $i (sort {$c{$a}<=>$c{$b} } keys %c) { printf "%10d %s\n", $c{$i}, $i; } }'
1 ca
1 mx
14 in
16 us
(For those who have been through a job interview with me will recognize this question… few of you got it right)
Here’s a different example: When I was in high school the student government ran an annual fundraiser by doing a “computer dating” event: everyone would fill out a short multiple choice survey, and those would be sent off to some company to generate reports matching people.
One day, the computer teacher obliquely asked me if I could write a program that could take a list of people and randomly put together lists and generate reports. I knew what he was hinting at: let’s skip hiring the company and generate fake reports ourselves. I wrote that code, but, being a curious sort (and not entirely comfortable with the deception), I decided to try doing it the “right” way. The basic algorithm is simply to take a string of responses and match it against everyone else in the list, take the top ten matches and generate a report. Simple, right?
I worked on that program through the summer (keep in mind I was 16 writing in BASIC on an Apple ][), but I got it working. When we finally ran it on the full data set (a few hundred surveys), it sat and ran for over a week on an Apple ][ computer. [A personal note: since I was nursing this through, I saw every report slowly trickling out; my name came up on exactly one of those reports, and at number 7, at that. I was disappointed but not surprised.] We all know the same thing would, nowadays, be a few dozen lines of code and would run in seconds, despite being an O(n^2) algorithm.
So, this leads to the current problem. I was generating a report and I noticed that it was taking several minutes to run. That wasn’t a big deal, but my curiosity got the best of me, so I broke out a profiler and found this:
Holy cow! It makes sense as the source file has almost two million records, and I am parsing the date on every record. Who would have thought simple date parsing could take up so much time? I don’t know what I can do to make date parsing more efficient, but I did realize that I only need the date in some limited cases, so I changed the code to only do the date parsing at the time I actually need to compare timestamps, Changing this to lazy evaluation, bringing me into better alignment with the three virtues. The result:
Instead of parsing 1.8 million date entries, I now only parsed 82k.
Now, in my case, my code will never have to scale much beyond the current dataset, but if I did, there would be a catastrophe waiting to happen here.
The moral of the story is that we should all think about these seemingly minor decisions we make and think about what happens when the dataset scales up by orders of magnitude. That expensive sort or needless date parsing could hurt someday. Think about how your code will scale. Profile your code! You may discover things you didn’t expect.
I used to rant on here about awful error messages, but I kind of gave up as the problem kept getting worse. It seemed that error messages were getting less informative and error handling in code was getting worse. (I complained about this years ago) But here’s a cautionary tale about what the cost of this sloppiness can be.
So I inherited an internal service from another team, and one day I go to deploy an update, and the service will not start. I barely understand this thing or the language it is written in, and now I have to debug it!
At first I think the service is stuck trying to contact some external service which is gone. But after studying the error messages I find a cyclical nature to the errors: “starting service”, “starting subsystem x”, “starting subsystem y”, etc., “starting service”, repeating endlessly. So it is not hanging but crashing repeatedly with no error messages.
Now I come to the horrifying realization: If my service should go down for any reason, it will not start up again. I cannot patch the OS, or do anything which would cause a restart. If that happens, I will now have a crisis on my hands, and I don’t even know what is happening.
I poked around further and discovered that the on-disk logs had one more error message. I suppose this error did not make it out to the logging server before the service crashed. Sounds like another bug to me. But this error message was bewildering: “java.sql.SQLException: An attempt by a client to checkout a Connection has timed out.
” That makes no sense. First off, which database? The service connects to several. Secondly, the time between the previous log entry and this one is around 2 seconds; no sane person would set a timeout that short. I found a timeout setting which was, bewilderingly, set to 1 second! I changed it to 10 seconds. Now the time gap in the log was 20 seconds. I am not sure why it is off by a factor of two. Perhaps there are two attempts to connect, or there is a straight up math error.
Clearly there is something going wrong while connecting to the database, but that makes no sense: the database is running, and nothing has changed in any of the servers or network settings. That I know of. Better test it. The MySQL command line client gets through just fine. But maybe the library being used is at fault, so now I have to learn enough of this language to write a simple database client. Several hours later I have some code working. It works fine in every relevant context. So the “timeout” is, as I thought, a lie.
I dig through the stack trace and it seems whatever is going wrong is happening right here:
Connection out = driver().connect( jdbcUrl, properties );
if (out == null)
throw new SQLException("Apparently, jdbc URL '" + jdbcUrl + "' is not valid for the underlying " +
"driver [" + driver() + "].");
From the stack trace it is clear the exception is being thrown, but somewhere along the line that error message, as uninformative as it is (i.e. there is no indication of why it failed), is getting lost or suppressed.
Fortunately, one of the previous maintainers pointed out a way to enable more detailed logging. Miraculously, it revealed this error message:
java.sql.SQLException: The server time zone value 'UTC' is unrecognized or represents more than one time zone. You must configure either the server or JDBC driver (via the serverTimezone configuration property) to use a more specifc time zone value if you want to utilize time zone support.
So a fatal error is happening when connecting to the database. But that fatal error is suppressed unless debugging output is enabled. Let’s just pause and re-read the last sentence. Furthermore, the message from the other exception shown above is getting dropped along the way. But regardless a helpful article on Stack Overflow which pointed out a simple solution to add “serverTimezone=UTC” to the URL. Problem solved!
Two great mysteries remain: What changed in order to cause the timezone disagreement? My best guess is that an OS update changed some obscure timezone setting in such a way as to cause one of the servers to become confused about their timezone. The other mystery is why this fatal error was only sometimes fatal, and that “sometimes” became more and more likely as time went on, so early on, the service would restart repeatedly but then come up after some number of retries, eventually this became infinite.
But regardless, this illustrates the cost of sloppy error handling and uninformative error messages. I spent two full weeks clawing through source code trying to find the actual problem (the whole time in a panic that my production servers might get restarted and my service would be entirely down with no way to bring it back). But once the real error message was exposed the problem was fixed in minutes. That’s two weeks lost because someone couldn’t be bothered to do proper error reporting.
I first learned to program when the primary mechanism of error handling was to check the return value of each function call and react appropriately. This often led to awful code where functions are called with no checking, and things mysteriously fail several steps removed from the actual error. Therefore I learned to add in lots of error handling so that any error messages would be very specific and easy to fix, to the point that I was once told I had too much error handling. Theoretically more modern programming languages with their try/catch syntax would help, but I am not convinced of that. syntactic sugar is no replacement for programmer discipline. Here’s a case in point, what’s the difference between this:
open(F, "$confdir/config")
$conf = parseconfig(<F>);
dbconnect($conf->{db}{uri}) or die;
and this:
try {
open(F, "$confdir/config")
$conf = parseconfig(<F>);
dbconnect($conf->{db}{uri});
catch {
die;
}
I can think of at least 6 ways this could fail, all yielding the exact same error, in either version. Don’t tell me you’ve never seen code like either one of those!
But, moving on from hypotheticals, here’s a simplified version of my code:
use Search::Elasticsearch;
my $e = Search::Elasticsearch->new( nodes => [$esurl]);
die "Error: unable to connect to $esurl\n" unless $e->ping;
print "connected to $esurl\n";
Running it gets this error:
[NoNodes] ** No nodes are available: [https://es.example.com:9200], called from sub Search::Elasticsearch::Role::Client::Direct::__ANON__ at foo.pl line 31.
Hmmm… neither my die or print generated any output; it is crashing the entire program while inside ping(). So this is my fault, I didn’t realize this module was written assuming try/catch blocks would be used, but doing that doesn’t really change much. The error message is the same; I still don’t know why it is failing. The server at the url works perfectly fine. So I dig in with the Perl debugger, and finally narrow down the crash to this line in HTTP::Tiny:
_croak(qq{$type URL must be in format http[s]://[auth@]<host>:<port>/\n});
That looks like a potentially helpful error message! It’s too bad it is getting lost somewhere along the way. Going up the call stack I come to this:
return HTTP::Tiny->new( %args, %{ $self->handle_args } );
No error handling, no try/catch, they just blindly create the HTTP::Tiny object and hope it all works. There are a number of try/catch blocks further up the stack, but the error message gets lost along the way, and by the time my catch happens, it is gone. I did discover that enabling detailed logging via “use Log::Any::Adapter qw(Stderr);” did yield that error message:
http_proxy URL must be in format http[s]://[auth@]<host>:<port>/
at /usr/local/share/perl5/Search/Elasticsearch/Cxn/HTTPTiny.pm line 86.
Proxy?! There shouldn’t be a proxy set! After some searching I finally find that someone kludged /etc/environment with the proxy. Obviously that was a bad idea. I never could figure out what was wrong with the proxy url, but since it should not have happened, and I had wasted at least an hour on this, I stopped digging.
So what went wrong here? First the error message says that the proxy url is invalid, but it doesn’t show me what that value was. Secondly, that real error message got lost somewhere along the way. This sort of sloppiness is what leads to many of the awful error messages I have documented here previously (like this), and trying to catch the exceptions doesn’t fix sloppy code.
So I just ran into a piece of code which looked something like this
request = http.request(someurl)
if (request.status == 200)
{
mystuff = request.body.foo
}
I showed this to my wife, who knows almost nothing about code. She said “what if it gets a status other than 200?” Despite the fact that her one and only coding class was in high school and she barely remembers it, she has managed to do better than the veteran who wrote that code! I responded, “Well, at least they are checking for the status rather than blindly proceeding!” I should not have said that. Fifteen minutes later, in the same file, I ran into code which was like this:
request = http.request(someurl)
mystuff = request.body.foo
Larry Niven once said “That’s the thing about people who think they hate computers … What they really hate is lousy programmers.”
[I started writing this in 2016 and unearthed it amongst some old drafts. But 6 years have only intensified my feelings here, so here it is updated and finished]
I’m sure you’ve all heard the statement from Arthur C. Clark that “Any sufficiently advanced technology is indistinguishable from magic.” But I had an exchange which convinced me of a variation of that: “Any suffciently hyped technology is indistinguishable from religion.”
The case in point was a discussion with someone who seemed to think that Git was the only version control system which had checkin hooks. And after informing him that every modern version control system, indeed, every one of them I’ve used in the last two decades, has such things (in one way or another), he repeated the same thing later on in the conversation, as if unable to process this new piece of information which contradicted established dogma.
Another interaction was with someone who asserted that Git was “more secure”. When I questioned him as to exactly how it is more secure, we was unable to articulate anything meaningful. Then I pointed out that is was trivial to forge checkins (and even demonstrated it in front of him by doing a checkin in his name), but this didn’t phase him, and he returned, mindlessly, to his original point.
I have nothing against Git on the whole, I use it myself every day. I have minor gripes with it, mainly having to do with the arcane, counter-intuitive interface (like this). But my biggest gripe is the religious fervor with which it is hyped, the irrational one-size-fits-all, Maslow’s Hammer weilding, be-all and end-all, the perfect final pinacle of version control. For ever and ever. Amen!
There have been many “religious wars” in the software world over the years. Long ago I defected to the Emacs camp, so I know it well. But with all of those wars, there were always competing technologies; differing views on how to approach a problem. But in this case, there is only one left, all others have been shouted down into irrelevancy. At the time of Git’s ascendance, I was managing at least 6 different version control systems, and I thought this would just be one more to add to the mix. But I was mystified as my team was quickly sidelined as everyone mindlessly rushed to Git.
I am a believer in using the right tool for the right job, and Git is certainly the right tool in a lot of cases, but not every problem is a nail. Sometimes you need different tools.
I notice that I wrote about this earlier, but I am now taking the long view of this: every ascendant technology eventually declines as the next shiny thing attracts everyone’s attention. I look forward to the day something new comes along and pushes Git to the sidelines.
So I create numerous SVN replicas and since it takes several steps to do this, I have it automated all the icky bits in a script. Usually it worked fine, but this time, the whole thing mysteriously fails with this:
svn: E165001: Revprop change blocked by pre-revprop-change hook (exit code 255) with no output.
Let’s translate “with no output”: “somewhere along the line a programmer neglected to detect and/or issue an error message”. A fatal error never fails with no output unless someone, somewhere, screwed up the error handling code.
So I run the propset command manually… works fine. I run it with the debugger… all is well, but the propset command fails. I run the propset command in another window… works fine. Now, from in the debugger I run the propset with strace. Buried in the output I find this:
18704 chdir(".") = -1 EACCES (Permission denied)
18704 exit_group(-1) = ?
Sure enough! I had su’ed to the repository owner id, but my personal home directory was locked up:
$ ls
ls: cannot open directory .: Permission denied
One thing to note is that there is no attempt to issue an error message between the chdir() and the exit_group()! I wonder which would cost more: the programmer adding one line of code to issue an error message, or me spending half an hour figuring out this problem?
[Disclaimer and credit: I do not remember where I first heard about the brush turkey, but I suspect it was Douglas Adams during a book signing for his book Last Chance to See. The details of the brush turkey are exaggerated and maybe even wrong, but this is to serve a narrative purpose, so any zoologists can sit down, this parable isn’t for you.]
The brushturkey is very prone to boredom. Every time she lays an egg, she thinks to herself, “sitting on these eggs for several weeks is going to be really boring. There has to be a better way. I have an idea! I will build a compost pile on top of the eggs, and that will keep the eggs warm and I can go have fun.” She gets up and runs all over the place gathering organic matter and piles it up on top of the clutch of eggs. She has to gather quite a bit in order to get the temperature high enough to incubate the eggs. But finally she finished! Now she can relax!
After a while she thinks: “I better double check the temperature.” She sticks her head into the pile. “That feels too cool, let me gather more material”. She runs around gathering more material to add to the pile.
A few hours later she checks again. “Oh, no! It’s too hot, I’ll have to take some of the material off”. More running around.
She repeats this, constantly running around adding and removing material from the pile to keep the temperature just right. Day in and day out for several weeks.
Those with ears, let them hear!
As a programmer the urge to automate tasks is constant. However, there are many times when the effort of automating the task may be far greater than just doing the task manually. Let’s say you have a task which takes you 15 minutes, but with some automation you could reduce that to 5. So you spend a day writing a program to do the automation. But you only do that task twice a month. It will take years to break even.
You would have been better off sitting on that egg!
I had a whole bunch of error message screenshots saved and as I was looking through, I realized two things: first, even though there are only 2 dozen of them I don’t know if I have the energy to compose witty comments about each without becoming entirely demoralized. Secondly, I began to notice some themes. Therefore, I am going to take all the error messages I have and roughly categorize them. Since there is often overlap I am going to present it as a table, a sort of bingo card, indicating the categorization.
The first category I call “Sorry, not sorry”: inauthentic apologies for things that the author of the error message is likely responsible for in the first place.
The next category I call “Funny, not funny”: instead of giving inauthentic apologies for our screw up, we will try to distract from it with a “cute” saying (“aw, snap”) or a frowny face icon. I am not laughing, just stop.
The next category is what I call “helpfully not helpful”: the error message gives some excess, though unhelpful, detail with the error message. You still don’t know what went wrong but you spent twice as long reading the verbiage. Also useless suggestions like “try again later” or “retry” fall into this bucket.
The category “dunno” covers most of these errors: I don’t know what happened (even though I am coding an exception block), so I will just feign ignorance. It’s best when these are in passive voice, and extra credit for using the word “something” in the error.
The next category, which rarely comes up, is when there are clues as to what went wrong. I myself have (unintentionally) written errors like this with code like this:
warn "Error: unable to open file $file\n";
Of course, if $file is blank you get:
Error: unable to open file
Someone with some coding experience may pick up that a filename belonged there. Putting quotes around the file would have at least given a hint that I got an empty filename (and failed to sanitize my inputs). Note: I never said I was blameless in this error message hall of shame.
Here we go:
error | notsorry | notfunny | nothelp | dunno | clues |
---|---|---|---|---|---|
x | x | x | x | ||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | |||||
x | x | ||||
x | x | x | |||
x | x | ||||
x | |||||
x | x | ||||
x | |||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | ||||
x | x | x | x | ||
x | x | ||||
x | |||||
x |
How many of those have you seen? Perhaps I should have a giveaway for the first person to have personally seen every single one. I don’t know what sort of prize it would be. Maybe we could sit down and share some whiskey… I feel like I need it after looking at all those.
Years ago one of my co-workers complained that my code had “too much error handling”. I was astonished, but said little in defence since I was the new guy on the team. Looking back on this, years later, I am bothered by this attitude. It is easy to write code that works correctly when everything it depends upon works correctly. Given the complexity of modern software and hardware, there are an endless number of things which can fail.
Therefore, error handling becomes the most critical part of the code. We have to code with the assumption that anything can fail. In my experience, it will, sooner or later. When the failure does happen, it must be dealt with in a reasonable manner. Ideally that would be some sort of self-healing, retrying in the case of transient issues, and, failing that a useful and comprehensive error message.
I first started writing this post at least 4 years ago, and in the meantime it has become apparent that my point of view is the minority amongst programmers. Silent failures, incomprehensible error messages, and, crashes are a daily part of life amongst the recent wave of gadgetry. But I guess the plus side is it gives me something to complain about here.
When I was a child, my parents would often tell me to repeat what they just told me, since I usually wasn’t paying attention. Now I have to do the same thing with my own daughter. Payback time, it seems.
But this blog entry isn’t about parenting, it’s about error messages.
I was just writing some code and realized that an important rule when writing error messages is to repeat back what the user said. There are many violations of this rule, the first one that comes to mind is this one from Windows:
The system cannot find the path specified.
That error may be comprehensible if you just typed a command, but as part of a script, it will be entirely useless. Obviously, the pathname needs to be displayed (of course, we still don’t know what was being done, or why).
This becomes even more important when a user specified value is modified in some way. For example I had a command line argument which could take a list. After breaking the list apart, I needed to validate the entries in the list. If I found anything invalid I could have simply given the error “invalid parameter”. Useless! Rather, I filtered out the valid values and then printed out the offending ones: “invalid parameters: a,b,c”.
Now, repeat what I just said!
NOTE: I started writing this on 22-Jul-2004, tranferred it to my personal wiki on 15-Oct-2006 and the last edit was on 4-Feb-2011 (except for an update and formatting fixes).
I have been working with ClearCase since 1994 and have become very familiar with its problems and shortcomings. I am using this page to accumulate a list of what is wrong, broken, or sub-optimal with ClearCase. This page has been written gradually over several years, often when I was in a bad mood after running into a problem. There are a number of good features of ClearCase which are not included in this page, but that information is readily available from IBM marketing.
Update: I attended the IBM Rational User’s conference in Jun 2007, and it appears that some of these problems are finally getting addressed. Version 8.0 should be sweet. I just hope I can hold out until mid 2009.
Another Update: It is now mid 2010 and no sign of version 8.0, and version 7.1 broke the installer such that we have yet to upgrade. My hope was obviously misplaced.
Final Update (July 2022): A reorg at my company moved me out of the team maintaining ClearCase so I no longer touch it.
If I had a nickel for every time someone complained to me about ClearCase performance I could have retired by now. The network architecture of ClearCase assumes that all users will be accessing the vob server via a high-speed local LAN. This is because most ClearCase operations require a huge number of round-trips between the vob server and the client.
I did some rough measurements of the packets exchanged during common operations and found that a simple “desc” operation takes over 100 round-trips, a “checkout” takes over 500 round-trips, and a “checkin” requires over 1000 round-trips.
I also did a comparison of creating a snapshot view and doing an initial checkout from SVN of an identical source tree. Subversion took about 49 round trips, but Clearcase did 117-144. Due to this latency difference it took 19 seconds to pull the source from Google code (through an https proxy), but it took 30 seconds to pull the source from a neighboring site over the intranet.
Clearly, even the slightest increase in latency between these hosts will mean a huge performance degradation. According to It’s the Latency, Stupid! the theoretical minimum latency for between machines on opposite shores of the USA is 42ms, in Siebel it seems to be about 62ms… that translates to a minimum checkin time of 62 seconds and that does not account for any processing time on any of the involved machines.
Both albd and the lock manager are single-threaded. This means that for a large user population you must have multiple servers in order to get reasonable performance. Update: it appears that the lock manager has been fixed in ClearCase 7.0.
Access control is very limited. It uses the old Unix model: user/group/other. If you have a vob which must be restricted to the members of two different groups, you will be in trouble. The suggestion always given by IBM is to create different regions for different user populations, but that suffers from the same multi-group issue, not to mention that a machine’s region can be changed.
I implemented rudimentary access control by applying an ACL on the vob storage directory. This prevents Windows users from mounting the vobs. Unfortunately, since vob mounting on Unix is done by root, those ACLs are ignored.
Update: Version 8.0 should include ACLs! Version 7.0.1 has a group to region mapping mechanism which is a reasonable stop-gap for CCRC until then.
In one sense this is the greatest feature of ClearCase: creation of a view (a.k.a. workspace) is a constant-time operation, i.e. creating a view for a 1mb source tree takes the same amount of time as for a 1tb source tree. Most source control systems require you to have a copy of every file on your local disk, which can be prohibitive for large source trees, both in terms of time and space.
But, here’s the rub: This means ClearCase lets you avoid careful segmentation/componentization of a product, instead developers can throw everything into one big source tree. But who cares? Since dynamic views are so cheap, it doesn’t matter, right? Wrong! When the source tree becomes so big that snapshots are no longer possible there are big downsides:
Like dynamic views, config specs are a mixed bag. They are a powerful and incredibly flexible mechanism for specifying what versions you want to look at. I find that understanding config specs is the central piece of knowledge you must have to effectively use ClearCase.
The problem with this is that it gives you a lot of rope. Plenty to hang yourself, though with enough slack that you won’t notice until much later (usually after a lot of damage is done).
So, people often hack their config specs, usually in an effort to avoid a merge, for example something like this:
element * .../mybranch/LATEST
element * .../otherbranch/LATEST
element * .../anotherbranch/LATEST
...
As long as those three branches are on non-overlapping sets of files, which are based on the same code base (e.g. label) this will usually work fine. But as soon as there is overlap, files must be merged. So much for avoiding the merge. Except now the situation is worse, now the merge must be carefully done in the right order. If, using the example above, the file foo.c has been changed on all three branches, a merge must take place from “anotherbranch” to “otherbranch”, and then a merge must take place from “otherbranch” to “mybranch”. Until that is done the source tree is out of whack.
Another reason people will modify config specs is to “fix” them. For example a whole team is using a config spec which has a timestamp. Someone notices that the timestamp has no timezone and is thus ambiguous. That person “fixes” the timestamp, but now is out-of-sync with the rest of the team. Then when that person branches a file, it may be off the wrong version. For this reason, I always tell people that consistency is more important than correctness.
The other problem with config specs is that it is the sole documentation as to the relationship between a branch and the code base.
The ClearCase Web interface has been included with the product for many years and is still severely limited. One of these limitations is that interactive triggers will not work. It seems like this would have simply have been a matter of making “clearprompt” understand that it is being run via the web interface and interoperate with it. But they didn’t bother with that (see the “Triggers” section for further criticism of clearprompt). Upon testing with our extensive set of triggers (only one of which is “interactive”, we find that “describe” does not work, so their documentation is dead wrong: it’s not “interactive” triggers that won’t work, but, indeed, triggers that call most any external clearcase command. I know that numerous companies use triggers for policy enforcement, to throw all those out the window to use the ClearCase Web interface is absurd.
Update: Version 7.0.1 seems to fix this so that almost all triggers work (those that modify the source file are said not to work). Even “clearprompt” seems to work.
It’s really too bad, because had AJAX been around when this was written they could have made a fairly nice interface, I’ll bet.
What a fantastic idea! Replace the clunky web interface (see above) with a small Java application which talks to the same server, and gives you decent performance even over a slow WAN connection. Unfortunately, the idea was kind of half-baked. There are tons of bugs, among them:
Some of these problems may be mitigated by 7.1, but it appears the server has been entirely rewritten in a way that will most hamper upgrade efforts.
In any given version control system, a “trigger” could be run on the client or on the server. Subversion took the latter option, ClearCase, the former. Both approaches have their downsides, but running them on the client have several severe downsides.
The first is that security is nearly impossible in an environment where users can manipulate their workstations. For example, many years ago, I had a trigger which simply ran “false” in order to make the operation fail. To get around that, one clever person replaced /bin/false with /bin/true, fortunately, he forgot to put it back and this is how we caught him. Had this person been a bit more careful, there would have been no way of knowing how his checkin got in despite the trigger.
The second downside is portability. There are a number of platforms on which the trigger must run, and throwing Windows into the mix makes this an even greater challenge. This leaves several options, all bad:
The first option would quickly became a nightmare of keeping duplicate scripts in sync in all but the simplest of triggers. The third option is obviously a huge performance hit, and in a system which is already renowned for slowness, would be extremely unwise.
I took the second option, and wrote an elaborate trigger infrastructure to work around all the platform foibles and perl anachronisms. It’s about 4000 lines of perl (including perldoc). But, even so, there are a myriad of ways in which a trigger can fail.
Now the good part of client-side triggers is that it is more scalable. It is better to have a trigger running on each person’s machine rather than have all of them running on a server.
Displaying good error messages is very difficult since the UI on Windows loses the output generated by the triggers, as does the web interface (though I think 7.x improved this). Therefore I took to using “clearprompt”, but it is very ill suited for displaying long (a.k.a. informative) messages, the text ends up wrapped in odd ways and often chopped off. Furthermore you can’t select text from it (say, for a URL).
Oh, and I discovered a serious problem many years ago. When do a triggerable action, ClearCase searches for all applicable triggers and builds a list, during which the vob is in a semi-locked state. If you have hundreds of triggers, this can cause all kinds of problems. Admittedly, it was not smart to have that many triggers, and it was easily fixed.
Also, see Web Interface section above.
In pre-MultiSite days, if you had development at multiple sites, someone was going to be stuck accessing a vob via a WAN, which is unacceptably slow (see performance section above). MultiSite promised to fix that by allowing vobs to be replicated between sites, such that each site would have local copies of each vob. It sounded wonderful, and my employer at the time (Informix) was lobbying hard for this product and were one of the first to deploy it.
Sadly, there was a hitch: mastership. MultiSite makes a key assumption:
In all my years I have never seen such a situation, and over the years teams have become more widely distributed. As such, “mastership” was troublesome for administrator, and confusing for users.
In order to mitigate this explicit mastership was introduced (in v3, I think). So now mastership of a branch could be moved around on different files. This is an improvement given the following assumption:
Strike two. There are always files that multiple teams need to modify. Furthermore, this sort of mastership is confusing.
Next, request mastership was introduced, which allowed users to request mastership for a given branch or branch instance. This seems like a good idea, but there are several problems:
Here’s a different problem: When packets are being imported each action has to be replayed. Normally this is quick… but if the packet contains 50,000 mklabel commands, your MultiSite queues become jammed. (See the Labeling section)
And another issue: The entire vob database and source pools are replicated, even though it is rare for a remote site to use more than a few branches/versions. 90% of what’s being replicated is of no interest to a given site. As I understand it, Perforce has a better replication strategy where local replicas simply cache what is used locally, which would be a much smarter way of doing things.
There is no formal relationship between a branch and its base point, that key bit of information is in the config spec. So, given a branch name, there is no way to find out the base of the branch without asking someone. Guessing is a sure way to run into trouble.
Now, this is actually a feature since it means that the base point can be changed, which is a great optimization of the merge process. For example, given the following branch structure:
dev -------o
/
main --------o----------o
C1 C2
So, this means the “dev” branch was created based on the “C1” checkpoint. Let*s assume that 100 files have been changed on the “dev” branch, but, on main 1000 files have been changed between “C1” and “C2”. If you do a merge from “C2” to the “dev” branch (which seems an intuitive way of rebasing) you will bring 1000 files into your “dev” branch. This means that another 1000 files will need to be merged from now on. However, if first change your config spec to base “dev” on “C2” and then do a merge from “C2”, you have done the same thing except you will only merge files which have been changed on both branches (which will be 100 or less).
While the merge tools with ClearCase are some of the best I have seen, there are several key shortcomings:
Labels are a linear-time operation (O(n), for CS types), that is, the time taken is proportional to the number of elements being labeled. The fastest labeling rate I ever saw was about 25 files per second. For small source trees this is irrelevant, but for large ones it is insanely slow (see the Dynamic View section about large source trees). This can be mitigated to a certain degree by running the mklabel commands in parallel.
Furthermore, in a MultiSite environment, the update packets containing these mklabel commands clog things up since MultiSite replays these events at about the same pace they took to run in the first place. This clog can be made worse by labeling in parallel as suggested above.
Of course, using timestamps in a config spec can work just as well as a label, providing the engineering managers are willing to accept such a thing.
There should be a new type of label, which is really just a config spec excerpt, which would, in turn contain a branch and timestamp.
When view profiles first appeared (in v3, I think), it seemed like it might address some problems with config specs. However after working with them for several years, they seem like more of a hindrance than a help. First off they are not portable to Unix; what is truly astounding about this mistake is that simply using forward slashes instead of backwards slashes would have done the trick. Though this portability is further hindered by the software’s inability to handle Unix line endings. Secondly, there is no command line interface, which means if you want your build scripts to use the same view profile that developers use, you have to cook up a wrapper to do so (my ClearCase::ConfigSpec perl module does so). It is nice that it gives you an easy, graphical way to create branches and deliver changes from them, however, this UI is missing a key feature: rebasing! How could such an essential feature have been forgotten?
It also astounds me that a product that specializes in version control would write the view profile mechanism so that is exceedingly hard to incorporate into a VOB. I spent a lot of time figuring out how to check in view profiles and distribute them to all sites.
Also, the if a view is associated with a view profile mechanism, all relevant vobs should be automatically mounted when the view is started. A great feature! However, it doesn’t work much of the time, though no errors are recorded as to why.
Another problem is that the automatically generated private branch config specs use the old, cumbersome, -mkbranch modifiers, rather than the mkbranch rule. Furthermore, they neglected to include the “-override” modifier which would have greatly simplified how they set up the private branch config specs.
It seems to me that the view profile mechanism was written by someone who knew nothing of Windows/Unix portability, version control, the command line, recent config spec features or typical branch/merge techniques.
Of course, this all begs the question as to why they didn’t simply extend the config spec mechanism to include vob lists and the like? They extended it for snapshot views…
As noted above, snapshot views can be slow to populate due to the number of round-trips required. For large source trees, it can be prohibitively slow. With some systems (like visual sourcesafe) this could be mitigated by running many updates in parallel on different directories, but, unfortunately, that is out of the question for ClearCase as the snapshot view update is single-threaded and will not permit more than one to be running at a time.
When I first read about clearmake and derived objects, I thought it was of the cleverest ideas I had seen. However, I have never been able to use it in practice. The only way to use it is to rewrite all your makefiles using their old generic dialect of Make.
Furthermore, the disk space requirements can be rather onerous, since a single old view can cause many old derived objects to be retained. So, to avoid an explosion of disk usage, frequent audits have to be run and users constantly pestered about old views.
TBD… line endings
The region synchronizer is dumb, it doesn’t understand when to use -gpath/-hpath for a vob on a NAS, and, worst of all, it is a windows-only ui, which means I had to write a custom script to do automated vob tagging.
In ClearCase v4 a new “Scheduler” system was introduced, which purported to “fix” many of the problems with cron. However, in practice it is a very cumbersome system. The first problem is that current job status and schedule information is mashed together into one “configuration file”, which makes version control of these files very tricky (it is odd for a company which specializes in version control to prevent its use). The job numbers are problematic and redundant (why not just use a job name?). Creating a new job is tricky as there are so many entries that you have to set up. You cannot tell from the sched file what command will be run. That is stored in another file which cannot be modified via the “sched” command! It appears as though this is a system which expects to be manipulated via a GUI, but in the years that this system has been in existence, so such GUI has surfaced.
Most version control systems I know off (e.g. CVS, SVN, VSS, RCS, SCCS, &c.) will expand certain keywords (like $Header$) inside text files to contain information about the version of the file being looked at. This is essential for identifying which files/versions contributed to a given version of a product. However ClearCase has no such thing. It is often suggested to implement this via a trigger. The problem is that this trigger will cause any non-trivial merge to be a conflicting merge, since the same line has been modified on both branches.
Some then suggested that a type manager be set up to help with this. This is a good idea except for one thing: there is no mechanism for deploying type managers. Every client needs to have the new type manager. That’s not happening with over 1000 clients.
Such a type manager should have been a stock part of ClearCase from the beginning.
I brought up a big issue with type managers in the previous section. Another problem, is that the type manager mechanism confuses two separate concepts: How the versions are stored and how differences will be presented.
Case in point: the “ms_word” type manager is based on “file”, which will store full copies of every version. Often, the old versions of word documents are rarely going to be used, so devoting all that disk space to them is dumb. I could convert the element type to “binary_delta_file”, but that would lose the MS Word diff magic.
This is not unique to ClearCase, by any means; programmers should be ashamed of the poorly written, uninformative or downright misleading error messages that have become commonplace. Any error message should answer the usual set of questions: who? what? when? how? why? That means it should include all relevant file names, the reason for the failure, identity information (if relevant), and, ideally, some hint as to how to fix it.
Here is a little error message “hall of shame”:
I had high hopes for post 7.0 versions after attending the IBM Rational User’s conference in Jun 2007, though their predictions of future releases was overly optimistic. But upon getting version 7.1 I found the installer had been replaced by a new installer, which was a confusing mess. All the initial documentation seemed to assume that all installations would be done via a UI, which seems to indicate they forgot that most people have headless servers. The documentation on how to set up release areas and set up “silent installs” is hopelessly scattered and confusing. It is telling that the best information I have found about this comes from outside IBM.
This is a classic case of not following the maxim “if it ain’t broke, don’t fix it.” The old installer may have been a bit klunky but it worked! I’m betting this is the work of some clueless pointy-haired IBM executive who demanded that ClearCase be brought into conformance with other IBM products. To what degree this effort has distracted engineers from actually improving the product has yet to be seen, since a year after 7.1 was released, my team has just now gotten an installation to work, and are nowhere near figuring out how to deploy this to production servers.
cleartool: Error: Nonmastered checkouts are not permitted in unreplicated VOBs; pathname: "L:/ccperf_sdc78322svod_tfisher//cm_test_sun/somefile.cpp"
Here’s my take on the state of ClearCase: Around the time that Rational took over ClearCase (1998, I think) the core product stagnated entirely. There were no significant bug fixes or improvements since that time. Note that most of the problems I mention above have been like that for 10 years. A few new things were added but they all seemed botched in that they left part of the product out, or didn’t support all platforms (plenty of examples above). When IBM took over, I had hoped that they would shake things up, and it seems that in the last year or so they have. But I fear it may be too late. I have seen many teams abandon ClearCase out of frustration and competitive products pop up in the mean time (e.g. Subversion), and, to be honest, I’m not entirely sad about that since I am tired of being the messenger that everybody shoots at.
Some years ago I ran into a piece of code which shocked me, and in the time since then I have realized that it exemplified a lot of what is wrong with software. Sadly, I have since lost the code, so here is an approximation:
unless (open(F, "/some/important/file"))
{
# We don't want to scare the users with an error message
# warn "Unable to read config file";
}
Am I the only one who is outraged by this? What is scarier to a user, to get an error message when a genuine error situation occurred or let the software plod on getting even stranger and more non-sensical errors which cascade from this initial problem? For example, imagine the following code further on:
my $req = $http->request($config->{url});
die "Unable to contact web server $config->{url}\n" unless $req;
The config structure was empty because it could not be read due to the earlier problem, so the error message simply says “Unable to contact web server”. So now you are led to believe that the problem is with some unspecified web server. How much time will you waste trying to track that down?
So which is worse, “scared” or confused and frustrated?
There’s an old joke told many years ago by those who didn’t like Unix:
Ken Thompson has an automobile which he helped design. Unlike most automobiles, it has neither speedometer, nor gas gauge, nor any of the numerous idiot lights which plague the modern driver. Rather, if the driver makes any mistake, a giant “?” lights up in the center of the dashboard. “The experienced driver”, he says, “will usually know what’s wrong.”
I’m sure the early versions of ed inspired this. Though in those days, when every byte counted, a certain level of terseness was understandable. And the software was simple enough that there were a limited number of things which could be going wrong.
But now our computers are orders of magnitude bigger and more complicated. We have layer upon layer of drivers, libraries and applications, which nobody can understand in their entirety. And we still have a giant “?” lighting up on our dashboard. The combination of sloppy (or nonexistent) error handling and poor error reporting, means that we all encounter incomprehensible or meaningless out-of-context error messages on a regular basis. Increasingly, I feel that this is the key problem with computers these days: we expend much of our time, energy and morale to the struggle of figuring out what the latest incomprehensible error message means.
Therefore, I will be devoting some time here to cataloging terrible error messages I run into and some of the bad programming practices that lead to them. I thought I should provide some warning (and context) before I vent my spleen.